What's in a Name?
This week, XML-Deviant looks at a XML-DEV discussion on the best practices for identifying XML resources; then wonders why more developers aren't taking advantage of entity management systems.
Correctly naming resources and objects is widely regarded as one of the most difficult problems in computing (another being caching). As the saying goes, any problem in computing can be solved by adding another level of indirection. One step toward solving naming problems is to add indirection by separating the name of the resource from its address. This is a common pattern, which we see in a number of areas from pointers in C to Persistent URLs (PURLs) on the web.
XML 1.0 offers a separation between the naming and addressing of resources or entities referred to in XML documents. Broadly speaking SYSTEM identifiers define an actual resource that is retrieved, or dereferenced to retrieve, the entity in question. A PUBLIC identifier simply gives a name for the required resource. It says nothing about where that resource may be dereferenced.
Of course life isn't really that simple, and its likely that some readers are already objecting. The short but heated XML-URI debate earlier this year testifies to the disagreement on this issue. A SYSTEM identifier is specified as a URI, which can be easily be a Uniform Resource Name (URN) as well, instead of being the more commonly found URL. A URN is more like a PUBLIC identifier, as it simply names the resource in question. Yet there is still no widely deployed means of using URNs.
This glosses over disagreements about whether a URI is actually a name or address, a completely different debate. For most purposes, this distinction is probably the most useful: A SYSTEM identifier is an address, a PUBLIC identifier is a name.
We've covered some of these issues previously in the XML-Deviant, (see "Filling In The Gaps"), when a discussion about identifiers took place on XML-DEV back in April. The advice given then was to provide PUBLIC identifiers with your documents and maintain a local catalog of identifiers (i.e. store the addresses associated with those PUBLIC names). These cataloging facilities are often referred to as "entity management systems," as they can do more more than just providing a look-up table of names and addresses.
Always keeping a keen eye on interoperability issues, Simon St. Laurent observed that many parsers fail if they cannot properly deference a SYSTEM identifier.
SYSTEM identifiers, or more properly, the SystemLiteral which contains the content of the SYSTEM identifier, are defined as URIs, conforming to RFC 2396. These URIs are "meant to be dereferenced to obtain input for the XML processor to construct the entity's replacement text."
In common practice, that's meant using URLs, typically HTTP-based URLs. Validating (and some non-validating) XML parsers tend to report errors when they can't retrieve the content referenced by a SystemLiteral, since effectively it means that they can't validate the document.
SYSTEM identifiers are, therefore, a possible failure point in your XML application. Norm Walsh recommended using an entity management system, and PUBLIC identifiers to improve robustness.
At the very least, you should use a PUBLIC identifier as well since that allows an entity manager to do the right thing even in the presence of varying system identifiers.
My hope is that XML parsers will make sure that they have entity resolvers that allow the local parser to match URIs used in the parsing process, thus ensuring that parsers don't need access to a network in order to be able to work. It seems kind of problematic to me to require that your parser is part of a network in order to use the DTD that you have locally...
Potential failures therefore not only encompass incorrect identifiers, but also the possibility that a resource is unavailable. The Internet operates on a best effort basis. Is this really acceptable for a mission critical XML application?
Freely available tools to perform entity management have been available for some time, as we have previously reported. Yet few developers seem to use them.
What approach should you take to achieve the greatest degree of interoperability: SYSTEM or PUBLIC Identifiers? Simon St. Laurent advocated using both as best practice.
This is the approach the W3C takes with XHTML. I'd suggest this makes more sense than the alternatives, since the PUBLIC identifier allows processors which support entity resolution to use it (as Mozilla does with XHTML) but provides a canonical URL which developers who've never heard of entity resolvers (lots of them) can still use.
...For DTDs and schemas, resolvability really matters. I'd stick to the combination of a public identifier and a 'guaranteed' URI..., but make clear that the public identifier is the critical piece and the SystemLiteral is only provided for backup.
Among other things, that would let people without entity resolvers point to local URLs while still identifying the document with the right PUBLIC identifier.
One might wonder why these kind of problems aren't surfacing daily. It may be that the lack of runtime validation in many applications means that remote resources are not being retrieved. Tim Bray noted that retrieving DTDs and schemas is an infrequent operation.
...across the universe of XML processing, the proportion of times that the DTD or schema actually gets fetched is pretty small; for example, your average XHTML agent is not going to go chasing after DTDs in the course of displaying web pages, and your average b2b code probably doesn't do a lot of DTD munging.
In Bray's opinion SYSTEM identifiers are the better option, although this doesn't preclude the usage of entity management systems: they just naturally grow to encompass caching of retrieved resources, etc. For example, W3C XML Schemas includes a schemaLocation attribute which
...provides hints from the author to a processor regarding the location of schema documents.
...Note that the schemaLocation is only a hint and some processors and applications will have reasons to not use it; For example, an HTML editor may have a built-in HTML schema.
If and when generic XML browsers start appearing, we may see these issues occurring more frequently: such applications will have to flexibly handle new document types, namespaces, and schemas as they are delivered, most of which aren't likely to be built-in. A properly-layered XML application will allow an entity management system to be plugged in to support retrieval of any required resource.