Filling in the Gaps
April 12, 2000
If there's been an underlying theme to XML-Deviant over the last few weeks, it's been about feeding back real-world experience into the standards process. The XML community has begun highlighting gaps in the XML standards framework that need filling. This week's column is no different -- XML-DEV has again provided a forum for highlighting problems. The results have been interesting, and range from taking steps to improve interoperability to calling for a new version of XML.
... if you use public identifiers within your own organization, that's perfectly OK, but if you want to interchange XML documents with anybody external, they have the right to demand, and you have the obligation to provide, a working system identifier (URI) for each external entity.
The above is a note, "Public Identifiers Are Non-Portable," taken from the Annotated XML Specification. It outlines the obligation for document authors to provide useful System Identifiers in their documents. A System Identifier is a URI: practically speaking, this is generally a file path or a URL. System Identifiers are used to specify the location of a DTD, external entities, etc. A Public Identifier simply names an external resource; it doesn't define how to locate it. XML provides support for Public Identifiers but doesn't define their format, or how they are meant to be used.
If you think about it for a while, you'll realize that there's a big problem here. To fulfill your obligation as an XML document author you must provide a working system identifier for your documents. If you decide to use a file identifier, then you've tied your document to a particular directory structure. This has obvious portability problems. If you use a URL then you gain portability, but at the cost of requiring your users to have an open Internet connection if they wish to validate or manipulate your document's content.
Peter Murray-Rust encountered the problem whilst processing XHTML documents, and observed that
.. the file I have created can only be processed as XML if:
(a) I am connected online
(b) the W3C maintain for all time a means of dereferencing either the [Public Identifier] or the URL
This is an interoperability problem that will affect anyone exchanging XML documents across the Web. Luckily it's a problem with a known solution, stemming from the SGML world. John Aldridge pointed the way:
There's a perfectly good answer to this -- dereference the PUBLIC identifier in some local catalogue. Unfortunately, XML 1.0 makes this optional; and parsers don't do it.
Last time I checked, schemas have just the same problem -- they provide a syntax for looking up schema definitions other than by dereferencing a URL, but don't mandate that processors provide a mechanism for using it.
There was general consensus on XML-DEV that this was a shortcoming that needed fixing. David Megginson observed that SAX includes a hook that allows a catalog-based mechanism to be developed:
I know that it's not a general-purpose solution or even a particularly good one, but that's one of the reasons we provided the EntityResolver in SAX1.
Some parsers already provide this feature. Gopinath M.R. showed that the Xerces parser uses catalogs. Xerces' support for catalogs is based on John Cowan's draft XMLCatalog specification, which is itself derived from the SGML Open Catalogs (OASIS TR9401) standard. In a moment of synchronicity, Paul Grosso and Norman Walsh were able to leap to the rescue, announcing open source Java classes for processing both types of catalog:
These Java classes implement the OASIS Entity Management Catalog format as well as an XML Catalog format for resolving XML public identifiers into accessible files or resources on a user's system or throughout the Web. These classes can easily be incorporated into most Java-based XML processors, thereby giving the users of these processors all the benefits of public identifier use.
The classes are currently undergoing submission to OASIS. Norman Walsh has written an excellent article, "If You Can Name It, You Can Claim It!", which discusses how the classes can be integrated in Java code. This is a first step towards increasing the portability of Public Identifiers.
This success aside, it soon became clear that the issue of parser behavior in response to particular XML features is a much wider problem.
It's a commonly held perception that when you process an XML document you only need a DTD when you're using a validating parser. However, this glosses over some awkward details that can cause validating and non-validating parsers to have very different interpretations of your document. In fact, the situation is worse, as you can't even guarantee that two non-validating parsers have the same view of your data. It all depends on how the parser follows the XML 1.0 specification.
A number of features in the specification are optional, some additional leeway being granted to non-validating parsers. One example of this flexibility is the retrieval of external entities. It is optional whether a non-validating parser has to retrieve external entities. If the parser decides to do so, then it may have to retrieve the DTD where the entity is declared. If it decides not to do so, then the entity won't be expanded during parsing.
If an entity contains significant content, then the parser may present only a limited view of the data. This again poses problems for interoperability as two users of an XML document may see very different views. (This is aside from the issue of how the entity is catalogued and identified.)
Peter Murray-Rust was again able to illustrate the problem:
... a good example of the problem is the use of external entities in SVG with the Adobe plugin. [The plugin is excellent - it's just that whatever machinery it uses doesn't expand external entities. Presumably even though it may be used with a DTD, this doesn't trigger the activity.] If you include your picture as entities it doesn't get displayed!
David Megginson observed that the safest way to distribute documents is to normalize their content:
The safest approach seems to be to distribute XML documents normalized (all external general entities expanded and all default attribute values filled in) and without a DOCTYPE declaration.
Obviously, this limits the possibilities for allowing an XML document to reference external information, as well as potentially introducing some data management problems. For example, if you normalize your documents to include some standard boilerplate text, what happens when you want to revise the wording?
Tim Bray concluded that in this case, the XML 1.0 specification should have included stricter requirements on processors:
My conclusion is that XML 1.0 should have made it compulsory for processors, if they read external entities, to allow this behavior to be selected and disabled. And since every remotely-plausible XML parser turns out to be able to read external entities, we might as well have made that ability compulsory.
The debate continued by exploring ways of resolving the issue. Peter Murray-Rust suggested that an effort be made to identify all the "gray areas" in the specification:
My invitation is for someone or some group to describe exhaustively what the problem(s) actually are. It might be that we can then all agree on appropriate behaviour under every combination. In that case the document would consist of a definition of conformance. Presumably a given parser could have a label stating its conformance to this document. And when XML-based software was produced which included a parser (and most will :-) then the behaviour of the parser should be stated.
Murray-Rust later refined his suggestion into an action plan:
There must be a relatively small finite number of combinations of parser behaviour. What I suggest we tackle is:
- an exhaustive examination of all parser behaviours consistent with the XML1.0 spec
- a clear tabulation and labeling of these
- the requirement that a parser announce which of these behaviours it supports
- the ability to select this behaviour
- a means for the author of a document to communicate which of these behaviours she expects the receiver to use in their parser.
Many developers were in agreement, the general solution being seen as some type of "Feature Manifest" that states which XML 1.0 features a given XML document uses. Simon St. Laurent observed that his XML Processing Description Language (XPDL) already provides some of these features:
XPDL describes a larger number of possible problems, and defines a format that resides outside of the actual document, but a lot of these issues - parameter entities, attribute defaulting, etc. - are already in there.
Thomas Passin was quick to draw a distinction between a feature manifest and parser configuration:
... a statement by a document that it requires a certain feature is DIFFERENT from a features manifest for a parser that lets you find features and perhaps turn them on or off. One is a property of a document, the other is a behavior of a piece of software.
It seems likely that these two concepts could be coordinated, or at least informed by each other. For example, they should use the same names for equivalent features, like "external-entities".
The issue is so far unresolved, although some progress is being made. Passin has already abstracted the relevant material from the XML 1.0 specification. Exactly how a Feature Manifest may be implemented -- either as a separate declarative document, or a processing instruction -- has yet to be decided.
Michael Champion invited further debate on the relative merits of particular mechanisms, drawing links between feature manifests, schemas, and packaging:
... what I'd really like to see is a discussion of the relative merits of a Feature Manifest coded in a PI in an instance, in elements/attributes in a wrapper/package, and in the schema.
(Some discussion on XML Packaging was reported in an earlier XML Deviant, "Good Things Come in Small Packages.") It will be interesting to see how this progresses. Peter Murray-Rust seems particularly tenacious in wishing to solve the problem. His last success was in spurring on the early development of SAX, so watch this space!
The other aspect of this discussion involves the first product of the SML-DEV group. The SML-DEV mailing list is dedicated to discussion of simplification of the XML specification and its attendant standards. The group recently presented their ideas at the XTech conference. Their first public draft document, "Common XML," is a usage guide for XML developers wishing to maximize the interoperability of their XML data.
Common XML consists of a central "core" of features (which can be used safely) and a set of "extended" features (which include a number of drawbacks or caveats). The aim is to allow a developer to make an informed decision upon whether they should use individual features. The document also provides a useful starting point for developers approaching XML for the first time, as it highlights the "important bits."
Rick Jelliffe considered that neither Common XML nor a Feature Manifest would adequately address the core issues. Jelliffe believed that it was time for a rationalization of XML, and issued a Call for XML 1.1:
I am all for plurality and competition; but only in a manageable framework. This either requires some "features manifest" system or a rationalization of XML. I suggest that a rationalization would meet user's legitimate expectations better, and require less change to the XML Spec.
Jelliffe's message was quite lengthy, but I'll try to summarize: Jelliffe believed that the subtleties of the XML specification, and additional layered standards like XBase and XInclude, are defeating user expectations. Jelliffe saw that rationalizing the framework would bring XML "back on track":
"Basic XML 1.1" would in fact be more complicated than the base-level XML 1.0 as it was originally released (no DTD parsing, but xml:include replaces entities and the namespace and xml:base support would be needed too.) But it would seem more simpler for newbies and put XML back on track. While I do not agree that there is a groundswell of opinion for a minimal XML, I have taught XML classes to hundreds of people now, and the consistent expectation is that there should be two modes: declarations or no declarations.
Michael Champion agreed that there were two clear options: a Features Manifest or rationalization effort, but was more pessimistic about the likelihood of change:
... I strongly favor the "features manifest" solution. "Rationalization" would be an option only if the W3C dropped everything and focused on this problem, vendors sacrificed backwards compatibility for interoperability, implementors chose full conformance over performance and time-to-market...
Champion concluded that a features manifest is the only viable solution:
... the only realistic alternative is a way of negotiating contracts between producers and consumers of XML as to which of the character encoding schemes, XML features, and related standards one must support [in] order to participate in some XML sub-community. That is not as nice a vision as the one of universal interoperability that guided XML 1.0 and Rick Jelliffe's XML 1.1 proposal, but it seems far more achievable to me.
This debate is far from over, and is addressing some important issues. It seems likely that at some stage the XML specification will have to be revised, if only to address the rapid changes experienced in the last 12 months.