Truth in Advertising

March 12, 2003

In this week's column I will focus on two of the bread and butter issues of the XML development community: XML subsetting and XML namespace management. While both of these issues are among the permanent topics of conversation (that is, "permathreads") on the XML-DEV mailing list, this time around there are some interesting wrinkles which make reviewing the conversations worthwhile. The subsetting issue, in particular, raises interesting questions about the degree to which a restricted XML profile makes sense in resource-constrained computing environments, as well as the best way to go about documenting and specifying such a subset.

J2ME, JSR 172, and Subsetting XML

According to Elliotte Rusty Harold, Sun's J2ME Web Services Specification (JSR 172) proposes a subset of "JAXP, XML, and SAX" which has several failings, not least of which is that it is "not suitable for generic XML processing". JSR 172 allows, Harold said, "parsers to throw a SAXParseException when encountering a document type declaration, and to not support non-predefined entity references". In Harold's estimation, this is a result of the decision taken by SOAP's architects to "forbid the internal DTD subset".

The important points, it seems, to keep at the forefront in such instances are the ones about interoperability; namely, does subsetting XML hurt interoperability per se, does it matter who or what is employing the subset, does it matter what kind of subset is being used or proposed, and so on. The issue here seems not so much to be that Sun has proposed a subset of XML for use in resource-constrained applications such as mobile phones. Rather, the issue is the more serious one of calling a thing an XML parser when it complains at seeing a document type declaration; or, more to the point, of calling a document a specification of an XML parser which allows conformant implementations reject a document type declaration.

That JSR 172 proposes a subset of XML is to be expected, after all. As Mike Champion pointed out, "Anyone following sml-dev three years ago would not be surprised to hear that vendors are subsetting XML for mobile, data-oriented applications. Where's the 'sin' here? What's a cellphone supposed to do with an external entity reference, or a notation declaration?"

By way of responding to Champion's question, Harold suggeted that the real fault is the way JSR 172 allows parsers to handle (better: to refuse to handle) the document type declaration:

The sin is in forbidding the document type declaration. If they choose not to load any external entities, then that's blessed by XML. However, this does not give them freedom to reject documents that contain such things, or to drop out lexical features of XML such as default attribute values and internal entities declared in the internal DTD subset.

But surely we want XML consuming and producing applications which operate in resource-constrained environments, in which case it may make sense to exchange XML data which is also constrained to a specific subset of the full XML specification? There seems to be consensus that XML on, say, mobile phones is a good thing. But, as Harold again pointed out, those decisions should be reflected and implemented at the application level and not by crippling XML parsers (not, one presumes, while continuing to pretend they are still XML parsers):

This is not a decision that should be made at the parser level though. Parsers do need to process documents that contain document type declarations. No one should ship a parser that simply gives up when it encounters a document type declaration.

An application such as SOAP may decide it doesn't want to accept document type declarations, and reject documents that contain them, perhaps to avoid the billion laughs attack, perhaps for other reasons. I still think that's a bad idea, but it's not nearly as bad an idea as what's happening in JSR 172. This is turning up the subsetting a notch. Now the parser is making the decision to reject documents that contain document type declarations rather than the application using the parser. SOAP's mistake only affects SOAP. This affects everybody using that parser for any application.

In other words, insofar as JSR 172 specifies the Sun-blessed XML parser for the J2ME platform, it is inappropriate to push application and domain-specific subsetting questions down into the core XML parser for an entire platform. That is, as Harold suggests, a harmful way to make decisions about restricted profiles of XML.

There was some indication that the team responsible for JSR 172 is open to criticism from the XML development community. As Norm Walsh said,

I've spoken to the folks working on JSR 172 and I think they understand the distinctions to which Elliotte Rusty Harold alludes. They're building a SOAP processor for devices with a code footprint of something like 25kb. (*kilo*bytes). I think there's room for their spec to be clearer about the decisions they've made, why they've made them, and the ways in which the API they're exposing is intended to be used. And I think they're going to make those changes.

While the comment and review period for JSR 172 extends to 22 March, it's good to see that the people responsible for it are responsive to feedback from the community.

Because of XML's ubiquity (which is similar to Linux's ubiquity: they're both used everywhere from tiny handheld devices to big mainframes), it's not practical or smart (or even necessarily possible) to restrict it from being used in a wide range of kinds of applications and environments. But the truth-in-labeling issue is very important. If an application says that it is an XML parser, it should be able to parse XML, according to the W3C's XML Recommendation. As the XML-DEV discussion pointed out, that some XML parsers contain bugs which make them unable to parse XML always in every situation is simply a fact of computer programming life. That JSR 172, a J2ME specification for XML parsers, subsets XML in such a way as to make every conformant implementation of it unable to parse XML is a very different and more troubling problem.

As Norm Walsh said, in the message of his I quoted earlier, the issue of profiles of XML is one that is being addressed by the W3C's TAG. And at the forefront of that issue is truth-in-labeling:

Perhaps the right answer is simply to say that a processor for the subset of XML defined by "foo" should be called a "foo processor" and not an XML processor.

The argument that "foo" isn't XML probably isn't very interesting from a purely practical standpoint. But maybe we can get everyone to agree to call a spade a spade.

This conversation, which took place in more than 500 messages over 3 weeks, is worth careful, detailed review if you are contemplating building a restricted profile (i.e., a "subset") of XML, whether for resource-constrained environments or for other specific applications.

Can the World Have a Namespace Registry?

Turning from a permathread to the ur-permathread, Jeff Lowery suggested a centralized registry (think DNS, domain name registrars, but for a simpler dataset) for namespace prefixes. As Lowery said,

The advantage of a registry is that prefixed names become universal names when prefixes are registered. There are no scope issues. The primary disadvantage of registration is that there will be a prefix rush. I don't see a dependency on access to the registry at parse time, unless there are resources to be associated with the prefix (such as a URI to a RDDL doc) that the parser needs.

The other disadvantages are all the well-known, primarily political and social, ones associated with centralized data registries. (And the troubles which come with any rush to register names should never be underestimated.) Building such a registry is, Daniel Veillard suggested,

a perfect way to prepare a robbery [of] the enslaved masses. Prepare your checks! Even if someone promises you whatever freedom now, the capitalist temptation of making money on any central authority is just impossible to resist in the long term.

No way [am] I going to accept the idea of coupling parsing of XML resources to the access of a registry outside the encoding and character ones. Parsing XML does not even require DNS resolving unless one want to access remote resources (and in general you try to avoid that when deploying).

Simon St. Laurent seemed to support the idea, in principle, even while noting that it's probably too late in practice to implement it. He also offered a gloss of Veillard's estimation of the tendency to profit from centralized registries. In response to Veillard's claim that the temptation to privatize and profit from such a registry would be irresistible, St. Laurent said,

In the current climate of "everything must be privatized", sure. We've seen how much damage business can do to an infrastructure while still letting it breathe. "Just impossible to resist in the long term" seems like far too great a stretch, however. I still don't take "capitalist temptation" to be inevitable.

I'm tempted to offer my own retort to St. Laurent, one which suggests that while he clearly doesn't find the "capitalist temptation" to be inevitable, there are many who are, one might say, less able to resist such a temptation. In a world where ICANN is the spectacular failure that it is, the possibilities of running a namespace registry indefinitely, and for the general benefit of all, that is, as a non-profit enterprise, seem very dim indeed. This may well be generally conceded among members of the XML development community, at least if St. Laurent's comment is at all representative:

If ICANN is in fact involved, I heartily agree [that a namespace prefix would be ruinous]. ICANN's catastrophic stupidity and cupidity don't by themselves mean that registries are a bad idea, provided that their keepers are actively (and genuinely) regulated.

Also in XML-Deviant

The More Things Change

While a centralized registry for namespace prefixes may be an ideal solution (though it has technical detractors who make good points), any such registry must exist in the actual world, which is full of political currents and climates in which a registry would be at best tenuous and fragile. Those simply aren't good properties in centralized registries, which need to be bulletproof and tamper-resistant. Even a modest registry needs an operating budget, and a public registry needs regulation and oversight. Given the continuing morass of IT spending and global economies generally, it's not clear one could arrange public monies to fund it. And given the climates of deregulation and privatization to which St. Laurent alludes, it's not clear, if the monies were arranged, that competent and relevant oversight could be arranged.

Let's set aside the complexities of a namespace registry: the XML development community has enough trouble securing a stable host for the XML-DEV mailing list, which seems to be undergoing subscriber problems. Enough high profile XML "stars" -- such as Tim Bray and Lauren Wood -- have been getting silently unsubscribed, for reasons it's not clear anyone really knows, that there is a growing movement afoot to relocate the list to a new host. Even when XML-DEV repeats itself, endlessly, there's just enough newness each time around to keep things interesting.