Misconceive Early, Misconceive Often

August 4, 2004

While XML's noblest and best are away directly injecting their brains with mind-altering drugs at Extreme Markup, there's still plenty of food for thought for the rest of us from last week's XML mailing lists and weblogs.

Yes, weblogs. Although this column has traditionally been dedicated to mailing lists, I will in the future bring in debates of interest where they occur in weblogs as well, as they form just as much a part of today's XML developer debate as mailing lists.

In the vague struggle of finding a unifying theme for this week's reports, I note that misconceptions and mistrends established early in a technology's lifetime tend to hang around to cause confusion for an uncomfortably long period. In XML's case I focus on the text/xml content type; and, in RDF's case, the notion that everything must have a URI.

Also in XML-Deviant

The More Things Change

Mark Pilgrim: How Wrong Can he Be?

In the most recent installment of his column for XML.com, Mark Pilgrim made the case that XML on the Web has failed. Put briefly, his contention is that XML delivered by web servers is a mess, due to disrespect of content types and character sets. Pilgrim concludes:

Multiple specifications and circumstances conspire to force 90% of the world into publishing XML as ASCII. But further widespread bugginess counteracts this accidental conspiracy, and allows XML to fulfill its first promise ("publish in any character encoding") at the expense of the second ("draconian error-handling will save us").

All else aside, when an English-speaking American talks about "ASCII" and "90% of the world" you know you're in for trouble. In the comments at the bottom of Pilgrim's article are responses enough, including some from Liam Quin, W3C's XML Activity Lead, and Murata Makoto. The latter is co-author of RFC 3023, which defines content types for XML, and a new Internet Draft that goes even further and deprecates the troublesome text/xml content type. Murata-san boils down Pilgrim's article to the pithy observation that "I18N is hard."

However, it's not those comments on which I wish to focus for now. (Loyal XML.com follower as you are, dear reader, you will have seen those already.) Rick Jelliffe was so moved by Pilgrim's piece that he wrote a lengthy rebuttal in his weblog, declaring the "usually excellent" Pilgrim to be "wrong in every way."

... he uncovers the horrible fact that XML is unreliable when used unreliably. Knock me down with a feather! Mr. Pilgrim could correct his article by substituting "text/xml" for "xml" in most places ...

Indeed it is the pernicious text/xml content type that lies at the heart of the problem. Jelliffe offers some rebuttals to Pilgrim's point, and the first of these is a reiteration that "XML should not be served as text/xml", along with some suitable historical justification of that. It is a shame that somehow the message didn't get through, and I personally suspect that the root of this text/xml content type proliferation can be found in the rendering functionality offered by web browsers to documents with this content type. Yet before we blame the browsers too much, consider also what they had to work with.

Now Jelliffe picks up on the ASCII point -- it's simply to good to miss, after all.

If I sampled a Japanese aggregator and then said, "because none of the sources use ASCII, ASCII has failed!" people would be surprised. Of course when Mark looks at a site with almost entirely U.S. sources, he finds a very high incidence of ASCII data: this has absolutely nothing to do with the characteristics of XML.

Jelliffe finally lands the all-too-obvious punch invited by generalizing RSS as a bellwether of XML:

That 20% of RSS is badly tagged is hardly surprising, and no reflection on XML either ... Putting aside the bogosity of making a conclusion about all XML from one application (RSS), XML simply has never promised that you can publish reliably in any arbitrary encoding.

In the rest of his weblog entry Jelliffe goes on to describe in fact how he and the specifiers of XML 1.0 went to great lengths to solve the issues around encoding. Definitely worth a read. As Jelliffe notes, Pilgrim is himself a controversialist and as a bystander watching the rebuttals is also enjoyable. However, this discussion is worth underlining for the tablet-of-stone edict that must by now be painfully obvious. It can be found in Jelliffe's summing up (emphasis mine):

... you can lead a horse to water, but you cannot make it drink. Use application/xml not text/xml. If you don't know what encoding your system uses, and so you don't know what encoding declaration to use, force it to UTF-8 and remove guesswork from the equation.

IFPs Are the New URIs?

An interesting discussion on the RDF Interest list this week revolves around a three-letter acronym that may be new to some. Inverse Functional Properties, IFPs, are those for which any given value is generated by exactly one object. For instance, one's social security number is an IFP. It is impossible for more than one individual to possess that SSN.

IFPs have found practical use in the FOAF project as a way of identifying people. RDF's usual convention for identifiers is the URI, but it is obvious that there are things that do not have URIs (such as people) and no global way of ever agreeing on a URI scheme for them. That point alone has been enough to put many people off RDF. However, in such cases, reference-by-description using IFPs seems to provide a good solution.

On RDF Interest, Phil Dawes asks if inverse functional properties are the new URI, and what this may mean:

I can see an instant benefit in doing this -- end users don't need to worry about the problems of minting URIs, maintaining them, etc.

Is this the way RDF is going -- URIs for the schema, BNodes with InverseFunctional properties for the instance data? What are the consequences?

Dan Brickley, co-creator of the FOAF project, provides an answer and an interesting timeline on how reference-by-description techniques have emerged in RDF processing:

FOAF encouraged an emphasis on reference-by-description techniques. OWL subsequently gave us a way of expressing simple reference-by-description strategies in a machine-readable way.

Rob McCool and Guha in their TAP work take a similar line, advocating reference-by-description as a useful strategy for merging web data. http://tap.stanford.edu/tap/rbd.html http://tap.stanford.edu/sw002.html.

Brickley also relates reference-by-description to the cult of URIs:

I think in the early days of RDF there was something of a fairytale quality to the way URIs were perceived -- basically a myth that all interesting and description-worthy things will have well-known URIs. FOAF and reference-by-description in general shouldn't be taken as an attack on URIs as such, but as advocacy that other techniques are useful too ...

Of course, as the text/xml trouble shows us, early misconceptions and attitudes persist for a long time.

Births, Deaths, Marriages

The latest announcements from XML-DEV.

OASIS TC Call for Participation: Int'l Health Continuum TC: A new OASIS technical committee is being formed to address standards across the healthcare world, promoting best practices, cooperation, and adoption of OASIS specifications.
Public Review of OASIS WSS TC spec: Two web-services security drafts are now available for public review: REL Token Profile and SAML Token Profile.
XQuery Tutorial on YukonXML.com: Here's an XQuery tutorial on Microsoft XML-focused site. Except, of course, you need to use SAXON to get the latest XQuery implementation...
TagSoup 0.9.5: The latest release of John Cowan's HTML-in-the-wild SAX-style parser. Includes bug fixes.

Scrapings

The truth about "XML certification" ... Len does it like they do on the Discovery Channel ... mails to XML-DEV last week 67, Len rating 9% (sadly depleted) ... bad signs in the Times ... advice to Adf Dsfa, never mind the validation, just bash the keyboard in frustration.