XML: A Disruptive Technology

June 21, 2000

Table of Contents

• XML's Impact on Current
Internet Technologies
   ·HTTP
   ·MIME Types
   ·URIs
   ·Security and
    Infrastructure Issues
   ·HTML
• XML's Impact on Upcoming
Internet Technologies
• Is It Time to Rebuild?

For all the talk of XML changing the world, I find more and more that XML is actually changing the Web technology on which it was meant to build. I have long hoped that XML would make an impact on the broader world, acting as a disruptive technology that improved all kinds of data exchanges. Yet it is also worth looking at what XML is doing to the Web itself and to Web technologies, and what it might be able to contribute to the Web and Internet infrastructures over the long term.

XML's Impact on Current Internet Technologies

While XML's developers definitely intended XML to be "SGML for the Web"--a simplified version that could fit easily into existing Internet and Web infrastructures--it's becoming clear that XML demands "adaptive reuse" of existing infrastructure, rather than the "simple reuse" I think many people would have preferred. XML's generic syntax allows it to take advantage of XML-specific infrastructures, from parsers to repositories, while its chameleon-like ability to carry any vocabulary is putting it into situations that go far beyond the typical delivery models used by Web and Internet applications.

This isn't good news for a lot of people who have already deployed infrastructure and have understandings of how that infrastructure should work. XML could be a serious problem for much of the existing Internet, as developers push the envelope on technologies that weren't built to host flexible content. Existing protocols work very well, up to a point, and at that point the troubles begin.

I'd like to illustrate my point with a few examples--we'll look at MIME content-type identifiers, URIs, HTTP, HTML, and some general security and infrastructure issues.

HTTP

The most obvious battle over adaptive reuse, and the one that likely has the most dollars at stake, is the current battle over HTTP. The arrival of XML-RPC--and particularly SOAP--has meant that developers are treating what was once a publishing protocol, with limited two-way communications, as a full-blown peer-to-peer communications protocol. While these protocols continue to use HTTP to transport documents, the contents of those documents have been given new status in the processing of requests and responses.

This reuse has generated protest from a number of different quarters. Keith Moore, long-time IETF participant and (now-retired) IESG member, has written an Internet Draft, "On the use of HTTP as a Substrate for Other Protocols," setting rules for HTTP re-use that would more or less bar XML-RPC and SOAP in their current forms, requiring them to find new protocol names and ports.

Opposition to the extension of HTTP for remote procedure calls and object exchange generally rests on a number of claims that such usage breaks the understood rules for HTTP, even if the implementations conform strictly to the rules laid out for HTTP in the HTTP 1.1 specification. Effectively, some URIs that purport to be HTTP no longer accept "ordinary" HTTP requests, and support only a subset of the HTML forms-based GET and POST approaches to transmitting information from client to server.

It isn't clear that these protests will be heard--SOAP and XML-RPC don't have formal working groups presently at either the IETF or the W3C (although the W3C has discussed a protocol activity and sponsored a mailing list and several events). These concerns aren't limited to HTTP, either--SMTP, FTP, and nearly any other transport protocol capable of carrying XML information could receive similar reuse.

MIME Content-type Identifiers

After an enormous amount of rather difficult discussion (which I participated in), the ietf-xml-mime-types mailing list settled on a suffix (+xml) indicating that a specific media type, whatever else it may be, uses an XML syntax. Getting to this point has been difficult.

XML opens up new generic processing possibilities that weren't expected when MIME was originally created. Trying to integrate those possibilities with the existing MIME infrastructure is hard. MIME types are widely deployed, relatively well understood, and remarkably limited at the same time.

It is difficult to express simultaneously that an SVG document, for instance, is an image, an XML document, and uses the SVG vocabulary. Some processors may only care that it is an image, others (search engines, transformation tools, etc.) may care that it uses XML syntax, and still others may care that it is specifically an SVG document.

MIME only provides two layers of identifiers, while XML opens up the possibility of three or perhaps even more. While other approaches, notably content-negotiation, are capable of more sophisticated identification, they are still in development, rarely deployed, and require additional overhead.

By using a "+xml" suffix on MIME content-type identifiers, format creators can identify the XML syntax within formats of any kind without breaking the existing MIME infrastructure. Processors looking for a two-level identifier still receive a two-level identifier, while processors looking for further information about the use of XML syntax can look for the suffix. It's a convention, not a fundamental change to the structure of MIME types, but reaching consensus on it was difficult, and it is still a draft.

The draft for the next round of XML MIME types now provides a set of rules for identifying specific vocabularies that use XML syntax, as well as an appendix explaining why the suffix is useful, necessary, and interoperable.

URIs

Uniform Resource Identifiers (URIs) didn't have very much to do with XML originally, except as a syntax for retrieving resources like DTDs and external entities in SYSTEM identifiers. Those uses of URIs have been mostly uncontroversial, and have tended to dominate the alternative provided by XML, public identifiers.

The more abstract usage of URIs as namespace identifiers has proven distinctly more complex and controversial (though so has almost everything in namespaces). While the usage of URIs (or at least the commonly used URL subset of URIs) is well understood in retrieval contexts, the usage of URIs in identification contexts is much foggier. For example, the rules used for HTTP 1.1 URL retrieval and comparison for caching purposes (absolutize, then ignore case in scheme and host names) may or may not be appropriate to URLs built on the HTTP protocol in XML namespaces.

When URIs first appeared, they (generally) referred to directly retrievable resources. The shift--in terminology, if not so much in practice--from Locators (URLs) to Identifiers (URIs) made this a bit more abstract. Not nearly as abstract, however, as the usage assigned them by the Namespaces in XML spec, which makes clear that they don't need to point to anything specific. The possibility that they might point to something has raised even more questions about namespaces, and there appears to be no consensus on the question of how to process relative URI references used as namespaces.

Public identifiers, an alternative to URIs, may be on the rise as well, as the Internet infrastructure provided by URIs isn't always available or reliable. Arbortext's release of some public-identifier processing tools makes it easier to use public identifiers. The XHTML 1.1 specification's inclusion of SGML Open Catalog Files also suggests that pure URI-based resolution isn't always the best practice.

Security and Infrastructure Issues

Even apart from the data/instructions separation some claim make SOAP and its ilk dangerous, XML raises some new issues for both security and scale. David Megginson's keynote at XTech 2000 demonstrated that XML's reliance on external resources presents some significant new issues for document management and security, especially when such resources are shared.

More recently, Eric van der Vlist noted the potential impact of four HTTP requests to the W3C's servers every time a strictly conforming XHTML document passes through a vanilla XML 1.0 validating parser. While we hope they're set up for that kind of traffic, this could get interesting--and more so as XHTML 1.1's modularization work proceeds. This may prove to be an excellent reason to take advantage of the non-URI approach of using public identifiers to reference local copies of resources.

Developers can open new vistas in application flexibility by expressing how data should be processed through referencing external resources--like style sheets, packaging documents, DTDs, and schemas--rather than relying on internal application logic. Unfortunately, these dependencies apply new stresses to the existing architecture. Caching can solve some problems, but it may not solve all of them.

HTML

With XHTML, the W3C is applying XML to its foundation standard, HTML. XHTML is HTML expressed using an XML syntax, starting with a clean-up of HTML 4.0 (in XHTML 1.0) and moving on to modularization in XHTML 1.1. In the HTML Working Group Roadmap, the W3C announces that "W3C has no intention to extend HTML 4 as such. Instead, further work is focusing on a reformulation of HTML in XML."

Effectively, XML has ended the development of HTML as its own domain, and reduced it to the status of a vocabulary, albeit a widely-used and very important one. With CSS2, Cascading Style Sheets were described for both XML and HTML, and a sample XML+CSS style sheet provided for HTML 4.0. While a substantial portion of HTML's behavior--from image inclusion to hyperlinking--can't yet be described as the application of style sheets to an XML vocabulary, this appears to be the general direction of future HTML development.

While I don't think any one of these issues by itself is a problem, I think it's reasonable to step back and assess the impact XML is having on these tools, which once seemed relatively easy to understand. If XML is indeed "SGML for the Web," it's changing that Web by making new demands on the infrastructure.

XML's Impact on Upcoming Internet Technologies

XML's disruptive impact may extend into Internet technologies that aren't even really here yet. Although it's much harder to talk about impacts on things that aren't widely implemented (or even completely designed), it seems clear that the rush to XML is causing complications for other plans. The planners' dreams may be complicated by the existence of an "XML community" outside of any standards organization that has publicized various techniques (regarding namespaces in particular) as something akin to "best practices." Those best practices don't always align with the planners' dreams, as the battles on xml-uri@w3.org have demonstrated.

Tim Berners-Lee's vision of the Semantic Web may be most at stake in those discussions. Descriptions of the Semantic Web--both in W3C NOTEs (1 2 3) and in his book, Weaving the Web--have focused on RDF as the key "semantic layer" for the Web, treating XML as a low-level format rather than a standalone set of tools. XML's easy acceptance of any provided vocabulary is a problem to be constrained, rather than an opportunity to be enjoyed.

Tim Bray notes, in his Annotated XML 1.0 specification, the following:

"While [the W3C] authorized Jon Bosak to found and run the activity, providing that he made no call on W3C resources, the W3C staff did not perceive that XML had the potential for really high impact."

XML may have surprised the W3C, forcing them to alter their focus and allocation of resources, taking them perhaps further away from the project of building a Semantic Web. Certainly the tie-in between XML and a Semantic Web has not been easy so far--Namespaces in XML, a document intended to build links between XML and the URI foundations described by RDF, has proven intensely controversial, and the role of URIs in RDF is an especially difficult point.

While the W3C has been planning the Semantic Web, the IETF has been working on a number of projects that may also be affected or perhaps derailed by the sudden rise of XML. The work of the content-negotiation (conneg) working group, notably A Syntax for Describing Media Feature Sets (RFC 2533), neither references nor uses XML.

It isn't entirely clear that the content negotiation systems the IETF is developing are a good fit for needs commonly encountered in XML, like figuring out which vocabularies are supported by given systems. While a URI-based escape hatch is available, the current difficulties with URIs in XML and the existence of larger numbers of XML documents relying on DOCTYPE identification (or no identification) of the vocabulary may limit this usage.

The IETF's content negotiation may eventually fit well into the W3C's Composite Capability/Preference Profiles (CC/PP) work (CC/PP Requirements), which both uses and describes XML, and which emerges from wireless work using XML.

Is It Time to Rebuild?

It's not clear that it's time to throw away existing infrastructure and start over with an XML foundation, but it may be time to at least consider that prospect. The Internet currently runs--usually quite well--on application protocols that were designed when the prospect of using characters containing more than 7 bits in any reliable way was considered exotic. HTTP and SMTP both use a very limited number of characters in very limited ways to transmit information. These may have generated efficiencies at some point, but they're also becoming limiting factors as the Internet grows.

It's not hard to imagine a revised version of HTTP or SMTP that uses XML vocabularies--even extensible XML vocabularies--in place of the current set of text headers. Transmitting binary data may continue to be a problem, but protocols that transmit a series of "documents" can keep the binary information out of the way of XML processing, much as binary information isn't used in the headers of these protocols today.

In some ways, it's deeply ironic that SOAP and XML-RPC are reusing the HTTP protocol rather than starting fresh and building a simpler transport for XML information. The HTTP infrastructure is well-understood, incredibly widely deployed, and offers the promise of crossing firewall boundaries. Yet HTTP is also rather inefficient, subject to caching along the way, and not always the best choice for transmitting short disconnected messages.

A generic XML transmission protocol, something like the Extensible Protocol, or perhaps optimized for shorter messages, seems like something that could be very useful for a wide range of XML-based communication. A generic foundation might open the way for large-scale exchange of XML messages of many different varieties, using a shared and easily internationalized infrastructure.

On the other hand, reinventing the wheel can be painful. While I'd really like to get my e-mail using an XML-based protocol, SMTP, POP, and IMAP are the accepted standards. Gateways impose their own inefficiencies, and trying to convince large organizations to change over the core of their infrastructure is difficult--as the developers of IPv6 have found. While XML may be creating new demands for infrastructure, the development process is likely to grind through slow consensus-building, upgrading protocols and other standards over time, and hammering out agreement over how to integrate XML with the wider world of Internet standards.