SML: Simplifying XML

November 24, 1999

Robert E. La Quey

"One should not increase, beyond what is necessary, the number of entities required to explain anything "

William of Ockham (1285-1349)

Much has been written about the "XML Revolution", and the advantages of XML's readily implementable nature. It is clear from the origins of XML and the avowed goals of the W3C that simplicity is a primary driver. The abstract of the W3C XML 1.0 Recommendation states:

"The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML."

Of late, there are those that are thinking beyond this simplification of SGML into XML, and are pushing for an even simpler standard. Simple Markup Language (SML) is the newly-coined name for a de facto stripped-down variant of XML being used by two groups of developers.

The first of these groups of SML users comprises those who believe that the revolution stopped too soon. The second, a much larger group, started using the core of XML and have no need for additional complexity.

The first group knows its XML history and believes that SGML revisionists added document-centric complications that complicate XML beyond what is really needed for data-centric applications both on and off the Web. In fact, the reality is that many developers are already using a Simple Markup Language—essentially XML without DTDs—to build useful systems.

Simplifying XML

In a recent message to the XML developers' mailing list XML-dev, Don Park explored a definition of SML. He characterized it as a subset of Canonical XML (a standard form of XML being specified by the W3C), but having:

  • No Attributes
  • No Processing Instructions (PI)
  • No Document Type Declaration (DTD)
  • No non-character entity-references
  • No CDATA marked sections
  • Support for only UTF-8 character encoding
  • No optional features

The technical case against attributes is very strong. Kent Sievers has pointed out that "Two ways of representing the simplest of data (a name/value pair) has caused a fracture that has propagated through the DOM, DTDs, namespaces, queries, schemas, etc. and the higher it goes the more problems it is causing." The only argument for keeping attributes seems to be legalistic precedence, that is, "we made that mistake so long ago that we cannot fix it now."

Kent goes on to note:

"As evidence of this I give:

1) almost every other object oriented language in existence. = 'joe' is easy to understand and, since even an INT is an object and even an "=" is a function, is done entirely in the spirit of "elements only" and

2) the obvious nature of everyone's first XML tutorial in which they are typically shown something like <Author><FirstName>joe<FirstName/><Author/> and understand it completely."

Many developers who use a subset of XML -- using as much as needed to meet their application needs, but leaving out attributes, DTDs, processing instructions, and non-character entity-references -- don't understand the fuss. These users have been nicknamed Simpletons, as opposed to the docucentric DocHeads. Despite the sanguine position of the Simpletons, Park's message kicked off a thread of debate unusual in its length and diversity even for the frequently verbose people that hang out on XML-dev.

So what is the big deal?

Well for starters, as Michael Champion, wrote "What Don Park is doing is trying to get us to figure out what in XML is of 'fundamental significance' and what is merely a residue of the SGML legacy and impedes understanding, implementing, processing and using XML in the real world."

One measure of the success of XML is to be found in the law of unexpected consequences. Developers are using XML for all sorts of purposes that its creators did not consider. Those complications remaining in XML from the document-focused days of its origin are unneeded and unwanted by this new class of XML developer. From the widespread de facto usage of a simple markup language emerges a call for standardization of that language as a well-defined subset of XML.

The rapid spread of XML into areas other than Web publishing technology is a testimony to the advantages available to the development community from open standards and open source software. Thanks to the widespread availability of excellent XML parsers, especially James Clark's expat, a large number of developers have had a chance to experiment with XML in many ways that do not involve conventional documents. To further engage even the Desperate Perl Hacker, expat has been made available as a Perl module, XML::Parser, by no less than Larry Wall and Clark Cooper.

A typical quote from the mailing list for Jabber (an open-source ICQ-like messaging system) demonstrates this point:

"Anyone should compile the sample/elements.c file from the expat package and pipe something like






into it. That was an 'ahaaa' experience for me."

Tue Wennerberg

All over the world, developers have had this 'ahaaa' experience and are applying what is in fact SML to the solution of real problems that often have little or nothing to do with documents.

Examples of this data-centric usage of XML include:

  • XUL, the user interface language in Mozilla
  • Configuration files: many programs are finding XML a convenient way to store configuration data in a flexible, future-proof manner
  • Messaging: the above-mentioned Jabber messaging system uses XML as a base for its protocols. The XML-RPC and SOAP efforts use simple XML to achieve their ends.

This is without mentioning the vast efforts being made in e-business protocols which use a reduced-feature subset of XML. All of these data-centric applications just don't need to use the more complex machinery available in the full XML 1.0 specification.

This ubiquity of XML is well-put by Mark Birbeck: "Now, am I using XML? Well of course I am, but I am also using binary, electricity, RDBMS, C++, COM, blah, blah, blah. Sure, XML is still new enough for it to have its own mailing list—just like electricity would have done if it could have ... Eventually XML will be 'seen' by the equivalent of 'particle physicists' today. The rest of us mere mortals will use their parsers, editors, class libraries, protocols, and so on".

The motivation of the SML camp is to tie down this ubiquitous XML subset and make it as well-defined as XML is itself.

Reaction from the XML establishment

A number of the leading figures in the XML community have not been amused by the SML discussions. This is understandable—any call for experts to re-examine their basic assumptions, to go back to ground zero, is guaranteed to evoke strong reactions. Especially when it comes at a time when the experts thought they had already done just that. XML is, after all, a vast simplification of SGML.

A challenge to authority coming from a groundswell of outsiders is also seldom welcome: William of Ockham was forced to flee the papal court at Avignon for Germany because of his calls to simplify the accepted philosophy of "scholasticism" and his opposition to the temporal authority of the Pope.

Mark Birbeck, quoted above for noting the ubiquity of XML, counts himself among those against SML. He argues that the time for talking about XML variants is past. It is now time to move up the chain of abstraction. The advocates of SML do not disagree about the need to look upward, but do ask "Up from what base?" and "With what intellectual baggage?"

David Megginson, parser developer and leader of the initiative that created SAX, the Simple API for XML, is critical of the SML push. His perspective is that of a victorious XML revolutionary. "If we could go back in time, I'd be happy to argue that notations, unparsed entities, and some other junk be removed from XML 1.0," he said, "but it's too late now, and we won anyway."

Megginson argues that people can simply leave things out when using the parser. The underlying problem of a "admittedly slightly-pudgy (though not bloated) XML grammar" is not worth fixing, simply because another dialect of XML will only split the community and lead to further confusion.

Unfortunately, there is already a lot of confusion. The aspects of XML that are optional—in the sense that they are there, but you don't need to use them—create havoc for the beginner. Simon St. Laurent, author of several XML books and defender of the XML newbie, sympathizes with the simplifiers. "I'd much rather see developers start from the simplest possible base and build out than require them to understand all of the parts involved in XML 1.0".

This is a point of general agreement among the supporters of SML, who go on to point out that SML is evolving into a very well-defined and quite specific subset of XML. They ask, Why should this cause the existing XML community any problems?

The SML perspective is not that SML is a watered-down SGML, despite its origins, but that it is an enabling technology for a vast array of applications, many of which are not even yet on the horizon. Like binary and electricity, they ask us to examine the fundamentals and listen closely to William of Ockham's sage advice from 700 years ago, even at risk of offending the establishment.

Where does the SML movement go from here?

It seems likely that an SML specification will be developed by paring down the existing W3C XML 1.0 Recommendation. Perhaps this will be submitted to the W3C for discussion, but it is more likely to be short and simple enough to send directly to the XML-dev mailing list for a larger and more open discussion. It could be great fun and attract a new crowd to the debate. The SML specification needs to be widely promulgated to other venues for discussion as well.

Should a viable specification for SML emerge from such a discussion, then the creators of the specification will decide whether to take it to the W3C for review, or to the IETF as an Internet Draft.

Since a Simple Markup Language is already a reality in many applications, there is some urgency to setting the standard in place. Such a standard is not likely to be difficult to define, nor is there good reason for it to be particularly controversial, so it should not take long for it to appear. Expect to see the draft no later than mid-December and a standard by January. A nice way to start the new millennium (or end the old one, depending on your choice of standards).

The SML enthusiast can then start the new year with a well-defined standard that makes good use of existing XML tools. A clean slate is available to explore a wide range of issues and to build applications using the fundamental, almost molecular, structure inherent in SML.

Several paths can then be explored as we develop understanding and experience with XML fragments, schemas, XPath, XSLT and the like. The opportunity to leverage other powerful technologies, for example MIME in XML, will also be easy to explore with SML. Look for a period of serious experimentation and innovation as developers all over the world realize the power of SML.

Look, too, for continued scrutiny of each new XML-based technology. The Simpletons will continue to defend William of Ockham's razor -- even if a few cardinals and popes are offended.