Simplifying XML: MicroXML

June 3, 2017

Uche Ogbuji

The XML specification has many complex areas, largely for historical reasons, and the widely used XML Namespaces specification compounds the complexity. There has always been interest in simplifying XML at its bedrock layer, and a community group created MicroXML, a specification that reduces XML, entirely specified, to around 8 pages even while adding a data model, which is not part of XML 1.0. MicroXML is backwards compatible yet far simpler and more secure than XML 1.0, introduced in this second article of the Simplifying XML series.

How well do you know the XML declaration, and how it differs from a processing instruction? What it means for an attribute to have no namespace prefix? What of the perils of cutting and pasting from documents that use namespaces? Do you remember the one case in which you are required to escape the “>” character in content? The different sorts of entities? How such entities can be used to construct very simple security attacks against most XML tools? If you are an XML expert chances are you’ve learned these things several times over, and yet still sometimes get tripped up by them.

These and other pitfalls of XML have sometimes come about for historical reasons, and sometimes because of solutions which happened to be wrong-sized for their targeted problems. What might XML look like if it were redesigned from scratch, losing as much historical baggage as practical, and meant to be understood in all its nuances by a complete newbie within hours? A group, including the author, set out to answer this question in 2012, to not only produce a simplified syntax and data model for XML, but also to preserve backwards compatibility with full XML. The result was a compact specification for what was called MicroXML.

The MicroXML specification as it emerged from the community process is less than eight pages long, with around half of one page devoted to a built-in data model. We decided that including a data model is important. XML 1.0 did not include one, which encouraged the many and increasingly byzantine data models (Infoset, SOAP, XPath, XDM, etc.) that were created in separate specifications, each with often quite different interpretations of core XML constructs. MicroXML’s core data model helps enforce simplicity and improves the likelihood of interoperability. This is not to say that data models could not or should not be developed in extension to MicroXML, but at least the fundamentals would be consistent. These are element, attribute, and character data, and MicroXML’s data model is stripped down to just enough data structures to cover these basics; three primitive types: character, list and map. Comments are ignored, which makes sense since properly used they’re for human eyes rather than processing logic. The MicroXML data model does not make the mistake of wandering into any discussion of behavior or “API.” XML should always be about text data and minimal supporting metadata, and MicroXML brings it back to this ideal. As such, the specification even contains a completely specified JSON syntax for the data model.

Having attained our goal of backwards compatibility, all MicroXML documents are also well-formed XML 1.0 documents, but many syntactical XML features are excluded from MicroXML. It only supports the UTF-8 character encoding, as does the I-JSON standard subset of JSON. It prohibits the XML declaration and the document-type declaration. No DTD features survive at all; the only entities supported are named or numerical character entities and the latter must be in hexadecimal. There are no processing instructions or CDATA sections. Almost all other XML 1.0 constructs are permitted, but by just excluding these few features the format is radically simplified. Another result is reduction of the security vulnerabilities implicit in the core format to near zero. This factor alone means you should consider MicroXML for applications taking input from network sources.

MicroXML does prohibit XML namespaces by disallowing the colon from element and attribute names. As I mentioned in the previous article, namespaces are notoriously tricky, and almost always cause more problems than they solve. It was decided in MicroXML to take a hard line against them. 

One area where MicroXML has taken a more lenient line than XML 1.0 is in error handling. The latter requires the parser to immediately fail in the case of a fatal error. MicroXML on the other hand does not dictate how parsers handle errors at all. They can fail, they can attempt recovery, or they can respond in any other manner. This takes MicroXML out of the collision course with Postel’s Law.

The requirement for backward compatibility means there are still one or two dusty corners in MicroXML, for example the “>” character must still be escaped if it follows “]]”, but by having many fewer such idiosyncrasies MicroXML users have less to trip over.

The following MicroXML document (derived from one in the spec) includes examples of all syntactic features.

<memo lang="en" date="2017-05-01">
I <em>love</em> &#xB5;<!-- MICRO SIGN -->XML!<br/>
It's so clean &amp; simple.</memo>

The following is a representation of the document in the JSON syntax of the MicroXML data model.

[ "memo",
  {  "date": "2017-05-01", "lang": "en" },
  [ "\nI ",
    ["em", {}, ["love"]],
    " \u03BCXML!",
    ["br", {}, []],
    "\nIt's so clean & simple."

Regardless of the problems around XML’s complexity, which I discussed broadly in the previous article, it is well entrenched as it is, and much of the initial energy around innovation in XML has dissipated. This is not necessarily a bad thing, because a period of stability is a key stage in any technology’s long-term success, but it does mean that it is hard to really get the engines of standardization and widespread community adoption around a reboot as significant as MicroXML. I recommend people look at MicroXML primarily as a well-considered subset of XML which can be used to ease development and improve performance of processing. If you use a conversion process to MicroXML in the first stage of your processing pipeline, which can be largely accomplished with a variation on the XSLT identity transform, you can gain the benefit of refreshing simplicity.

Related links