Simplifying XML

March 8, 2017

Uche Ogbuji

There is a great deal of complexity to the XML technology stack which has not proven necessary considering the profile of XML use in practice. XML users would benefit greatly from a round of simplification to improve efficiency of processors and reduce hostility among mainstream developers. This is also key to preserving and perhaps boosting XML's relevance. This article is first in a series advocating particular steps to such simplification.

The shape of a technology’s success is often not what was imagined for it in the beginning. XML was envisioned as SGML for the Web, a way to present and process documents more richly than possible with HTML’s far lower coherence. The Web took the long-standing idea of multi-component documents and really brought it to life with embedded images, audio, video and entire apps. Early work on XML was geared to seize this opportunity, with namespaces, XLink, XML/DOM, XML Schema, etc.

This process of specification proliferation continued unabated for at least a decade, and has resulted in a sprawling XML technology stack. If we set aside specialized applications, even the most fundamental and universal aspects of XML have taken on a great deal of complexity in alternative and overlapping standards.

Rather than the anticipated, wide-ranging usage XML has instead found its relevance in a myriad of very specialized data applications. The ultimate result is that you can find a bit of XML almost everywhere, but it has truly dominated few areas of interest because of competition from lower-overhead binary formats and programmer-friendly serializations. To take an example of the former, there seemed a time when XML would be the backbone of RFID communications, but ultimately lost out. JSON for Web application data exchange is the obvious example of the latter.

Given the narrow focus of most actual use scenarios for XML, much of the intricacy that’s crept into the XML stack does little to provide practical value. Take the example of XML namespaces. The original idea was that there would be a proliferation of multi-purpose, polyglot XML vocabularies being used in combination, i.e. multi-component documents. There is a bit of that out there, for example SVG and MathML, but the reality is that most XML documents use a single namespace. Nevertheless many documents are saddled with meaningless boilerplate such as the document’s main namespace declaration as well as one for rarely used XML Schema instance attributes, and other declarations which creep in through arbitrary copy/paste. XML namespaces are governed by surprisingly complicated specifications, with many nuances that trip up even experienced developers, yet they offer very little benefit for the vast majority of use cases.

There are certainly a few applications for which XML namespaces make sense, generally XML processing systems expressed in XML, of which XSLT is the foremost example. This represents a very small proportion of documents actually in use, and it would make sense for something like namespaces to be understood as a special case, rather than as a core concept to XML.

It's also worth noting there are well-served sub-ecosystems in the XML world. For example MarkLogic has grown prodigiously as a company providing XML tools, though it does the same for JSON, etc. XML users who are committed to MarkLogic, would find all the complexity less of a problem, given the handy UIs and support tools. In this case use of XML in its most elaborate form becomes similar to the use of any other such technology within the support system of a large software vendor. Nevertheless even a company as successful as MarkLogic only covers a small fraction of the XML user base, and any technology remains healthiest when served by a broad, diverse ecosystem of developers, from large companies to lone open source hackers.

In a perfect world the solution to all this would be for XML standards organizations to regroup, reflecting on the past couple of decades in practice, and come up with a rightsized XML stack that makes sense going forward. In reality, these standards organizations have also lost a lot of the energy that fueled the original proliferation of XML specifications. Of the many brilliant individual XML leaders from the early days, almost all have moved focus to entirely different technologies. Almost all the companies who sponsored the efforts of these leaders have moved on to different strategic initiatives, seeking competitive advantage elsewhere now that XML has lost its fairy sheen. There is also the question of legacy and technical inertia which makes it hard to find traction for such community-wide efforts at evolving XML technology.

For most XML users it makes sense to simplify matters. By starting with the basic concepts of elements, attributes and text, and standardizing around as little overhead as reasonable, we can significantly cut the costs of XML adoption and keep it relevant in areas where it is truly the best fit. XML would benefit from a reduction in costs around developer training and quality assurance. It would also benefit from avoiding the computing platform costs related to numerous code paths and memory-intensive data structures to address the spaghetti of potential idioms.

Having everyone select a different simplification of XML and XML technologies would not work, because the many different permutations would amount to the same overall complexity. In this series of articles I recommend a particular set of simplifications across the XML stack, with the goal of motivating a shared set of conventions to supports those who are interested in reducing the costs of XML adoption.

Even though XML has not succeeded in exactly the way it was envisioned, it has injected into mainstream computer technology much-needed techniques from the document processing world. This is a very good thing, and one of the main reasons XML remains relevant, despite the many purely mechanistic arguments to eliminate it for the sake of efficiency. XML is deeply unpopular among mainstream developers, mostly because they’ve skinned their knees on its many idiosyncrasies. The best chance at maintaining XML’s relevance is to simplify everything about it. Read this series to learn just how.