Look Ma, No Tags
July 24, 2002
XML combines all the inefficiency of text-based formats with most of the unreadability of binary formats. -- Oren Tirosh, comp.lang.python
Silence is the mark of XML's ultimate success. The less people talk about XML, the more and more easily they use it, and the more using it is unremarkable, the more it can be said to have won in the market place of ideas. Think of XML as one of the basic utilities of the Web, say, electricity. In the most industrialized nations, unless electrical service is interrupted, it's not common to talk about it a lot. It's used just about everywhere, by just about everyone: all with the literal, end-user ease of flipping a switch. And people rarely point to it as a distinguishing factor: it's not the deal-clincher in very many, if any new home sales, for example.
Ordinariness is just one way of judging XML's success; other ways include the number of novel applications it provokes or abets, the horizontal ubiquity of its adoption, the number of competitors it makes obsolete. By each of these measures, XML is very successful.
Yet XML has inspired as many competitors as it has made obsolete, maybe more. There have been simplified or "core" XML alternatives (SML, discussed in XML.com's "SML: Simplifying XML" 1999 article), alternative syntaxes for XML (like SOX and SLiP), and even new competitors, including a very interesting one, YAML, which appears to be growing rather than fading away.
YAML -- short for: YAML Ain't Markup Language; rhymes with "camel" -- popped up on my radar screen last week as a result of an interesting thread on comp.lang.python about the uses and abuses of XML, a conversation which I commend to your attention on its own merits.
In rummaging around for a plain, concise description of YAML, I kept stubbing my toe on a felt need to define it by referring to XML in some way. That was a mistake. YAML stands on its own very nicely, even if its most immediate point of contrast is XML. In other words, if there were no XML, there could still be a YAML, but it would have a different public face. If the XML world tends to get divided into data and documents, a distinction which is probably more pedagogically useful than it is necessarily true, YAML corresponds more to the data part of XML than to the document part. As the YAML specification puts it, "YAML is more closely targeted at messaging and native data structures" than at structured documents.
Accordingly, my plain, concise description of YAML is that it's a processing model and a wiki-markup-esque way to represent relatively arbitrary, high-level language data structures.
Easy to Read, Easy to Write
First, a word about "wiki-markup-esque": YAML's syntax is very lightweight, especially compared to XML's, and though the specification doesn't list wiki text as one of its influences, YAML visually evokes wiki markup. The specification explicitly credits RFC 822 for syntactical influence. One of the YAML designers, Clark Evans, said recently that "Ward Cunningham's WikiWiki is a very cool concept and I'm sure we borrowed from it sub-consciously". The important point is that YAML is lightweight. But what do I mean by lightweight?
Consider, for example, a data structure I've been using in a Python project -- an IRC bot which needs a configuration file full of bindings: that is, a collection which relates IRC events, methods or functions, and regular expressions. After being parsed, the configuration file is represented as a Python list of tuples. I can write this structure literally in Python as follows:
[ ("PRIVMSG", "newUri", "^http://.*"), ("PRIVMSG", "deleteUri", "^delete.*"), ("PRIVMSG", "randomUri", "^random.*") ]
Representing a list of tuples in this way in Python imposes a space overhead of about 34 characters: that is, I have to type 34 characters, to represent the structure, beyond the data itself. That's not too bad.
I can also represent this structure in plain text according to some informal Unix conventions about configuration files:
PRIVMSG newUri ^http://.* PRIVMSG deleteUri ^delete.* PRIVMSG randomUri ^random.*
For plain text, that's about as little space overhead as one could want, but it's also inflexible: data elements cannot have spaces, cannot span more than one line, and so on. But in some applications, it's all you really need, and it's trivial to serialize to or deserialize from disk, which is nice.
The XML space overhead is of course considerably higher (we'll conveniently ignore the XML setup boilerplate), and one rendering might look something like
<bindings> <binding> <ircEvent>PRIVMSG</ircEvent> <method>newUri</method> <regex>^http://.*</regex> </binding> <binding> <ircEvent>PRIVMSG</ircEvent> <method>deleteUri</method> <regex>^delete.*</regex> </binding> <binding> <ircEvent>PRIVMSG</ircEvent> <method>randomUri</method> <regex>^random.*</regex> </binding> </bindings>
The other obvious rendering -- which might not work at all, given the restrictions on XML attribute names -- would look something like
<bindings> <binding ircEvent="PRIVMSG" method="newUri" regex="^http://.*" /> ... </bindings>
I'm too lazy to count the space overhead for either XML rendering, but it's definitely more than the literal Python structure or the Unixy config file. Of course, you get more from XML, since now the data can be called "self-describing", though in simple cases that tends to be of limited utility. With judicious use of comments, the XML and plain text versions can be equally self-describing.
Finally, let's look at the same structure, a list of tuples, represented by YAML:
--- - - PRIVMSG - newUri - '^http://.*' - - PRIVMSG - deleteUri - ^delete.* - - PRIVMSG - randomUri - ^random.*
That's 12 visible characters of overhead, plus a good number of newlines and spaces. (Note that YAML uses three dashes, ---, to separate documents within a file or stream, which it considers a "series of disjoint directed graphs, each having a single root" (YAML spec).) From the point of view of absolute space efficiency, YAML is not a radical improvement over XML or literal Python. But if you're really interested in absolute space savings, you're probably ready to sacrifice human readability anyway. What I mean by calling YAML "lightweight" is that from the standpoint of visual perspicuity and input concision, the YAML version is almost as lightweight as the plain text, and more lightweight than either XML or literal Python. In short, YAML is as easy to read, if not more so, and considerably easier to type by hand than XML. And that counts for a great deal in many kinds of application, especially stuff like configuration files and the like.
Native Processing Model
As for YAML's other influences, section 1.2 of the specification generously exhibits them. Python fans will be happy to note that it uses whitespace as a block delimiter. It also steals ideas from MIME, HTML, XML and SOAP, including aliasing, application-specific types, and a namespace mechanism which is part Java package naming and part XML URI-based namespace naming. But perhaps the biggest influence on YAML is Perl -- which I, as a Python devotee, had to learn not to hold against it! -- especially in the way YAML conceptualizes data structures and types, which it distinguishes into scalars, like integers and strings, and collections, like hashes and arrays.
In addition to being more lightweight for reading and writing than XML, YAML has a different processing model, too. As is well known, XML's nested elements and attributes most fluently describe tree-shaped structures. YAML, by contrast, hews very closely to the data processing models of programming languages like Perl, Python, and Java, freely mixing sequence, mapping, and scalar types. As a result, YAML serialization fits typical programming language constructs more closely than XML, requiring neither mapping conventions nor DOM or DOM-like adaptations.
In short, the two leading selling points of YAML over XML are that it's more lightweight, and that it uses native processing models and data structures. The most serious YAML detractions are that it isn't XML, and it isn't nearly as ubiquitous as XML; though YAML is very well supported in Perl, the support in Python, Java, and Ruby is maturing, and there are rumors of a forthcoming libyaml in C, too. It bears repeating that ubiquity of tool support is not an absolute value; it is context-dependent and goal-specific. You may be able to sacrifice it for the sake of using YAML and securing its virtues, depending on what you need to do and where you need to do it.
Also in XML-Deviant
A Profitable Coexistence?
There is a lot to YAML. The specification fits in one HTML document, but it is neither short nor simplistic. For example, if you're interested in YAML but circumstances prevent you from moving away from XML all at once or altogether, you might want to look at YAXML, which is the YAML conceptual model with XML's familiar syntax bolted on.
If you have been late to adopt XML as a matter of policy, or if you have been having second thoughts about its costs for your projects, YAML is definitely worth a long, serious look. You may well find that it imposes less overhead, both on the people who produce it by hand, and the people who program computers to produce and consume it. And even if YAML never becomes more than a niche tool, if you happen to occupy that niche you'll be happy to have it around.