XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Goldilocks and SML

December 15, 1999

Goldilocks saw three bowls of porridge on the table. The first was too big and could only be eaten with a complicated  tool: the recipe was there, but she couldn't quite understand it. "This SGML is too big!" she cried. The second was sweet but insubstantial: "This HTML is too simple," she lamented. Then she tried the third bowl: "Mmmm, this XML is just right...and the international flavor is sure to be a hit at my pajama party."

Not everyone is Goldilocks. Here is a prediction of mine from XML-DEV of November 1997 (Simon uses bozo here ironically):

> From: Simon St.Laurent <SimonStL@classic.msn.com>
> ... XML is going to bring a lot of 'bozos' into the
> field of markup, people who care neither about the history nor the theory and
> just want to get things done. A different attitude and different needs will
> very likely increase the demands for XML to find its own voice.

Yes. And all the questions: "Are declarations good?" "Should we remove constants to headers, or allow inline declarations?" "Why isn't everything an element? Wouldn't that be simpler?" "Why can't we leave out these strings since they are not needed for parsing?" and so on.

Just because questions about simplifying XML are predictable and repetitive does not mean they are trivial, or that they are asked idly. In a fit of eye-rolling about the same issues coming up again and again, in November 1999 I posted a little hoax on XML-DEV announcing XML 2-0alpha:

I am happy to announce the release of the XML 2.0alpha specification. It has been created by using all the criticisms of XML over the last year.
A non-exhaustive list of its main features compared to XML 1.0 would include the following:

    * It only uses UTF-8 (e.g., some W3C people) and so gets rid of the
encoding header, numeric character references.
    * It gets rid of PIs (e.g., TimB-L).
    * It gets rid of parameter entities (e.g., some schema people).
    * It gets rid of DTDs (e.g., many of the same people).
    * It gets rid of notations (since no one knows what they are for still).
    * It gets rid of entities (since XLink will replace them).
    * It gets rid of the ' as a literal delimiter.
    * It gets rid of attributes (since all we need are elements).
    * It gets rid of IDs (since local names are old hat).
    * It gets rid of PUBLIC identifiers (since everything should be a URI).
    * It gets rid of CDATA sections (since they make grepping unsafe).
    * It gets rid of using the name of an element to key its type (i.e., you
have to use a namespace URI and the munged name).
    * It gets rid of elements and embedded markup (e.g., Ted Nelson).
    * It gets rid of chunks of Unicode through "early normalization" (e.g., the W3C I18n WG).
    * Because the only delimiter is the comment delimiter, the need for
& and > is removed; because the string <!-- can be represented as
<!<@-- -->-- there is no need for < either.

This gives us quite a nice markup language: XML 2.0alpha which consists of only
    * data
    * comments, at user option

I propose that we should all spend the next 100 years discussing this,
and that every W3C specification in the meantime should try to
influence the outcome by supporting only the subset of XML 1.0 they
like, until consensus is reached by people outside the original
developers of XML.

Indeed, wise implementors should delay until XML 3.0.  There is talk
that allowing all these characters poses internationalization problems,
so it is possible that only ASCII characters may be used in the future.
Because of WAI reason, it may be that only visually distinct characters may be allowed: so XML 3.0 will only consist of one or more occurrences of the letters O and X.  This will provide substantial benefits for compression and binary representation, as well as direct representation of certain games.

A few hours later, across the international data line, Don Park hit back—floating a proposal for just what I was knocking, calling it SML. In the future, I should just shut up! But no: the questions that the SMLers are asking are good. XML has been made with a defensible set of choices of language features (which is not to say that they are all the choices that I would have made); debate can clarify why the choices were made and what the alternatives are. A technology is engineered to the perceived tradeoffs of the times, and times change.

My impression of SML is not that it represents some conspiracy against the one true path of XML, but rather that it shows that some people's technical or aesthetic needs are not being met by XML. The rallying cry is simplicity, which is as excellent as motherhood, but the rationales seem wildly divergent or vague. It will be interesting to see what is ultimately produced from the effort: a syntax, a methodology, an API, an entity manager, some implementation techniques. There is lots of room in the world for innovative ideas.

Let's look at some specific areas for which XML is criticized.

Homogeneous or Curdled?

Some people say that we can get simplification by layering. Let us put aside the obvious answer that cutting the cake in a different way does not reduce its size. Instead, is it possible to devise a highly layered XML?

Here is a model of XML in 16 skinny layers. The XML 1.0 specification does not present itself using these layers, and I think it would be crazy to implement a system using them. But layering is possible: let us assume stream processing:

  1. Storage/Transportation: this layer is concerned with bits and bytes: HTTP and MIME for example, or file access.
  2. Compression: this layer is concerned with size reduction.
  3. Encoding handling: this layer handles the XML encoding header, converting the data to Unicode.
  4. Normalization: this layer (not part of current systems) copes with Unicode's inherited internal infelicities.
  5. Parameter entity handler: this layer reads parameter entities and expands them, like a partial cpp.
  6. DTD marked section handler: this layer ignores or includes marked sections, like a partial cpp.
  7. Entity expander: this layer reads general entities and expands them and numeric character references, like a partial cpp.
  8. Comment stripper: this layer strips comments, like a partial cpp.
  9. CDATA section handler: this layer converts CDATA sections to a form using <,& and >.
  10. Server-side PI handler: some PIs we handle before parsing; they don't require context. Most server-side tags in server pages use some PI variant.
  11. Parser: we now have a language the size of SML—just elements, attributes, three built-in character references and PIs, but with various declarations; this layer constructs some programmatic representation of the data, perhaps a tree or a stream.
  12. Attribute defaulting: this layer adds default values.
  13. Namespace prefixing: this layer adds explicit namespaces to all elements.
  14. Validation: this stage checks the document against the types declared.
  15. PI handling: this stage processes the PIs.
  16. Element handling: this stage processes elements.

That gives a very rich environment. None of these layers  needs to contain a full XML parser: for example, the CDATA handler only needs to look for "<![" and "]]>" and to skip "<?" and "?>".   So if lack of layerability is not a problem, let us turn 180 degrees and say that XML has too many layers. How do we evaluate too many?

Reductionists think that everything that can be removed from XML should be removed. First to go would be the minor features of XML—CDATA, comments and PIs. But these result in almost no reduction in complexity. Then goes DTDs and entities, a biggy. Then goes internationalization. Then goes attributes, numeric character references, and long end-tags. The language that is left is simple, something like:

data ::=  "<" name " " data* ">"  | "<" | ">" | "&" | other-UTF-8
name ::= ASCII

This type of syntax is akin to s-expressions, familiar to and beloved by LISP programmers, and not dissimilar to Microsoft's RTF either. (However, it has consistently failed to be popular for writing.) At this stage, the softer reductionists will back off a bit, and the ones who want to keep compatibility with XML will win, due to the presence of infrastructure: long end-tags, numeric character references, and perhaps even attributes come back.

So within XML we have so many layers, and the W3C is developing more as its members think of new applications: linking, datatyping, transformation, formatting, scripting, packaging, etc. The problem with treating low-level layers as optional rather than required is that it allows vendors to pick and choose which layers to provide.  What if one big browser company decided not to use namespaces, while another chose to include them? Or if one company decided that the comment layer was no good while another kept it?  This would bring us all the joys of the HTML market, where you have to create different pages for each browser many times. Now perhaps we will always have to transform to different pages, but why should we have to transform to different syntaxes? That is not simplicity. SML must be justified by a respect for plurality rather than by the pursuit of simplicity.

There is always a tendency to confuse a technology with its specification. This was also the demon with SGML, where the difficulty of ISO 8879 was often blamed as being intrinsic to SGML. I hope it is not a general requirement of a language specification  to present its subject in an "as-layered-as-possible" form.

Fuzzy?

Without taking too much time on this, there is a claim that because attributes and elements are not conceptually different, there should be no way to mark them up.  The answer I would give is two-fold:

  • First, it is a natural pattern in elements to have a head containing metadata and a body containing data; this is true all the way from the top-level elements, such as html, down to the lowest level.
  • Second, elements often push function-invocation in programs, and functions typically pull attribute values: so there is clear pragmatic justification for a distinction being made. This distinction also applies to comments and processing instructions.

So there are general conceptual distinctions at play. Why should these be marked up using special delimiters rather than merely by reserving some special element names?  There are several reasons, including readability, simplification of paths, simplification of content models, simplification of at least some kinds of programs, and so on.  For example, if there are no attributes, imagine how to handle IDs and xml:lang.  A small simplification in grammar but a corresponding complication in client layers.

But doesn't the element/attribute distinction complicate mapping data from databases which do not have this distinction? To go from a database to XML requires a mapping convention rather than a new syntax.  Removing attributes from XML would remove one question from the list of issues such a system must resolve, but it does not remove them all:  the flattening of data, the representation of relations, the order of records, the format of dates, the  mapping of case-insensitive names, etc.

Sugary?

Even though I have used the term reductionist, perhaps some SMLers will prefer to see themselves as essentialists: stripping XML of its syntactic sugar down to its essential features.  But the essential feature of XML is that it is a markup language. It is not merely a language for computer-to-computer exchange—it also provides minimal features aimed at making life easier for direct readers and writers of data.

A markup language is not a data-modeling language. To look at justifications for syntactic and deeper features in XML, one cannot rely on statements like "my data needs this" or "this makes sense in my data model." There is no intrinsic reason to make one atomic piece of information an element or attribute from a data modeling point of view. But from the view of markup, of how a human reads and knows, there is every reason.

So I suspect that many of the proponents of SML really are not interested in making a markup language at all: they want a data representation language, or a data packaging language, or a data transport language. The XML strategy is to build  on top of a solid, human-friendly basis. This focus on human factors has allowed SGML-family markup languages to thrive while others wither.

If visual distinctiveness is a primary human-factors requirement for a markup language, the various syntax alternatives within XML make sense:

  • the different delimiters for different kinds of tags
  • the alternative delimiters for attribute values
  • the provision of CDATA sections, which have delimiters that are very unlikely to be found in real data

Even the fact that we cannot use numeric character references in element names shows that the undesirability of unreadable names is considered greater than the desirability of universal transcodibility: it is a tradeoff only justifiable by considering XML as a markup language rather than merely as a data interchange language.

Full of Indigestible Foreign Muck?

XML is a technology that provides sufficient internationalization for a fair and usable global infrastructure.

My job for the last year has been to try to encourage Western developers not to ignore the rest of the world. It is not just me trying to tell everyone what they should do; I have been employed (in a desperation-move) by an Asian research institution who sees that without an internationalized WWW, the opportunities for economic development towards an information-economy are retarded. Backyard hacks are fun, but after they escape the backyard, they have different requirements.

Technology has an impact; if someone proposes a technology for the World Wide Web, it is fair game to have it examined for its predictable social effects. It is also fair game to have the values underlying it pointed out.

When internationalization is built into the infrastructure, the whole world becomes much simpler. For most non-Americans and people who work in other scripts than Western European alphabets, you can imagine that the "simpler" in SML would be rather ironic.

XML was designed in the hope that, by paying full attention to the current plurality of character encodings and by providing a reliable method to label these encodings, XML can act as a Trojan Horse to bring Unicode everywhere. It is naive to think that current and legacy encodings will go away without an off ramp. I suspect that this off ramp is one of the major attractions of XML for large corporations: the number of encodings supported by IBM's and Sun's XML parsers is testament to that.

Furthermore, it is not enough to merely allow different encodings. There must be both a mechanism to label the encoding when the operating system does not provide it (actually, I don't know if any operating system provides this) and a convention to resolve disputes in encodings as data passes through different network layers: for XML, the first is the XML header and the second is the rule that a higher-layer protocol has precedence over a lower. (The XML header has precedence over any markup in the document, and a MIME header has precedence over the XML header.)

The kites for SML currently only provide UTF-8. This is five years ahead of its time, in my estimation. It is very desirable that people move towards UTF-8 and UTF16 as a matter of priority; but SML is not the thing that will achieve it—XML will, by acting as a virus for Unicode alongside Java.

Whither SML?

Many of the current justifications offered for SML are unconvincing—the need for tiny SML for personal devices, electronic dictionaries, and WAP in phones. Or the need for even easier implementation, when so many public domain libraries and classes for XML are available.  Or how our lives will be simplified by having more choice, when we can already choose our own simplified XML. I am happy for dialects like SML to exist as part of a methodology giving a well-thought-out way of using or analyzing or implementing XML, but I think we lose out if it is promoted as an alternative syntax outside XML.  If the SMLers hatch some major difference in functionality and there are good reasons why this functionality should be expressed at the lexical level, and if XML cannot express it well, then there is certainly room for a non-XML grammar.  But many things that look different turn out to be yet another structure that namespaces, fixed attributes or PIs are fine for.

So what areas is an SML suited for?  Perhaps reverse engineering will give some clue:

  • If it is UTF-8 only, it is not practical for local use for much non-Western data.
  • If it allows other sets but does not allow this to be unambiguously labeled, is it not suitable for transnational use.
  • If it does not include PIs, it is not suitable for server use (on the evidence that most server-side includes use special delimiters for what are in fact PIs).
  • If it does not include some mechanism for literal text, it is not suitable for direct data entry.
  • If it does not include syntactic distinction for the most common targets of tags (i.e., comments, elements, processing instructions, entity references), then people must introduce another layer straight away.
  • If it does not have basic attribute defaulting, it must be bundled with some transformation language; so it is best for recipient systems that know the defaults.

So I think that if SML has a future, it may be in the area of closed data transport and interprocess communication, where it is generated by API, and where human reader/writers do not touch it. But that area is the one that binary formats poach easily: some of the requirements may be better solved by more sophisticated entity management capabilities in MIME.

It is good to see some creativity and ingenuity at work. There is nothing wrong with that recipe.