XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Parsing the Atom

April 25, 2001

This week XML-DEV has been considering some interesting twists on XML data processing, prompted by the use of regular expressions in the W3C XML Schema specification to define complex data types.


While the world may be increasingly surrounded by pointy brackets, the majority of data exchanged isn't XML, which will be true for some time to come (if not always). Yet even in situations where data has been generated as XML, there is a near infinite variety of forms which that markup can take. Hence the numerous initiatives to define horizontal and vertical XML standards; these schemas limit the acceptable forms of XML documents to those deemed suitable for particular application or business uses.

However the core flexibility of XML -- the ability for anyone to quickly define and produce their own formats -- will mean that a variety of document types will evolve and coexist. There will never be a single blessed way to markup a single piece of information, just acceptable forms for particular processing contexts.

Where does that leave the Desperate XML Hacker? How does she deal with the following variety of document fragments:

  <date day="30" month="04" year="1972">30th April 1972</date>

All these fragments are legal XML, they merely differ in the granularity of their markup. The last fragment is obviously the most granular, and the most convenient for processing with XSLT, for example. But what about this fragment of Scalable Vector Graphics Markup (SVG)?


Or perhaps this example of CSS styling?

<a href="http://www.w3.org/"
          style="{color: #900}
          :link {background: #ff0}
          :visited {background: #fff}
          :hover {outline: thin red solid}
          :active {background: #00f}">...</a>

Not quite so amenable to processing and certainly not with XSLT's limited string handling capabilities. The only resort then is to fall back on application code to parse, and process this information. Or perhaps not.

The second XML Schema specification describes the means to define simple and complex datatypes in W3C XML schema. This goes a long way toward standardizing the legal forms of datatypes in XML documents; the specification itself defines many simple types such as numbers and dates, etc. An important aspect to these definitions are the use of regular expressions to define the legal forms of type values. The specification refers to the definitions of these type values as atoms, which are used to validate the contents of elements declared to have specific types.

But what if we could do more than simply validate this content? What if we could use these regular expressions to extract atoms of data from element and attribute values? These are the questions that XML-DEV has been considering.

Splitting the Atom

Simon St. Laurent began the discussion, unhappy with the current simple and complex types in W3C XML Schemas.

It seems as if regular expressions could be used not just for validation of typed content, but for fragmentation of typed molecules into smaller atoms. Instead of binding users to a particular (ISO 8601) date format, this approach would let users provide their own rules for fragmenting date strings into the parts we need for processing -- year, month, day, etc.

It would also open up the prospect of treating other compounds -- like the CSS style attribute, some of the path information in SVG, and various other places where the principle of one chunk, one string has been violated -- as a set of atoms which could themselves be validated and/or transformed and/or typed.

This leads to another kind of post-processing infoset, where the atoms are available as an ordered set of child nodes, but it seems like a promising road.

Instead of relying on sections of application code to process this information, instead defining a mechanism to pull out atoms of data from within "molecules" of information, St. Laurent argued that interoperability would be increased.

[B]y making it possible to access that information in multiple environments, you preserve a lot of the interoperability that XML promises. Working with atoms lets us avoid a lot of not-very-portable application code.

St. Laurent was not alone. Steve Rosenberry suggested a similar mechanism to the Schema Working Group earlier in the year. Rosenberry was particularly interested in being able to define values that included units of measurement.

My motivation was to specify an attribute as a numeric value with units of measure attached to it for absolute clarity. (Ask NASA how important this might be. They lost a Mars probe because numbers had no units associated with it and one group assumed metric while the other group was specifying English.)

...Since the only guideline for when one should use attributes vs. elements is "It depends upon the application", I don't see this further structuring, parsing, and use of attributes as fundamentally wrong given that it has clear functionality for certain developers and users of XML.

This highlights another aspect to the debate. If there are several atoms of information within a given value, then surely they must be explicitly identified. Henry Thompson was one XML-DEV member with this viewpoint:

Strong disagreement (speaking personally). We have a way in XML to express compound objects -- it's called elements-and-attributes. The mistake, in my opinion, was giving in to the SQL people and having _any_ kind of date or time as simple types -- they should _all_ have gone in to the type library as complex types.

Thompson suggests here that there are many simple XML Schema types that may better have been explicitly defined as complex types. Indeed, Thompson (again expressing his personal viewpoint) believes that the most granular form of markup is the correct approach.

Recognizing the utility of regular expression processing of information atoms for simple examples, Michael Brennan had some concerns over a possible trend toward sparsely marked-up data.

...I hope there is not going to be too much of a trend toward doing this sort of thing. In my mind, if a datatype has some structure to it, why not just make it a complex type and leverage XML syntax to convey that structure? Isn't that really the whole point of XML -- a standardized syntax for conveying structure?

Anders Tell had similar reservations.

That doesn't mean that standardizing [dates and times] is bad, but they should be standardized as compound types -- otherwise, it will encourage people to make other things into "simple" types with their own special parsing rules (For example, phone numbers, ZIP codes, UK Postcodes...

While philosophically in agreement with this viewpoint, St. Laurent believed that the trend toward data clumps was already in evidence, citing several examples. St. Laurent stated his goal as dealing with the variety of data already being produced:

I'm afraid the trend's already happened...

Given that situation, I'd like very much to have a means of breaking into different lexical forms representing such compounds without having to revert to full-scale XML Schema processing.

It's not so much that I want to encourage such things, but that they already exist and that I'm not especially impressed with current models for processing and handling them.

Expressing similar sentiments, Jonathan Borden highlighted the role of St. Laurent's proposal within a typical XML processing chain and suggested that the feature need not be limited to XML Schemas.

Generally I agree with this sentiment that markup is the best way to represent structured data. The problem exists with getting stuff into the proper form -- especially when the data you are handed isn't organized in an ideal fashion. I like to use a processing chain in such cases, and to the extent that regular expression matching/parsing can be integrated into an XML processing chain, we might use standard XML techniques such as XSLT transforms to "clean up" such data into a properly structured form.

I'm not sure I need this facility in XML Schema per se, what I would really like is an XSLT/XPath regular expression function to include variable bindings. Recursive parsing or character data in XSLT is a fairly ugly proposition at the moment, but something that is frequently needed in practical applications.

Microparsing and Beyond

Also in XML-Deviant

The More Things Change

Agile XML


Apple Watch

Life After Ajax?

Use of regular expressions is only the beginning, although in most cases they're likely to hit the 80/20 sweet spot for most functions. Complex data molecules such as XPath expressions and SVG paths may require more complex parsing rules. Extracting information in this manner is often termed 'microparsing'. It is possible to identify an alternative, more verbose XML syntax for this information that may be easier to manipulate with the available tools.

This is an interesting parallel to the recent effort to define an XML encoding for XPath. Robin Berjon also posted a simple example demonstrating the generation of SVG paths from an alternative path vocabulary. One can view this as a downward translation, or compression, into a more readable, succinct format. Obviously it is desirable that this format can also be uncompressed or upwardly translated into its equivalent form. Len Bullard observed that these kinds of encoding have been a common feature of standardization efforts.

This points out something that recurs a lot and was noted often in the days before people were trained to think of markup languages as a single DTD or schema which everyone implements: the need for an up/down transformable language spec. This approach was well-understood and documented in the past.

For a standard language there should be:

  1. A form that is an up translation target. The maximum information form into which and out of which other forms can be created.
  2. Specifications for compressed format. These should be normatively expressed as the transform itself.

... SVG with minimized paths is a compressed form, one of the possible normative compressed forms which could be documented by transform.

To conclude, it seems there may be some mileage in defining some finer-grained utilities for manipulating data within XML markup. Although regular expressions have long been a part of SGML and XML markup, as they form the basis for XML schema languages in general, it seems that their formal addition to XML Schemas opens up some additional possibilities. Perl programmers may feel justifiably smug. C developers may wish to look at Hackerlab Rx-XML, which is a regular expression matcher that processes XML Schema regular expressions

Whether the individual design decisions that have lead to complex data formats within elements and attributes are themselves questionable is a moot point (and has previously been debated on XML-DEV). The very real situation is that there are a many varieties of data and markup to deal with, and it's always handy to have a few extra tools in the toolbox to handle them.