Hacking XML

September 15, 2004

O'Reilly Book Excerpts: XML Hacks

Author's note: Among my favorite hacks in XML Hacks are two that use an SGML package called SP to do some clever tricks. James Clark's SP is a free, open-source SGML-parser package that includes an SGML parser called nsgmls and an SGML-to-XML converter called sx. The first hack excerpted below is Hack #20, Sean McGrath's "Create Well-Formed XML with Minimal Manual Tagging Using an SGML Parser." This hack, which uses the SP SGML parser, shows you how to take a document that is minimally tagged and convert it to well-formed XML. Another ingenious hack is Hack #94, Rick Jelliffe's "From Wiki to XML, through SGML." This one was originally featured in an XML.com article in March 2004. It uses SHORTREF maps to STARTTAG and ENDTAG style entity references to convert a Wiki format to XML via SGML and SP tools. I'm sure you'll agree with me that these hacks are timesavers and downright fun to use.

Hack #20: Create Well-Formed XML with Minimal Manual Tagging Using an SGML Parser

The problem of converting plain text into basic, well-formed XML occurs over and over again in XML processing. As a general rule, I like to get data into XML as quickly as possible and leave it in XML for as long as possible (preferably forever). The sooner I can get data into XML, the sooner I can bring all my XML-processing tools and knowledge to bear on the data-processing challenges.

When the volume of markup to be created is small, hand-editing using one-off text editor macros is a powerful technique. For higher volumes of markup, a custom program is often the best way to go—Python, Ruby, and Perl, for example, all excel at this sort of work.

Sometimes, the quickest way to get data into XML is by combining judicious use of hand-edits and automatic addition of the markup required using an SGML parser. XML is a subset of a much larger markup technology standard known as SGML (ISO 8879:1986), which has been an international standard since 1986. SGML provides a variety of mechanisms, not found in XML, to minimize the amount of tagging required in documents. Collectively, these techniques are known as markup minimization features. By using an SGML parser to process text, it is possible to take advantage of the tag minimization features to automatically add markup and help create well-formed XML documents.

In these examples, we will use James Clark's SP SGML parser. You can download it from http://www.jclark.com/sp/. The examples in this hack assume that SP has been installed in the working directory for the book's files.

From HTML to XML

You may already be familiar with some of SGML's tag minimization capabilities, as they are used extensively in HTML. (HTML is an example of an SGML application—by far the most successful SGML application in the world.)

The most common tag minimization technique from SGML used in HTML is known as tag omission. Here is a small HTML document, min.html, which, thanks to SGML's tag omission features, is valid per the HTML DTD:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Strict//EN">

<title>Hello World</title>

<p>Hello World

Note that numerous HTML tags that you normally see have been omitted from the document: there is no head element, no body element, and no html element. The end tag of the p element has also been omitted.

Using the nsgmls command-line application that ships with SP, we can parse this document against the HTML DTD on Windows using this command:

nsgmls -c pubtext/html.soc min.html >nul

Or on Unix by using this:

nsgmls -c pubtext/html.soc min.html >/dev/null

The -c command-line option is used to tell the parser where to find the HTML DTDs. These are shipped in the pubtext subdirectory of SP, which came with the archive of files for the book. I have redirected normal output to the null device in the examples. The fact that no errors are displayed on the screen tells us that, from an SGML perspective, the document is both well-formed and valid per the HTML DTD.

SP also ships with sx, a utility for converting documents from SGML to XML. Using the sx utility, we can now automatically add all the tags needed to make min.html a valid XML document. Run it on Windows or Unix like this:

sx -c pubtext/html.soc -xno-nl-in-tag min.html >min.xml

The -x command-line option tells the sx application not to add newlines into the tags it creates. This is an option provided by sx for situations where you might wish to avoid the creation of very long lines of XML output. For a complete list of sx options, see doc/sx.htm in the SP distribution.

The resultant file, min.xml, is shown in and is indented for clarity.

Example 1. min.xml

<?xml version="1.0"?>

<HTML VERSION="-//IETF//DTD HTML 2.0 Strict//EN" SDAFORM="Book">

 <HEAD>

  <TITLE SDAFORM="Ti">Hello World</TITLE>

 </HEAD>

 <BODY>

  <P SDAFORM="Para">Hello World</P>

 </BODY>

</HTML>

There are a total of ten tags in this document, of which sx has added seven automatically, while we only contributed three manually—a 70 percent savings on manual markup!

In addition to adding start and end tags as required, sx has also added attributes called SDAFORM and VERSION. These are examples of defaulted attribute values. Defaulting attribute values is a form of markup minimization that, unlike SGML's tag minimization, is included in the XML standard.

Marking Up the Names of People

A common problem in XML data processing is dealing with the names of people. Many applications require that people's names be split into two parts—a family name and a given name. In the general case, doing this across all languages and cultures is very complex at best and impossible at worst. Even within a limited set of languages/cultures, the complexity of the problem rapidly manifests itself. Consider the following text file (names.txt), which contains the names of three people:

Asmar Hohsen  Mickey Joe Mac Entaggart   Javier Ausas Lopez de Castro

Splitting these names into their given name and surname component parts requires the application of complex rules, rules that are very difficult to explain to a computer. We can take advantage of our human ability to out-guess machines to get this data into an XML form quickly by using an SGML parser. The critical human interventions we need to make are:

Split the list into separate names using the whitespace information and our best guess as to where the boundaries lie.
Mark the point where the surname begins, changing the order of given name and surname, as needed.

Here is an SGML document created with the minimal amount of markup added. A Name tag is used to mark the start of each name, and an S tag is used to mark the point where a surname starts (names.sgml):

<!DOCTYPE Names SYSTEM "names.dtd">

<Name>Hohsen  <S>Asmar <Name>Mickey Joe <S>MacEntaggart 

<Name>Javier <S>Ausas Lopez de Castro

Now we need to create a DTD to describe the Names document type. In XML, it would look like this (namex.dtd):

<!ELEMENT Names (Name*)>

<!ELEMENT Name (F,S)>

<!ELEMENT F (#PCDATA)>

<!ELEMENT S (#PCDATA)>

To make it SGML compatible, we need to make a minor alteration (names.dtd):

<!ELEMENT Names o o (Name*)>

<!ELEMENT Name o o (F,S)>

<!ELEMENT F o o (#PCDATA)>

<!ELEMENT S o o (#PCDATA)>

Note the pair of lowercase o's (o o) between the element type name and the content model of each element type declaration. The o stands for omissable and indicates that documents may omit the start tag (first o) and end tag (second o).

Now we can parse the document with the nsgmls utility to check for errors. On Windows, the command is:

nsgmls names.sgml >nul

On Unix, the command is:

nsgmls names.sgml >/dev/null

The fact that no error messages appear on the screen tells us that the document is well-formed and valid per names.dtd. Now we can proceed to use the sx utility to generate fully marked-up XML from this document. On Windows or Unix, the command is:

sx -x no-nl-in-tag -x lower names.sgml >names.xml

Note the addition of another -x switch with lower. This will produce tag names in lowercase. The resultant XML file is names.xml, which is indented for clarity ().

Example 2. names.xml

<?xml version="1.0"?>

<names>

 <name>

  <f>Agmar</f>

  <s>Hohsen </s>

 </name>

 <name>

  <f>Mickey Joe </f>

  <s>MacEntaggart   </s>

 </name>

 <name>

  <f>Javier </f>

  <s>Ausas Lopez de Castro</s>

 </name>

</names>

You can't do that with just plain old XML!

Hack #94: From Wiki to XML, through SGML

Wikis exploded onto the scene in the late 1990s but have been quieter recently. A Wiki is a site written using a very simplified syntax without tags (e.g., a blank line means start a new paragraph). The nice thing about Wiki formats is that they reduce the number of keystrokes needed to mark up a document (to the level of very simple HTML) to the same number as a nice, swanky, custom application needs. Today I am writing this in Mozilla Composer: in order to put in a link I need to type ^L to open the link editor, type the text, then Tab, then the URL, then Enter: three characters overhead. In a Wiki, I type [ and then the text, a > character followed by the URL, and then ]. Pretty much identical to Mozilla Composer, but with the advantage that one doesn't need to learn any special keystrokes or customize an editor to cope with them.

WikiWikiWeb (http://c2.com/cgi/wiki?WikiWikiWeb) provides a simple rich text format based on visual cues rather than HTML tags. This is convenient for hand editing and has the virtue that the raw text gives a rough indication of the formatted text. Wiki-like syntaxes are useful, for example, for creating documents in mail programs or PDAs, where terseness is important and there is no well-formedness checking or validation available.

SGML: A Language for Describing Wikis

Now that Wiki and XML are both fairly established, it gives us nice end-points for suggesting where SGML fits in: SGML is a language for defining the syntaxes of Wiki-like languages and parsing them into HTML-like documents with missing tags, then filling in the gaps to reveal XML-like structured documents.

The idea is as old as the hills, so old that I expect it will be patented soon. SGML provided many facilities to support this kind of terse markup. XML removed support in the name of ease of implementation, but the strict mantra of "all structure should be explicitly tagged with elements" is not always the best answer.

This hack shows how to describe Wiki content using SGML. This technique converts a Wiki page into XML using the open source SP software from James Clark (http://www.jclark.com/sp/). Using this technique, you can provide your users with low-keystroke ways to send structured data with less opportunity for syntax errors and no reliance on the editing system at their end. The data does not need to be rich text—it could even be simple records of data.

An SGML Document Type for Wiki

The first step is to describe Wiki as an SGML document type, using HTML names where possible:

<!ENTITY % blocks     " p | head | bull1 | bull2 | bull3  | li1 | li2 | 

    li3 |  bq  " >

<!ENTITY % inlines     " b | i | bi | tt " >

<!ELEMENT page         o o ( keyword |%blocks; | pre | dl | hr )* >

<!ELEMENT ( dt | dd | pre | %blocks;  )

            o o  ( #PCDATA | %inlines; | link  )* >

<!ELEMENT ( text | ref | %inlines;   )

            o o ( #PCDATA )* >

   

<!ELEMENT dl         o o ( dt, dd? )+>

<!ELEMENT hr         o o EMPTY >

<!ELEMENT link         o o (text, ref? ) >

<!ELEMENT keyword    o o (#PCDATA) >

If you are used to XML DTDs, you can see this is a little terser. For example, the same element declaration can declare multiple element types, and you can use a parameter entity reference. The various - and o symbols, also unavailable in XML, describe whether the start and end tags can be implied by the parser (o means "omissible").

There are some differences in this document type from HTML. The top-level element is page. List items are not nested, but indented according to their number: bulln means a bullet list at indent level n. lin means a list with no indent at that level. There is an element link instead of the a or img elements from HTML.

Which Wiki?

There are many Wiki dialects. In this hack, we use a fairly generic one that falls between the readily parseable WikiWorks dialect and the original Wiki, and is tweaked to be suitable for blogs. The fact that there are already many mutually incompatible dialects of Wiki should be no surprise: the optimal shortcuts for you depend on the kinds of text you have to deal with.

Let's first describe the various rules for the Wiki in rough English, and then we can figure out the declarations for them to get them into our SGML vocabulary:

Start-of-line followed by ---- means a hr
A blank line starts a new paragraph except in a pre
A start of line followed by a space means a pre
In a pre, a blank line or a line starting with a space continues the pre
Start-of-line followed by * means a bull1; start-of-line followed by ** means a bull2; and start-of-line followed *** means a bull3
Start-of-line followed by # means an li1 indented list with no bullet or number; start-of-line followed by ## means an li2 indented list with no bullet or number; start-of-line followed ### means an li3 indented list with no bullet or number
Start-of-line followed by a tab means a dl followed by a dt
In a dt element, a : starts a dd
Start-of-line followed by a tab, space, :, and another tab starts a bq
Start-of-line followed by # means a numbered list li1
'' starts and ends an i element (or end-of-line)
''' starts and ends a b element (or end-of-line)
'''' become a single quote
''''' starts and ends bold italic text (or end-of-line)
** starts and ends a tt section (when not in a pre)
[ and ] delimit a link
In a link, a > means the text to its left is a title and to the right is the reference; otherwise; the text is the reference
[[ to escape [ in normal text
Start-of-line followed by ! is a heading
Start-of-line followed by = is a keyword (metadata)

Some Wiki dialects also allow tables and automatic link recognition. The first is left as an exercise to the reader; the second is more than SGML can handle. Once you have described your Wiki dialect in terms of SGML, you can easily add more shortcuts as needed.

Wiki as SGML

Now that we have a description of our Wiki dialect, and of the abstract grammar of a Wiki page, we need a way of parsing one in terms of the other.

SGML's short references provide just that. We can tell the parser to substitute a particular tag for a particular delimiter string. SGML provides four features to make this feasible:

You can recognize different delimiter strings in different contexts (using different delimiter maps).
An SGML parser is pretty smart about filling in the gaps for missing tags (tag implication).
SGML (like any decent text processing system) allows us to distinguish between end-of-line and start-of-line positions (let's call them RE and RS for Record Start and Record End).
The parser adopts a longest-match-first approach to delimiter matching.

First, we define entity references for all the start tags and some end tags. We will use these later. We use a special form of the entity reference that doesn't need to be reparsed, shown in . Examples 7-13 through 7-15 (including the fragments shown) are all in the file wiki.sgm.

Example 1. STARTTAG entity references

<!ENTITY bull1-s STARTTAG "bull1" > 

<!ENTITY bull2-s STARTTAG "bull2" > 

<!ENTITY bull3-s STARTTAG "bull3" > 

<!ENTITY li1-s STARTTAG "li1" > 

<!ENTITY li2-s STARTTAG "li2" > 

<!ENTITY li3-s STARTTAG "li3" >  

   

<!ENTITY p-s STARTTAG "p" > 

<!ENTITY pre-s STARTTAG "pre" > 

<!ENTITY bq-s STARTTAG "bq">

<!ENTITY hr-s STARTTAG "hr" >

<!ENTITY head-s STARTTAG "head" >

<!ENTITY key-s STARTTAG "keyword">

<!ENTITY key-e ENDTAG "keyword" >

   

<!ENTITY b-s STARTTAG "b"> 

<!ENTITY bi-s STARTTAG "bi"> 

<!ENTITY i-s STARTTAG "i"> 

<!ENTITY b-e ENDTAG "b"> 

<!ENTITY bi-e ENDTAG "bi"> 

<!ENTITY i-e ENDTAG "i"> 

<!ENTITY tt-s STARTTAG "tt">

<!ENTITY tt-e ENDTAG "tt">

   

<!ENTITY ref-s STARTTAG "ref"> 

<!ENTITY link-s STARTTAG "link"> 

<!ENTITY link-e ENDTAG "link"> 

   

<!ENTITY dt-s "</><dt>" >

<!ENTITY dd-s STARTTAG "dd" >

We also add a few funnies:

<!ENTITY fourQuot CDATA "'" >

<!ENTITY lsb CDATA "[">

Next we define maps (sets) of these in .

Example 2. Maps to entity references

<!SHORTREF imap

    "''" i-e 

    "&#RE" i-e >

<!SHORTREF bmap

    "'''" b-e 

    "&#RE" b-e >

<!SHORTREF bimap

    "'''''" bi-e 

    "&#RE" bi-e >

<!SHORTREF linkmap

    "]" link-e

    ">" ref-s >

<!SHORTREF dtmap

    ":" dd-s >

<!SHORTREF ttmap

    "**" tt-e 

    "&#RE;" tt-e >

<!SHORTREF keymap

    "&#RE;" key-e >

   

<!SHORTREF pagemap

    "&#RS;----" hr-s

   

    "&#RS;*" bull1-s

    "&#RS;**" bull2-s

    "&#RS;***" bull3-s

    "&#RS;#" li1-s

    "&#RS;##" li2-s

    "&#RS;###" li3-s

   

    "&#RS;&#RE;&#RS;*" bull1-s

    "&#RS;&#RE;&#RS;**" bull2-s

    "&#RS;&#RE;&#RS;***" bull3-s

    "&#RS;&#RE;&#RS;#" li1-s

    "&#RS;&#RE;&#RS;##" li2-s

    "&#RS;&#RE;&#RS;###" li3-s

   

    "&#RS;&#RE;&#RS;" p-s

    "&#RS;!" head-s

    "&#RS;=" key-s

    "&#RS;&#TAB;" dt-s

    "&#RS;&#TAB;&#SPACE;:" bq-s

    "&#RS;&#SPACE;" pre-s

   

    "&#RS;&#RE;&#RS;&#RE;&#RS;" p-s

    "&#RS;&#RE;&#RS;!" head-s

    "&#RS;&#RE;&#RS;&#TAB;" dt-s

    "&#RS;&#RE;&#RS;&#TAB;&#SPACE;:" bq-s

    "&#RS;&#RE;&#RS;&#SPACE;" pre-s

   

    "''" i-s

    "'''" b-s

    "''''" fourQuot

    "'''''" bi-s

    "**" tt-s

   

    "[" link-s

    "[[" lsb

>

   

<!-- the difference with pmap is that a blank does not start a para -->

   

<!SHORTREF premap

    "&#RS;----" hr-s

   

    "&#RS;*" bull1-s

    "&#RS;**" bull2-s

    "&#RS;***" bull3-s

    "&#RS;#" li1-s

    "&#RS;##" li2-s

    "&#RS;###" li3-s

   

    "&#RS;&#RE;&#RS;*" bull1-s

    "&#RS;&#RE;&#RS;**" bull2-s

    "&#RS;&#RE;&#RS;***" bull3-s

    "&#RS;&#RE;&#RS;#" li1-s

    "&#RS;&#RE;&#RS;##" li2-s

    "&#RS;&#RE;&#RS;###" li3-s

   

    "&#RS;!" head-s

    "&#RS;&#TAB;" dt-s

    "&#RS;&#TAB;&#SPACE;:" bq-s

    "&#RS;&#SPACE;" pre-s

   

    "&#RS;&#RE;&#RS;!" head-s

    "&#RS;&#RE;&#RS;&#TAB;" dt-s

    "&#RS;&#RE;&#RS;&#TAB;&#SPACE;:" bq-s

    "&#RS;&#RE;&#RS;&#SPACE;" pre-s

   

    "''" i-s

    "'''" b-s

    "''''" fourQuot

    "'''''" bi-s

   

    "[" link-s

    "[[" lsb

>

There are some extra declarations to handle the common case of someone typing two blank lines. This will reduce the number of spurious elements with no content.

And finally, we define when each map is active (which delimiters get recognized in which elements) in .

Example 3. Define maps as active

<!USEMAP imap i>

<!USEMAP bmap b>

<!USEMAP bimap bi>

<!USEMAP ttmap tt>

<!USEMAP linkmap ( link | ref | text ) >

<!USEMAP dtmap dt >

<!USEMAP keymap keyword >

   

<!USEMAP pagemap ( page | %blocks; | dd ) >

<!USEMAP premap pre >

How does it work? is a Wiki document:

Example 4. A Wiki document (page.txti)

!An Example Document

=Wiki

=SGML

=XML

   

This is an 

example document.

   

*It has some

kind of list

**with some kinds of nested list

* and also

#some

##type of

###indentation

   

But that is '''not''' ''all''!

You can link by URL alone

[http://www.topologi.com], by name plus **URL**,

[Schematron>http://www.ascc.net/xml/schematron]

or by an existing name only 

[Schematron] (in the last case, the [[system] must fill

in the gap from a linkbase, so it mightn''''t work

the first time a document is link-indexed.)

----

 And here we have some preformatted text

which should be '''OK'''

   

And still ''should'' be preformatted.

----

!Now Another Head

   

    A term: a definition

   

    Another term: another definition

with wrapped text

     : This is supposed to be a block quote now

but...

   

I am not sure how useful it is.

   

And here is another paragraph.

shows the document in XML (page.xml), after the text has been parsed as SGML and re-emitted as XML.

Example 5. page.xml

<?xml version="1.0"?>

<page><head>An Example Document

</head><keyword>Wiki</keyword><keyword>SGML</keyword><keyword>XML</keyword>

<p>This is an example document.</p>

<bull1>It has some kind of list</bull1>

<bull2>with some kinds of nested list</bull2>

<bull1> and also</bull1>

  <li1>some</li1>

  <li2>type of</li2>

  <li3>indentation</li3>

<p>But that is <b>not</b> <i>all</i>!

You can link by URL alone

<link><text>http://www.topologi.com</text></link>, by name plus 

<tt>URL</tt>,

<link><text>Schematron</text><ref>http://www.ascc.net/xml/schematron

</ref></link> or by an existing name only 

<link><text>Schematron</text></link> (in the last case, the [system] 

must fill in the gap from a linkbase, so it mightn't work

the first time a document is link-indexed.)

</p><hr/><pre>And here we have some preformatted text

which should be <b>OK</b>

And still <i>should</i> be preformatted.

</pre><hr/><head>Now Another Head

</head><dl><dt>A term</dt><dd> a definition

</dd><dt>Another term</dt><dd> another definition

with wrapped text

</dd></dl><bq> This is supposed to be a block quote now

but...

</bq><p>I am not sure how useful it is.

</p><p>And here is another paragraph.</p></page>

The following command line was used:

sx -wno-all -xno-nl-in-tag -xlower -xempty wiki.sgm page.txt > page.xml

When we parse the Wiki page, we will need to prepend the SGML declaration as well as the appropriate DOCTYPE declaration, which should say we are starting with a page element. Actually, we are going a little beyond strict SGML and relying on SP's particular error recovery to handle definition lists; however, the point is not that SGML could describe all Wikis but that it gets pretty close.

We're using SX, an SGML-to-XML converter. Part of the SP package, it is available as open source C++ code at the OpenJade Project or directly from James Clark's site (http://jclark.com/sp/), which includes premade binaries for Windows. Linux users may find their system already comes with SP: try the command man sx or man osx to check.

What this does not implement is that Wikis should allow & and < anywhere. In this Wiki dialect, use & and < to get them, or use a numeric character reference. (SGML does allow these delimiters to be remapped, but this confuses SX; in any case, having character references available is a net win.)

Is SGML worth it? It all depends on your skills and preferences. If you were doing this in Java, you would need to alter the JavaCC grammar (if you used that), adjust the mapping functions to create the XML, and then adjust the XML's DTD when validating, which isn't necessarily less work at all. The SGML approach also has the benefit that DTDs can be written and maintained by technical people who are not programmers.

SGML's weak spot here is definitely the need to predeclare the short reference delimiters in the SGML declarations. Without that we could have an all-DTD solution, which would be easier and more fun.

—Rick Jelliffe