From Wiki to XML, through SGML

March 3, 2004

Wikis are nice for typing. XML is nice for processing. SGML is a standard compiler compiler language for specifying conversions from one to the other.

Something Wiki this Way comes

Wikis exploded onto the scene in the late 90s but have been quieter more recently: a wiki is a site written using a very simplified syntax without tags: a blank line means "start a new paragraph" and so on.

The nice thing about Wiki formats is that they reduces the number of keystrokes needed to mark a document up (to the level of very simple HTML) to the same number as a nice, swanky, custom application needs. Today I am writing this blog in Mozilla Composer: in order to put in a link I need to type ^L to open the link editor, type the text, then tab, then the URL, then Enter: three characters overhead. In a Wiki, I type [ then the text, > then the URL and then ]. Pretty much identical from a Fitt's Law perspective, but with the advantage that one doesn't need to learn special keystrokes for different applications nor need to customize any editor to cope with them.

WikiWikiWeb provides a simple rich text format based on visual cues rather than HTML tags. This is convenient for hand editing and has the virtue that the raw text gives a rough indication of the formatted text. Wiki-like syntaxes are useful, for example, for creating documents in mail programs or PDAs, where terseness is important and there is no WF-checking or validation available.

SGML: A Language for describing Wikis

Now that Wiki and XML are both fairly established in people's minds, it gives us nice end-points for suggesting where SGML fits in: SGML is a language for defining the syntaxes of Wiki-like languages and parsing them into HTML-like documents with missing tags, then filling in the gaps to reveal XML-like structured documents.

These syntaxes may range from utterly Wiki-like (no angle-brackets) to utterly XML-like (only angle-brackets). You could say that SGML is a standard for creating compiler compilers for a class of languages which are not describable using the COTS compiler compilers, such as yacc/lex.)

The idea is as old as the hills: so old that I expect it will be patented soon. SGML provided many facilities to support this kind of terse markup. XML removed support in the name of ease of implementation, but the puritan mantra of "all structure should be explicitly tagged with elements" is not always the best answer.

This article shows how to describe Wiki content using SGML. We use this technique to convert a Wiki page into XML using the open source SP software from James Clark. Using this technique, you can provide your users with low-keystroke ways to send structured data with less opportunity for syntax errors and no reliance on the editing system at their end. The data does not need to be rich text, it could even be records of data.

A Document Type for Wiki

The first step is to describe Wiki as a document type, using HTML names where possible:

<!ENTITY % blocks 	" p | head | bull1 | bull2 | bull3  | li1 | li2 | li3 |  bq  " >

<!ENTITY % inlines 	" b | i | bi | tt " >



<!ELEMENT page 		o o ( keyword |%blocks; | pre | dl | hr )* >

<!ELEMENT ( dt | dd | pre | %blocks;  )

			o o  ( #PCDATA | %inlines; | link  )* >

<!ELEMENT ( text | ref | %inlines;   )

			o o ( #PCDATA )* >



<!ELEMENT dl 		o o ( dt, dd? )+>

<!ELEMENT hr 		o o EMPTY >

<!ELEMENT link 		o o (text, ref? ) >

<!ELEMENT keyword	o o (#PCDATA) >

If you are used to XML DTDs, you can see this is a little terser: the same element declaration can declare multiple element types, and you can use a parameter entity reference. The various - and o symbols, also unavailable in XML, describe whether the start and end-tags can be implied by the parser: o means omissible.

There are some differences from HTML. The top-level element is <page>. List items are not nested, but indented according to their number: <bulln> means a bullet list at indent level n. <lin> means a list with no indent at that level. There is an element link instead of the html:a or html:img element.

Which Wiki?

There are many Wiki dialects: we use a fairly generic one, between the readily-parseable WikiWorks dialect and the original wiki, tweaked to be suitable for blogs. The fact that there are already many mutually incompatible dialects of Wiki should be no surprise: what the optimal shortcuts for you are depends on the kinds of text you have to deal with.

Lets first describe the various rules in rough English, then we can figure out the declarations for them:

start-of-line followed by ---- means <hr>
a blank line starts a new paragraph except in a <pre>
a start of line followed by a SPACE is a <pre>
in a pre, a blank line or a line starting with a space continue the pre
start-of-line followed * means a <bull1> start-of-line followed ** means a <bull2> start-of-line followed *** means a <bull3>
start-of-line followed # means a <li1> indented list no bullet or number start-of-line followed ## means a <li2> indented list no bullet or number start-of-line followed ### means a <li3> indented list no bullet or number
start of line followed by a TAB means a <dl><dt>
in a dt element, a : starts <dd>
start of line followed by a TAB, SPACE then : then TAB starts a <bq>
start-of-line followed by # means a numbered list <li1>
'' starts and ends an i element (or end-of-line)
''' starts and ends a b element (or end-of-line)
'''' become a single quote
''''' starts and ends bold italic text (or end-of-line)
** starts and ends a tt section (not in <pre>)
[ and ] delimit a link
in a link, a > means the text left is a title and the right is the reference, other wise the text is the reference
[[ to escape [ in normal text
start of line followed by ! is a heading
start of line followed by = is a keyword (metadata)

Some Wiki dialects also allow tables and automatic link recognition. The first is left as a exercise to the reader; the second is more than SGML can handle. Once you have described your Wiki dialect in terms of SGML, you can easily add more short cuts as needed.

Wiki as SGML

Now that we have a description of our Wiki dialect, and of the abstract grammar of a Wiki page, we need a way of parsing one in terms of the other.

SGML's short references provide just that: we can tell the parser "when you see this delimiter string, substitute this tag". SGML provides four features to make this feasible:

you can recognize different delimiter strings in different contexts (using different delimiter maps),
an SGML parser is pretty smart about filling in the gaps for missing tags (tag implication),
SGML (like any decent text processing system) allows us to distinguish between end-of-line and start-of-line positions (let's call them RE and RS for Record Start and Record End), and
the parser adopts a longest-match-first approach to delimiter matching.

First we define entity references for all the start-tags and some end-tags. We will use these later. We use a special form of the entity reference which doesn't need to be reparsed.

<!ENTITY bull1-s STARTTAG "bull1" > 

<!ENTITY bull2-s STARTTAG "bull2" > 

<!ENTITY bull3-s STARTTAG "bull3" > 

<!ENTITY li1-s STARTTAG "li1" > 

<!ENTITY li2-s STARTTAG "li2" > 

<!ENTITY li3-s STARTTAG "li3" >  



<!ENTITY p-s STARTTAG "p" > 

<!ENTITY pre-s STARTTAG "pre" > 

<!ENTITY bq-s STARTTAG "bq">

<!ENTITY hr-s STARTTAG "hr" >

<!ENTITY head-s STARTTAG "head" >

<!ENTITY key-s STARTTAG "keyword">

<!ENTITY key-e ENDTAG "keyword" >



<!ENTITY b-s STARTTAG "b"> 

<!ENTITY bi-s STARTTAG "bi"> 

<!ENTITY i-s STARTTAG "i"> 

<!ENTITY b-e ENDTAG "b"> 

<!ENTITY bi-e ENDTAG "bi"> 

<!ENTITY i-e ENDTAG "i"> 

<!ENTITY tt-s STARTTAG "tt">

<!ENTITY tt-e ENDTAG "tt">



<!ENTITY ref-s STARTTAG "ref"> 

<!ENTITY link-s STARTTAG "link"> 

<!ENTITY link-e ENDTAG "link"> 



<!ENTITY dt-s "</><dt>" >

<!ENTITY dd-s STARTTAG "dd" >

And also a few funnies:

<!ENTITY fourQuot CDATA "'" >

<!ENTITY lsb CDATA "[">

Next we define maps (sets) of these:

<!SHORTREF imap

	"''" i-e 

	"&#RE" i-e >

<!SHORTREF bmap

	"'''" b-e 

	"&#RE" b-e >

<!SHORTREF bimap

	"'''''" bi-e 

	"&#RE" bi-e >

<!SHORTREF linkmap

	"]" link-e

	">" ref-s >

<!SHORTREF dtmap

	":" dd-s >

<!SHORTREF ttmap

	"**" tt-e 

	"&#RE;" tt-e >

<!SHORTREF keymap

	"&#RE;" key-e >



<!SHORTREF pagemap

	"&#RS;----" hr-s



	"&#RS;*" bull1-s

	"&#RS;**" bull2-s

	"&#RS;***" bull3-s

	"&#RS;#" li1-s

	"&#RS;##" li2-s

	"&#RS;###" li3-s



	"&#RS;&#RE;&#RS;*" bull1-s

	"&#RS;&#RE;&#RS;**" bull2-s

	"&#RS;&#RE;&#RS;***" bull3-s

	"&#RS;&#RE;&#RS;#" li1-s

	"&#RS;&#RE;&#RS;##" li2-s

	"&#RS;&#RE;&#RS;###" li3-s



	"&#RS;&#RE;&#RS;" p-s

	"&#RS;!" head-s

	"&#RS;=" key-s

	"&#RS;&#TAB;" dt-s

	"&#RS;&#TAB;&#SPACE;:" bq-s

	"&#RS;&#SPACE;" pre-s



	"&#RS;&#RE;&#RS;&#RE;&#RS;" p-s

	"&#RS;&#RE;&#RS;!" head-s

	"&#RS;&#RE;&#RS;&#TAB;" dt-s

	"&#RS;&#RE;&#RS;&#TAB;&#SPACE;:" bq-s

	"&#RS;&#RE;&#RS;&#SPACE;" pre-s



	"''" i-s

	"'''" b-s

	"''''" fourQuot

	"'''''" bi-s

	"**" tt-s



	"[" link-s

	"[[" lsb

>



<!-- the difference with pmap is that a blank does not start a para -->



<!SHORTREF premap

	"&#RS;----" hr-s



	"&#RS;*" bull1-s

	"&#RS;**" bull2-s

	"&#RS;***" bull3-s

	"&#RS;#" li1-s

	"&#RS;##" li2-s

	"&#RS;###" li3-s



	"&#RS;&#RE;&#RS;*" bull1-s

	"&#RS;&#RE;&#RS;**" bull2-s

	"&#RS;&#RE;&#RS;***" bull3-s

	"&#RS;&#RE;&#RS;#" li1-s

	"&#RS;&#RE;&#RS;##" li2-s

	"&#RS;&#RE;&#RS;###" li3-s



	"&#RS;!" head-s

	"&#RS;&#TAB;" dt-s

	"&#RS;&#TAB;&#SPACE;:" bq-s

	"&#RS;&#SPACE;" pre-s



	"&#RS;&#RE;&#RS;!" head-s

	"&#RS;&#RE;&#RS;&#TAB;" dt-s

	"&#RS;&#RE;&#RS;&#TAB;&#SPACE;:" bq-s

	"&#RS;&#RE;&#RS;&#SPACE;" pre-s



	"''" i-s

	"'''" b-s

	"''''" fourQuot

	"'''''" bi-s



	"[" link-s

	"[[" lsb

>

There are some extra declarations to handle the common case of someone typing two blank lines: this will reduce the number of spurious elements with no content.

And, finally, we define when each map is active (which delimiters get recognized in which elements):

<!USEMAP imap i>

<!USEMAP bmap b>

<!USEMAP bimap bi>

<!USEMAP ttmap tt>

<!USEMAP linkmap ( link | ref | text ) >

<!USEMAP dtmap dt >

<!USEMAP keymap keyword >



<!USEMAP pagemap ( page | %blocks; | dd ) >

<!USEMAP premap pre >

Example

How does it work? The following is a Wiki document:

!An Example Document

=Wiki

=SGML

=XML



This is an 

example document.



*It has some

kind of list

**with some kinds of nested list

* and also

#some

##type of

###indentation



But that is '''not''' ''all''!

You can link by URL alone

[http://www.topologi.com], by name plus **URL**,

[Schematron>http://www.ascc.net/xml/schematron]

or by an existing name only 

[Schematron] (in the last case, the [[system] must fill

in the gap from a linkbase, so it mightn''''t work

the first time a document is link-indexed.)

----

 And here we have some preformatted text

which should be '''OK'''



And still ''should'' be preformatted.

----

!Now Another Head

    A term: a definition

    Another term: another definition

with wrapped text

     : This is supposed to be a block quote now

but...

I am not sure how useful it is.



And here is another paragraph.

And here it is in XML, after the text has been parsed as SGML and re-emitted as XML.

<?xml version="1.0"?>

<page><head>An Example Document

</head><keyword>Wiki</keyword><keyword>SGML</keyword><keyword>XML</keyword><p>This is an 

example document.

</p><bull1>It has some

kind of list

</bull1><bull2>with some kinds of nested list

</bull2><bull1> and also

</bull1><li1>some

</li1><li2>type of

</li2><li3>indentation

</li3><p>But that is <b>not</b> <i>all</i>!

You can link by URL alone

<link><text>http://www.topologi.com</text></link>, by name plus <tt>URL</tt>,

<link><text>Schematron</text><ref>http://www.ascc.net/xml/schematron</ref></link>

or by an existing name only 

<link><text>Schematron</text></link> (in the last case, the [system] must fill

in the gap from a linkbase, so it mightn't work

the first time a document is link-indexed.)

</p><hr/><pre>And here we have some preformatted text

which should be <b>OK</b>

And still <i>should</i> be preformatted.

</pre><hr/><head>Now Another Head

</head><dl><dt>A term</dt><dd> a definition

</dd><dt>Another term</dt><dd> another definition

with wrapped text

</dd></dl><bq> This is supposed to be a block quote now

but...

</bq><p>I am not sure how useful it is.

</p><p>And here is another paragraph.</p></page>

sx -wno-all -xno-nl-in-tag -xlower -xempty wiki.sgm eg.txt > eg.xml

When we parse the Wiki page, we will need to prepend the SGML declaration as well as the appropriate doctype declaration, which should say we are starting with a page element. Actually, we are going a little beyond strict SGML and relying on SP's particular error recovery to handle definition lists; but the point is not that SGML could describe all Wikis but that it goes pretty close.

We're using SX, an SGML-to-XML converter; part of the SP package, it is available as open source C++ code at the OpenJade Project or directly from James Clark's site, which includes pre-made binaries for Windows. Linux users may find their system already comes with SP: try the command man sx or man osxto check.

What this does not implement is that Wikis should allow & and < anywhere. In this Wiki dialect, use & and < to get them, or a numeric character reference. (SGML does allow these delimiters to be remapped, but this confuses SX; in any case, having character references available is a net win.)

Is SGML worth it?

It all depends on your skills and preferences. If you were doing this in Java, you would need to alter the JavaCC grammar (if you used that), adjust the mapping functions to create the XML, and adjust the XML's DTD when validating. Which isn't necessarily less work at all. The SGML approach also has the benefit that DTDs can be written and maintained by technical people who are not programmers.

SGML's weak spot here is definitely the need to pre-declare the short reference delimiters in the SGML declarations. Without that we could have an all-DTD solution, which would be easier and more fun.

Download

Download the SGML declaration for Wiki markup, wiki.sgm.