Converting an SGML DTD to XML

July 8, 1998

by Norman Walsh

XML.com is pleased to welcome Norman Walsh as a regular columnist on the site. Norm is a Senior Application Analyst at ArborText, Inc., a developer of industrial stength SGML authoring and publishing tools, though we need to say that the words here are his own, not those of his employer.

Norm brings a wealth of expertise in SGML, XML, text processing and digital publishing. His column, XML Q&A will cover a variety of topics, dictated by you, the viewer. Please share your questions and suggestions for things you'd like to see covered to xmlqna@xml.com.

Q: How do I convert my SGML DTDs to XML? Should I do this?

A: XML is hot! That may be all the motivation you need to consider converting some or all of your existing SGML documents into XML. Other motivations include impending browser support for XML and new XML-aware tools that you'd like to be able to use. But if you want to maintain the control over your documents that an SGML DTD provides, you're going to have to convert not only your documents from SGML to XML, but your DTDs as well. This article provides an overview of some of the common problems encountered in converting an SGML DTD to XML, and provides suggestions about how best to work around these problems.

Making an SGML DTD XML compliant involves a number of changes. Many of these changes are straightforward and, for a large number of SGML DTDs, they will be fairly easy to accomplish. The issues to keep in mind are:

Case sensitivity. You have to make the case of all declarations match.
Minimization changes. XML doesn't allow most forms of SGML minimization so their specification has to be removed from the DTD.
Markup simplification. XML parsers are easier to write than SGML parsers because XML is simpler. A number of “typing shortcuts” have been removed. These changes (removal of name groups in declarations, moving comments outside of declarations, etc.) don't have any bearing on the semantics of the DTD.
Declared attribute types. XML allows far fewer declared attribute types.
Mixed content model changes. There are additional restrictions on mixed content models in XML.
Public identifiers. They're not well supported in XML; you must supply a system identifier for all external declarations.
Processing instruction syntax. It's slightly different.

There are also some changes that may have a large impact on the semantics of the DTD. Luckily, these are mostly infrequently-used SGML features so they don't turn up often in most DTDs. In XML DTDs, the following are not allowed:

Inclusions and Exclusions
The “&” connector in content models
CDATA and RCDATA content models
#CONREF and #SUBDOC attributes
SDATA entities

These are the most common issues. A complete list of all the differences between SGML and XML is available in a W3C note by James Clark.

Figure 1: An Example SGML DTD

The DTD below demonstrates a number of problems for XML conversion.

 <!Element Doc - - (Title, (para|listing)+)
          +(IndexTerm)> <!ELEMENT Para - - (Emphasis|#PCDATA|Cite|XRef)* +(Footnote) --
          footnotes anywhere in para --> <!ELEMENT (Emphasis|Cite) - - (#PCDATA)>
          <!ATTLIST Cite Type (Book|Article|Other) Book > <!ELEMENT Footnote - O
          (#PCDATA|Para+) -(Footnote) -- no footnotes in footnotes --> <!ELEMENT Title O -
          (#PCDATA|Emphasis)*> <!ELEMENT Listing - - CDATA> <!ATTLIST Listing ID ID
          #IMPLIED ColWidth NUMBER 80 > <!ELEMENT IndexTerm - - (Prim, Sec)> <!ATTLIST
          IndexTerm ID ID #IMPLIED Type (RangeStart|RangeEnd|Singular) "Singular" StartRef IDREF
          #CONREF -- points to RangeStart -- > <!ELEMENT Prim - O (#PCDATA)> <!ELEMENT
          Sec - O (#PCDATA)> <!ELEMENT XRef - O EMPTY> <!ATTLIST XRef LinkEnd IDREF
          #REQUIRED > <!ENTITY ldquo SDATA "[ldquo ]"> <!ENTITY rdquo SDATA "[rdquo
          ]">

As we move through the following sections, we'll look at the possible ways to convert this DTD to XML compliance. Note however, that the fragments shown in each sample are not necessarily valid XML. In particular, only one change is shown in each section, even when the example contains more than one problem. The completed, XML compliant DTD is shown in Figure 2.

Case Sensitivity

Case sensitivity in SGML is actually controlled by the SGML Declaration (XML has no declaration, although there is a fixed declaration for XML that makes SGML systems obey many of the conventions of XML). If your existing SGML Declaration enforces case sensitivity (most do not) then this is a non-issue. If it doesn't, then you may have used mixed case in your DTD.

In the example above, “para” appears in the content model of Doc, but the element is declared as “Para”. In XML, the case of all of your element, attribute and entity names must match. Also, all of the declaration keywords must be upper-case.

Converting all of the names to lower-case is one common solution to this problem, although any consistent case will be equally valid. Remember that your authors will have to use the same case in their documents.

 <!ELEMENT doc - - (title, (para|listing)+)>
        +(indexterm)>

Minimization Changes

In most SGML DTDs, element declarations include tag omission characters (the “-”s and “O”s that follow the element names). These indicate whether or not the start- and end-tags can be omitted (O) or are required (-). In XML, both start- and end-tags are always required and these characters must be removed from the DTD.

<!ELEMENT doc (title, (para|listing)+)>
        +(indexterm)>

(It is also possible to replace the tag omission characters with a parameter entity. You might want to do this if you're trying to parameterize your DTD so that you can share declarations between your SGML DTD and your XML DTD.)

Markup Simplification

In XML:

you cannot put comments inside markup declarations,
you cannot use name groups on element or attribute declarations, and
you must put quotes around default attribute values.

<!-- comments can be preserved outside of the declarations
        --> <!ELEMENT para (emphasis|#PCDATA|cite|xref)* +(footnote)> <!ELEMENT emphasis
        (#PCDATA)> <!ELEMENT cite (#PCDATA)> <!ATTLIST cite type (book|article|other)
        "book" >

Declared Attribute Types

SGML has a number of declared attribute types that are not present in XML (NUMBER, NUTOKEN, etc.). The data typing in SGML was never robust enough to be really useful, so these aren't likely to be missed very much. Just change them into NMTOKEN to restrict the attribute values to just name characters or CDATA. A future schema specification (the future of DTDs) will almost certainly include stronger data-typing.

<!ELEMENT listing CDATA> <!ATTLIST listing id ID #IMPLIED
        colwidth CDATA "80">

Mixed Content Models

Mixed content refers to content models that include #PCDATA. (Content models that contain only other elements are said to have “element content”.) To aid in parsing and to simplify SGML, XML enforces some additional constraints on mixed content models. Specifically, all mixed content models must have #PCDATA as the literal first element and must be optional, repeatable “or” groups (in other words, they must use only the “|” separator, no commas).

In most cases, this is a simple matter of reordering the elements:

 <!ELEMENT para (#PCDATA|emphasis|cite|xref)* +(footnote)>

If your DTD uses parameter entities to hold fragments of content models, the #PCDATA element must be factored out and added to the front of each complete content model or it must be the first element of the first parameter entity used in each mixed content model.

For some content models, the requirements on mixed content introduce semantic changes in the DTD. Consider the content model of footnote in Figure 1:

<!ELEMENT footnote (#PCDATA|para+)>

footnoteeither#PCDATAorpara

<footnote>some text</footnote>

<footnote><para>a paragraph</para>
        <para>another paragraph</para></footnote>

<footnote>some text<para>some more
        text</para></footnote>

You can change the content model to a repeatable mixture of #PCDATA and paras, which will make the last example above legal.
You can remove #PCDATA from the content model.

You can add a new “wrapper” element inside the footnote:

<!ELEMENT footnote (shortfn|longfn)> <!ELEMENT shortfn
            (#PCDATA)> <!ELEMENT longfn (para+)>

Or you can split the footnote element into two elements: shortfootnote and longfootnote (which is effectively moving the wrapper up a level).

None of these solutions is perfect, and the one you choose will depend on the context and on your control over authors and tools. In this case, I'm going to broaden the content model of footnote and rely on my authors or my tools to catch the mixed content case that I consider invalid (unwrapped #PCDATA and paragraphs in the same footnote):

<!ELEMENT footnote (#PCDATA|para)*>

Public Identifiers

If your DTD uses public identifiers to locate external entities, you'll have to add system identifiers to all of the declarations that don't already include them. XML processors may, but are not required to, attempt to generate a URI from the public identifier (using a catalog, for example).

Processing Instruction Syntax

Processing instructions are not frequently encountered in a DTD, but they can be used. In XML, processing instructions must begin with a single keyword (called the “PITarget” in the XML specification) and they end with “?>”, rather than simply “>”:

<?targetkeyword processing instruction data?>

Inclusions and Exclusions

If your SGML DTD uses inclusions or exclusions, removing them may be the hardest part of converting your DTD to XML. In brief:

Inclusions allow an element to occur anywhere inside another element (recursively). For example, the inclusion of footnote in para means that footnote can occur not only anywhere inside para, but also inside any element that occurs in para (e.g., emphasis inside para). This means that footnote can occur inside emphasis in para even though it cannot occur inside emphasis in title.
Exclusions have the opposite effect, they prevent an element from occurring anywhere inside another element (recursively), even if the element being excluded occurs in the content model of an interior element (either by inclusion or because it's listed in the content model proper). By excluding footnote from itself, a paragraph inside a footnote is prevented from containing another footnote even though it's legal for paras in other contexts to contain footnotes.

Removing inclusions and exclusions greatly simplifies the job of the XML parser, but it makes converting an SGML DTD to XML tricky. It may be very difficult (even impossible) to make an XML DTD that is structurally identical to an SGML DTD that uses inclusions or exclusions.

The only way forward is to factor the included elements into the content models where they need to be allowed, bearing in mind that this may make it possible to put them in contexts where they were formerly forbidden. For example, if we decide that it's necessary to allow footnote inside emphasis in a paragraph, we will have to add footnote to the content model of emphasis. This will have the side-effect that it will become possible to put footnotes in emphasis elements in titles. C'est la vie.

Simply removing the exclusions (which is basically the only thing you can do) will make the content models of some elements broader than they used to be. You'll have to rely on your tools and/or your authors to restrain themselves.

For this DTD, I'll factor indexterm into the body of the document and footnote into para:

<!ELEMENT doc (title, (para|listing|indexterm)+)>
        <!ELEMENT para (#PCDATA|emphasis|cite|xref|footnote|indexterm)*>

SGML Exceptions and XML

The “&” Connector

The “&” connector is not allowed in XML. The meaning of a content model using the “&” connector is that all of the elements connected by “&” must occur exactly once but in any order.

You have two choices: pick a fixed order or allow them to be optional and repeated. Which you choose depends on the particular case, of course, but picking a fixed order probably provides the closest semantic match.

`CDATA` and `RCDATA` Content Models

Using CDATA and RCDATA declared content is not allowed in XML. Instead, you'll have to change the content models to be #PCDATA. From the perspective of the DTD, this is a minor change, but it may have some effect on your documents. In CDATA and RCDATA declared content, normal parsing rules are suspended so that “<” and “&” (in CDATA) may appear literally without being escaped. You can't do that in XML so you'll have to add a CDATA section around the element content.

<!ELEMENT listing (#PCDATA)>

`#CONREF` and `#SUBDOC` Attributes

Using #CONREF provides element content by ID reference and #SUBDOC allows you to include other document types inline. Neither are allowed in XML.

Most #CONREF attributes can be made simply #IMPLIED, although the content model of the element on which they occur may have to be loosened so that it can be empty

You'll just have to live without #SUBDOC. Note, however, that if a downstream processor (like a web browser) is only examining the XML that it receives for well formedness, you can freely mix document types in your published XML. But this may require namespaces and fairly sophisticated stylesheets to get appropriate presentation.

`SDATA` Entities

Most SDATA entities are used for character references. For XML, the appropriate thing to do is replace them with Unicode character references:

<!ENTITY ldquo "&#x201C;"> <!ENTITY rdquo
        "&#x201D;">

Figure 2: An Example XML DTD

The converted XML DTD is shown below.

 <!ELEMENT doc (title, (para|listing|indexterm)+)>
        <!ELEMENT para (#PCDATA|emphasis|cite|xref|footnote|indexterm)*> <!ELEMENT emphasis
        (#PCDATA|footnote)*> <!ELEMENT cite (#PCDATA)> <!ATTLIST cite type
        (book|article|other) "book" > <!ELEMENT footnote (#PCDATA|para)*> <!ELEMENT
        title (#PCDATA|emphasis)*> <!ELEMENT listing (#PCDATA)> <!ATTLIST listing id ID
        #IMPLIED colwidth CDATA "80" > <!-- startref points to rangestart --> <!ELEMENT
        indexterm (prim?, sec?)> <!ATTLIST indexterm id ID #IMPLIED type
        (rangestart|rangeend|singular) "singular" startref IDREF #IMPLIED > <!ELEMENT prim
        (#PCDATA)> <!ELEMENT sec (#PCDATA)> <!ELEMENT xref EMPTY> <!ATTLIST xref
        linkend IDREF #REQUIRED > <!ENTITY ldquo "&#x201C;"> <!ENTITY rdquo
        "&#x201D;">

What About Your Documents?

Depending on your DTD, it may be possible to make a completely isomorphic XML DTD from your SGML DTD. In that case, you probably wouldn't have to change your documents very much except where necessary to make them valid XML (changing empty tags, adding system identifiers, etc.) A Perl script or a tool like sgmlnorm from the SP distribution can automate many of these changes.

But it's more likely that your XML DTD is a little bit different from your SGML DTD. These differences may have an impact on your documents. Figure 3 contains an example SGML document that is valid according to the example SGML DTD in Figure 1.

Figure 3. An SGML Document

<!doctype doc system "docsgml.dtd">
          <doc><title>Test Document<indexterm>
          <prim>Document</prim><sec>Test</sec></indexterm>
          </title> <para> This is a test paragraph.<footnote>This is a
          footnote.</footnote> <indexterm id=iterm type=rangestart>
          <prim>index</prim><sec>term</sec></indexterm> </para>
          <para> The title of <emphasis>this</emphasis> article is
          &ldquo;<cite type=article>Test Document</cite>&rdquo;. </para>
          <para>Here's a program listing:</para> <listing id=l1> int main(int
          argc, char **argv) { if (argc < 1) { ... } int *i = &argc; ... </listing>
          <indexterm startref=iterm type=rangeend> <para> The program in <xref
          linkend=l1> is meaningless. </para> </doc>

An equivalent document that is valid according to the XML DTD in Figure 2 is shown in Figure 4.

Figure 4. An XML Document

<?xml version='1.0'?> <!DOCTYPE doc SYSTEM
          "docxml.dtd"> <doc><title>Test Document</title>
          <indexterm><prim>Document</prim><sec>Test</sec></indexterm>
          <para> This is a test paragraph.<footnote>This is a footnote.</footnote>
          <indexterm id="iterm" type="rangestart">
          <prim>index</prim><sec>term</sec></indexterm> </para>
          <para> The title of <emphasis>this</emphasis> article is
          &ldquo;<cite type="article">Test Document</cite>&rdquo;. </para>
          <para>Here's a program listing:</para> <listing id="l1"><![CDATA[ int
          main(int argc, char **argv) { if (argc < 1) { ... } int *i = &argc; ...
          ]]></listing> <indexterm startref="iterm" type="rangeend"/> <para>
          The program in <xref linkend="l1"/> is meaningless. </para>
        </doc>

Several changes made to the DTD in the course of making it XML compliant have had an impact on the document:

The indexterm has been moved out of the document title, where it was formerly allowed by inclusion.
A CDATA section has been added around the content of the listing, which used to have CDATA declared content.
The indexterm now uses XML empty-tag syntax. Note, however, that this element is no longer empty in the traditional SGML sense, it merely has no content. It was empty in the SGML document because the startref attribute was #CONREF.

There are a number of invisible changes as well, having to do with the fact that some content models have become more broad. These changes have no impact on document conversion but will have an impact on future authoring. In particular, the broader range of possibilities will have to be considered in stylesheets and other tools that process the documents.

Checking Your Work

The only way to be sure of your changes is to test them with a validating XML parser. Two popular validating parsers are Tim Bray's Larval (a validating version of Lark) and DataChannel's DXP (a validating version of NXP). Recent versions of James Clark's SP also include some support for XML validation.

(The list of available XML tools is growing almost daily; watch XML.com for a list of other XML processors.)

Should I Do This?

That's a good question. Right now, there do not seem to be a lot of good reasons for converting DTDs from SGML to XML. If you've got a working SGML system, and you're primary motivation for converting to XML is web delivery, it may make more sense to simply translate your SGML documents into well-formed XML documents before publishing them. This allows you to maintain the benefits of your current investment while allowing you to take advantage of XML.

However, if you really need or want to switch your DTD to XML, another important question is how soon do you need to do it? The XML Schema discussion is just starting and within a couple of years we may have a complete DTD replacement for XML that includes equivalent support for some of the SGML features lost in XML DTDs and support for new features like proper data typing.

Conclusion

In most cases it is possible to convert your SGML DTDs to XML. It's likely that you're going to have to make a few changes and this may have an impact on your documents. Depending on your environment and your needs, it may make more sense to keep your existing system, at least until your XML Schemas are cooked well enough to evaluate.

Files Used in This Article

For your convenience, the files used in this article are available separately as straight text:

docsgml.dtd: The SGML DTD.
docxml.dtd: The XML DTD.
doc.sgm: The SGML test document.
doc.xml: The XML test document.