From DTDs to Documents

September 27, 2000

The questions I've dealt with so far in this column have been on general topics like "Which parser should I use?" and "What is markup?" This month we'll delve into a couple of deeper subjects.

Q: Can you give some guidelines for how to create a good DTD?

A: I'll assume here that you've looked into using XML Schemas instead of DTDs and decided, for one reason or another, to stick with the latter. In that case, the first question you need to ask is whether an existing DTD fulfills your exact need or something close to it. Some good places to look for existing DTDs are

James Tauber's schema.net site.
The XML.org Registry, with a search page at http://registry.xml.org:2020/repository/ui/home.jsp.
xmlTree.

Or you can use a search engine such as Google, using search terms like "music DTD," if that's your interest, or "automotive DTD," "address book DTD," and so on.

Once you're satisfied there aren't existing DTDs you can adopt or use as-is, then you need to know some guidelines. Here are a few such guidelines, all of which assume you've already taken the time to learn the syntax of DTDs. There's no way this list can be considered comprehensive; entire shelves of books exist on the subject, especially if you include those on SGML DTDs.

Maintainability

First, you want your DTD to be easily maintainable. This means two things: legibility and usability. Legibility includes all the physical characteristics of the DTD itself: liberal use of whitespace, for example, and comments (especially explaining the nasty little bits whose purpose you're likely to forget in six months). A very usable DTD means that it's structured according to some logical design, and that it's modularized in some rational way -- without going to crazy extremes. A modularized DTD is one that you have, to the extent possible, broken into DTD fragments, which can then be included in a main DTD using external parameter entities. Such DTD fragments are much easier to update and much easier to share and can be assembled in all kinds of interesting ways, perhaps even by DTDs that you yourself don't devise.

Elements or attributes?

Second, take a stance on the elements-or-attributes question, decide why you've adopted that stance, then stick to it in your content models and ATTLIST declarations. Perhaps the authoritative source on this enduring question is Robin Cover's page of commentary and links. If nothing else, note Robin's use of the word "anathematize" -- a word appearing most commonly in the context of religious heretics. There is indeed a religious-war flavor to the enduring debate.

Versioning

Finally, decide how you're going to handle versioning. This is the problem that arises when you add to, subtract from, or modify your existing DTD. This is a problem because later changes may break documents whose validity depends on a element or attribute (or other aspect of the DTD) that is no longer present, or present in some different way. Your decision on how to handle versioning will hinge on whether you choose to eschew backward compatibility (and rename your DTD), or attempt to preserve compatibility of documents written against the old DTD.

Q: Can I dynamically create MS Word and Adobe PDF files from XML source documents?

A: Both Microsoft Word and Adobe Acrobat are, of course, proprietary products. Like many proprietary products, their native storage formats are use binary encodings for your document information. This is neither good nor evil, but it's completely at odds with XML's "everything is text" view of the world.

Converting a Word file or an Acrobat PDF to XML is a challenge I wouldn't wish on anyone -- I'm not even sure it's possible. But your question comes at the problem from the opposite angle: how do I incorporate an XML document's text contents into one of these binary-and-text formats?

XML to Word

There's been a certain amount of interest and debate about Word 2000's ability to store XML content. On its own, though, what Word 2000 stores as XML is document metadata: creation date, author, and so on.

All is not lost, though. First, if you're a Microsoft Visual C++ or Visual Basic aficionado, creating a Word document is certainly easy enough in its own right; adding a reference to msmxl.dll provides your code with the XML smarts necessary to fill in the text-goes-here blanks and voilá, there's your Word document complete with XML contents.

If you're not a programmer (or just don't want to get into it), you can't do much about creating a Word document per se. However, you can take advantage of the fact that Word can read certain text-only formats that contain "formatting instructions" as well. The two most common formats like that are Rich Text Format (RTF) and of course HTML. XSLT can easily transform XML into either of those formats, as long as you understand their syntaxes. (Therein may lie the rub; RTF is notoriously under-documented, which may drive you to HTML.) Depending on your environment, you may be interested in XML2RTF, a Java bean suite developed by IBM's alphaWorks people. More information about XML2RTF is available from alphaWorks.

XML to PDF

This task is easier than the first problem, thanks to the work of a substantial number of developers who've already tackled it.

Let's take a look at two existing products; both convert XSL formatting objects (FOs) to PDF. (Therein may lie another rub, of course, since the XSL-FO standard isn't final.) They are

FOP, originally developed by James Tauber but now under the aegis of the Apache Project. For more information, see http://xml.apache.org/fop/.
RenderX's XEP Rendering Engine (formerly known by the less enigmatic name FO2PDF). For more information, see http://www.renderx.com/FO2PDF.html.

Dynamically

Whether, or how, you can create either Word or Acrobat PDF files "dynamically" depends on what you mean by that word. If you choose the programming approach, well, anything is theoretically possible. (That's what I tell my clients. Of course, I immediately follow it up with, "On the other hand, some things just aren't worth doing.") If not, it depends on the capabilities of such packages as FOP and XEP and your own performance requirements.