DTDs, Industry Markup Languages, XSLT and Special Characters

March 28, 2001

Q: Can an XML document reference two or more DTDs?

A: Directly? No. To be validated against a DTD an XML document can have only one document type declaration. (That's the portion of the document prolog starting "<!DOCTYPE...") And a document type declaration can reference only one DTD. If you think about it, this makes perfect sense: One of the requirements of well-formedness (let alone validation) is that there be only one root element. So given one DTD which defines elements A, B, and C, and another which defines X, Y, and Z, how do you get a single root out of that?

That said, you may be able to do it indirectly.

Let's call the two DTDs I mentioned above start.dtd and end.dtd, respectively, and assume that the A and X elements (respectively) are their root elements. You could create your own DTD -- let's call it combo.dtd -- which includes all the declarations from start.dtd and end.dtd, by making combo.dtd look something like

<!ELEMENT combo (A, X)> <!ENTITY % dtd_start SYSTEM "http://www.foo.com/start.dtd"> <!ENTITY % dtd_end SYSTEM "http://www.foo.com/end.dtd"> %dtd_start; %dtd_end;

The two ENTITY declarations establish so-called external parameter entities, essentially pointers to other DTDs. (Of course, the URIs following the SYSTEM keyword need to point to those DTDs' actual locations. These can be relative URIs -- relative to the location of combo.dtd itself.) The declarations in the two DTDs are logically included in combo.dtd by the parameter entity references, %dtd_start; and %dtd_end;.

A document based on this composited DTD may use the A, B, C, X, Y, and Z elements (and their attributes, entities, notations, and so on) in any ways permitted by their respective original DTDs. There are, however, some caveats:

If you really want to mix the two vocabularies, any XML document pointing to combo.dtd must include as its root the combo element. You still won't be able to do things like make the A element a child of the Z element, though. If you charted your document's structure, there'd be the one root element, combo, and two completely independent branches off that: all the elements (and content models) from start.dtd in one, and all those from end.dtd in the other.
The technique above works best when there are no overlapping/conflicting declarations in the constituent DTDs. For example, if both start.dtd and end.dtd declare an element named M, only the first declaration may apply.
The declarations from start.dtd and end.dtd are automatically "included" in combo.dtd. Again, though, this is only a logical (not a physical) inclusion. What this means is that if start.dtd or end.dtd change in any way, those changes will automatically roll out to any documents -- or other DTDs -- which reference combo.dtd. This may or may not be a good thing, depending on what those changes do; therefore, it may be best to reference only those DTDs over which you (or your organization) have control.

Q: Is there a list of industry-specific markup languages (e.g., Chemical Markup Language, HR Markup language) available anywhere?

A: Congratulations on looking around first before building your own.

Yes, there are a number of such lists.

James Tauber's schema.net: This site has been around for a few years now. Not as complete as some of the others, it's nonetheless very easy to use and an excellent place to start without being overwhelmed. A menu of hyperlinked categories ("Education"; "Scientific/Technical"; etc.) down the left-hand side leads you to information about XML vocabularies in each problem domain.
XML.ORG (sponsored by the Organization for the Advancement of Structured Information Standards, OASIS): Like schema.net this resource is categorized, with its (alphabetically arranged) categories rather more granular than those on schema.net. For instance, under "A" XML.ORG lists Accounting, Advertising, Agriculture, Architecture and Construction, Astronomy and Space, Automotive, and Aviation and Aerospace.

You might also check Robin Cover's "XML Cover Pages" (also hosted by OASIS). Cover has his ear to the ground on apparently everything XML-related, and you will frequently find news of new markup languages here well in advance of their appearing at the aforementioned repositories.

Yet another approach is taken by Calaba's xmlTree. This is an index not of language names but of sites serving XML. On entry to the site, you select a keyword from several categories (Subject, Content Type, Language, Schema, or Location) and then drill down to specific sites within that category. What I like about this approach is its informality. While a vocabulary repository (like the ones linked to above) requires the language's proprietor to register the vocabulary, you can find many unregistered gems through the xmlTree interface. The site includes two special collections: a directory of RSS sites (Rich Site Summary markup language, providing free XML-based content syndication), and a directory of WML sites (serving up Wireless Markup Language documents on various topics).

Q: I want the value of an element to be passed to the HTML result from an XSLT transformation as is (including special characters: >, <, &, etc.). How do I do that?

A: Getting those literal characters into the result tree is one thing, and you've got a couple of options for doing so. The main trick -- or, rather, obstacle, in this case -- is that you're transforming to HTML.

If you were transforming to pure XML, that is, some dialect not intended for browsers to consume, you could wrap the appropriate result element(s) in CDATA sections. (Use the cdata-section-elements attribute to the xsl:output element for this.) Unfortunately, browsers don't treat CDATA sections reliably (or even at all properly).

Alternatively, you can output the problematic text directly to the result tree using a disable-output-escaping="yes" attribute to either an xsl:text or xsl:value-of element. This doesn't depend on the browser's ability to understand or, well, do anything special.

However, this is one of those gray areas where browsers will probably let you get away with something you're not supposed to get away with. Those markup-related special characters are supposed to "break" a browser. (In practice, the > shouldn't create too many problems, but the < and & serve as wake-up calls to the processor: "Markup ahead!") If the browser respects XHTML, it will reject those literal characters just as any other good XML application will.

There aren't a lot of convincing reasons I've encountered why someone must get those literal characters into their XSLT result trees. The best reason is when transforming to (X)HTML which includes a script element, containing JavaScript code using the < character in Boolean tests. In this case, you might find the following solution useful. It shows a portion of the result tree that you want to create.

<script type="text/javascript"> //<![CDATA[ ... if (i<12) { } ... //]]> </script>

This gets around the problem of browsers not recognizing CDATA sections by enclosing the CDATA delimiters in JavaScript comments (see the "//" characters on those lines?). At the same time, it allows XML-aware browsers (or other XML-aware applications) to treat the CDATA section as it should.

(This ingenious solution isn't mine by the way. It's one of Simon St. Laurent's tips for working around limits in browser support for XHTML, taken from the XHTML-L mailing list archive.)