Menu

Controlling the DOCTYPE and XML Declaration

September 4, 2002

Bob DuCharme

XSLT processors usually create result documents that are well-formed XML with a simple XML declaration at the top. They don't have to add that XML declaration, though; it's easy to suppress it. It's also easy to add one and control exactly what it shows, such as an encoding declaration or a declaration of the version of XML being used. Your result document can also include a document type declaration that specifies the DTD to which it conforms, which is necessary for your result document to be a valid XML document. This month we'll see how to add these.

XML Declarations

The XML declaration at the beginning of an XML document is not necessary, but it's the best way to say "this is definitely an XML document and here's the release of XML it conforms to." The following is typical:

<?xml version="1.0"?>
Note Despite its beginning and ending question mark, an XML declaration is not a processing instruction; it's a separate kind of markup declaration. In fact, the XML specification explicitly prohibits the processing instruction target (the name right after a processing instruction's opening question mark) from being "xml" in any case in order to prevent a processing instruction from being confused with an XML declaration.

An XSLT processor's default behavior is to add an XML declaration to the beginning of an XML document that it creates in the result tree. If your stylesheet includes an xsl:output instruction with a method value of "text" or "html" the XSLT processor doesn't consider the result tree's document to be XML, so it won't add an XML declaration. If method is "xml" or the stylesheet has no xsl:output element (in which case the default value of "xml" is assumed), the result is considered an XML document. To show the simplest case, we'll apply the simplest possible stylesheet

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0"/>

to this little document:

<test>Dagon his Name, Sea Monster</test>

The result, thanks to XSLT's built-in template rules, shows the element's character data with the XML declaration preceding it:

<?xml version="1.0" encoding="utf-8" ?>Dagon his Name, Sea Monster

Although an XML declaration is optional, when it is included it must have the version information. (As I write this, XML 1.1 is in Last Call status, so we'll have to start worrying about whether XML processors are aware of 1.1's new features soon.) In the example above, after the version information, the XML declaration includes an encoding declaration to tell us how the characters in the document are encoded. While the XML specification considers an encoding declaration to be optional if the document is encoded as UTF-8 or UTF-16, the XSLT specification says that XSLT processors must add one to the result document with a value of "utf-8" or "utf-16" if no other encoding value is specified.

You can specify one yourself or change the version value by adding encoding and version attributes to an xsl:output element in your stylesheet. The encoding attribute actually does more than add an encoding declaration to the result document; it tells the XSLT processor to write out the result using that encoding. If you specify an encoding that it can't handle, the processor will let you know.

The following stylesheet adds an encoding declaration and version information to the result document.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">
  <xsl:output method="xml" version="1.1" encoding="utf-16"/>
</xsl:stylesheet>

This produces the following using the same input as the previous example (although it may not look right in text editors that can't handle UTF-16):

<?xml version="1.1" encoding="utf-16" ?>Dagon his Name, Sea Monster

That's just a toy example. The following slightly longer program is actually useful. It copies an XML document without changing anything, except that it writes out the result as a UTF-16 document:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">

  <xsl:output encoding="utf-16"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

By changing the value of its encoding attribute, you can create a general-purpose stylesheet to copy an XML document with the copy being in any encoding that you want, as long as your XSLT processor supports that encoding.

What if you don't want an XML declaration in the result of your transformation? For example, I rarely show them in the result of my examples because I want the examples to be as concise as possible. I suppress them by adding an omit-xml-declaration attribute to most of the sample stylesheets' xsl:output elements, like this:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">
  <xsl:output method="xml" omit-xml-declaration="yes"/>
</xsl:stylesheet>

The output of this stylesheet applied to the earlier XML document is identical to the output created with the earlier stylesheet, minus the XML declaration:

Dagon his Name, Sea Monster

Valid XML Output: Including DOCTYPE Declarations

A valid XML document is one that has a document type (or "DOCTYPE") declaration and conforms to the DTD in that document type declaration. (Remember, an XML document with no DOCTYPE declaration isn't valid, but it can still be a legal XML document as long as it's well-formed. "Valid" is a technical term referring to the presence of and conformance to a DOCTYPE declaration.)

A DOCTYPE declaration can include DTD declarations as an internal DTD subset between square brackets, like this,

<!DOCTYPE chapter [
<!ELEMENT chapter (title,para+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT para (#PCDATA)>
]>

or it can point to DTD declaration stored in a separate file like this:

<!DOCTYPE chapter SYSTEM "../dtds/chapter.dtd">

The SYSTEM identifier tells the XML parser where to find the DTD file on the system. An optional PUBLIC identifier can specify another string for the parser to use when locating a DTD file. These usually use a string similar to the following, which avoids any system-specific information to make the document more portable across different systems:

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML//EN"
          "../dtds/chapter.dtd">

The XML parser should look up this PUBLIC identifier somewhere to find the exact location of the local copy of the DTD file. (There have been proposals for the format and location of the lookup table, but none has caught on enough to be a widespread standard in the XML world, so that "somewhere" has never been completely resolved. In fact, people are using PUBLIC identifiers less and less anyway.) If it can't find it, the parser uses the SYSTEM identifier following the PUBLIC identifier. In the example above, the SYSTEM identifier doesn't need the word "SYSTEM" -- because it's a required parameter, the XML parser knows what it is.

To create valid XML documents using XSLT, a stylesheet must add a DOCTYPE declaration to the result tree. Because a DOCTYPE declaration isn't an element or a processing instruction, standard methods for adding those to your result tree won't accomplish this. Instead, an XSLT processor knows that it must create a DOCTYPE declaration in your result document when it sees certain specialized attributes in an xsl:output element.

Two more xsl:output attributes let you add SYSTEM and PUBLIC declarations to a DOCTYPE declaration in your result. If your xsl:output element has a doctype-system attribute, the XSLT processor adds a DOCTYPE declaration to the result tree with that attribute's value as its SYSTEM identifier. If it also has a doctype-public attribute, it adds this attribute's value to the result's DOCTYPE declaration as a PUBLIC identifier. (An XSLT processor ignores a doctype-public attribute without an accompanying doctype-system attribute, because an XML document can't have a PUBLIC identifier without a SYSTEM identifier.)

The following example source document conforms to the DocBook DTD.

<chapter><title>Chapter 1</title>
  <para>More unexpert, I boast not: them let those</para>
  <para>Contrive who need, or when they need, not now.</para>
  <para>For while they sit contriving, shall the rest,</para>
  <para>Millions that stand in Arms, and longing wait</para>
</chapter>

The following stylesheet just copies it to the result tree. Because its xsl:output instruction includes both doctype-system and doctype-public attribute specifications, the result will include a DOCTYPE declaration with both of these identifiers.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     version="1.0">

<xsl:output method="xml" doctype-system="../dtds/docbookx.dtd" 
     doctype-public="-//OASIS//DTD DocBook XML//EN"/> 

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

The stylesheet could have had different instructions after that xsl:output element to rearrange, rename, or delete the elements, or to perform any of the other XSLT tricks possible on the source tree's nodes as they're copied to the result tree. The DOCTYPE declaration added to the result tree would still look like the one produced by the stylesheet and input document above, as shown here:

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE chapter
  PUBLIC "-//OASIS//DTD DocBook XML//EN" "../dtds/docbookx.dtd">
<chapter><title>Chapter 1</title>
  <para>More unexpert, I boast not: them let those</para>
  <para>Contrive who need, or when they need, not now.</para>
  <para>For while they sit contriving, shall the rest,</para>
  <para>Millions that stand in Arms, and longing wait</para>
</chapter>

How does the XSLT processor know what to put for the document type (the "chapter" part in "DOCTYPE chapter")? It knows the root element of the document it's creating in the result tree, and that's what an XML document type is: the element that serves as the document's root element.

If the method attribute of the stylesheet's xsl:output element has a value of "text", then a DOCTYPE declaration for the result tree wouldn't make any sense, because a non-XML text file won't have any use for a DOCTYPE declaration. If method has a value of "html", a DOCTYPE declaration might make sense; some Web pages, especially XHTML documents, actually do conform to a DTD, so specifying doctype-system and doctype-public attribute values for such an xsl:output element method attribute can be useful.

    

Also in Transforming XML

Automating Stylesheet Creation

Appreciating Libxslt

Push, Pull, Next!

Seeking Equality

The Path of Control

The DOCTYPE declarations added this way can only point to external DTD files. XSLT offers no way to create a result tree DOCTYPE declaration with an internal DTD subset (that is, with DTD declarations between the square brackets, as shown in the first example earlier). The DTD named in your doctype-system attribute must have all the declarations that your document needs.

This column has mentioned five different attributes of the xsl:output element, and that's only half of them. The others are definitely worth exploring as you learn more ways to fine-tune your result documents.