Processing Inclusions with XSLT
The consequences of using more than one file to create an XML document, or XML inclusion, is a topic that enflames discussion lists, and for which finding a general solution is like trying to square the circle.
In this article, we will show how customized parsers can expose a more complete document model through an SAX interface and help process compound documents through standard XML tools such as XSLT.
Where's the Problem?
Most of the XML APIs and standards are focused on providing, in a convenient way, all the information needed to process and display the data embedded in XML documents.
Applied to document inclusions, this means that XML processors are required to replace the inclusion instruction by the content of the included resource -- this is the best thing to do for common formatting and information extraction tasks, but results in a loss of information that can be unacceptable when transforming these documents with XML tools.
This topic has been discussed a number of times on different mailing lists, and the feeling of many can be summarized by a post from Rick Geimer on the XSL List (in answer to a feature request from Dave Pawson to facilitate the generation of entity references with XSLT):
Entities in the internal subset are a feature of XML 1.0, and in my opinion, should be supported by XSLT. Until that happens, I don't plan to use XSLT for anything other than HTML creation, since it simply doesn't fit into our publishing model as a general XML-to-XML transformation tool.
This behavior is explicitly defined for external parsed entities in the XML 1.0 recommendation, chapter 4.4.2:
An entity is included when its replacement text is retrieved and processed, in place of the reference itself, as though it were part of the document at the location the reference was recognized.
As a result, the replacement is done at parse time, and the information is lost by SAX parsers and is unavailable in the DOM and XPath data models.
This behavior is likely to be adopted by the XInclude specification, the 17 July 2000 Working Draft chapter 3.3 stating that:
The acquired infoset is merged with the source infoset to create a new infoset by replacing the information items representing the include elements with information items in the acquired infoset. The include element, its attributes and any children, are not represented in the result infoset.
The only glimmer of hope one may find in the W3C specifications seems to come from the Infoset, the 26 July 2000 Working Draft describing "an abstract data set which contains the useful information available from an XML document," which includes entities start and end markers, but these are not required in order to be core conformant.
A Possible Solution
Some parsers such as XP provide proprietary interfaces, which can expose more information than requested by the specification and than is available in a SAX interface. However, these extra features are not available to standard transformation tools.
Since most of these tools use a SAX interface, why not modify a SAX parser to expose more of the document structure through a standard SAX interface?
The idea is to define a XML representation of the document structure for applications needing to work at this level.
The whole infoset could be represented using a XML representation, as it can be represented using a RDF Schema as shown by Dan Connolly in the Infoset Working Draft. For the purpose of this article, I have used a more simple model, offering direct access to the document elements and attributes.
I'll illustrate the XML vocabulary used to describe a XML document with a simple example taken from Karen Lease's presentation at XML Europe 2000 (this example could also have been taken from DocBook which relies on external parsed entities to consolidate the chapters of a document):
<?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE preparation [ <!ENTITY include_tools SYSTEM "tools.xml"> <!ENTITY include_products SYSTEM "products.xml"> <!ENTITY include_parts SYSTEM "parts.xml"> ]> <!-- Simple example of file using external parsed entities --> <preparation> &include_tools; &include_products; &include_parts; </preparation>(main.xml)
To keep this example as simple as possible, we will assume that the three included files are the following:
<?xml version="1.0" encoding="iso-8859-1"?> <tool/>(tools.xml)
<?xml version="1.0" encoding="iso-8859-1"?> <product/>(products.xml)
<?xml version="1.0" encoding="iso-8859-1"?> <part/>(parts.xml)
Let's try first to apply the following XSLT transformation on this example:
<?xml version="1.0" encoding='iso-8859-1'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes"/> <xsl:template match="@*|*"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
This transformation uses a template known as the "identity" template, and is supposed to perform a roundtrip, transforming a document into itself. However, the result of this transformation using XT is:
<?xml version="1.0" encoding="utf-8"?> <preparation> <tool/> <product/> <part/> </preparation>
We see that, even if the canonical content of the document is unchanged, we have lost the encoding (changed from
doctype definition with all the entity definitions, a comment, and all the external parsed entities references have been replaced by their values.
This loss of information is not related to the way the transformation has been written, but really by the fact that this information does not survive a standard XML parsing and is not available in the XPath data model.
Let's imagine now that we have a parser that is modified to send the extra information through standard SAX events. The event stream, translated into XML, would give:
<?xml version="1.0" encoding="utf-8"?> <str:document xmlns:str="http://4xt.org/ns/xmlstructure"> <str:prolog> <str:X-M-L-Decl str:version="1.0" str:encoding="iso-8859-1"/> <str:doctype str:name="preparation"> <str:externalEntityDefinition str:name="include_tools" str:systemId="tools.xml"/> <str:externalEntityDefinition str:name="include_products" str:systemId="products.xml"/> <str:externalEntityDefinition str:name="include_parts" str:systemId="parts.xml"/> </str:doctype> <str:comment> Simple example of file using external parsed entities </str:comment> </str:prolog> <str:body> <preparation> <str:entity str:name="include_tools"/> <str:entity str:name="include_products"/> <str:entity str:name="include_parts"/> </preparation> </str:body> <str:epilog/> </str:document>
A new namespace (
http://4xt.org/ns/xmlstructure, aliased to
str) is used to describe the structure of the XML document.
The element document is
str:document and its three children are the three parts of a XML document (
In the prolog, we find the XML declaration (I have replaced
xml, which is reserved, with
X-M-L) and the
doctype with its external entity definitions.
The comment present in this example is also part of the prolog, as it is found in the document before the document element.
In the body, we find the document element (
preparation) available directly as an element in its own namespace (here, the default namespace). All the other elements and attributes would be found directly as they appear in the original document.
The entity references are described in
str:entity elements, where their names are available.
This representation could be used by any XML tool to access information usually hidden in a XML document, and to transform it. The roundtrip could be ensured by output handlers analyzing this model to write back a modified version of the document in its original form.
If you are convinced (or even half convinced), why don't you give it a try?
I have written a customized version of Ælfred2, which parses a document into the model shown above. You can use it together with a specific driver class for XT, which calls it and an output method (StructXMLOutputHandler) that writes the document back into its original form.
Before you download the whole package, note that the features shown in this example are the only ones that have been implemented so far!
The structure shown above (through_struct.xml) is the actual result of the transformation of our example through the identity transformation (which does not use
To make use of this output method, the
xsl:output of the style sheet has to be updated to:
<?xml version="1.0" encoding='iso-8859-1'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="java:StructXMLOutputHandler"
xmlns:java="http://www.jclark.com/xt/java" /> <xsl:template match="@*|*"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
The result is then:
<?xml version="1.0" encoding="iso-8859-1"?> <!DOCTYPE name [ <!ENTITY include_tools SYSTEM "tools.xml"> <!ENTITY include_products SYSTEM "products.xml"> <!ENTITY include_parts SYSTEM "parts.xml"> ]> <!-- Simple example of file using external parsed entities --> <preparation>&include_tools;&include_products;&include_parts; </preparation>(roundtrip.xml)
We see that the result is identical to the original document, except for some whitespace and line feeds that we have not included in our document structure representation, and are hence not well preserved.
The fact that the input (the parser) and the output (the output method) have been implemented separately gives more potential usages for this hack. Used alone, the parser can let you generate (X)HTML documentation from XML documents,
mentioning which encoding is used, giving
doctype information, tables of external parsed entities declarations and references, and copying the comments.
On the other side, the output method can be used alone (eventually with a XML layout file as shown in "style-free XSLT") to generate documents containing external parsed entity definitions and references.
And, of course, used together, they allow us to transform compound documents, allowing for instance changing the name of an entity, the location of an included document or just transforming elements or attributes from these documents, which a standard transformation cannot do without losing its structure, as we've shown above.
The vocabulary shown in this example is not complete, and doesn't yet cover all the information available in a XML document.
Defining such a vocabulary based either on the one which I have presented here, or one more conformant to the infoset, would open other promising avenues -- especially where DTDs are concerned, such as transforming DTDs to other schema formats, or input forms, and more generally retrieving information from DTDs and generating DTDs.
From a wider perspective, developing special parsers covering applications out of the scope of the mainstream parsers is probably a good way to keep simple things simple, while ensuring that everyone can find a solution for their problems.
Such a solution could be applied to other specifications (such as XInclude) which tend to restrict their field of application to keep things simple and find a consensus.
Acknowledgements and References
Many thanks to Karen Lease, whose presentation during XML Europe 2000 has been the starting point of my developments on the subject; to the SML mailing list, whose discussions about "colors" are a good tutorial for thinking differently about XML; to David Megginson and David Brownell for Ælfred2 -- so easy to read and hack -- and James Clark for XT.