XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML Canonicalization, Part 2

October 09, 2002

In the previous installment of this article, I introduced Canonical XML, and I discussed when and why you need to canonicalize an XML file. I also demonstrated a step-by-step process that results in the canonical form of an XML document.

In this second and final installment, I'll take the concept further and explain the canonicalization requirements of CDATA sections, processing instructions, comments, external entity references and XML document subsets.

Let's start with an example. Listing 1 is an XML file that contains, among other things, a CDATA section, comments, a processing instruction, and an external entity references. The thirteen steps of part 1 are not sufficient to canonicalize it. We need to perform a few additional steps.

14. CDATA Sections

The canonical form requires all CDATA sections to be replaced with their equivalent PCDATA XML content. This is what we have done in Listing 2. If you compare the two listings, you will find that the markup for CDATA section ("<![CDATA[" in the beginning and "]]>" at the end) has been deleted and "<" character in the CDATA section of Listing 1 has been replaced with its equivalent escape sequence (&lt;) in Listing 2.

15. Processing Instructions

We need to normalize whitespace inside processing instructions. This means that the whitespace between the target and its data will be reduced to a single space (the #x20 character).

There is only one processing instruction in the XML file of Listing 2. The target in this processing instruction is xml-stylesheet, which is followed by the data string. Listing 3 is the same as Listing 2, except that all whitespace between the target and its data has been normalized.

16. External Entity References:

Recall the section on entity references in part 1, where we demonstrated how to canonicalize parsed internal entity references. In a similar fashion, parsed external entity references also need to be replaced with the content they refer to, as shown in Listing 4.

17. Comments

The canonical XML specification allows both retaining and removing comments from an XML file. An XML canonicalization engine will receive a boolean parameter (flag) along with the XML file to be canonicalized, which will tell the canonicalization engine whether to include or exclude XML comments in the canonical form.

For example, Listing 5 shows the removal of comments from Listing 4 (canonical XML without comments).

We are now ready to apply the thirteen steps to Listing 5 (as described in part 1). The result is shown in Listing 6.

18. Document Subsets

XML document subsets or fragments (portions of complete XML files) are an interesting case. When we extract a portion from an XML file, we essentially separate a child node form its parent (call it an orphan node). This separation may result in the invalidation of the child's namespace context if the namespace context of the orphan child was declared in the parent that has been omitted in the document subset.

The Canonical XML specification proposes a method to preserve the namespace context while extracting a document subset. However, there are application scenarios in which preserving the namespace context may create other problems. W3C has released a separate recommendation named Exclusive XML Canonicalization which deals with such scenarios.

The difference between the Canonical XML and Exclusive XML Canonicalization specifications is only about preserving or excluding the ancestor context.

Preserving the Ancestor Context

Have a look at Listing 7, which is a SOAP message. Let's assume we need to canonicalize the booking element in Listing 7 whose unitCharge attribute shows "50" as the value. The first step in doing this is to write an XPath expression that will extract the required document fragment from the XML file. While trying to identify which element I intend to canonicalize, I said "the booking element in Listing 7 whose unitCharge attribute shows '50' as the value". The equivalent XPath expression with the same meaning is

(//. | //@* | //namespace::*)[ancestor-or-self::bs:booking[@unitCharge="50"]]

(with namespace declaration xmlns:bs="http://www.FictitiousTourismInterface/BookingService")

This XPath expression will extract the required booking element from the XML file of Listing 7. The expression in the first pair of brackets (//. | //@* | //namespace::*) selects all element, attribute, and namespace nodes of an XML file. The expression in the outer pair of square brackets (ancestor-or-self::bs:booking) selects all booking elements (along with their children) and the expression in the inner pair of square brackets (@unitCharge="50") selects the booking element whose unitCharge attribute has the value "50".

Listing 8 is a subset of Listing 7 and consists of the booking element. Some readers might be tempted at this point to apply the thirteen steps of part 1 to canonicalize Listing 8. However, there are a couple of problems that require additional processing before we can apply those thirteen steps:

  1. The namespace declarations for the bs and hs prefixes were made in booking element's parent tag, which is not included in the document subset shown in Listing 8.

  2. The xml:lang attribute of the bookingPackage element of Listing 7 was applicable to all its children. This attribute is also missing in the document subset of Listing 8.

These problems clearly indicate that extracting document fragments should be accompanied by actions to preserve the namespace context and the effect of attributes from the xml: namespace. The Canonical XML specification requires the following measures to be taken while canonicalizing document subsets (in addition to all the requirements of canonicalizing complete XML files).

  1. Namespace declarations in the omitted ancestors of the document subset are included in the canonical form.

  2. Attributes in the xml namespace are also included in the canonical form, if they are not already present in the fragment being canonicalized.

These two steps are intended to conserve the ancestor context of a document subset. Have a look at Listing 9 (the required canonical form), which includes the four namespace declarations made in the ancestors of the booking element of Listing 7. Listing 9 also includes the xml:lang attribute. Also notice that the canonical form of document subsets does not have any line breaks (#xA) i.e. the entire file appears on the same line.

Once the ancestor context has been included, the ordering of namespace declarations and attributes is the same as for canonicalizing the complete XML file.

Pages: 1, 2

Next Pagearrow