XML Canonicalization, Part 2
In the previous installment of this article, I introduced Canonical XML, and I discussed when and why you need to canonicalize an XML file. I also demonstrated a step-by-step process that results in the canonical form of an XML document.
In this second and final installment, I'll take the concept further and explain the canonicalization requirements of CDATA sections, processing instructions, comments, external entity references and XML document subsets.
Let's start with an example. Listing 1 is an XML file that contains, among other things, a CDATA section, comments, a processing instruction, and an external entity references. The thirteen steps of part 1 are not sufficient to canonicalize it. We need to perform a few additional steps.
14. CDATA Sections
The canonical form requires all CDATA sections to be replaced with their equivalent PCDATA XML content. This is what we have done in Listing 2. If you compare the two listings, you will find that the markup for CDATA section ("<![CDATA[" in the beginning and "]]>" at the end) has been deleted and "<" character in the CDATA section of Listing 1 has been replaced with its equivalent escape sequence (<) in Listing 2.
15. Processing Instructions
We need to normalize whitespace inside processing instructions. This means that the whitespace between the target and its data will be reduced to a single space (the #x20 character).
There is only one processing instruction in the XML file of Listing 2. The target in this
processing instruction is xml-stylesheet, which is followed
by the data string. Listing
3 is the same as Listing
2, except that all whitespace between the target and its data has
been normalized.
16. External Entity References:
Recall the section on entity references in part 1, where we demonstrated how to canonicalize parsed internal entity references. In a similar fashion, parsed external entity references also need to be replaced with the content they refer to, as shown in Listing 4.
17. Comments
The canonical XML specification allows both retaining and removing comments from an XML file. An XML canonicalization engine will receive a boolean parameter (flag) along with the XML file to be canonicalized, which will tell the canonicalization engine whether to include or exclude XML comments in the canonical form.
For example, Listing 5 shows the removal of comments from Listing 4 (canonical XML without comments).
We are now ready to apply the thirteen steps to Listing 5 (as described in part 1). The result is shown in Listing 6.
18. Document Subsets
XML document subsets or fragments (portions of complete XML files) are an interesting case. When we extract a portion from an XML file, we essentially separate a child node form its parent (call it an orphan node). This separation may result in the invalidation of the child's namespace context if the namespace context of the orphan child was declared in the parent that has been omitted in the document subset.
The Canonical XML specification proposes a method to preserve the namespace context while extracting a document subset. However, there are application scenarios in which preserving the namespace context may create other problems. W3C has released a separate recommendation named Exclusive XML Canonicalization which deals with such scenarios.
The difference between the Canonical XML and Exclusive XML Canonicalization specifications is only about preserving or excluding the ancestor context.
Preserving the Ancestor Context
Have a look at Listing
7, which is a SOAP message. Let's assume we need to canonicalize the
booking element in Listing 7 whose
unitCharge attribute shows "50" as the value. The first step
in doing this is to write an XPath expression that will extract the
required document fragment from the XML file. While trying to identify
which element I intend to canonicalize, I said "the booking element in
Listing 7 whose unitCharge attribute shows '50' as the value". The
equivalent XPath expression with the same meaning is
(//. | //@* | //namespace::*)[ancestor-or-self::bs:booking[@unitCharge="50"]]
(with namespace declaration xmlns:bs="http://www.FictitiousTourismInterface/BookingService")
This XPath expression will extract the required booking element from
the XML file of Listing
7. The expression in the first pair of brackets (//. | //@* |
//namespace::*) selects all element, attribute, and namespace
nodes of an XML file. The expression in the outer pair of square brackets
(ancestor-or-self::bs:booking) selects all booking
elements (along with their children) and the expression in the inner pair
of square brackets (@unitCharge="50") selects the
booking element whose unitCharge attribute has
the value "50".
Listing 8 is a subset of Listing 7 and consists of the booking element. Some readers might be tempted at this point to apply the thirteen steps of part 1 to canonicalize Listing 8. However, there are a couple of problems that require additional processing before we can apply those thirteen steps:
The namespace declarations for the
bsandhsprefixes were made in booking element's parent tag, which is not included in the document subset shown in Listing 8.The
xml:langattribute of thebookingPackageelement of Listing 7 was applicable to all its children. This attribute is also missing in the document subset of Listing 8.
These problems clearly indicate that extracting document fragments
should be accompanied by actions to preserve the namespace context and the
effect of attributes from the xml: namespace. The Canonical
XML specification requires the following measures to be taken while
canonicalizing document subsets (in addition to all the requirements of
canonicalizing complete XML files).
Namespace declarations in the omitted ancestors of the document subset are included in the canonical form.
Attributes in the xml namespace are also included in the canonical form, if they are not already present in the fragment being canonicalized.
These two steps are intended to conserve the ancestor context of a
document subset. Have a look at Listing 9 (the required
canonical form), which includes the four namespace declarations made in
the ancestors of the booking element of Listing 7. Listing 9 also includes the
xml:lang attribute. Also notice that the canonical form of
document subsets does not have any line breaks (#xA) i.e. the entire file
appears on the same line.
Once the ancestor context has been included, the ordering of namespace declarations and attributes is the same as for canonicalizing the complete XML file.
Pages: 1, 2 |