XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML Canonicalization
by Bilal Siddiqui | Pages: 1, 2

4. Double quotes for Attribute values

Only double quotes should be used to encapsulate attribute values in canonical form. Have a look at Listing 4, where the name attribute of the product element is enclosed inside single quotes. In Listing 5, we have replaced the single quotes around the name attribute value with double quotes.

5. Special Characters in Attribute Values and Character Content

When we replaced single quotes with double quotes, we introduced a problem in Listing 5. It is no longer a well formed XML file, as the string representing value of the name attribute already contained double quotes as part of the string value. In order to solve this problem, the Canonical XML specification requires that all special characters (e.g. double quotes) in attribute values and element content be replaced with character entities (e.g. " for double quotes). Listing 6 is the result of applying this rule to Listing 5.

6. Entity References

Listing 6 contains a DTD declaration, which defines an entity named testhistory. The testhistory entity is referenced by the comments element content.

Canonical XML requires that all entity references be replaced with the content represented by the entity (e.g. in Listing 6, the testhistory entity represents the string "Part has been tested according to the specified standards."). Listing 7 is the resulting XML file after entity references in Listing 6 have been replaced.

7. Default Attributes

The DTD declaration in Listing 7 defines a default attribute named approved for each part element. None of the part tags in Listing 7 contains this attribute.

Canonical XML requires that default attributes should be included in the canonical XML form. Listing 8 is the result of including the approved attribute with default value in Listing 7.

8. XML and DTD declarations

Canonical XML does not require XML and DTD declarations. Therefore XML and DTD declarations should be removed in the canonical form. Although we have used the DTD declaration while replacing entity references and adding default attributes, the actual XML and DTD declarations need to be removed as shown in Listing 9.

9. White Space outside the Document Element

A Canonical XML document starts with the '<' character. This means that there should be no white space before the first node.

10. White Space in Start and End Elements

Start and End elements should have normalized white space in canonical form. This means there should be:

  • No white space between the left angle bracket ('<') and the name of a start element. Similarly there should be no space between a slash ('/') and the name of an end element.
  • A single #x20 character between the element name and the first attribute name, if present.
  • No white space before and after the equality sign in attribute-value pairs.
  • A single #x20 character between attribute-value pairs.
  • No white space following the closing double quote of the last attribute's value.
  • If there are no attributes, there should be no white space between the element name and the right angle bracket '>'.

Listing 10 is the result of normalizing white space in start and end elements of Listing 9.

11. Empty Elements

Canonical XML requires start-end tag pairs for all elements, which includes empty elements as well. Therefore, all empty elements of the form <emptyElement/> need to be converted to <emptyElement></emptyElement>. Listing 11 shows the result of applying this rule to Listing 10.

12. Namespace Declarations

Listing 11 contains three namespace declarations, two in the product element and one in the second part element. Canonical XML requires preserving all namespace declarations as such (along with the namespace prefixes) except superfluous namespace declarations (those namespace declarations that have no effect on the namespace context of any node in the XML file).

The namespace declared in the second part element in Listing 11 is superfluous. You can remove it from the element with no effect on the namespace context of any node in the file. That's why Listing 12 does not include this namespace declaration, while preserving the rest of Listing 11 as such.

13. Ordering of Namespace Declarations and Attributes

Canonical XML requires the inclusion of namespace declarations and attributes in ascending lexicographic order. Inside an opening element, all namespace declarations should appear first, followed by the attribute-value pairs. Listing 13 shows how Listing 12 will look like after the ordering rule is applied.

Listing 13 is the final canonical form of all listings from 3 to 12.

Just to give you a bit of variety, we have provided another example in the Listing 14 and 15 pair (Listing 15 is the canonical form of Listing 14). We didn't produce Listing 15 by hand. We rather generated it using the Canonical XML implementation by IBM alphaWorks, which is part of their XML Security Suite (refer to resources). However, curious readers may start with Listing 14 and follow the thirteen steps described above to arrive at Listing 15.

Note: There is no DOCTYPE declaration in Listing 14. Therefore, some of the canonicalization steps such as replacing entity references and adding default attributes are not relevant in this case.

Conclusion

In the second article in this series, we will take this concept further and discuss more advanced concepts such as dealing with parts of XML documents, CDATA sections, comments and processing instructions. We will also discuss tricky situations where canonicalization process renders XML documents uesless for their intended function.

Resources



1 to 5 of 5
  1. Listings not rendering correctly in IE
    2006-09-14 09:44:48 _James_
  2. XML Canonicalization
    2006-07-31 04:16:03 sachinneelu
  3. Canonicalization byte array question
    2004-07-13 12:25:33 ponsay
  4. XML example does show in IE right
    2002-11-09 15:48:05 John Addington
  5. Need clarification
    2002-09-20 11:33:19 Charlie Kaiman
1 to 5 of 5