Character Repertoire Validation for XML

January 14, 2004

In this article I present a small schema language for XML -- which can be used to restrict the use of character repertoires in XML documents -- called Character Repertoire Validation for XML (CRVX). CRVX restrictions can be based on structural components of an XML document, contexts, or a combination of both.

Restricting Character Repertoires is Smart

Why would anyone want to use CRVX? Well, in the first place, why have character repertoire restrictions at all? And, second, why not use W3C XML Schema (WXS), which also provides some mechanisms for this through the pattern facet of simple type restrictions.

Why Character Repertoire Restrictions?

In many application scenarios, XML is only a small part of the overall picture, very useful for integrating components and facilitating communications, but by no means the technology governing the data model of the whole application. Furthermore, many applications are far from supporting the full Unicode character repertoire. I have yet to receive a credit card statement without some character conversion errors. For a very long time to come, it is not realistic to expect IT infrastructures to support full Unicode. Also, in many scenarios it is explicitly required to limit the supported character repertoire, because the business process workflow should only accept characters that make sense (for example, can be rendered and understood) in the specific application scenario. So character repertoire restrictions are useful for two reasons:

Protecting legacy applications: Legacy applications often support only very limited character repertoires (for example ASCII or an ISO 8859 variant), and if they are equipped with an XML interface, like a web service, care should be taken that only data is accepted that can be processed by the application.
Protecting workflows: Newer applications may support Unicode, but won't accept the whole Unicode character repertoire because doing so doesn't make sense in their business case. Consequently, these applications should be protected from unwanted characters in the same way as legacy applications.

As a last point it could be asked why there should be a schema-based way to use character repertoire restrictions. Reasons for preferring schema-based approaches over writing program code include lower authoring effort, easier maintainability, no portability issues, and the general rule that declarative definitions are better than procedural code.

Why not use W3C XML Schema?

So why not use WXS for character repertoire restrictions? It has always been a good design pattern for WXS schemas to use a complete layer of application-specific simple types (i.e., WXS built-in types should never be instantiated), and this layer could be used to restrict all simple types with pattern facets, thus restricting all simple types to the Unicode characters repertoires required by the application. There are several arguments against this:

Mixed Content: Depending on the application, mixed content may not even be used, or it may represent a substantial amount of the XML's character data. Unfortunately, mixed content in W3C XML Schema does not have any type, and consequently it cannot be restricted.
Simple Types Only: Since facets can only be used for simple types, they cannot restrict anything that is not a simple type, such as comments, processing instructions, element and attribute names, and mixed content.
WXS Complexity: While WXS may be the schema language of choice for some application scenarios, it still is considered to be too complex by many XML users. So if the focus of an application designer is to define character repertoire restrictions, WXS is a very complex and cumbersome way to define them. Furthermore, there still is not a single complete WXS implementation. Using WXS exposes application designers to the numerous bugs found in every WXS implementation.
Modularization: WXS was designed to define grammars for XML documents, supported by a type system supporting different ways of type derivation. Some other mechanisms has been added to WXS as an afterthought, such as identity constraints, which do not fit very well into the overall picture (because they completely ignore the types). In the same way, character repertoire restrictions could be added (though only partially, as shown already) to a WXS schema, but it may be better from the modularity point of view to separate them from the grammar defined by the schema.

So the conclusion is that it may be possible to specify character repertoire restrictions using WXS, but there are limitations, and the approach might not be the best idea in the first place.

The CRVX Schema Language

CRVX is a specialized schema language with the single purpose of restricting the character repertoire of XML documents. For a general introduction to why a modular approach to validation is a good idea, the Document Schema Definition Languages (DSDL) home page describes a complete framework for modular XML validation. Note that DSDL does list "character repertoire validation" as one of its goals, but so far there has been no candidate for this task.

A CRVX schema has two ways of restricting character repertoires. Restrictions can either be based on structural criteria ("all element names may only use the ASCII character repertoire"), contexts ("all content of description elements may only use ISO-8859-1 characters"), or a combination of both ("all text content inside description elements is limited to the ASCII character repertoire"). Since the structural view of an XML document is determined by the interpretation of the structures, CRVX supports two structural views: pureXML and namespaceXML. Depending on the selected structural view, different structural components may be specified (for example, restrictions on namespace names are only possible for namespaceXML). Since CRVX uses XSLT's concept of patterns, contexts can only be used for namespaceXML.

Restrictions in CRVX can restrict the character repertoire and the length of selected structures ("all element names must be ASCII and at most 8 characters long"). Restrictions reuse mechanisms from WXS, which are character class expressions (for example [b-y]) and category escapes (for example \p{Ll} for all lowercase letters). Basically these constructs have been taken from the Unicode Regular Expression Guidelines. The expressiveness of the restrictions is very powerful because the category escapes reference values from the Unicode Character Database (UCD), which defines a very rich characterization of Unicode characters into blocks and categories. The following example shows a small CRVX schema:

<crvx structures="namespaceXML" version="1.0" xmlns="http://dret.net/xmlns/crvx10">

  <context path="figure/caption">

    <restrict charrep="\p{IsBasicLatin} \p{IsLatin-1Supplement}"/>

    <context path="link">

      <restrict structure="elementContent" maxlength="10"/>

    </context>

  </context>

</crvx>

This schema selects the namespaceXML structural view of XML documents. Consequently, only namespace-compliant document can be successfully validated with this schema. It defines a context, selecting all caption elements appearing inside figure element. For these contexts, the restriction selects all content to be restricted to the ISO-8859-1 character repertoire. In addition, all link elements appearing in this context must satisfy the condition that the element content contains a maximum of ten characters. Restrictions are logically AND'd, so that link elements inside the figure/caption context are effectively restricted to contain a maximum of ten ISO-8859-1 characters.

A second example demonstrates how CRVX deals with namespaces and enables users to declare and use namespaces in CRVX schemas:

<crvx structures="namespaceXML" version="1.0" xmlns="http://dret.net/xmlns/crvx10">

  <namespace prefix="html" name="http://www.w3.org/1999/xhtml"/>

  <context path="html:html/html:head/html:title">

    <restrict charrep="\p{IsBasicLatin}"/>

  </context>

</crvx>

This is a very small example showing how a CRVX schema could be used to restrict the content of XHTML Web page titles, if for some reason the authors do not trust the ability of browsers to display non-ASCII characters in the title bar (and all the other places where titles show up). Namespaces must be declared using a dedicated element, and declared namespace prefixes may then be used in context elements.

These examples have almost shown everything there is to CRVX. It has been designed as a simple schema language, and it is very easy to learn and apply. Of course, for complex document classes, the CRVX schema can get lengthy since complex document classes tend to have many different contexts that need special restrictions. If one restriction should be applied to multiple contexts, it can carry a within attribute, which is used to refer to a named context. This way, CRVX supports reuse and makes it easier to specify maintainable schemas.

Validation

CRVX can be implemented using XSLT, even though there are some disadvantages to this approach. The advantage is that XSLT-based implementations are very easy to deploy because XSLT processors are becoming a ubiquitous piece of XML software. The disadvantages of the XSLT-based approach include:

XSLT 2.0: Since CRVX needs parts of WXS regular expressions, XSLT 2.0 is required. XSLT 2.0 is still in Working Draft status, and it will take some time to finish the specification and deploy implementations. Early implementations are available, but they are experimental and may change in the future.
Namespace-compliant XML Only: Since XSLT's data model is based on the Infoset, namespace compliance is required. However, since namespace-compliant XML is the rule, this is not a big problem.
No CDATA Sections: Another consequence of XSLT's foundation on the Infoset is the fact that CDATA sections are not part of the data model, so the structural selection of CDATA sections within documents is impossible.

For a prototypical implementation, we decided that these drawbacks were acceptable and that the effort required to implement a native solution (based on some XML API) would be too high to be justified. However, for a more efficient and complete implementation, a native solution would be preferable.

Further Work

During the work on CRVX, it became clear that additional features could be useful in some scenarios. While CRVX 1.0 might still be below the 80/20 cut, some of these features may be too exotic to be included in a future version of CRVX:

Character References: In XML, characters may appear literally or as character references. It may be necessary to control whether characters appear in one form or the other; additional restrictions could enable users to define these two ways in which characters may appear.
XPath 2.0: Along with XSLT 2.0, XPath 2.0 is being standardized. Upgrading CRVX to XSLT 2.0 patterns would make the context patters of CRVX more powerful, allowing type-based patterns and other new features of XPath 2.0.
Character Normalization: XML 1.0 as well as the upcoming XML 1.1 do not require character normalization. If this is an application requirement, CRVX (or some other mechanism dealing with characters and character encodings) probably would be the right place to put it.

Currently there are no plans to release a second version of CRVX, and seeing the lack of development in the DSDL activity, the theoretically appealing concept of modular XML validation seems to lack support from actual XML users. However, a more modular approach to XML processing in general would probably benefit many XML software projects, so we hope that at least the modular view on XML processing exemplified by CRVX is useful to give XML users some new ideas.

Conclusions

CRVX is a rather small schema language designed with a very specific goal in mind. At this point in time, it is the only schema language of its kind, but Diederik A. Gerth van Wijk as lead of the DSDL character repertoire validation activity is working on something that goes beyond CRVX's capabilities (so far there are no publications, though). So whether future character repertoire validation for XML will use CRVX's simple concepts is uncertain, but it certainly would be a good idea for XML users concerned with character repertoire validation to encapsulate their requirements in some declarative way and then process it with some component interpreting the declarations.

Resources

[1] CRVX Project Page
[2] DSDL Home Page