Character Repertoire Validation for XML
In this article I present a small schema language for XML -- which can be used to restrict the use of character repertoires in XML documents -- called Character Repertoire Validation for XML (CRVX). CRVX restrictions can be based on structural components of an XML document, contexts, or a combination of both.
Why would anyone want to use CRVX? Well, in the first place, why
have character repertoire restrictions at all? And, second, why not
use W3C XML Schema (WXS), which also provides some mechanisms for
this through the pattern facet of
simple type restrictions.
In many application scenarios, XML is only a small part of the overall picture, very useful for integrating components and facilitating communications, but by no means the technology governing the data model of the whole application. Furthermore, many applications are far from supporting the full Unicode character repertoire. I have yet to receive a credit card statement without some character conversion errors. For a very long time to come, it is not realistic to expect IT infrastructures to support full Unicode. Also, in many scenarios it is explicitly required to limit the supported character repertoire, because the business process workflow should only accept characters that make sense (for example, can be rendered and understood) in the specific application scenario. So character repertoire restrictions are useful for two reasons:
As a last point it could be asked why there should be a schema-based way to use character repertoire restrictions. Reasons for preferring schema-based approaches over writing program code include lower authoring effort, easier maintainability, no portability issues, and the general rule that declarative definitions are better than procedural code.
So why not use WXS for character repertoire restrictions? It has
always been a good design pattern for WXS schemas to use a complete
layer of application-specific simple types (i.e., WXS built-in types
should never be instantiated), and this layer could be used to
restrict all simple types with pattern facets,
thus restricting all simple types to the Unicode characters
repertoires required by the application. There are several arguments
against this:
So the conclusion is that it may be possible to specify character repertoire restrictions using WXS, but there are limitations, and the approach might not be the best idea in the first place.
CRVX is a specialized schema language with the single purpose of restricting the character repertoire of XML documents. For a general introduction to why a modular approach to validation is a good idea, the Document Schema Definition Languages (DSDL) home page describes a complete framework for modular XML validation. Note that DSDL does list "character repertoire validation" as one of its goals, but so far there has been no candidate for this task.
A CRVX schema has two ways of restricting character
repertoires. Restrictions can either be based on structural criteria
("all element names may only use the ASCII character repertoire"),
contexts ("all content of description elements may only
use ISO-8859-1 characters"), or a combination of both ("all text
content inside description elements is limited to the
ASCII character repertoire"). Since the structural view of an XML
document is determined by the interpretation of the structures, CRVX
supports two structural views: pureXML
and namespaceXML. Depending on the selected structural
view, different structural components may be specified (for example,
restrictions on namespace names are only possible
for namespaceXML). Since CRVX uses XSLT's concept of patterns, contexts can only
be used for namespaceXML.
Restrictions in CRVX can restrict the character repertoire and
the length of selected structures ("all element names must be ASCII
and at most 8 characters long"). Restrictions reuse mechanisms from
WXS, which are character class expressions (for
example [b-y]) and category escapes (for
example \p{Ll} for all lowercase letters). Basically
these constructs have been taken from the Unicode Regular Expression Guidelines. The
expressiveness of the restrictions is very powerful because the
category escapes reference values from the Unicode Character Database (UCD), which
defines a very rich characterization of Unicode characters into
blocks and categories. The following example shows a small CRVX
schema:
<crvx structures="namespaceXML" version="1.0" xmlns="http://dret.net/xmlns/crvx10">
<context path="figure/caption">
<restrict charrep="\p{IsBasicLatin} \p{IsLatin-1Supplement}"/>
<context path="link">
<restrict structure="elementContent" maxlength="10"/>
</context>
</context>
</crvx>
This schema selects the namespaceXML structural view
of XML documents. Consequently, only namespace-compliant document
can be successfully validated with this schema. It defines a
context, selecting all caption elements appearing
inside figure element. For these contexts, the
restriction selects all content to be restricted to the ISO-8859-1
character repertoire. In addition, all link elements
appearing in this context must satisfy the condition that the
element content contains a maximum of ten characters. Restrictions
are logically AND'd, so that link elements inside
the figure/caption context are effectively restricted
to contain a maximum of ten ISO-8859-1 characters.
A second example demonstrates how CRVX deals with namespaces and enables users to declare and use namespaces in CRVX schemas:
<crvx structures="namespaceXML" version="1.0" xmlns="http://dret.net/xmlns/crvx10">
<namespace prefix="html" name="http://www.w3.org/1999/xhtml"/>
<context path="html:html/html:head/html:title">
<restrict charrep="\p{IsBasicLatin}"/>
</context>
</crvx>
This is a very small example showing how a CRVX schema could be
used to restrict the content of XHTML Web page titles, if for some
reason the authors do not trust the ability of browsers to display
non-ASCII characters in the title bar (and all the other places
where titles show up). Namespaces must be declared using a dedicated
element, and declared namespace prefixes may then be used
in context elements.
These examples have almost shown everything there is to CRVX. It
has been designed as a simple schema language, and it is very easy
to learn and apply. Of course, for complex document classes, the
CRVX schema can get lengthy since complex document classes tend to
have many different contexts that need special restrictions. If one
restriction should be applied to multiple contexts, it can carry
a within attribute, which is used to refer to a named
context. This way, CRVX supports reuse and makes it easier to
specify maintainable schemas.
CRVX can be implemented using XSLT, even though there are some disadvantages to this approach. The advantage is that XSLT-based implementations are very easy to deploy because XSLT processors are becoming a ubiquitous piece of XML software. The disadvantages of the XSLT-based approach include:
For a prototypical implementation, we decided that these drawbacks were acceptable and that the effort required to implement a native solution (based on some XML API) would be too high to be justified. However, for a more efficient and complete implementation, a native solution would be preferable.
During the work on CRVX, it became clear that additional features could be useful in some scenarios. While CRVX 1.0 might still be below the 80/20 cut, some of these features may be too exotic to be included in a future version of CRVX:
Currently there are no plans to release a second version of CRVX, and seeing the lack of development in the DSDL activity, the theoretically appealing concept of modular XML validation seems to lack support from actual XML users. However, a more modular approach to XML processing in general would probably benefit many XML software projects, so we hope that at least the modular view on XML processing exemplified by CRVX is useful to give XML users some new ideas.
CRVX is a rather small schema language designed with a very specific goal in mind. At this point in time, it is the only schema language of its kind, but Diederik A. Gerth van Wijk as lead of the DSDL character repertoire validation activity is working on something that goes beyond CRVX's capabilities (so far there are no publications, though). So whether future character repertoire validation for XML will use CRVX's simple concepts is uncertain, but it certainly would be a good idea for XML users concerned with character repertoire validation to encapsulate their requirements in some declarative way and then process it with some component interpreting the declarations.
Resources
|
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.