Character Repertoire Validation for XML
January 14, 2004
In this article I present a small schema language for XML -- which can be used to restrict the use of character repertoires in XML documents -- called Character Repertoire Validation for XML (CRVX). CRVX restrictions can be based on structural components of an XML document, contexts, or a combination of both.
Restricting Character Repertoires is Smart
Why would anyone want to use CRVX? Well, in the first place, why have character repertoire
restrictions at all? And, second, why not use W3C XML Schema (WXS), which also provides
some
mechanisms for this through the pattern
facet of
simple type restrictions.
Why Character Repertoire Restrictions?
In many application scenarios, XML is only a small part of the overall picture, very useful for integrating components and facilitating communications, but by no means the technology governing the data model of the whole application. Furthermore, many applications are far from supporting the full Unicode character repertoire. I have yet to receive a credit card statement without some character conversion errors. For a very long time to come, it is not realistic to expect IT infrastructures to support full Unicode. Also, in many scenarios it is explicitly required to limit the supported character repertoire, because the business process workflow should only accept characters that make sense (for example, can be rendered and understood) in the specific application scenario. So character repertoire restrictions are useful for two reasons:
- Protecting legacy applications: Legacy applications often support only very limited character repertoires (for example ASCII or an ISO 8859 variant), and if they are equipped with an XML interface, like a web service, care should be taken that only data is accepted that can be processed by the application.
- Protecting workflows: Newer applications may support Unicode, but won't accept the whole Unicode character repertoire because doing so doesn't make sense in their business case. Consequently, these applications should be protected from unwanted characters in the same way as legacy applications.
As a last point it could be asked why there should be a schema-based way to use character repertoire restrictions. Reasons for preferring schema-based approaches over writing program code include lower authoring effort, easier maintainability, no portability issues, and the general rule that declarative definitions are better than procedural code.
Why not use W3C XML Schema?
So why not use WXS for character repertoire restrictions? It has always been a good
design
pattern for WXS schemas to use a complete layer of application-specific simple types
(i.e.,
WXS built-in types should never be instantiated), and this layer could be used to
restrict
all simple types with pattern
facets, thus
restricting all simple types to the Unicode characters repertoires required by the
application. There are several arguments against this:
- Mixed Content: Depending on the application, mixed content may not even be used, or it may represent a substantial amount of the XML's character data. Unfortunately, mixed content in W3C XML Schema does not have any type, and consequently it cannot be restricted.
- Simple Types Only: Since facets can only be used for simple types, they cannot restrict anything that is not a simple type, such as comments, processing instructions, element and attribute names, and mixed content.
- WXS Complexity: While WXS may be the schema language of choice for some application scenarios, it still is considered to be too complex by many XML users. So if the focus of an application designer is to define character repertoire restrictions, WXS is a very complex and cumbersome way to define them. Furthermore, there still is not a single complete WXS implementation. Using WXS exposes application designers to the numerous bugs found in every WXS implementation.
- Modularization: WXS was designed to define grammars for XML documents, supported by a type system supporting different ways of type derivation. Some other mechanisms has been added to WXS as an afterthought, such as identity constraints, which do not fit very well into the overall picture (because they completely ignore the types). In the same way, character repertoire restrictions could be added (though only partially, as shown already) to a WXS schema, but it may be better from the modularity point of view to separate them from the grammar defined by the schema.
So the conclusion is that it may be possible to specify character repertoire restrictions using WXS, but there are limitations, and the approach might not be the best idea in the first place.
The CRVX Schema Language
CRVX is a specialized schema language with the single purpose of restricting the character repertoire of XML documents. For a general introduction to why a modular approach to validation is a good idea, the Document Schema Definition Languages (DSDL) home page describes a complete framework for modular XML validation. Note that DSDL does list "character repertoire validation" as one of its goals, but so far there has been no candidate for this task.
A CRVX schema has two ways of restricting character repertoires. Restrictions can
either be
based on structural criteria ("all element names may only use the ASCII character
repertoire"), contexts ("all content of description
elements may only use
ISO-8859-1 characters"), or a combination of both ("all text content inside
description
elements is limited to the ASCII character repertoire"). Since
the structural view of an XML document is determined by the interpretation of the
structures, CRVX supports two structural views: pureXML
and
namespaceXML
. Depending on the selected structural view, different structural
components may be specified (for example, restrictions on namespace names are only
possible
for namespaceXML
). Since CRVX uses XSLT's concept of patterns, contexts can only be used for namespaceXML
.
Restrictions in CRVX can restrict the character repertoire and the length of selected
structures ("all element names must be ASCII and at most 8 characters long"). Restrictions
reuse mechanisms from WXS, which are character class
expressions (for example [b-y]
) and category escapes (for example
\p{Ll}
for all lowercase letters). Basically these constructs have been taken
from the Unicode
Regular Expression Guidelines. The expressiveness of the restrictions is very powerful
because the category escapes reference values from the Unicode Character Database (UCD), which defines a very rich
characterization of Unicode characters into blocks and categories. The following example
shows a small CRVX schema:
<crvx structures="namespaceXML" version="1.0" xmlns="http://dret.net/xmlns/crvx10"> <context path="figure/caption"> <restrict charrep="\p{IsBasicLatin} \p{IsLatin-1Supplement}"/> <context path="link"> <restrict structure="elementContent" maxlength="10"/> </context> </context> </crvx>
This schema selects the namespaceXML
structural view of XML documents.
Consequently, only namespace-compliant document can be successfully validated with
this
schema. It defines a context, selecting all caption
elements appearing inside
figure
element. For these contexts, the restriction selects all content to be
restricted to the ISO-8859-1 character repertoire. In addition, all link
elements appearing in this context must satisfy the condition that the element content
contains a maximum of ten characters. Restrictions are logically AND'd, so that
link
elements inside the figure/caption
context are effectively
restricted to contain a maximum of ten ISO-8859-1 characters.
A second example demonstrates how CRVX deals with namespaces and enables users to declare and use namespaces in CRVX schemas:
<crvx structures="namespaceXML" version="1.0" xmlns="http://dret.net/xmlns/crvx10"> <namespace prefix="html" name="http://www.w3.org/1999/xhtml"/> <context path="html:html/html:head/html:title"> <restrict charrep="\p{IsBasicLatin}"/> </context> </crvx>
This is a very small example showing how a CRVX schema could be used to restrict the
content of XHTML Web page titles, if for some reason the authors do not trust the
ability of
browsers to display non-ASCII characters in the title bar (and all the other places
where
titles show up). Namespaces must be declared using a dedicated element, and declared
namespace prefixes may then be used in context
elements.
These examples have almost shown everything there is to CRVX. It has been designed
as a
simple schema language, and it is very easy to learn and apply. Of course, for complex
document classes, the CRVX schema can get lengthy since complex document classes tend
to
have many different contexts that need special restrictions. If one restriction should
be
applied to multiple contexts, it can carry a within
attribute, which is used to
refer to a named context. This way, CRVX supports reuse and makes it easier to specify
maintainable schemas.
Validation
CRVX can be implemented using XSLT, even though there are some disadvantages to this approach. The advantage is that XSLT-based implementations are very easy to deploy because XSLT processors are becoming a ubiquitous piece of XML software. The disadvantages of the XSLT-based approach include:
- XSLT 2.0: Since CRVX needs parts of WXS regular expressions, XSLT 2.0 is required. XSLT 2.0 is still in Working Draft status, and it will take some time to finish the specification and deploy implementations. Early implementations are available, but they are experimental and may change in the future.
- Namespace-compliant XML Only: Since XSLT's data model is based on the Infoset, namespace compliance is required. However, since namespace-compliant XML is the rule, this is not a big problem.
- No CDATA Sections: Another consequence of XSLT's foundation on the Infoset is the fact that CDATA sections are not part of the data model, so the structural selection of CDATA sections within documents is impossible.
For a prototypical implementation, we decided that these drawbacks were acceptable and that the effort required to implement a native solution (based on some XML API) would be too high to be justified. However, for a more efficient and complete implementation, a native solution would be preferable.
Further Work
During the work on CRVX, it became clear that additional features could be useful in some scenarios. While CRVX 1.0 might still be below the 80/20 cut, some of these features may be too exotic to be included in a future version of CRVX:
- Character References: In XML, characters may appear literally or as character references. It may be necessary to control whether characters appear in one form or the other; additional restrictions could enable users to define these two ways in which characters may appear.
- XPath 2.0: Along with XSLT 2.0, XPath 2.0 is being standardized. Upgrading CRVX to XSLT 2.0 patterns would make the context patters of CRVX more powerful, allowing type-based patterns and other new features of XPath 2.0.
- Character Normalization: XML 1.0 as well as the upcoming XML 1.1 do not require character normalization. If this is an application requirement, CRVX (or some other mechanism dealing with characters and character encodings) probably would be the right place to put it.
Currently there are no plans to release a second version of CRVX, and seeing the lack of development in the DSDL activity, the theoretically appealing concept of modular XML validation seems to lack support from actual XML users. However, a more modular approach to XML processing in general would probably benefit many XML software projects, so we hope that at least the modular view on XML processing exemplified by CRVX is useful to give XML users some new ideas.
Conclusions
CRVX is a rather small schema language designed with a very specific goal in mind. At this point in time, it is the only schema language of its kind, but Diederik A. Gerth van Wijk as lead of the DSDL character repertoire validation activity is working on something that goes beyond CRVX's capabilities (so far there are no publications, though). So whether future character repertoire validation for XML will use CRVX's simple concepts is uncertain, but it certainly would be a good idea for XML users concerned with character repertoire validation to encapsulate their requirements in some declarative way and then process it with some component interpreting the declarations.
Resources
|