Menu

A Compact Syntax for W3C XML Schema

August 27, 2003

Erik Wilde

W3C XML Schema (WXS) is a very powerful and also a rather complex schema language. One of the problems when working with WXS is the fact that it uses an XML syntax, which makes schemas verbose and hard to read. In this article I describe a compact text-based syntax for WXS, called XML Schema Compact Syntax (XSCS), which reuses well known syntactic constructs from DTDs; and I also present a Java implementation for converting the compact syntax to the XML syntax and vice versa.

Introduction

The W3C XML Schema specification is based on the model of schema components, which are abstract representations of various WXS constructs (such as simple types, complex types, attributes, elements, and various other things). W3C XML Schema also defines an XML representation of these components, but the separation of the specification into the abstract components and the XML syntax makes it obvious that WXS's XML syntax can be replaced.

WXS XML syntax is meant to be consumed by machines; it can be parsed and transformed using standard XML technologies and thus fits well into the XML landscape. However, XML is verbose. And the WXS XML syntax is often criticized as being too complex. Indeed it is a complex language, but the syntactic complexity could be alleviated by introducing a new syntax which is more appropriate for human users. This approach has been inspired by RELAX NG Compact Syntax, which defines an alternative syntax for RELAX NG's XML syntax. RELAX NG's compact syntax has become quite popular and makes it much easier for beginners to start using the language and for experts to be able to deal with complex schemas. XSCS's goal is to accomplish the same for WXS.

When working with WXS, the syntax in many cases is a problem, especially when schemas are large and hard to read. As a result, WXS development tools in most cases invent their own representation, often a graphical one, as shown in the Figure below.

Screen shot.
Screen shot.
Two Examples for XML Schema GUIs

In these figures, it can be seen that graphical representations can differ (both figures show the same schema fragment, but shown by different products, both commercially available tools). Both screenshots show the following fragment of schema code (which has been taken from the Schema for Schemas (the complete Schema for Schemas in compact syntax is available at XMLSchema.xsd in XSCS syntax):

<xs:element name="complexContent" id="complexContent">
  <xs:complexType>
    <xs:complexContent>
      <xs:extension base="xs:annotated">
        <xs:choice>
          <xs:element name="restriction" type="xs:complexRestrictionType"/>
          <xs:element name="extension" type="xs:extensionType"/>
        </xs:choice>
        <xs:attribute name="mixed" type="xs:boolean"/>
      </xs:extension>
    </xs:complexContent>
  </xs:complexType>
</xs:element>

In XSCS syntax (as taken from XMLSchema.xsd in XSCS syntax), this schema fragment is represented as follows (the mixed attribute name is preceded by a backslash because it is a keyword and thus must be escaped when used literally):

element complexContent extends xs:annotated {
  ( restriction { xs:complexRestrictionType } | extension { xs:extensionType } )
    attribute \mixed { xs:boolean } }

This is only a small introductory example to demonstrate the difference between the verbose XML syntax and the more readable compact XSCS syntax. In the following sections we will look at a variety of XSCS constructs in a more systematic way. XSCS is intended as a new "interface" to WXS and is, thus, roughly comparable to graphical representations. However, being a character-based interface, it is not as comfortable and interactive as a graphical interface, but on the other hand can be easily exchanged, can be easily supported by existing schema tools on any platform, and provides a minimal common ground for schema representations geared toward human users.

DTD-like Constructs

Since WXS extends many concepts from DTDs, it makes sense to reuse DTD syntax where possible, enabling users to make the transition from DTDs to WXS more quickly. Assuming a very simple content model definition, we can illustrate this using the following example:

<element name="page">
 <complexType>
  <sequence>
   <element name="head" type="string"/>
   <element name="section" type="string" maxOccurs="4"/>
   <element name="foot" type="string"/>
  </sequence>
 </complexType>
</element>

In this case, the page element type is declared as a sequence of three other elements. The fact that the page element type is a sequence can be expressed in DTDs also. However, there are two additional aspects which are not expressible in DTDs, one is the maxOccurs="4" specifier, the other one is the fact that the elements are locally declared. XSCS has two semantically equivalent notations to reflect local declarations:

element page { ( head { string }, section { string } [,4], foot { string } ) }

In this case, the element declarations appear directly inside the content model, mirroring closely the structure of the XML syntax. But this clutters the content model, which may become hard to read if it is more complex and contains many local declarations. Consequently, there is an alternative notation:

element page { ( head, section [,4], foot )
 element head { string }
 element section { string }
 element foot { string } }

In this second case, the content model looks a lot like a DTD content model, and the local element declarations are outside the content model (but still inside the declaration of the page element, since they are local). In this notation it's easy to recognize the reuse of DTD syntax in the content model, as well as an extension, the [,4] qualifier, which is used to represent the maxOccurs="4" attribute of the XML syntax.

XSCS reuses the complete DTD content model syntax (with the exception of the #PCDATA and ANY keywords and parameter entities) and introduces new syntax constructs for the minOccurs and maxOccurs attribute, as well as for the all model group, which is represented by the & symbol. (SGML experts will recognize that this was a legal model group in SGML and disappeared in XML. However, it is more restricted in WXS than it was in SGML because of the all model group's restrictions.)

Because of the increased complexity of attributes in WXS, and the orthogonality of the type concept, where simple types may be used for elements as well as attributes, XSCS does not reuse the DTD's attribute list syntax. Instead, attribute declarations appear inside element type declarations (or attribute group definitions), thus making them the syntactical part of the element type declaration that they failed to be in the DTD syntax.

New Constructs

Apart from the extensions of DTD syntax, some constructs are entirely new in WXS and, consequently, cannot borrow from DTD syntax. Type definitions are the first thing that come to mind, because the whole type system (including the various derivation mechanisms) is new. As an example for a type definition, facets for simple type restrictions are simply listed one after the other in WXS XML syntax:

<xs:simpleType name="short">
 <xs:restriction base="xs:int">
  <xs:minInclusive value="-32768"/>
  <xs:maxInclusive value="32767"/>
 </xs:restriction>
</xs:simpleType>

This notation is not very intuitive: it fails to reflect the logical connection between the minInclusive and minInclusive facets. To make facets easier to read and write, XSCS introduces a facet syntax that clearly mirrors the logical connection of the two facets and which reuses a notation for intervals well-known from mathematics:

simpleType short { xs:int { [-32768,32767] } }

XSCS introduces special syntax constructs for all facets of WXS, mapping some of them to keyword-value pairs, and others to special symbols, such as the interval facets and the pattern facet, which uses a Perl-like notation with the string surrounded by slash (/) characters.

The concept of model and attribute groups is a part of WXS which mirrors the experience with DTDs, where these concepts are simulated using parameter entities but are not available as first-class constructs. For example, citing another part of the Schema for Schemas, a model group can be used as follows:

<xs:complexType name="extensionType">
 <xs:complexContent>
  <xs:extension base="xs:annotated">
   <xs:sequence>
    <xs:group ref="xs:typeDefParticle" minOccurs="0"/>
    <xs:group ref="xs:attrDecls"/>
   </xs:sequence>
   <xs:attribute name="base" type="xs:QName" use="required"/>
  </xs:extension>
 </xs:complexContent>
</xs:complexType>

There is no such concept as a model group in DTDs, so XSCS introduces a special character (@), which is used to identify a part of a content model to be a model group reference (attribute groups are used and referenced in the same way):

complexType extensionType extends xs:annotated {
 ( @xs:typeDefParticle?, @xs:attrDecls );
 required attribute base { xs:QName } }

In this case, the syntax clearly reflects the fact that both names in the content model are model group references rather than being global element name references. The complete Schema for Schemas is available in XSCS XMLSchema.xsd in XSCS syntax, and can be used as an extensive example, which uses most of WXS's mechanisms (for example, it uses identity constraints and named groups, but does not use substitution groups). For a complete overview of XSCS, either get the specification from the XSCS Project Page or start playing around with XSCS using your own schemas and the software described in the following section.

Software

We've implemented Java software that transforms WXSs between XML syntax and XSCS and vice versa. The software is available from the XSCS Project Page. The XSCS parser consists of two components: the generated parser class and a class that generates a DOM representation in WXS XML syntax. When converting from XSCS to XML, a DOM tree of the schema is first generated and then written to a file using a standard DOM serializer module. From XML to XSCS, the process starts by parsing the XML file using a standard DOM parser and then handing over the generated DOM tree to the XSCS serializer component. All coding and tests have been conducted using the Xerces parser library; other DOM implementations could be used too.

Adding XSCS support to existing WXS tools is easy to do because the syntax does not change any of the semantics of WSX. It simply requires an additional parser and serialization module. However, XML/XSCS conversion can also be done separately using our Java tools or an equivalent implementation. Currently, there is no WXS processor supporting XSCS syntax directly, but first and foremost XSCS is intended to be an interface for human users, who can use the existing Java tool to transform from and to XSCS.

Limitations

The current version of XSCS and our implementation have some limitations:

  • Software limitations: We do not perform XML Schema checking on the XSCS syntax. Thus, error messages tend to be hard to track back to the XSCS syntax. By better integrating XSCS into an XML Schema processor, this problem could be eliminated.

  • XSCS limitations: XSCS doesn't preserve the namespace declaration structure of a WXS document. Namespace declarations in XSCS always appear on the top-level, by consequence of which an XML-XSCS-XML roundtrip normalizes all namespace declaration to appear in the schema's document element. Annotations are the second problem area; they're allowed in many places in the XML syntax, but in far fewer places in XSCS.

Conclusions

In this article, I've presented an alternative syntax for WXS. I believe that at least some of the criticism of WXS is based on the verboseness of its XML syntax. XSCS is an attempt to make WXS easier to read and write for humans. If the syntax proves to be useful for a sufficiently large number of people, it or a successor may even be fed into the W3C's specification process.