Designing a New Schema with XML Design Patterns

June 4, 2003

Introduction

In " Architectural Design Patterns for XML Documents" I proposed a catalog of XML schema-design patterns. In keeping with the idea behind design patterns, they were culled from schemas in common use today, but it's still good to take them for a spin and apply them to a real project. In this article I'll do just that. The project in question is the design of an XML-based type library format. If you've had exposure to Microsoft COM or Mozilla's XPCOM, you're probably familiar with their binary TLB (MS) and XDT (Mozilla) formats that define the available operations and interfaces for a package of portable components. An interpreted language such as JavaScript can use these definitions as cheat sheets to find out what operations and parameters are available to call on-the-fly.

Each component model allows you to write components in many languages, but these binary formats are friendly to just one language: C. In this article we're going to start a project to design an XML format that could be generated from IDL, TLB, XDT or any other representation of a portable component and then be read by any language that supports XML and further manipulated to generate GUI tools, documentation, and more.

We could imagine an instance document like this that defines an interface Hello with a single operation, string sayHello(in string personName):

<tlx:typelib

  xmlns:tlx="http://schema.amberarcher.com/polaris/tlx.html">

  <tlx:interface name="GreetingFactory"

    id="8bb35ed9-e332-462d-9155-4a002ab5c958">

    <tlx:operation name="sayHello" type="string">

      <tlx:in name="personName" type="string"/>

    </tlx:operation>

  </tlx:interface>

</tlx:typelib>

Dynamic or Static Document Structure

The Dynamic Document pattern, which suggests not writing a schema and just following the data structure of the program, doing the parsing using .NET Marshalling or the JavaBeans long-term persistence from JDK 1.4, could let us end the article right here. It should be clear immediately this option is right out: the point of the project is to create a stable, portable format, and we don't want it influenced by changes in the language we use to parse or generate it. Also, unlike Ant, say, this is a fairly self-contained project, and we don't need to allow arbitrary people to write extensions without having to incorporate them into the schema.

Composition/Self-Documenting Files

There are two types of composition: reuse of your own schema elements and reuse of other people's schema elements. We'll endeavor to do both for this project. To start, we need some requirements:

We want to represent interfaces and operations on those interfaces.
We want to allow for vendor/format-specific extensions.
We want to provide documentation for those same interfaces and operations.
Since this will be generated code, we want to provide some information about who generated it and when, as well asthe original source.

Requirement (1) is core: it wouldn't be a type library without defining the interfaces and operations, and we can't borrow this from anywhere else. Requirement (2) is also central, but one thing we have to consider for composition is who controls the schema and how fast it changes. The idea of an interface and a method call will not change over the lifetime of this schema, but vendors come and go. We should put Mozilla and Microsoft specific information in their own schemas and compose them into the larger schema.

Requirement (3) screams for some petty schema larceny. Documentation nowadays should be hypertext and convertible into other formats. Both DocBook/XML and XHTML's more basic elements fit the bill. Since HTML is familiar to most programmers, we'll choose to compose XHTML.

Finally, requirement (4) could be filled with RDF and the Dublin Core. There's no point in reinventing file metadata, and there are plenty of tools out there for reading RDF. So with our tools chosen, let's go back to our first example, and mark up the document using all our borrowed schemas:

<tlx:typelib 

   xmlns:tlx="http://schema.amberarcher.com/polaris/tlx.html">

  <!-- for generated code, we mostly want to know what

      version of the program created it, when it was last

      (re)generated, and what was the original source;

      Dublin Core offers all of these for our use -->

    <tlx:metadata>

      <rdf:RDF 

          xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

          xmlns:dc="http://purl.org/dc/elements/1.1/">

        <rdf:Description>

          <dc:creator>xpidl2tlx 1.0</dc:creator>

          <dc:date>2003-05-02</dc:date>

          <dc:source>file:c:/oss/mozilla/modules/libreg/

xpcom/nsIRegistry.idl</dc:source>

        </rdf:Description>

      </rdf:RDF>

    </tlx:metadata>



    <tlx:interface name="GreetingFactory"

         id="8bb35ed9-e332-462d-9155-4a002ab5c958">



      <!-- documentation is nested inside the object being

          documented; the text chosen here is from the

          javadoc-type description in the original IDL -->

        <tlx:documentation>

          <xhtml:p

              xmlns:xhtml="http://www.w3.org/1999/xhtml">

              Interface for creating custom greetings.

          </xhtml:p>

        </tlx:documentation>



        <tlx:operation name="sayHello" type="string">

          <tlx:documentation>

            <xhtml:p

                xmlns:xhtml="http://www.w3.org/1999/xhtml">

                Returns a greeting that incorporates the

                person's name.

            </xhtml:p>

          </tlx:documentation>



          <tlx:in name="personName" type="string">



            <!-- documentation is nested inside the

                parameter being documented -->

              <tlx:documentation>

                <xhtml:p 

                    xmlns:xhtml="http://www.w3.org/1999/xhtml">

                  name of the person to greet

                </xhtml:p>

              </tlx:documentation>



          </tlx:in>

        </tlx:operation>

    </tlx:interface>

</tlx:typelib>

What have we gained? It seems like a lot of added complexity and bulk. As in the original pattern catalog, the reason we use XHTML or RDF when we have a need for their services is to get ourselves out of the business of defining tags in that domain. We care about typelibs, not hypertext. If someone wants to put in a hyperlink in the documentation, we don't need to define a tag. XHTML has it already. Also, by using a well-known schema, it becomes easier to use shared tools: an RDF-aware document-management system would immediately recognize the creation date and creator, while it might ignore that extra information if it were buried in our own "create-date" and "creator" tags.

We've also satisfied another common schema pattern along the way: we've made our format self-documenting. While not child's play, it's still much easier to write a stylesheet to convert the above document to HTML than it is to, say, parse XPIDL and javadoc-style tags to produce it. A GUI browser tool can also provide human descriptions for cryptically-named functions and parameters.

Multipart Files

Our last architectural pattern is Multipart Files. This pattern suggests we should offer our document creators a way to compose a single, coherent document out of many. Type libraries define interfaces, and interfaces can inherit from other interfaces. We may thus want to define our common base interfaces in one file, much like we'd define and #include a header in C or C++. We have two choices for implementing this feature: borrow or invent. We're on a roll borrowing, so here's what it would look like using XInclude, the W3C Candidate Recommendation for this purpose:

<tlx:typelib

     xmlns:tlx="http://schema.amberarcher.com/polaris/tlx.html"

     xmlns:xi="http://www.w3.org/2001/XInclude">

   <xi:include href="base.tlx"/>

   <!-- ... -->

 </tlx:typelib>

Next Step: Building the Schema

We've gone from a blank sheet of paper and a couple requirement,s to sketches of the file format we want to support, plus a set of patterns to guide us: Multipart Files, Self-Documenting Formats, and Composition. There are a number of capable XML Schema tutorials out there, so the focus here will be on XML Schema details to support incorporating other schemas into our design.

Implementing Composition

We're composing three different schemas for this design, so we need to import them so we can refer to their elements.

<schema  targetNamespace=

  "http://schema.amberarcher.com/polaris/tlx.html"

  xmlns:tlx="http://schema.amberarcher.com/polaris/tlx.html"

  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

  xmlns:xhtml="http://www.w3.org/1999/xhtml"

  xmlns:xi="http://www.w3.org/2001/XInclude"

  xmlns="http://www.w3.org/2001/XMLSchema"

  elementFormDefault="qualified">



  <import

    namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

    schemaLocation="http://dublincore.org/documents/2002/07/31/

dcmes-xml/dcmes-rdf.xsd"/>



  <import

    namespace="http://www.w3.org/1999/xhtml"

    schemaLocation="http://schema.amberarcher.com/polaris/w3c/

xhtml1.1.xsd"/>



  <import

    namespace="http://www.w3.org/2001/XInclude"

    schemaLocation="http://schema.amberarcher.com/polaris/w3c/

xinclude.xsd"/>

For the W3C schemas, the first snag I hit was that there's not yet an authoritative URI for either the XHTML 1.1 or XInclude XML schemas. I had to download the zip files that come with the specifications and put them up on my server. Since these are snapshots of the versions at this time, I do not recommend that you reuse my URLs (my ISP will thank you as well); they may not stay up-to-date.

With the full contents of these three namespaces at our disposal, we can reference them. Let's define our top-level element, typelib, and say it's a sequence of two optional elements and an unbounded set of one or more interfaces. That looks like this:

<element name="typelib">

  <complexType>

    <sequence>

      <element ref="tlx:metadata" minOccurs="0"/>

      <element ref="xi:include" minOccurs="0"/>

      <element ref="tlx:interface" maxOccurs="unbounded"/>

    </sequence>

  </complexType>

</element>



<element name="metadata">

  <complexType>

    <sequence>

      <element ref="rdf:RDF"/>

    </sequence>

  </complexType>

</element>

We use <element ref="xxx:yyy"> whenever we want to link to another schema element. We have to define a namespace prefix (xi or rdf in this case) to scope the name, and then we can pick any element in that schema. Note that we don't necessarily have to choose the top-level element, as we'll see in the next section.

Implementing multi-part documents on the schema side was just one line to include <xi:include>. Supporting it on the document processing side may be easy if your XML Parser has native XInclude support, or you may have to write it yourself using one of the known implementations. There are versions for Java, .NET, and C/C++.

The XHTML Snag

What I intended to do with XHTML was to reuse its <div> and <p> as allowed container elements for documentation so that our embedded documentation could have hyperlinks and other mark-up. This can be done with W3C XML Schema like this:

<element name="documentation">

  <complexType>

    <choice>

      <element ref="xhtml:p"/>

      <element ref="xhtml:div"/>

    </choice>

  </complexType>

</element>

Easy, right? I completed the schema, validated the W3C XML Schema code itself without trouble, and then tried to validate an instance document using XSV. I got an explosion of validation errors, all from what seemed to be the XHTML 1.1 schema itself; listserve and newsgroup messages on the topic seemed to indicate that this problem has been around a while. I tried to find variants of the schema published elsewhere, but the only one I could dig up was based on a draft version of W3C XML Schema.

Although the XHTML working group has moved on to XHTML 2.0, my understanding is that the finished work for 1.1 is in the DTD, while Modularization of XHTML in XML Schema is still just a W3C working draft as of 9 December 2002. I've written the lead author of the specification and at the time of writing have yet to hear back. I think the lesson for anyone applying Composition is to be very careful about choosing stable specifications. Don't make assumptions about the availability of a compliant W3C XML Schema for any incomplete specification until you've tested it yourself. The flip-side of being able to rely upon someone else to define the elements for your problem domain is that you have to depend upon them.

For now the test schema for TLX has the XHTML import commented out and the documentation tag is declared more simply as:

<element name="documentation" type="string"/>

With these changes to the test schema, and changes to the example document, XSV blessed the document as valid.

Using inheritance to share documentation

Even though I'm not yet able to support validation and embedded XHTML in the documentation at the same time, there's one final nice touch to put in the schema. A number of elements can have embedded documentation: interfaces, operations, and parameters. We can use inheritance in XML Schema to prevent duplication in our new schema. First we define a complexType named documentableType. We declare it abstract because we don't want any elements that directly use this type, just derived types:

<complexType name="documentableType" abstract="true">

  <sequence minOccurs="0">

    <element ref="tlx:documentation"/>

  </sequence>

</complexType>

To finish we just have to derive the complexTypes for each of our elements from documentableType using W3C XML Schema's <extension> tag.

<element name="interface">

  <complexType>

    <complexContent>

      <extension base="tlx:documentableType">

        <sequence>

          <element ref="tlx:operation" maxOccurs="unbounded"/>

        </sequence>

        <attribute name="name" type="string"/>

        <attribute name="id" type="string" use="optional"/>

      </extension>

    </complexContent>

  </complexType>

</element>

Conclusions

When you set out to design your own XML Schema, you do not need to start from scratch. You can use either patterns exemplified by the growing body of working schemas on the Internet, from the W3C to OASIS, or you can directly reuse their elements through Composition. It's worth the effort, because you can get the value of a language specific to your domain without the trouble of writing your own parser, and the end result can be used across multiple languages with ease.