Architectural Design Patterns for XML Documents

March 26, 2003

Kyle Downey

Introduction

Dynamic Document

Composition

Self Documenting Files

Multipart Files

References & Acknowledgements

No one wants to reinvent the wheel. One way programmers try to reuse good ideas about object design is to look to catalogs of design patterns like, most famously, the Gang of Four's Design Patterns: Elements of Reusable Object-Oriented Software (Gamma et. al.). XML has been used enough now that some high-level patterns are starting to emerge. Some patterns revolve around the low-level details of good schema design, like those put together by Dare Obasanjo in "W3C XML Schema Design Patterns"; but when you have a blank sheet of paper in front of you and you're ready to start designing your new XML format, you want patterns to guide you at a higher level. This article attempts to document a few whole-document design patterns that have proven themselves in the field.

Dynamic Document

Abstract

This pattern contains XML untyped by DTD or schema, but follows accessors for underlying program objects. It allows for unlimited extension by multiple, uncoordinated parties at the cost of lack of type-checking; and is simple to implement, with supporting libraries abounding (e.g. Apache Commons for Java; .NET's XML marshalling for C#).

Problem

You need to develop a format quickly, or many different people are contributing on an ad-hoc basis at different times, and it's not possible to have a fixed document design.

Context

This pattern is more common for private formats or technical ones, such as configuration for a server or a marshaling format. It also is a good match for Extreme Programming projects because you can get it working quickly, refactoring later to use another mechanism if needed.

Forces

You need a "quick and dirty" solution.
You can't know beforehand what extensions will be required, but you know they will be many and created by people other than the original document format creator.

Solution

Don't design a format and drop validation. Have a technical solution -- that is, a marshaller -- drive the XML generation. As data structures in your program change, the generated XML changes. In both .NET and in Java the marshaller uses reflection and extra metadata (.NET CLR attributes or JavaBean BeanInfo classes) to find the read/write properties of a class. It moves recursively through the object graph, generating a tree of XML elements named after the accessor. For example, these two classes:

public Person {

       public String getName() { ... }

       public void setName(String name) { ... }

       public Address getAddress() { ... }

       public void setAddress(Address address) { ... }

    }



    public Address {

       public String getCity() { ... }

       public void setCity(String city) { ... }

       public String getState() { ... }

       public void setState(String state) { ... }

    }

might be marshalled as

<person>

         <name>Kyle Downey</name>

         <address>

             <city>Forest Hills</city>

             <state>Queens</state>

         </address>

      </person>

Discussion

Before sitting down to do a potentially complex document design, you should always ask yourself if a dynamic, data-driven format might be sufficient. Most XML-aware development platforms provide at least one library that will take an object and convert it into XML. You've done the object design, and in a couple lines of code, you've done your document design as well. If you're on a tight deadline, this is a potentially big time-saver for the development team.

But not so fast. Dynamic document most likely isn't an option for you if

you're designing a long-lived business-critical exchange format and thus you don't want the format to change whenever you change your object design; or
you don't trust the producers of the data to get it right, and cost of a mistake is high. For example, a document notifying you about inventory changes at a partner's warehouse and thus the lack of validation is risky.

Related Patterns

None. This is the "zero design pattern design." Once you start to involve other patterns, you're enforcing a human design rather than having a dynamic document.

Known Uses

Ant build.xml
Apache Tomcat server.xml
JDK 1.4 JavaBean XML persistence
.NET XML Marshalling
SOAP default encoding

Composition

Abstract

Wherever possible, define the format using existing standards, referencing their elements by namespace rather than rolling your own. For example, add metadata to your metadata using RDF and the Dublin Core extensions rather than inventing your own <author> and <description> tags. Allows for independent evolution of markup by parties who know the business domain best.

Problem

You have an existing or planned document format that provides common types of data using its own, proprietary elements and types, and you're forced to maintain and understand that subset of data yourself, even though you're not a domain specialist.

Context

With all the standardization work out there, just about any business-oriented document problem presents an opportunity for defining some elements with Composition.

Forces

There is an opportunity to reuse a <simpleType>, <complexType> or <element> from another XML schema.
You can accept or even want to have the composed data type definition evolve independently of your own efforts.
Patents or other legal encumbrances do not prevent you from reusing that schema.

Solution

XML namespaces make it very easy to import entire elements from one spec to another. Let's say you're designing a format for capturing use cases. You want to include attribution information: who wrote it, when, etc.. You might want to consider using the Dublin Core RDF elements instead of defining your own <author> and other meta-information tags:

<uc:use-case 

  xmlns:uc="http://example.com/my/usecase.xsd" 

  id="3">

  <uc:metadata>

    <rdf:RDF

      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 

      xmlns:dc="http://dublincore.org/documents/2002/07/31/dcmes-xml/

dcmes-xml-xsd.xsd">

      <rdf:Description>

        <dc:title>Irritate Customer</dc:title>

        <dc:creator>Kyle Downey</dc:creator>

        <dc:date>2002-03-08</dc:date>

        <dc:format>text/xml</dc:format>

        <dc:language>en</dc:language>

        <dc:contributor>Amber Archer Consulting Co., 

           Inc.</dc:contributor>

        <dc:identifier>UC#3</dc:identifier>

      </rdf:Description>

    </rdf:RDF>

  </uc:metadata>

...

</uc:use-case>

In your use case schema you would have (in part)

<schema 

  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

    <import 

      namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

      schemaLocation="http://dublincore.org/documents/2002/07/31/

dcmes-xml/dcmes-rdf.xsd"

    />



    <element name="metadata">

      <sequence>

        <element ref="rdf:RDF"/>

      </sequence>

    </element>

</schema>

Discussion

One of the strong arguments for Composition -- aside from the well-documented programmer's virtue of laziness -- is that you can lean on the more specialized knowledge of others. The people who put together Dublin Core put a lot of thought into how to best represent document metadata. They have been doing it since 1994. Most likely, you've been thinking about how to put meta-information into your document since two paragraphs ago. There's no match. So your choice is either to get taken down by an angry librarian who's breaking noses and taking names or reuse the work. This design pattern recommends the latter.

As RDF and Dublin Core evolve, all you have to do is change the namespace and the import statement to point to a newer version of the schema, letting you take advantage of all the latest and greatest ways of representing metadata, widgets, documents, customers, fixed income instruments, or whatever it is you're reusing with very little effort. This capacity for concurrent evolution is, however, also the biggest gotcha in Composition. Unless the promoters of your standard have done the right thing and put version information in the namespace and schema URI, there's a risk users in the field will suddenly start getting backward-incompatible version 2.0 of the schema and get very angry. So keep an eye on versioning, and if necessary copy the schema to your own namespace and reuse from there.

Even where you can't reuse a public XML schema, you can still look for common, reusable data clumps in your document formats. Let's put it this way: if you have five business processes involving customers and addresses, do you really need to define customer and address five times? Or even want to? Reuse through Composition can and should start inside your enterprise.

Related Patterns

None from this catalog.

Known Uses

WSDL very nicely reuses XML schema by embedding a whole <schema> element in the WSDL document rather than defining its own mechanism for acceptable web service message types.

Self-Documenting Files

Abstract

Include as part of the document format elements that annotate the content.

Problem

Your human-readable format is so cryptic that it makes grown hackers cry: this fragment of Perl code rendered as XML that supposedly prints the entire Linux kernel when run:


   <perlml>

@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{

@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord

($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&

close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

   </perlml>

Note how it's much improved with just a little annotation:

<perlml>

   <annotation>

     You're not expected to understand this.

   </annotation>

   <code>

@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{

@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord

($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&

close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

   </code>

</perlml>

Context

Documents that are meant to be viewed by people or at least post-processed to generate documentation for people. Internal data structure formats like on-the-wire marshaling generally don't need annotation.

Forces

You're generating complex XML content that needs to be understood by people, or converted into some format for their viewing.
Ihe information in the document itself is not enough to be comprehensible.

Solution

Add an element or elements to your XML schema to include documentation. Generally you'll want to somehow tie the documentation to each significant element, so you could consider a base type -- for example, documentableType -- like this:

    <complexType name="documentableType">

      <sequence>

        <element name="annotation" type="string"/>

      </sequence>

    </complexType>

Discussion

XML comments are great, but if you find that they're becoming mandatory for users to decode your XML documents, maybe it's time to allow those annotations to be part of the XML itself. Probably the biggest win you get out of this (aside from standardizing where the comments go and how they're formatted using all the powerful features of XML Schema) is an ability to apply the rest of the XML toolkit to your documents. You could, for instance, write a "widgetdoc" XSLT stylesheet that takes your widget.xml files and converts them into an HTML document describing the widget, including all your extra annotations that might not mean much to your automatic widget-stamping machine that was reading the XML before, but will mean a lot to anyone debugging the machine's software.

Related Patterns

There's a nice combination of Composition and Self-Documenting Files. There are two well-known formats for documentation in XML: DocBook and XHTML. DocBook is specialized for technical documentation, and there are powerful stylesheets out there for converting it to HTML and PDF. XHTML is, obviously, very good for online presentation. So if you want to be able to generate professional-quality documentation with links and images from your own XML format, you should definitely consider embedding XHTML or DocBook XML.

Known Uses

XML Schema has annotations, and you can convert them to HTML using xs3p, a very snazzy schemadoc tool
WSDL

Multipart Files

Abstract

Define an explicit mechanism for splitting content into multiple files: a primary document and satellite ones that represent faster changing components or sections of content shared with other primary documents.

Problem

Your documents have become large and unwieldy, and you want to share pieces of them.

Context

This pattern can apply to just about any format, but it seems to be more common in the technical arena.

Forces

As documents grow in size and complexity, and as there are more documents that can overlap, this pattern becomes more appealing.
Pushing against use, security and absolute versus relative URIs become issues for anyone processing the format: if it's too complicated for your taste, or if there are concerns about a cracker manipulating this facility to pull in content he or she should not have access to, you might want to disallow inclusions

Solution

Add to your schema an <import> or <include> element that takes an href attribute which can be any valid relative or absolute URI. Compliant processors for your format will load and incorporate valid subdocuments in your format from the URI.

SOAP 1.1 with Attachments takes an interesting alternative approach to this problem, using Composition along the way. SOAP coopts the pre-existing MIME standard and allows SOAP messages to be mime/multipart, with the SOAP XML message as the initial part and others linked to it. This allows SOAP to behave something like the FTP protocol with separate "control" and "data" streams. You can send metadata about binary content and directives for what the recipient should do with it as part of the XML message and just attach the content directly to the message.

Discussion

From #include to the humble href in HTML, systems abound with ways to pull together content from multiple locations. This makes documents more maintainable and encourages basic reuse of common components, whether they're shared stylesheet rules or whole XML schemas. While it may seem hard to find instances where you wouldn't want to allow sharing of document parts and file composition, as noted above in forces there are potential complexity and security issues with allowing inclusions.

Related Patterns

You might want to make your Self-Documenting Format refer to external documents rather than embedding them, and you can use Composition by reusing the W3C standards for file inclusion: XInclude and XML Base. But if you need to have different meanings for including other files (as XSLT does with its <import> or <include> elements) you might still have to roll your own.

Known Uses

XSLT
XML Schemas
WSDL
SOAP with Attachments

References and Acknowledgments

XML Schemas
XSL/XSLT
SOAP 1.2
SOAP 1.2 Attachments
WSDL 1.2
XHTML
XML Pointer, XML Base and XLink
Dublin Core Group
Expressing Simple Dublin Core in RDF/XML
Programming Perl, 2nd Edition (for source of the "three great virtues of a programmer")
thanks to Raymond Blum for pointing out that Dynamic Document and XP go together well