Architectural Design Patterns for XML Documents
No one wants to reinvent the wheel. One way programmers try to reuse good ideas about object design is to look to catalogs of design patterns like, most famously, the Gang of Four's Design Patterns: Elements of Reusable Object-Oriented Software (Gamma et. al.). XML has been used enough now that some high-level patterns are starting to emerge. Some patterns revolve around the low-level details of good schema design, like those put together by Dare Obasanjo in "W3C XML Schema Design Patterns"; but when you have a blank sheet of paper in front of you and you're ready to start designing your new XML format, you want patterns to guide you at a higher level. This article attempts to document a few whole-document design patterns that have proven themselves in the field.
This pattern contains XML untyped by DTD or schema, but follows accessors for underlying program objects. It allows for unlimited extension by multiple, uncoordinated parties at the cost of lack of type-checking; and is simple to implement, with supporting libraries abounding (e.g. Apache Commons for Java; .NET's XML marshalling for C#).
You need to develop a format quickly, or many different people are contributing on an ad-hoc basis at different times, and it's not possible to have a fixed document design.
This pattern is more common for private formats or technical ones, such as configuration for a server or a marshaling format. It also is a good match for Extreme Programming projects because you can get it working quickly, refactoring later to use another mechanism if needed.
Don't design a format and drop validation. Have a technical solution -- that is, a marshaller -- drive the XML generation. As data structures in your program change, the generated XML changes. In both .NET and in Java the marshaller uses reflection and extra metadata (.NET CLR attributes or JavaBean BeanInfo classes) to find the read/write properties of a class. It moves recursively through the object graph, generating a tree of XML elements named after the accessor. For example, these two classes:
public Person {
public String getName() { ... }
public void setName(String name) { ... }
public Address getAddress() { ... }
public void setAddress(Address address) { ... }
}
public Address {
public String getCity() { ... }
public void setCity(String city) { ... }
public String getState() { ... }
public void setState(String state) { ... }
}
might be marshalled as
<person>
<name>Kyle Downey</name>
<address>
<city>Forest Hills</city>
<state>Queens</state>
</address>
</person>
Before sitting down to do a potentially complex document design, you should always ask yourself if a dynamic, data-driven format might be sufficient. Most XML-aware development platforms provide at least one library that will take an object and convert it into XML. You've done the object design, and in a couple lines of code, you've done your document design as well. If you're on a tight deadline, this is a potentially big time-saver for the development team.
But not so fast. Dynamic document most likely isn't an option for you if
None. This is the "zero design pattern design." Once you start to involve other patterns, you're enforcing a human design rather than having a dynamic document.
Wherever possible, define the format using existing standards, referencing their elements by namespace rather than rolling your own. For example, add metadata to your metadata using RDF and the Dublin Core extensions rather than inventing your own <author> and <description> tags. Allows for independent evolution of markup by parties who know the business domain best.
You have an existing or planned document format that provides common types of data using its own, proprietary elements and types, and you're forced to maintain and understand that subset of data yourself, even though you're not a domain specialist.
With all the standardization work out there, just about any business-oriented document problem presents an opportunity for defining some elements with Composition.
XML namespaces make it very easy to import entire elements from
one spec to another. Let's say you're designing a format for
capturing use cases. You want to include attribution
information: who wrote it, when, etc.. You might want to
consider using the Dublin
Core RDF elements instead of defining your own
<author> and other meta-information tags:
<uc:use-case
xmlns:uc="http://example.com/my/usecase.xsd"
id="3">
<uc:metadata>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://dublincore.org/documents/2002/07/31/dcmes-xml/
dcmes-xml-xsd.xsd">
<rdf:Description>
<dc:title>Irritate Customer</dc:title>
<dc:creator>Kyle Downey</dc:creator>
<dc:date>2002-03-08</dc:date>
<dc:format>text/xml</dc:format>
<dc:language>en</dc:language>
<dc:contributor>Amber Archer Consulting Co.,
Inc.</dc:contributor>
<dc:identifier>UC#3</dc:identifier>
</rdf:Description>
</rdf:RDF>
</uc:metadata>
...
</uc:use-case>
In your use case schema you would have (in part)
<schema
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<import
namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
schemaLocation="http://dublincore.org/documents/2002/07/31/
dcmes-xml/dcmes-rdf.xsd"
/>
<element name="metadata">
<sequence>
<element ref="rdf:RDF"/>
</sequence>
</element>
</schema>
One of the strong arguments for Composition --
aside from the well-documented programmer's virtue of laziness --
is that you can lean on the more specialized knowledge of
others. The people who put together Dublin Core put a lot of
thought into how to best represent document metadata. They have
been doing it since 1994. Most likely, you've been thinking about
how to put meta-information into your document since two
paragraphs ago. There's no match. So your choice is either to get
taken down by an angry librarian who's breaking noses and taking
names or reuse the work. This design pattern recommends the
latter.
As RDF and Dublin Core evolve, all you have to do is change
the namespace and the import statement to point to a newer
version of the schema, letting you take advantage of all the
latest and greatest ways of representing metadata, widgets,
documents, customers, fixed income instruments, or whatever it is
you're reusing with very little effort. This capacity for
concurrent evolution is, however, also the biggest gotcha in
Composition. Unless the promoters of your standard
have done the right thing and put version information in the
namespace and schema URI, there's a risk users in the field will
suddenly start getting backward-incompatible version 2.0 of the
schema and get very angry. So keep an eye on versioning, and if
necessary copy the schema to your own namespace and reuse from
there.
Even where you can't reuse a public XML schema, you can still look for common, reusable data clumps in your document formats. Let's put it this way: if you have five business processes involving customers and addresses, do you really need to define customer and address five times? Or even want to? Reuse through Composition can and should start inside your enterprise.
None from this catalog.
|
Include as part of the document format elements that annotate the content.
Your human-readable format is so cryptic that it makes grown hackers cry: this fragment of Perl code rendered as XML that supposedly prints the entire Linux kernel when run:
<perlml>
@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord
($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&
close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print
</perlml>
Note how it's much improved with just a little annotation:
<perlml>
<annotation>
You're not expected to understand this.
</annotation>
<code>
@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord
($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&
close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print
</code>
</perlml>
Documents that are meant to be viewed by people or at least post-processed to generate documentation for people. Internal data structure formats like on-the-wire marshaling generally don't need annotation.
Add an element or elements to your XML schema to include
documentation. Generally you'll want to somehow tie the
documentation to each significant element, so you could consider a
base type -- for example, documentableType -- like
this:
<complexType name="documentableType">
<sequence>
<element name="annotation" type="string"/>
</sequence>
</complexType>
XML comments are great, but if you find that they're becoming mandatory for users to decode your XML documents, maybe it's time to allow those annotations to be part of the XML itself. Probably the biggest win you get out of this (aside from standardizing where the comments go and how they're formatted using all the powerful features of XML Schema) is an ability to apply the rest of the XML toolkit to your documents. You could, for instance, write a "widgetdoc" XSLT stylesheet that takes your widget.xml files and converts them into an HTML document describing the widget, including all your extra annotations that might not mean much to your automatic widget-stamping machine that was reading the XML before, but will mean a lot to anyone debugging the machine's software.
There's a nice combination of Composition and
Self-Documenting Files. There are two well-known
formats for documentation in XML: DocBook and XHTML. DocBook is
specialized for technical documentation, and there are powerful
stylesheets out there for converting it to HTML and PDF. XHTML
is, obviously, very good for online presentation. So if you want
to be able to generate professional-quality documentation with
links and images from your own XML format, you should definitely
consider embedding XHTML or DocBook XML.
Define an explicit mechanism for splitting content into multiple files: a primary document and satellite ones that represent faster changing components or sections of content shared with other primary documents.
Your documents have become large and unwieldy, and you want to share pieces of them.
This pattern can apply to just about any format, but it seems to be more common in the technical arena.
Add to your schema an <import> or <include> element
that takes an href attribute which can be any valid
relative or absolute URI. Compliant processors for your format
will load and incorporate valid subdocuments in your format from
the URI.
SOAP 1.1 with Attachments takes an interesting alternative approach to this problem, using Composition along the way. SOAP coopts the pre-existing MIME standard and allows SOAP messages to be mime/multipart, with the SOAP XML message as the initial part and others linked to it. This allows SOAP to behave something like the FTP protocol with separate "control" and "data" streams. You can send metadata about binary content and directives for what the recipient should do with it as part of the XML message and just attach the content directly to the message.
From #include to the humble href in
HTML, systems abound with ways to pull together content from
multiple locations. This makes documents more maintainable and
encourages basic reuse of common components, whether they're
shared stylesheet rules or whole XML schemas. While it may seem
hard to find instances where you wouldn't want to allow
sharing of document parts and file composition, as noted above
in forces there are potential complexity and security issues
with allowing inclusions.
You might want to make your Self-Documenting Format
refer to external documents rather than embedding them, and you
can use Composition by reusing the W3C standards for file
inclusion: XInclude and XML Base. But if you need to have
different meanings for including other files (as XSLT does with
its <import> or <include> elements) you might still
have to roll your own.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.