Managing Enumerations in W3C XML Schemas
When working with data-oriented XML, there is often a requirement to handle "controlled vocabularies", otherwise known as enumerated values. Consider the following example of a bank account summary:
<accountSummary>
<timestamp>2003-01-01T12:25:00</timestamp>
<currency>USD</currency>
<balance>2703.35</balance>
<interest rounding="down">27.55</interest>
</accountSummary>
There are two controlled vocabularies in this document. One is the
currency, which is an
ISO-4217 3-letter currency code ("USD" is US Dollar). The
other is the rounding direction for the interest, which can be
"up", "down", or "nearest". The
bank in this example prefers to round the interest down.
The problem in designing this schema is that the ISO 3-letter currency codes are externally controlled. They can change at any time. If you embed them in your schema, you need to reissue the schema every time ISO makes a change, which can be expensive. This is especially true in enterprise situations where any schema change, no matter how small, can require full retesting of any applications that use the schema. This needs to be avoided whenever possible.
|
Related Reading
XML Schema |
In this article, we will discuss how controlled vocabularies can be managed when using W3C XML Schemas, since this is the dominant XML schema format for data-oriented XML. Note that the "vocabularies" we refer to are enumerated lists of element-attribute values. This differs from other contexts where "vocabularies" are sets of XML element names.
Before worrying about which controlled vocabularies are out of our control, the first thing to do is create a schema, using W3C XML Schema, for the account summaries. For the purposes of this article, we will use just a subset of the ISO 3-letter currency codes. A suitable schema is
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
version = "1.0"
elementFormDefault = "qualified">
<xsd:element name = "accountSummary">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref = "timestamp"/>
<xsd:element ref = "currency"/>
<xsd:element ref = "balance"/>
<xsd:element ref = "interest"/>
</xsd:sequence>
<xsd:attribute name = "version" use = "required">
<xsd:simpleType>
<xsd:restriction base = "xsd:string">
<xsd:pattern value = "[1-9]+[0-9]*\.[0-9]+"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>
</xsd:element>
<xsd:element name = "timestamp" type = "xsd:dateTime"/>
<xsd:element name = "currency" type = "iso3currency"/>
<xsd:element name = "balance" type = "xsd:decimal"/>
<xsd:element name = "interest">
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base = "xsd:decimal">
<xsd:attribute name = "rounding" use = "required"
type = "roundingDirection"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
<xsd:simpleType name = "iso3currency">
<xsd:annotation>
<xsd:documentation>ISO-4217 3-letter currency codes,
as defined at
http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90.xalter
or available from
http://www.xe.com/iso4217.htm
Only a subset are defined here.</xsd:documentation>
</xsd:annotation>
<xsd:restriction base = "xsd:string">
<xsd:enumeration value = "AUD"/><!-- Australian Dollar -->
<xsd:enumeration value = "BRL"/><!-- Brazilian Real -->
<xsd:enumeration value = "CAD"/><!-- Canadian Dollar -->
<xsd:enumeration value = "CNY"/><!-- Chinese Yen -->
<xsd:enumeration value = "EUR"/><!-- Euro -->
<xsd:enumeration value = "GBP"/><!-- British Pound -->
<xsd:enumeration value = "INR"/><!-- Indian Rupee -->
<xsd:enumeration value = "JPY"/><!-- Japanese Yen -->
<xsd:enumeration value = "RUR"/><!-- Russian Rouble -->
<xsd:enumeration value = "USD"/><!-- US Dollar -->
<xsd:length value = "3"/>
</xsd:restriction>
</xsd:simpleType>
<xsd:simpleType name = "roundingDirection">
<xsd:annotation>
<xsd:documentation>Whether the interest is
rounded up, down or to the
nearest round value.</xsd:documentation>
</xsd:annotation>
<xsd:restriction base = "xsd:string">
<xsd:enumeration value = "up"/>
<xsd:enumeration value = "down"/>
<xsd:enumeration value = "nearest"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
Notice the two controlled vocabularies (enumerations), the simple types
iso3currency and roundingDirection. For
iso3currency, the length of the enumeration strings is
explicitly set to 3, to help avoid stupid editing errors in future when
the list of currencies needs to be updated.
Note also that the schema's optional version attribute has
been set to "1.0". When working with data-oriented XML messages, it is
usually necessary to support multiple versions of the message schema
concurrently, as the systems that use the message schema will probably not
be able to upgrade to the latest version simultaneously. So, it is vital
to identify the schema version that an XML message was validated
against. In keeping with this, we will name our schemaq
accountSummary-1.0.xsd, so that future versions won't
overwrite the current version.
Further, a version attribute has been added to the
accountSummary element, so that message instances clearly
identify their schema version. It is assumed that the version numbers have
the form M.N where M is the major version number
and N is the minor version number. With this change, plus the
schema, the account summary now becomes
<accountSummary
version = "1.0"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation = "accountSummary-1.0.xsd">
<timestamp>2003-01-01T12:25:00</timestamp>
<currency>USD</currency>
<balance>2703.35</balance>
<interest rounding = "down">27.55</interest>
</accountSummary>
When dealing with controlled vocabularies (enumerations) in schemas, it is a good idea to rate the volatility of each vocabulary. A volatile vocabulary is one which is expected to change independently of the normal release cycle of schema versions. A stable vocabulary is one which is expected to change (if at all) only as new schema versions are released. Volatile vocabularies are a problem if embedded in a schema because they impose extra releases on all dependent applications.
In our example of an account summary, the currency codes are a volatile
vocabulary: they are externally controlled by ISO, and currencies can be
added or removed by ISO at any time. On the other hand, the set of the
rounding directions {"up", "down", "nearest"} is unlikely to
change, so it is a stable vocabulary. From the point of view of somebody
maintaining an application which deals with account summaries, adding a
new rounding direction would mean writing, testing, and deploying a new
version of the application. Political pressure would dictate that rounding
values would only ever change as part of the planned release cycle of the
schema. So it makes sense to leave the roundingDirection
simple type embedded in the schema.
|
However, it is unlikely that an application would need to be recoded just to handle a change in the set of currency codes; if it did, that would bee a sign of an inflexible design. As the currency codes are externally controlled, they need to be isolated: we do that by creating a separate vocabulary schema for them. A vocabulary schema is one which contains a single simple type definition with enumerated values and nothing else. The vocabulary schema for the currencies is
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
version = "1.0"
elementFormDefault = "qualified">
<xsd:simpleType name = "iso3currency">
<xsd:annotation>
<xsd:documentation>ISO-4217 3-letter currency codes,
as defined at
http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90.xalter
or available from
http://www.xe.com/iso4217.htm
Only a subset are defined here.</xsd:documentation>
</xsd:annotation>
<xsd:restriction base = "xsd:string">
<xsd:enumeration value = "AUD"/><!-- Australian Dollar -->
<xsd:enumeration value = "BRL"/><!-- Brazilian Real -->
<xsd:enumeration value = "CAD"/><!-- Canadian Dollar -->
<xsd:enumeration value = "CNY"/><!-- Chinese Yen -->
<xsd:enumeration value = "EUR"/><!-- Euro -->
<xsd:enumeration value = "GBP"/><!-- British Pound -->
<xsd:enumeration value = "INR"/><!-- Indian Rupee -->
<xsd:enumeration value = "JPY"/><!-- Japanese Yen -->
<xsd:enumeration value = "RUR"/><!-- Russian Rouble -->
<xsd:enumeration value = "USD"/><!-- US Dollar -->
<xsd:length value = "3"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
and is named iso3currency-1.0.xsd. As you see, the
currency vocabulary now has its own version numbers and, thus,its own
release cycle. The vocabulary schema can now be included in the new
version (1.1) of the main message schema:
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
version = "1.1"
elementFormDefault = "qualified">
<xsd:include schemaLocation = "iso3currency-1.0.xsd"/>
<xsd:element name = "accountSummary">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref = "timestamp"/>
<xsd:element ref = "currency"/>
<xsd:element ref = "balance"/>
<xsd:element ref = "interest"/>
</xsd:sequence>
<xsd:attribute name = "version" use = "required">
<xsd:simpleType>
<xsd:restriction base = "xsd:string">
<xsd:pattern value = "[1-9]+[0-9]*\.[0-9]+"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>
</xsd:element>
<xsd:element name = "timestamp" type = "xsd:dateTime"/>
<xsd:element name = "currency" type = "iso3currency"/>
<xsd:element name = "balance" type = "xsd:decimal"/>
<xsd:element name = "interest">
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base = "xsd:decimal">
<xsd:attribute name = "rounding" use = "required" type = "roundingDirection"/>
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
<xsd:simpleType name = "roundingDirection">
<xsd:annotation>
<xsd:documentation>Whether the interest is
rounded up, down or to the
nearest round value.</xsd:documentation>
</xsd:annotation>
<xsd:restriction base = "xsd:string">
<xsd:enumeration value = "up"/>
<xsd:enumeration value = "down"/>
<xsd:enumeration value = "nearest"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
and this is accountSummary-1.1.xsd according to our naming
scheme. Note that the currency codes no longer appear in the main
schema.
The problem with accountSummary-1.1.xsd is that it
directly imports iso3currency-1.0.xsd. When a new version of
the ISO currency vocabulary schema is released, you still
have to release a new version of the account summary schema. What is
needed is a mechanism to decouple the vocabulary schema versions from the
main schema versions. The simple solution is to use an unversioned
"pass-through" vocabulary schema:
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
elementFormDefault = "qualified">
<xsd:include schemaLocation = "iso3currency-1.0.xsd"/>
</xsd:schema>
This unversioned vocabulary schema has no version
attribute and is named iso3currency.xsd. To complete the
decoupling, a new version of the main schema,
accountSummary-1.2.xsd, is released. The only change from
version 1.1 is that the <xsd:include> changes from
<xsd:include schemaLocation = "iso3currency-1.0.xsd"/>
to
<xsd:include schemaLocation = "iso3currency.xsd"/>
so that the unversioned currency vocabulary schema is included. The
decoupling is now complete. If ISO changes the list of currency codes, a
new currency schema is released and iso3currency.xsd is
updated so that it imports the new currency schema. The main schema does
not need to be changed, since it includes iso3currency.xsd
and is agnostic to the version of the currency vocabulary schema.
Decoupling vocabulary schemas like this is not without issues. First, as new versions of the currency vocabulary schema are released, existing instance files will become invalid if they contain currency codes which ISO has deleted. In some situations that would be unacceptable, but it makes sense here. If an instance file refers to a currency code that no longer exists, then it has become semantically invalid; it is not unreasonable for it to become syntactically invalid too. The invalid syntax can then be used to detect such instances and route them for special processing, so that the code in the main application can focus on what to do with valid currency codes. Being able to remove error handling from the main application means the main application code remains smaller and easier to maintain.
Second, with the currency codes able to change at any time, there needs
to be synchronization between the currency codes in the currency
vocabulary schema and the currency codes known to the applications. There
are two solutions to this. The first is that applications can use the
vocabulary schema as the source of the currency codes. Treating the
vocabulary schema as an XML file, a quick SAX parse is all you need to
pull out the <xsd:enumeration> elements containing the
allowed values. The second solution is to keep the currency codes in a
central relational database. Applications can access this table directly,
while the vocabulary schema can be dynamically generated from the same
table. Either method keeps the set of allowed values synchronized across
applications.
Third, using such vocabulary schemas is only workable if applications can rely on them changing in one of two ways only: either an enumerated value is added or one is deleted.
Vocabulary schemas must never change structurally. If a new simple type, complex type, or element definition was added to a vocabulary schema, it could change the results of validating an instance with the main schema and cause a major application failure. So vocabulary schemas need to be "validated" to ensure that they contain just a single simple type definition with enumerated values. This is exactly the situation Will Provost described in "Working with a Metaschema".
An obvious solution would be to write a schema for vocabulary schemas as the metaschema. In practice I don't do this. The existing "Schema for Schemas" is known not to be 100% correct in describing the W3C XML Schema syntax, and so schema editing tools use it as an indicative, rather than normative guide. This means that schema editors tend to ignore any attempt to impose a metaschema on a schema. For this reason, and because the vocabulary schema format is quite simple, I use the following Schematron schema:
<sch:schema
xmlns = "http://www.w3.org/2001/XMLSchema"
xmlns:sch = "http://www.ascc.net/xml/schematron"
xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation =
"http://www.ascc.net/xml/schematron schematron-1.5.xsd">
<sch:title>Controlled vocabulary validation</sch:title>
<!-- The input is assumed to be a valid W3C XML Schema. -->
<!-- This just checks that it is also a valid -->
<!-- vocabulary Schema. -->
<sch:pattern name = "controlled-vocabulary-schema">
<sch:rule context = "schema">
<sch:assert test = "count(*) = count(simpleType[@name])"
>The schema must contain only a
single simple type definition.</sch:assert>
<sch:assert test = "count(simpleType[@name]) = 1"
>The schema must contain a single simpleType
definition or a single include.</sch:assert>
</sch:rule>
<sch:rule context = "simpleType">
<sch:assert test = "@name"
>The simpleType must have a name.</sch:assert>
<sch:assert test = "count(restriction) = 1"
>The simpleType must contain a
single restriction.</sch:assert>
<sch:assert test = "count(*) = count(annotation)+count(restriction)"
>The simpleType may have an annotation as well as its
restriction, but no other structure.</sch:assert>
</sch:rule>
<sch:rule context = "restriction">
<sch:assert test = "enumeration"
>A restriction must contain enumerated values.</sch:assert>
</sch:rule>
<sch:rule context = "enumeration">
<sch:key name = "enumerationsByValue" path = "@value"/>
<sch:assert test = "count(key('enumerationsByValue', @value)) = 1"
>An enumerated value must be unique.</sch:assert>
</sch:rule>
</sch:pattern>
</sch:schema>
Under Windows, you can run validate a vocabulary schema against this Schematron schema using the free validator from Topologi. For other platforms, see the list of tools in the Schematron Resource Directory. Chimezie Ogbuji introduced Schematron in "Validating XML with Schematron".
Schematron assertions are expressed using XPath expressions which must
evaluate to true. If they evaluate to false, a
Schematron validation error is generated. In our Schematron schema, note
the following:
Look at the rule for the schema context. It contains
the assertions that are applied to the <xsd:schema>
element in the vocabulary schema. The first assertion checks that the
only thing in the schema is <xsd:simpleType>
definitions. The second assertion checks that there is only
one <xsd:simpleType> definition.
The rule for the simpleType context asserts that the
<xsd:simpleType> must have a name
attribute, that the <xsd:simpleType> may contain
an <xsd:annotation> and must contain an
<xsd:restriction>, but cannot contain any other
elements.
The rule for the restriction context asserts that the
<xsd:restriction> must contain one or more enumerated
values.
The rule for the enumeration context asserts that the
enumeration values must be unique. This is checked using a Schematron
key (equivalent to an XSLT key). The expression
key('enumerationsByValue', @value) returns a list of the
<xsd:enumeration> elements with the same value as
the element being validated. If the values are unique, there will
always be just one <xsd:enumeration> element in the
list, the one being validated.
WXS schemas can be made more manageable by separating volatile controlled vocabularies (enumerations) into their own vocabulary schemas. In this article, we have seen how to identify volatile controlled vocabularies, how to separate them from the main schema, how to decouple the versions, and how to validate vocabulary schemas. There is no absolute rule for when a controlled vocabulary should have its own schema. Use the guidelines here, but always use your own judgment and your knowledge of your problem domain.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.