Modeling XML Vocabularies with UML: Part I

August 22, 2001

Dave Carlson

A Russian translation of this article is available here.

The arrival of the W3C's XML Schema specification has evoked a variety of responses from software developers, system integrators, XML document analysts, authors, and designers of B2B vocabularies. Some like the richer structure and semantics that can be expressed with these new schemas as compared to DTDs, while others complain about excessive complexity. Many find that the resulting schemas are difficult to share with wider audiences of users and business partners.

I look past many of these differences of opinion to view XML Schema simply as implementation syntax for models of business vocabularies. Other forms of model representation and presentation are more effective than W3C XML Schema when specifying new vocabularies or sharing definitions with users. In particular, I favor the Unified Modeling Language (UML) as a widely adopted standard for system specification and design. My goal in this article and in this series is to share some thoughts about how these two standards are complementary and to work through a simple example that makes the ideas concrete.

Although this discussion is focused on the W3C XML Schema specification, the same concepts are easily transferred to other XML schema languages. Indeed, I have already applied the same techniques to creating and reverse engineering DTDs and SOX schemas, as well as RELAX, TREX, and RELAX NG. In general, I use the term "schema" when referring to the family of XML schema languages.

The Role of Models in XML Applications

Also in this Series

Modeling XML Vocabularies with UML: Part Two

Modeling XML Vocabularies with UML: Part Three

It can be difficult to understand the breadth of a large multi-enterprise system. Most people need to divide and conquer the problem as a set of alternate models and views. Each of these models deliberately ignores aspects of the system that are not relevant to its purpose. Building these kinds of models is fundamental to the way we cope with the complexity of everyday life by ignoring unnecessary details to enable us to focus on the task at hand. Different stakeholder groups have different needs with respect to abstraction and focus.

In the context of B2B system integration, all business partners must agree on the information models that define the vocabulary for task-oriented communication. The models include both the data structure for XML documents that are exchanged, as well as the process models of the extended dialogs that are required to complete complex business transactions.

Historically, in system analysis and design, a variety of techniques, tools, and methodologies has existed for guiding and supporting these alternative models of system structure and behavior. In the absence of formal methods or tools, models are created using PowerPoint, Visio, or paper and pencil to help communicate a system's purpose and function. And when there are no written models, system architects work from mental models as a way to comprehend the whole and its parts. An XML schema is also a vocabulary model written in the syntax of that specification language.

A high-level process for developing XML vocabularies is shown in Figure 1 below. It includes three decision points that determine the final vocabulary definition, regardless of which schema language is used. Data-oriented versus text-oriented applications may have different usage requirements. For example, a data-oriented vocabulary can be optimized for serialization of objects or database query results and its constraints should be carefully aligned with the data-types and referential integrity constraints of its sources. These data-oriented documents may never be viewed by humans, other than by developers testing the application.

A text-oriented vocabulary often has human users who need to edit the XML documents, with or without the assistance of GUI editing tools. Its structure must be easily understood by people who write stylesheets that transform and present the documents' content. An XML vocabulary design that works perfectly for data interchange might cause human users unnecessary pain and distress. Don't forget the needs of your users when creating the XML schema!

Figure 1: UML activity diagram for schema development process

The process diagram in Figure 1 is a UML activity diagram, which is one of nine diagram types defined by that standard. This diagram was created using Rational Rose, one of the most widely used UML modeling tools. Most of our discussion, however, is focused on the UML class diagram that is used to specify the static information structure of a system's XML vocabulary in our application context.

What is UML?

The Unified Modeling Language (UML) defines a standard language and graphical notation for creating models of business and technical systems. Contrary to popular opinion, UML is not limited to use as a tool for programmers. The UML defines model types that span a range from functional requirements and activity workflow models to class structure design and component diagrams. These models, and a development process that uses them, improve and simplify communication among an application's many diverse stakeholders.

A UML class diagram can be constructed to represent the elements, relationships, and constraints of an XML vocabulary visually. With a little initial coaching, class diagrams allow complex vocabularies to be shared with non-technical business stakeholders. A very simple subset of a product catalog vocabulary is shown as a class diagram in Figure 2 [1].

Figure 2: A simple UML class diagram

The primary elements of a UML class diagram are as follows.

  • Class -- this example defines two classes: CatalogItem and Organization. A class represents an aggregation of structural features and defines a namespace for those feature names. Thus, both classes can contain an attribute named "name" but their class namespace scope makes the two attributes distinct.
  • Attribute -- each class may optionally define a set of attributes. Each attribute has a type; in this example string, double, and float refer to the built-in datatypes as defined by the XML Schema specification. For those of you thinking ahead to XML schema design, specifying a UML attribute does not limit the schema to an XML attribute; the mapping to schema syntax allows either an XML attribute or child element.
  • Operation -- the computeTax() operation of CatalogItem specifies part of the behavior for this class. In other words, what does the class do, in addition to defining the structure of its data? In object-oriented parlance, if you send a computeTax message to a CatalogItem object, it will return a floating-point data value. This operation does not expect any parameters, but they could be specified between the parentheses. We will not use class operations in the specification of XML vocabulary, but their definition would be critical to Web Services, especially a WSDL specification of SOAP messages.
  • Association -- an association relates two or more classes in a model. If an association has an arrow on one end, it means that the association is usually navigated in one direction and provides a hint to design and implementation of this vocabulary.
  • Role & Multiplicity -- the end of an association may specify the role of the class; the Organization plays a supplier role for a CatalogItem in this model. In addition, the "1..*" multiplicity means that there must be one or more suppliers for each catalog item.
  • Generalization -- although Figure 2 does not include class inheritance, this structure is fundamental to object-oriented models and is included in the next expanded example.

Conceptual Models of XML Vocabulary

Related Articles

Using XML Schema

W3C XML Schema Datatypes Reference

Guide to UML Class Diagrams

Design Patterns in XML Applications

Now that you understand the basics of UML class diagrams, let's apply them to a larger XML vocabulary design. We'll work with the purchase order vocabulary that is used in the XML Schema Part 0: Primer document. That example is first introduced in section 2.1 and then elaborated throughout the W3C specification. The model defined in this article adds international addresses and multi-schema support as explained in section 4.1 of the W3C specification. If you are new to XML Schema, I suggest that you review the Primer after reading this article, then compare our UML design process in these three articles with the same purchase order vocabulary in the schema specification.

The purchase order vocabulary is defined in two modules, corresponding to the core PurchaseOrder type and a separate reusable Address module specification. In UML, these modules are called packages. The first package specification is shown as a UML class diagram in Figure 3. The PurchaseOrder class has two attributes and three associations that define its structure. Several of these attributes include a multiplicity specification of [0..1], which means that those attribute values are optional, either 0 or 1 occurrences.

The Address class plays both a shipTo and billTo role in association with a PurchaseOrder. (Hint: these might become shipTo and billTo child elements in the schema.) The multiplicity of 1 means that a PurchaseOrder must have exactly one of each address role. On the Item class, notice that a quantity is of type QuantityType. This type is defined as another class in the UML model. In the same diagram, QuantityType is defined as a subclass of positiveInteger, which is annotated as coming from the XSD_Datatypes package in this UML model. Thus, a quantity is a specialized kind of positive integer.

Both QuantityType and SKU are user-defined data-types, and both include an attribute that further restricts their intended usage. The pattern and maxExclusive attributes are assigned a value that is used at later stages of the design process to guide XML Schema generation. Finally, the class name of Address is shown in italics, which means that it is an abstract class that is not intended to be used directly. As we'll see next, Address is further specified in another UML class diagram.

Figure 3: Conceptual model of purchase order vocabulary

The Address package specification, shown in Figure 4, follows a similar logic. In this diagram, both USAddress and UKAddress are specialized subtypes of Address. In ordinary common object-oriented usage, this means that both of these subtypes inherit the three attributes defined in their superclass. The exportCode attribute of UKAddress is assigned an initial value of 1.

Figure 4: Modularized Address schema component

Design Models of XML Schemas

Now that we've created a conceptual model of our XML vocabulary's content and gained approval from all business and technical stakeholders, what next? As hinted in previous sections, there are numerous alternatives available when the mapping this model to XML schema constructs. Are the UML attributes and association ends mapped to XML attributes or elements? How is UML's generalization of classes and datatypes mapped to schema definitions? How does this mapping differ when the target schema language is changed from W3C XML Schema to RELAX NG? What about DTDs?

If you refer back to the schema development process illustrated in Figure 1, the next design task depends on whether this vocabulary is data or text-oriented. Because the purchase order vocabulary is data-oriented, most of the remaining design decisions relate to deployment issues: developer conventions for using XML attributes or child elements, data type alignment with other sources and destinations of data to be exchanged using this vocabulary, and anticipated future requirements for extending this vocabulary or combining it with other XML namespaces.

If this were a text-oriented application, then content managers and authors would have further input on design choices. For example, most human authors prefer XML document structures that avoid excessive use of container elements to group related content elements, whereas this is common practice in data-oriented applications. Also, the order of elements in a document is often more important to human authors and readers than it is to data parsing.

The focus of the present article has been capturing the conceptual model of a vocabulary, which is the logical first step in the development process. The next article presents a list of design choices and alternative approaches for mapping UML to W3C XML Schema. The UML model presented in this first article will be refined to reflect the design choices made by the authors of the W3C's XML Schema Primer, where this example originated. For our purposes, these authors are the stakeholders of system requirements.

The third article will introduce a UML profile for XML schemas that allows all detailed design choices to be added to the model definition and then used to generate a complete schema automatically. The result is a UML model that is used to generate a W3C XML Schema, which can successfully validate XML document instances copied from the Schema Primer specification. Along the way, I'll introduce a web tool used to generate schemas from UML and reverse engineer schemas into UML.

Tips for Success

In order to help you when applying these ideas to your own e-business projects, I offer the following tips for success:


[1] David Carlson. Modeling XML Applications with UML: Practical E-Business Applications. Boston: Addison-Wesley, 2001. This book follows a full system development life-cycle based on a product catalog application design.

[2] Martin Fowler, Kendall Scott. UML Distilled, Second Edition. Boston: Addison-Wesley, 2000.

[3] Object Management Group (OMG) UML resources,

  1. Your e-business vocabulary defines an agreement or contract with all related business parties. Plan its specification accordingly. Get input on requirements of all key stakeholders using the visual models of UML to improve communication.
  2. Define all known terms, associations, and constraints and document their purpose, source, and usage. Do not restrict your specifications to the limited expressiveness of DTDs or even to the expanded W3C XML Schema language. Using UML, you can capture a complete specification and then transform it to one or more XML schema languages. Documentation notes added to the UML model can be automatically transformed to annotations in the XML schema.
  3. Create a common UML model that drives both the XML schema definition and other non-XML system components. Many systems use XML in a subset of their components, but the analysis must be done holistically.