Diagramming the XML Family
October 8, 2003
In this article we'll introduce some of the XML family members and discuss how they relate to one another. We'll then use these technologies to create a diagram of their relationships in order to demonstrate how they work together in practice. Of the hundreds of XML technologies in use, we'll limit the scope of this article to the technologies used in the creation of the diagram.
XML (eXtensible Markup Language) consists of a small set of rules which define a structured, text-based syntax for representing data. It isn't a language as such; rather it is a meta-language, a common syntax that can be shared across diverse standards and data models. But if XML doesn't really do anything, why have so many languages adopted the XML syntax? At a basic level, XML is
- Simple -- easy to learn and use, especially for users familiar with HTML.
- Flexible -- can be used in many situations, from graphics, to communication protocols, to raw data.
- Open -- no licensing or pricing restrictions, vendor, or platform lock-ins.
W3C XML Schema
W3C XML Schema allows us to create vocabularies with XML by adding further restrictions to the core XML rules. These restrictions mainly consist of valid names for elements and attributes; which elements can be used inside another; valid repetition of elements; and the type of data for elements and attributes.
Schemas allow us to publish and share these rules for new vocabularies and check the validity of any files which claim adherence to a particular vocabulary. Unlike Document Type Definitions, XML Schema uses the XML syntax, allowing us to parse and query XML Schema files using standard XML tools. (See the XML.com article "Using W3C XML Schema" for more detail.)
Given the ability to create different vocabularies in XML, together with a feature that allows the combination of vocabulary elements into a single file, we are presented with the name collision problem.
For example, an XML schema for books could define a
<table> element that
defines a table of contents. A second schema for furniture could also define an element
<table>. If data from both were to occupy the same XML file, it
would be impossible to differentiate which
<table> was which.
By defining a unique identifier, a namespace, for each XML vocabulary, we can group the elements under each identifier, so that XML software can identify each vocabulary element being used. (See the XML.com article "XML Namespaces by Example" for more information.)
Namespaces create another problem. It's fine to suggest that each XML vocabulary should be assigned a unique identifier, but how can you ensure that the unique identifier you choose hasn't already been used? Theoretically, there are no assurances that a namespace identifier hasn't been used by another vocabulary. However, by using a URI (Uniform Resource Identifier) you can greatly reduce the chance of a namespace collision.
A URI can be one of two types: a URN (Uniform Resource Name) or a URL (Uniform Resource Locator). The distinction between the two is a little vague and overlaps in some respects. URLs identify a resource by its location or by an address for accessing the resource. URNs identify a resource by an address that doesn't necessarily access the resource but which must be unique and must always refer to the same resource, even if it moves or becomes obsolete. In this way, a URN could also be a URL, if the URL address was guaranteed to persist and always point to the same resource.
In practice, URLs are the most commonly used kind of URI, particularly for namespace identifiers. Once an organization has purchased a unique domain name, it can create namespace identifiers based on this name. By using URLs which it theoretically owns, the organization can control and manage namespace identifiers under this domain, ensuring no namespace conflicts.
The Resource Description Framework (RDF) is a model for representing resource metadata, that is, information about things. These "things" can be web pages, people, books, or anything else. The information could be file size, height, color, or any other property that something might have. RDF therefore consists of a number of statements about something:
- Notes from a Small Island has an ISBN of 0552996009
- Notes from a Small Island has an author of Bill Bryson
- Bill Bryson has a birth place of Iowa
Note that each statement (or triple) is constructed from three parts: the resource (Notes from a Small Island), the property name (ISBN), and the property value (0552996009). These statements could be represented in any XML vocabulary:
<book name="Notes from a small island"> <ISBN>0552996009</ISBN> <author birthplace="Iowa">Bill Bryson</author> </book>
<document identifier="0552996009"> <title>Notes from a small island</title> <creator name="Bill Bryson"> <born location="Iowa" /> </creator> </document>
These two examples, by demonstrating the versatility of XML, show part of the problem that RDF solves. If we rely on just XML, these statements can be represented in an unlimited number of vocabularies, each with its own schema and rules. An application that had to search a collection of 100 different XML files for books written by Bill Bryson would need to know the exact element or attribute to search for within each vocabulary, that is, the application would need prior knowledge of each vocabulary.
By introducing an overarching model, RDF provides a superior solution. The RDF model
enforced within the XML syntax by basically restricting the XML rules and by introducing
set of core elements and attributes. The real power of RDF comes from its use of namespaces
and URIs. RDF vocabularies can be defined with RDF Schema. Within an RDF instance document, however, these vocabularies can be more
easily mixed than in standard XML. Once a vocabulary has been created that defines
author property, as long as the vocabulary is assigned a unique namespace,
the property can be used in any RDF file. RDF software, with knowledge of just a single
vocabulary, can search all statements in all files for particular authors.
URIs provide the icing on the RDF cake. When an RDF vocabulary is defined, each element within it can be referenced by a URI, uniquely identifying it. Each element can also define any part of a triple, i.e. RDF can be used to create lists of resources (things you want to describe), properties, and property values. A triple can consist of nothing but URI references.
Unlike standard XML, RDF files commonly contain data that can be decomposed into a set of URIs. For example, the previous XML example could be represented in RDF as
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="urn:isbn:0552996009"> <dc:title>Notes from a Small Island</dc:title> <dc:creator rdf:resource="http://authors.com/b/bbryson.rdf" /> </rdf:Description> </rdf:RDF>
If we examine one of the triples that an RDF parser would give us, we find:
- Item we are describing:
- Property of this item:
- Value of this property:
This has two immediate benefits:
- All aspects of the data can be unambiguously identified. For example, in a standard XML document, the author could be stated as "Bill Bryson", "Mr Bill Bryson" or "Bryson, Bill." Software searching for this author would need to be aware of the potential differences in representation and could also run into trouble if a second author existed with the same name. In RDF, assuming that all files use the same URI to reference the author, the author can always be uniquely identified.
- If these triple values are also RDF resources, the software can automatically find related information. For example, some other RDF document could contain information on the author's birthplace, which in turn may be represented by another URI. An RDF document concerning US cities could then be consulted, which might contain information such as longitude and latitude, average temperature, etc. So, by making a single statement in RDF, further and related information can potentially be automatically acquired.
The real benefit of these explicit, ongoing relationships of information (the Semantic Web) is enormous. The near future will hopefully bring us software agents, which can query and use a whole range of RDF information for a given question (e.g. "Show me hardback books created by anyone who is related to a politician" or "Which servers can run Solaris and are available in the UK?").
If nothing else, if and when engines such as Google starts to make use of RDF and its power of relations, it will become extremely good at The Kevin Bacon Game. (See the XML.com article "RDF: Ready for Prime Time" for more information.)
XSLT (Extensible Stylesheet Language Transformations) -- together with XPath and XSL-FO (Extensible Stylesheet Language -- Formatting Objects), constitute a set of technologies named XSL, which transforms and formats XML data.
XSLT in particular is powerful and commonly used. XSLT is an XML vocabulary that can completely transform XML data into another XML format or any other text format. (See the XML.com article " XML.com Style Resource Centre" for more information.)
XPath defines a non-XML syntax for referring to specific parts of an XML file. Using XPath you could find "the ISBN attribute inside the third book element" or "every book element in the file". When used within XSLT, XPath provides the means for extracting specific data from the input XML.
XPath also contains basic functionality for matching data (testing if an element has a specific value), as well as string and number manipulation.
XSL-FO is an XML vocabulary for the layout and styling of content within a paginated document. XSL-FO data is usually created from an XSLT process; i.e. XML data will undergo an XSLT transformation, resulting in an XSL-FO file. The XSL-FO file can then be processed with a formatting application, to produce a document file such as a PDF.
SVG (Scalable Vector Graphics) is an XML vocabulary for creating two-dimensional vector graphics. SVG images can also be interactive, animated, and can include text and bitmapped images.
SVG images boast an array of benefits over similar formats, including all the benefits associated with XML (platform independence, flexibility, open nature) and accessibility features. (See the XML.com article "SVG: A Sure Bet" for more information.)
Creating the Diagram
Now let's make use of these technologies to create a diagram that illustrates their relationships. We'll first need to create some raw data that describes each technology in a computer-readable format. We know that XML is a good format for computer-readable data; since we are describing something, RDF can be used for the model.
RDF likes us to specify unique URIs, so we'll use the specification document URL for each technology we are describing. In terms of the information that we need to record, basic properties such as title, subject and relationships are required. Luckily, RDF vocabularies already exist for properties such as these, so we won't need to define our own with RDFS. We'll use the Dublin Core RDF vocabularies:
- Basic Dublin Core element set (RDF) for title, description, subject and date properties.
- Qualified Dublin Core element set (RDF) for specific relationship properties, such as "conforms to" and "has part".
By using conventional, standard RDF vocabularies for our data, other applications that understand Dublin Core RDF can reuse our data at a later date.
Take a look at our RDF data for the technologies. For each technology we've specified title, description, creation date, relationships, and occasionally a subject. The subject property allows us to semantically group some technologies under common concepts, even if there are no specific relationships between them. We'll make use of the subject data in the diagram.
As you may have guessed, we're going to use SVG for the diagram -- thus, the next step is to convert our RDF data into a visual SVG representation.
We'll use XSLT and XPath to transform our RDF data into SVG objects (squares, lines, and text). The specifics of the XSLT aren't important (you can look at the XSLT that converts the RDF into SVG if you're curious.) What is important is to recognize how and why we've used it. The XSLT contains a series of logical steps that convert our input RDF data into a completely different visualization of the same data. Note that this XSLT has been designed specifically for our input data. Given time and proper planning, you could develop XSLT templates that transform any set of RDF statements into similar visualizations. It is, however, probably better to use an RDF toolkit. XSLT to parse all of the permitted constructs of RDF that you might find in the wild can be extremely complicated.
|The SVG diagram, output from the XSLT process. Click on image for a full-size version, or view the SVG diagram (may require a plug-in).|
SVG currently has limited support in web browsers and image viewers; as a final step we'll embed the diagram into a PDF document to make it available to a larger audience. An additional XSLT file is used to create the XSL-FO for our document, defining and structuring the page and its contents. Within the XSL-FO, the original XSLT for the SVG diagram is called, embedding the SVG code within the XSL-FO page.
The output (XSL-FO plus embedded SVG) is then processed with an XSL-FO processor. We'll use Apache FOP. FOP converts our plain text XSL-FO into a PDF file, rasterizing the SVG into a diagram on the page (using Apache Batik). We finally have our PDF diagram of the XML technologies.
To recap, we used terms from the Dublin Core RDF Schema (which makes use of XML, Namespaces and URIs) to create a RDF description for each technology. These were converted to SVG using XSLT and XPath. We could have validated our XSLT file with XML Schema (using the XSLT schema). The SVG diagram was finally embedded into an Adobe PDF document with XSL-FO, resulting in a printable file that contains a diagram of the technology relationships.
When XML and RDF data become ubiquitous on the Web, the potential for querying and displaying the information will be enormous. The tools and underlying technologies are already in place. All that's needed is a greater understanding of the potential that it offers. The growth of these technologies is limited largely by our reluctance to commit.