XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML Linking Technologies

October 04, 2000

Defining relationships between nodes of a tree is always an involved topic. During the 1990s we saw the success of relational databases, tabular data being an extreme solution to defining these relationships. In bringing hierarchical structures back to center stage, XML has revived the linking problem and presents multiple ways to solve it.

In this article, we explore some of the ways to express links. We'll focus on linking nodes in a single document. Using the example of book cataloging, for each technology we will show how the example can be represented, explain how this representation can be expanded using XSLT into the most simple model, see how the document could be validated using XML Schemas, and weigh the benefits of each approach against its complexity.

We take XSLT transformation as the typical XML processing scenario, using it to gauge the complexity of each linking scenario. Please note that the XML Schemas and XPointer technologies used below are still works in progress and lack production-quality implementations.

Containment

The first and most natural way to express a relationship between nodes is to use the containment of XML tree nodes and to model the structure of the XML document after the structure you wish to represent.

For example, to describe my library, I can define a containment-based structure.

<library>
    <book>
        <isbn>0836217462</isbn>
        <title>Being a Dog Is a Full-Time Job</title>
        <author>
            <name>Charles M. Schulz</name>
            <nickName>SPARKY</nickName>
            <born>November 26, 1922</born>
            <dead>February 12, 2000</dead>
        </author>
        <character>
            <name>Peppermint Patty</name>
            <since>Aug. 22, 1966</since>
            <qualification>bold, brash and tomboyish</qualification>
        </character>
        <character>
            <name>Snoopy</name>
            <since>October 4, 1950</since>
            <qualification>extroverted beagle</qualification>
        </character>
.../...
    </book>
</library>

(library1.xml)

This structure is easy to query if you want to retrieve all the information related to a book, since you'll find it in the child nodes of the book node.

Unfortunately, it isn't scalable and becomes messily redundant as soon as we add a second book into our library.

<library>
    <book>
        <isbn>0836217462</isbn>
        <title>Being a Dog Is a Full-Time Job</title>
        <author>
            <name>Charles M. Schulz</name>
            <nickName>SPARKY</nickName>
            <born>November 26, 1922</born>
            <dead>February 12, 2000</dead>
        </author>
        <character>
            <name>Peppermint Patty</name>
            <since>Aug. 22, 1966</since>
            <qualification>bold, brash and tomboyish</qualification>
        </character>
        <character>
            <name>Snoopy</name>
            <since>October 4, 1950</since>
            <qualification>extroverted beagle</qualification>
        </character>
.../...
    </book>
    <book>
        <isbn>0805033106</isbn>
        <title>Peanuts Every Sunday </title>
        <author>
            <name>Charles M. Schulz</name>
            <nickName>SPARKY</nickName>
            <born>November 26, 1922</born>
            <dead>February 12, 2000</dead>
        </author>
        <character>
            <name>Sally Brown</name>
            <since>Aug, 22, 1960</since>
            <qualification>always looks for the easy way out</qualification>
        </character>
        <character>
            <name>Snoopy</name>
            <since>October 4, 1950</since>
            <qualification>extroverted beagle</qualification>
        </character>
.../...
    </book>
</library>

(library2.xml)

The information about the author and some of the characters needs to be duplicated to fit this simple model, which would thus rapidly grow into an unwieldy unmaintainable document. Also, it complicates some queries, like locating the books written by a particular author or the books in which a particular character appears.

Handling Links in Applications

What we've seen is that characters and authors need to be separate objects, stored under different nodes trees, and linked to and from the book. XML is flexible enough that we don't need any external standard to define these links, we can just flatten our structure.

<library>
    <book>
        <isbn>0836217462</isbn>
        <title>Being a Dog Is a Full-Time Job</title>
        <author>Charles M. Schulz</author>
        <character>Peppermint Patty</character>
        <character>Snoopy</character>
        <character>Schroeder</character>
        <character>Lucy</character>
    </book>
    <book>
        <isbn>0805033106</isbn>
        <title>Peanuts Every Sunday </title>
        <author>Charles M. Schulz</author>
        <character>Sally Brown</character>
        <character>Snoopy</character>
        <character>Linus</character>
        <character>Lucy</character>
    </book>   
    <author>
        <name>Charles M. Schulz</name>
        <nickName>SPARKY</nickName>
        <born>November 26, 1922</born>
        <dead>February 12, 2000</dead>
    </author>

    <character>
        <name>Snoopy</name>
        <since>October 4, 1950</since>
        <qualification>extroverted beagle</qualification>
    </character>
    <character>
        <name>Sally Brown</name>
        <since>Aug, 22, 1960</since>
        <qualification>always looks for the easy way out</qualification>
    </character>
    <character>
        <name>Linus</name>
        <since>Sept. 19, 1952</since>
        <qualification>the intellectual of the gang</qualification>
    </character>
.../...
</library>

(library3.xml).

We have replaced the author and character elements in the book element by a reference to the author and character elements using application-specific identifiers (in this case, the contents of the name elements). While we gain scalability, the downside of this approach is that the applications processing this format need to know how to handle these links.

The most common way to implement this kind of structure is to embed the knowledge of the new XML layout within the application design. We can also try to avoid hard-coding the link processing into an application, either by transforming a document before handing it to the application or by providing providing hints sufficient to manage the links.

We can help our applications by transforming the new format to the original, expanded one. The XSLT template that replaces a reference to an author or a character by its full node structure takes exactly two instructions:

<xsl:template match="author">
    <xsl:variable name="name" select="normalize-space()"/>
    <xsl:copy-of select="/library/author[normalize-space(name)=$name]"/>
</xsl:template>

(expand3.xsl)

XSLT can be used to present the application with a logical structure based on containment, while the actual serialized structure of the source document may be different.

XML Schemas

We can also try providing the application with enough information about the data structure for it to be able to handle it by itself. XML Schemas look like the perfect way of doing by using the key and keyref instructions.

   <xsd:key name="authorKey">
        <xsd:selector xpath="author"/>
        <xsd:field xpath="name"/>
    </xsd:key>
    <xsd:keyref refer="authorKey" name="book2author">
        <xsd:selector xpath="book/author"/>
        <xsd:field xpath="."/>
    </xsd:keyref>

(library3.xsd)

Done within the definition of the library element, these declarations define that the author's name is a key (unique identifier), and that this key is referenced in a book's author element.

At the time of writing, the W3C online validator is the only tool I have found which accepts this XML Schema. Not all is lost, though, as the primary benefit of taking the pain to write a schema should be that you can now validate the document.

That said, the more ambitious goal of XML Schemas is to provide a means for defining the structure, content and semantics of XML documents. I wouldn't be surprised to see more projects and products producing all kind of interesting things using XML Schemas -- things like classes, bindings, transformations, or RDBMS mappings.

We can anticipate, then, that Schema-aware tools will be able to supply applications with the necessary information to facilitate the processing of such links.

ID and IDREF

In the third revision of our book catalog document, we used natural identifiers (the name of the author or characters), and the matching is done at an application level by comparing the values.

XML provides another way to identify nodes in a document through the ID and IDREF datatypes implemented in DTDs (also implemented by XML Schemas). Using these identifiers introduces constraints, though -- the values need to be carried by attributes, they must be valid XML tokens, and they must be declared in the DTD.

<!DOCTYPE library [
<!ATTLIST author id ID #IMPLIED>
<!ATTLIST book id ID #IMPLIED>
<!ATTLIST character id ID #IMPLIED>
]>

(library4.xml)

The identifiers need to be XML tokens and so cannot begin with a numeric character. Furthermore, they are global to a document and must identify a node independently of its context. To avoid conflicts between identifiers used for different elements, we can prefix the identifier with the name of the element which is identified.

   <book id="book_0836217462">
.../...
    <author id="author_Charles-M.-Schulz">
.../...
    <character id="character_Snoopy">

(library4.xml)

The references to these elements can be done with attributes.

   <book id="book_0805033106">
        <isbn>0805033106</isbn>
        <title>Peanuts Every Sunday </title>
        <author href="author_Charles-M.-Schulz"/>
        <character href="character_Sally-Brown"/>
        <character href="character_Snoopy"/>
        <character href="character_Linus"/>
        <character href="character_Lucy"/>
    </book>

(library4.xml)

The transformation needed to expand such references is still more simple than the previous example, and it can be generalized under the assumption that we want to expand all the elements having an href attribute.

<xsl:template match="*[@href]">
    <xsl:copy-of select="id(@href)"/>
</xsl:template>

(expand4.xsl)

This scheme reflects the SGML legacy of XML, and this is the only case in the possible implementations discussed in this article where the complete document validation -- including the check of the uniqueness of the identifiers and the validity of the references -- could be done using a DTD.

This is possible in a similar fashion within a XML Schema, where the declaration can be written as

  <xsd:attribute name="id" type="xsd:ID"/>

(library4.xsd)

and the references as:
   <xsd:attribute name="href" type="xsd:IDREF"/>

(library4.xsd)

Logical or Physical Links?

The difference between these first two linking schemes is that in the first case we defined links which were implemented logically, that is, comparing the values of abstract properties (the name of an author or a character, the ISBN of a book); while in the second case, we use identifiers that are located in our documents.

We will encounter this distinction repeatedly on our tour across linking technologies. The distinction between a physical link pointing to a node location and a logical link relying on the matching of values or combination of values is fundamental.

RDF

Let's see how simple an RDF variant of our book catalog document can be. First we have to replace the root element by the mandatory rdf:RDF root element and then define a namespace for our vocabulary.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns="http://my.library/rdf/syntax#">

(library5.rdf)

Next we use an rdf:about attribute to identify the element we are describing, using URIs as unique identifiers.

   <book rdf:about="http://my.library/book/0836217462">
or
   <character rdf:about="http://my.library/character/Snoopy">

(library5.rdf)

The third and final step is to use rdf:resource to identify the elements to which we want to link.

   <book rdf:about="http://my.library/book/0836217462">
        <isbn>0836217462</isbn>
        <title>Being a Dog Is a Full-Time Job</title>
        <author rdf:resource="http://my.library/author/Charles-M.-Schulz"/>
        <character rdf:resource="http://my.library/character/Peppermint-Patty"/>
        <character rdf:resource="http://my.library/character/Charlie-Brown"/>
        <character rdf:resource="http://my.library/character/Snoopy"/>
        <character rdf:resource="http://my.library/character/Schroeder"/>
        <character rdf:resource="http://my.library/character/Lucy"/>
    </book>

(library5.rdf)

That's all that's required to convert to a minimal but conforming RDF document with a syntax that is similar to the original.

We have added a second level of abstraction on top of our document, but we can still use it as an ordinary XML document. It can still be expanded using two instructions in an XSLT template.

<xsl:template match="lib:author">
    <xsl:variable name="resource" select="@rdf:resource"/>
    <xsl:copy-of select="/rdf:RDF/lib:author[@rdf:about=$resource]"/>
</xsl:template>

(expand5.xsl)

And you can still define a XML Schema on top of it.

The schema, which we have split into two files (library5-rdf.xsd for the rdf namespace and library5.xsd for our vocabulary), can still contain the key and keyref constraints.

   <xsd:key name="characterKey">
        <xsd:selector xpath="character"/>
        <xsd:field xpath="@rdf:about"/>
    </xsd:key>

    <xsd:keyref refer="characterKey" name="book2character">
        <xsd:selector xpath="book/character"/>
        <xsd:field xpath="@rdf:resource"/>
    </xsd:keyref>

(library5-rdf.xsd)

While we still have an ordinary and (relatively) simple XML file, which works with our preferred XML tools, we have also evolved this document into the basis of a semantic layer. Since this document is now a valid RDF document, it can be read by RDF tools, which will see it as a set of predicates (called triples) that can be inserted into databases and be used for logical programming applications.

The triples, which can be visualized using the W3C online service, are basic assertions that look like

triple('http://my.library/rdf/syntax#character',
       'http://my.library/book/0836217462',
       'http://my.library/character/Schroeder').

(lib5-triples.html)

This triple indicates that the book http://my.library/book/0836217462 is linked to http://my.library/character/Schroeder through a relation of type http://my.library/rdf/syntax#character.

In addition to the availability of the links, all the information in the document is available as triples for RDF applications.

triple('http://my.library/rdf/syntax#qualification',
'http://my.library/character/Lucy',
'bossy, crabby and selfish').

(lib5-triples.html)

One should note that the two layers (XML and RDF) are not equivalent. RDF is an abstract layer on top of XML. It doesn't care about the syntax that has been used to describe the triples, nor does it care that these triples can't be serialized back in the same XML format (since the structure of the input document has been lost).

Building RDF on top of XML results in the same loss of low-level detail as in building XML on top of raw text (which I have described in "Processing Inclusions with XSLT"). It could be interesting to define an RDF representation of an XML document (i.e., preserving its exact structure), just as XML representations of text documents can be defined. This is also one of the reasons why, though most existing RDF tools can read RDF documents to extract and work on triples, they do not provide a programmatic way to write triples back as RDF.

RDF Tradeoffs

The tradeoffs between added complexity (which can be minimal, as we've seen) and the potential applications (available or in development) makes using RDF to define the links between elements in our applications an interesting option.

One of the criticisms of RDF is that it's a disruptive technology: in order to be RDF compliant, we had to change the root element of our application. If we'd wanted to carry more information, such as the order of the links between a book and the characters, we would have included them in RDF sequences, which would've further modified its structure. Although this disruption can be acceptable when defining a new vocabulary, it may be a limitation when using or extending existing standards.

Pages: 1, 2

Next Pagearrow