XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Diagramming the XML Family

October 08, 2003

In this article we'll introduce some of the XML family members and discuss how they relate to one another. We'll then use these technologies to create a diagram of their relationships in order to demonstrate how they work together in practice. Of the hundreds of XML technologies in use, we'll limit the scope of this article to the technologies used in the creation of the diagram.

XML

XML (eXtensible Markup Language) consists of a small set of rules which define a structured, text-based syntax for representing data. It isn't a language as such; rather it is a meta-language, a common syntax that can be shared across diverse standards and data models. But if XML doesn't really do anything, why have so many languages adopted the XML syntax? At a basic level, XML is

  • Simple -- easy to learn and use, especially for users familiar with HTML.
  • Flexible -- can be used in many situations, from graphics, to communication protocols, to raw data.
  • Open -- no licensing or pricing restrictions, vendor, or platform lock-ins.

W3C XML Schema

W3C XML Schema allows us to create vocabularies with XML by adding further restrictions to the core XML rules. These restrictions mainly consist of valid names for elements and attributes; which elements can be used inside another; valid repetition of elements; and the type of data for elements and attributes.

Schemas allow us to publish and share these rules for new vocabularies and check the validity of any files which claim adherence to a particular vocabulary. Unlike Document Type Definitions, XML Schema uses the XML syntax, allowing us to parse and query XML Schema files using standard XML tools. (See the XML.com article "Using W3C XML Schema" for more detail.)

Namespaces

Given the ability to create different vocabularies in XML, together with a feature that allows the combination of vocabulary elements into a single file, we are presented with the name collision problem.

For example, an XML schema for books could define a <table> element that defines a table of contents. A second schema for furniture could also define an element named <table>. If data from both were to occupy the same XML file, it would be impossible to differentiate which <table> was which.

By defining a unique identifier, a namespace, for each XML vocabulary, we can group the elements under each identifier, so that XML software can identify each vocabulary element being used. (See the XML.com article "XML Namespaces by Example" for more information.)

URIs

Namespaces create another problem. It's fine to suggest that each XML vocabulary should be assigned a unique identifier, but how can you ensure that the unique identifier you choose hasn't already been used? Theoretically, there are no assurances that a namespace identifier hasn't been used by another vocabulary. However, by using a URI (Uniform Resource Identifier) you can greatly reduce the chance of a namespace collision.

A URI can be one of two types: a URN (Uniform Resource Name) or a URL (Uniform Resource Locator). The distinction between the two is a little vague and overlaps in some respects. URLs identify a resource by its location or by an address for accessing the resource. URNs identify a resource by an address that doesn't necessarily access the resource but which must be unique and must always refer to the same resource, even if it moves or becomes obsolete. In this way, a URN could also be a URL, if the URL address was guaranteed to persist and always point to the same resource.

In practice, URLs are the most commonly used kind of URI, particularly for namespace identifiers. Once an organization has purchased a unique domain name, it can create namespace identifiers based on this name. By using URLs which it theoretically owns, the organization can control and manage namespace identifiers under this domain, ensuring no namespace conflicts.

RDF

The Resource Description Framework (RDF) is a model for representing resource metadata, that is, information about things. These "things" can be web pages, people, books, or anything else. The information could be file size, height, color, or any other property that something might have. RDF therefore consists of a number of statements about something:

  • Notes from a Small Island has an ISBN of 0552996009
  • Notes from a Small Island has an author of Bill Bryson
  • Bill Bryson has a birth place of Iowa

Note that each statement (or triple) is constructed from three parts: the resource (Notes from a Small Island), the property name (ISBN), and the property value (0552996009). These statements could be represented in any XML vocabulary:

<book name="Notes from a small island">
    <ISBN>0552996009</ISBN>
    <author birthplace="Iowa">Bill Bryson</author>
</book>

or

<document identifier="0552996009">
    <title>Notes from a small island</title>
    <creator name="Bill Bryson">
        <born location="Iowa" />
    </creator>
</document>

These two examples, by demonstrating the versatility of XML, show part of the problem that RDF solves. If we rely on just XML, these statements can be represented in an unlimited number of vocabularies, each with its own schema and rules. An application that had to search a collection of 100 different XML files for books written by Bill Bryson would need to know the exact element or attribute to search for within each vocabulary, that is, the application would need prior knowledge of each vocabulary.

By introducing an overarching model, RDF provides a superior solution. The RDF model is enforced within the XML syntax by basically restricting the XML rules and by introducing a set of core elements and attributes. The real power of RDF comes from its use of namespaces and URIs. RDF vocabularies can be defined with RDF Schema. Within an RDF instance document, however, these vocabularies can be more easily mixed than in standard XML. Once a vocabulary has been created that defines an author property, as long as the vocabulary is assigned a unique namespace, the property can be used in any RDF file. RDF software, with knowledge of just a single RDF vocabulary, can search all statements in all files for particular authors.

URIs provide the icing on the RDF cake. When an RDF vocabulary is defined, each element within it can be referenced by a URI, uniquely identifying it. Each element can also define any part of a triple, i.e. RDF can be used to create lists of resources (things you want to describe), properties, and property values. A triple can consist of nothing but URI references.

Unlike standard XML, RDF files commonly contain data that can be decomposed into a set of URIs. For example, the previous XML example could be represented in RDF as

<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.0/"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

    <rdf:Description rdf:about="urn:isbn:0552996009">
        <dc:title>Notes from a Small Island</dc:title>
        <dc:creator rdf:resource="http://authors.com/b/bbryson.rdf" />
    </rdf:Description>

</rdf:RDF>

If we examine one of the triples that an RDF parser would give us, we find:

  • Item we are describing: urn:isbn:0552996009
  • Property of this item: http://purl.org/dc/elements/1.0/creator
  • Value of this property: http://authors.com/b/bbryson.rdf

This has two immediate benefits:

  • All aspects of the data can be unambiguously identified. For example, in a standard XML document, the author could be stated as "Bill Bryson", "Mr Bill Bryson" or "Bryson, Bill." Software searching for this author would need to be aware of the potential differences in representation and could also run into trouble if a second author existed with the same name. In RDF, assuming that all files use the same URI to reference the author, the author can always be uniquely identified.
  • If these triple values are also RDF resources, the software can automatically find related information. For example, some other RDF document could contain information on the author's birthplace, which in turn may be represented by another URI. An RDF document concerning US cities could then be consulted, which might contain information such as longitude and latitude, average temperature, etc. So, by making a single statement in RDF, further and related information can potentially be automatically acquired.

The real benefit of these explicit, ongoing relationships of information (the Semantic Web) is enormous. The near future will hopefully bring us software agents, which can query and use a whole range of RDF information for a given question (e.g. "Show me hardback books created by anyone who is related to a politician" or "Which servers can run Solaris and are available in the UK?").

If nothing else, if and when engines such as Google starts to make use of RDF and its power of relations, it will become extremely good at The Kevin Bacon Game. (See the XML.com article "RDF: Ready for Prime Time" for more information.)

Pages: 1, 2

Next Pagearrow







close