Going Native: Making the Case for XML Databases

March 30, 2005

Ronald Bourret

When native XML databases appeared on the heels of the XML 1.0 recommendation, most people weren't sure what to make of them. Were they a replacement for relational databases or a return to hierarchical databases? No, said the vendors. They were designed to manage large numbers of XML documents. But since many XML documents could be mapped to relational tables, the only obvious targets were document-centric XML, which lacked structure, and catalogs, which had deep hierarchies. Beyond that, it wasn't clear what they could be used for.

Got XML?

So what are the use cases for native XML databases? As John Merrells, one of the developers of Sleepycat Software's Berkeley DB XML, waggishly put it, there is only one use case, and that is simply, "Got XML?" In other words, if you have more than a handful of XML documents that you need to store, you should store them in a native XML database. The reasons are the same as for storing data in any other database; ease of management, enhanced query performance, concurrent access, transactional safety, security, and so on.

Other vendors are more specific: native XML databases are used in a wide number of fields, from genetics to health care to insurance, and technologies, from data integration to messaging to Web sites. This article describes the most common use cases, and is based on discussions with roughly half of the native XML database vendors, as well as a handful of customers. It does not attempt to determine the size of the current native XML database market — a task left to the market analysts — nor does it attempt to describe the capabilities of individual native XML databases, which vary significantly.

While most real-world uses of native XML databases do not fit cleanly into any single category, it is possible to characterize them in terms of a limited number of use cases. The most popular of these are storing and querying document-centric XML, integrating data, and storing and querying semi-structured data. Native XML databases are used in these cases because the data involved does not easily fit the relational data model, while it does fit the XML data model.

Other common use cases for native XML databases include managing long-running transactions, handling rapidly evolving schemas, working with very large documents, querying hierarchical data, and running Web sites.

This article discusses use cases in three parts. Part one discusses document-centric XML, part two discusses data integration and semi-structured data, and part three discusses the remaining cases.

A Very Quick Review of Native XML Databases

There are two different ways to "store" XML documents in a database. The first is to map the document's schema to a database schema and transfer data according to that mapping. The second is to use a fixed set of structures that can store any XML document. To understand the difference, consider how a sales order document might be stored in a relational database. The first method uses a set of tables designed specifically for storing sales orders: Orders, Items, Customers, Parts, and so on. The second method uses a set of tables designed to hold arbitrary XML documents: Elements, Attributes, Text, and so on.

Databases that support the first method are called XML-enabled databases. (Such functionality is also provided by third-party middleware, which is said to XML-enable a database.) Databases that support the second method are called native XML databases. (While our example shows how to build a native XML database on top of a relational database, virtually all native XML databases are built from scratch.) A more theoretically correct way to say this is that XML-enabled databases have their own data model — relational, hierarchical, object-oriented — and map instances of the XML data model to instances of their data model. Native XML databases use the XML data model directly.

XML-enabled databases are useful when publishing existing data as XML or importing data from an XML document into an existing database. However, they are not a good way to store complete XML documents. The reason is that they store data and hierarchy but discard everything else: document identity, sibling order, comments, processing instructions, and so on. In addition, because they require design-time mapping of schemas, they cannot store documents whose schema is not known at design time. Native XML databases, on the other hand, store complete documents and can store any document, regardless of schema.

For a complete discussion of native XML databases, see Introduction to Native XML Databases or XML and Databases. We will only define two more terms. In a native XML database, a document is the fundamental unit of storage, equivalent to a row in a relational database. While this might be a document in the traditional sense, such as a chapter in a book, it might simply be a set of related data, such as the sequence of a gene, a list of known mutations to the gene, and a list of papers about that gene. The latter sense is important for things like semi-structured data. A collection is a set of related documents and plays a role similar to that of a table in a relational database or a directory in a file system.

And now, on to the use cases.

Storing and Querying Document-Centric XML

As might be guessed, the most common use case for native XML databases is storing and querying document-centric XML. Of the vendors surveyed, all but one listed it as a use case, and many noted it was their most common use case.

Documents in the real world

Let's start with some examples of how native XML databases are used with document-centric XML in the real world:

  • Elsevier Science, a publisher of scientific, technical, and medical information, uses Mark Logic's Content Interaction Server to manage more than two terabytes of data: five million full-text journal articles, 60 million citations and abstracts, thousands of complete books, and five thousand informational pamphlets. The system is used to search and transform documents. Raining Data's TigerLogic XML Data Management Server is used in similar fashion by large scientific publishing companies.

  • A variety of content management applications are built on top of IXIASOFT's TEXTML Server. For example, third-party editorial systems such as AILINK, Eurocortex, KnowledgeView, Protec, Rosebud, Modulo, and PCI use it to manage production, archive content, or both, and are "installed at most major publishing companies in the US and in Europe." General Dynamics uses TEXTML Server in AFCV, a browser used to view technical manuals for various aircraft, including AWACS surveillance airplanes and the F-16.

  • The Las Vegas Sun uses Snapbridge's Cross Media Server (a content management system built on its FDX XML Server native XML database) to manage content for its day-to-day operations as well as its Web site. The system holds more than 750,000 items (XML documents, images, and PDF documents), comprising tens of gigabytes of data.

  • An unnamed bank uses Xyleme Zone Server as the basis for an equity research portal, which is composed of multiple sub-portals. It is used by roughly 10,000 employees inside the bank and more than 30,000 customers outside the bank, who perform both contextual and full-text searches. Analysts add thousands of documents daily. As these are loaded, a separate engine queries them and notifies users of any changes of interest. Latency between loading and user notification is "a couple of minutes".

  • The Tasmanian Government uses TeraText DBS to power a Web site that allows users to track Tasmanian legislation. A single piece of legislation is stored as a series of time-stamped fragments. This allows users to track changes to the legislation over time. In addition, links in the documents refer to fragments also stored in the database, allowing them to be implemented as queries on the database.

  • Autodesk stores software manuals in Idiom's GEP, a content management system built on top of X-Hive/DB. Content is stored as reusable components, from which manuals are built. Manuals are published in a variety of formats (HTML, PDF, CHM, print) and over 30 languages. X-Hive/DB's versioning mechanism is used to track changes, which reduces translation time. The system also relies heavily on XQuery and XLink.

  • Amirsys uses Ipedo XML Store to manage descriptions of radiology diagnostic cases, image data, and data used to drive a document editor. The editor is designed for writing books about radiology and contains features such as queries on existing descriptions and the ability to insert images. Books are published using Apache FOP. In addition, managers can query across documents, such as to track the progress of authors or check that elements are used in the same way by all authors.

  • Le Monde uses Xyleme Zone Server to manage an archive holding more than 800,000 documents and using 6 gigabytes of storage. The archive is used by employees, partners, and customers.

  • The US Navy uses the GEMt content management system, which is built on Tamino, to store roughly 100,000 "volumes of on-board binders" for systems used on aircraft carriers.

  • The XML Transactional DOM from Ontonet is used to manage data about an archive of medieval manuscripts: a thousand-page book cataloging the archive, scholarly markup about the archive, and METS metadata about each manuscript in the archive.

Native XML databases are also used to store a variety of other types of documents, such as contracts, case law, drug information sheets, insurance claims, e-forms, product support procedures, classified ads, and intelligence documents.

Inside the applications

Applications use document-centric documents in a variety of ways, but most use falls into four broad categories: managing documents, finding documents, retrieving information, and reusing content.

Managing documents

Many applications need to store and retrieve documents. For example, a content management system might add a new document to its data store, or a Web server might retrieve a document for display. At the level of a native XML database, managing documents is quite simple. Applications either submit documents to be stored or request documents to be retrieved; the latter functionality uses a document ID, which is usually assigned by the user. (More complex functionality, such as the versioning, check-in/check-out, and workflow facilities found in content management systems, is usually built on top of the database.)

Finding documents

A wide class of applications needs to find whole documents. For example, a Web portal might allow users to search for all documents about a particular company and a content management system might allow users to find all documents relating to a certain part.

The least complex way to search for documents is with full-text searches. In native XML databases, these are XML-aware. That is, they distinguish between content (which is searched) and markup (which is not).

More complex are structured queries, which can query markup, text, or both. (XPath and XQuery are examples of structured query languages; native XML databases support a number of proprietary languages as well.) For example, consider the following queries:

  • Find all books that Maria Lopez wrote:

    for $b in collection("books")
    where $b//Author="Maria Lopez"
    return $b
  • Find all articles written after June 1, 2004 with the words "presidential election" in the title:

    for $a in collection("articles")
    where $a//Date > 2004-06-01 and
          fn:contains($a//Title, "presidential election")  
    return $a
  • Find all procedures with more than seven steps:

    for $p in collection("procedures")
    let $s := $p//Step
    where fn:count($s) > 7
    return $p

Although these queries are relatively simple, none can be satisfied by a full-text search: the first two queries restrict the search to certain sections of the document and the third doesn't even query the text. It is also interesting to note that these queries do not require all documents to use the same schema; they only require documents to contain certain common elements that have roughly the same meaning.

Retrieving information

Although documents contain useful data, they haven't traditionally been used as a source of data. XML and XML query languages make that possible. For example, consider the following queries:

  • From a procedure for synthesizing a compound, list the required chemicals:

    for $p in collection("procedures")
       <Chemicals procedure="{$p/Title}">
  • Find maintenance procedures for a specific spare part of a specific airplane with a specific effectivity:

    let $today = fn:current-date()
    for $proc in collection("maintenance_docs")//Procedure
    let $p = $proc//Part
    where $p = "AX723" and $p//AppliesTo = "Model 1023i"
          and fn:date-greater-than($today, $p//EffectivityStart)
          and fn:date-less-than($today, $p//EffectivityEnd)
    return $proc
  • Create a table of contents or index from a book document.

These queries are fundamentally different from those that return whole documents to be read or modified. Instead, they answer questions, create reports, or construct entirely new documents.

Reusing content

Reuse represents an important way for companies to extend the value of their investment in content. For example:

  • Knowledge-based companies, such as newspapers and scientific publishers, commonly re-package and re-sell content. In addition, their own writers can reuse content, such as for background in a newspaper article or the basis of a quickly evolving online story.

  • Companies that manufacture complex systems, such as airplanes and ships, must create and maintain large amounts of documentation. Since such systems are often configured for a specific client, each must have its own documentation. By modularizing documentation, custom manuals can be built from a library of topics. This reduces the chance of inconsistent documentation as well as surrounding costs, such as for editing and translation.

Other examples include building contracts from libraries of boilerplate text, publishing blogs and news releases as RSS feeds, and making documents available through an internal Web portal for arbitrary reuse.

Why You Need a Native XML Database

Assuming they were built at all, many of the applications described earlier were originally built from full-text search engines, relational databases, and flat files. For example, they stored metadata in the relational database and documents in the file system, or they stored documents as CLOBs and copied them to the file system for use with a full-text search engine.

These systems suffered from two main problems. The first was scalability. According to one vendor, such systems "usually degrade very quickly past a few thousand documents, while the applications typically involve millions of documents." The second problem was the lack of structured queries, since the full-text search engines were not XML-aware, and queries over metadata were limited to a few fields. Other problems included synchronization between the database and non-database components, the need to write custom code to process results, lack of node-based updates (a problem for large documents), brittleness in the face of evolving schemas, and the usual laundry list of reasons for using a database in the first place: concurrency, security, transactional safety, and so on.

(This is not to say that such systems don't work, especially under controlled conditions. For example, the American Geophysical Union uses one to manage roughly 95,000 scientific papers. Full papers are stored in a search engine repository, which provides full-text searches to Web site users. Metadata is extracted and stored in a relational database, where it is used to generate abstracts, bibliographic entries, and live citations. In the system's favor are a low growth rate (about 25 papers per day), very few updates, XML-awareness in the search engine, and a slowly evolving schema. The latter is particularly important, as schema evolution is the biggest problem in the system, requiring changes to the database schema, the XML-to-database mapping, and the extraction code.)

On the other hand, native XML databases have a number of features that are useful for working with document-centric XML. The most important are the XML data model, which is flexible enough to model documents, XML-aware full-text searches, and structured query languages like XQuery. These allow documents to be stored and queried in a single location, rather than multiple locations held together by glue code. Other useful features include node-level updates (which reduce the cost of updating large documents), links, versioning, and more flexibility in handling schema evolution than is found in relational databases. And while not all native XML databases can scale into the gigabyte or terabyte range, some clearly can.

A Peek into the Future

We will finish with a look at some recent developments in the field of content management:

  • Some content management systems are being built on native XML databases rather than the file system or relational databases. For example, GEMt is based on Tamino; Docato, WorldServer GEP, and UltraXML are based on X-Hive/DB; Ingeniux CMS and Ektron CMS300 use TEXTML Server; TeraText DMS is based on TeraText DBS; and Syncato is based on Berkeley DB XML. While it is too early to say whether major systems like Documentum — which currently uses a relational database — will switch to native XML storage, it seems that the only barrier, at least for XML, is migration.

  • JSR 170, which specifies a standard Java API for content repositories, has replaced SQL with XPath as its required query language. SQL support is now optional.

  • Native XML databases are being used to extend systems that treat XML documents as opaque objects. IXIASOFT offers a plug-in for Microsoft's Content Management Server, which stores XML documents as CLOBs and allows queries only on metadata stored separately. The plug-in uses TEXTML Server to index documents and allows users to perform XML-aware full-text queries. These return pointers to the original documents. Similarly, QuiLogic offers a filter for the Windows Indexing Service. This uses SQL/XML-IMDB to query XML documents and return the contents of particular elements or attributes to the Indexing Service for indexing.

In our next installment, we will look at how native XML databases are used to integrate data and work with semi-structured data.