Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Going Native: Making the Case for XML Databases

by Ronald Bourret
March 30, 2005

When native XML databases appeared on the heels of the XML 1.0 recommendation, most people weren't sure what to make of them. Were they a replacement for relational databases or a return to hierarchical databases? No, said the vendors. They were designed to manage large numbers of XML documents. But since many XML documents could be mapped to relational tables, the only obvious targets were document-centric XML, which lacked structure, and catalogs, which had deep hierarchies. Beyond that, it wasn't clear what they could be used for.

Got XML?

So what are the use cases for native XML databases? As John Merrells, one of the developers of Sleepycat Software's Berkeley DB XML, waggishly put it, there is only one use case, and that is simply, "Got XML?" In other words, if you have more than a handful of XML documents that you need to store, you should store them in a native XML database. The reasons are the same as for storing data in any other database; ease of management, enhanced query performance, concurrent access, transactional safety, security, and so on.

Other vendors are more specific: native XML databases are used in a wide number of fields, from genetics to health care to insurance, and technologies, from data integration to messaging to Web sites. This article describes the most common use cases, and is based on discussions with roughly half of the native XML database vendors, as well as a handful of customers. It does not attempt to determine the size of the current native XML database market — a task left to the market analysts — nor does it attempt to describe the capabilities of individual native XML databases, which vary significantly.

While most real-world uses of native XML databases do not fit cleanly into any single category, it is possible to characterize them in terms of a limited number of use cases. The most popular of these are storing and querying document-centric XML, integrating data, and storing and querying semi-structured data. Native XML databases are used in these cases because the data involved does not easily fit the relational data model, while it does fit the XML data model.

Other common use cases for native XML databases include managing long-running transactions, handling rapidly evolving schemas, working with very large documents, querying hierarchical data, and running Web sites.

This article discusses use cases in three parts. Part one discusses document-centric XML, part two discusses data integration and semi-structured data, and part three discusses the remaining cases.

A Very Quick Review of Native XML Databases

There are two different ways to "store" XML documents in a database. The first is to map the document's schema to a database schema and transfer data according to that mapping. The second is to use a fixed set of structures that can store any XML document. To understand the difference, consider how a sales order document might be stored in a relational database. The first method uses a set of tables designed specifically for storing sales orders: Orders, Items, Customers, Parts, and so on. The second method uses a set of tables designed to hold arbitrary XML documents: Elements, Attributes, Text, and so on.

Databases that support the first method are called XML-enabled databases. (Such functionality is also provided by third-party middleware, which is said to XML-enable a database.) Databases that support the second method are called native XML databases. (While our example shows how to build a native XML database on top of a relational database, virtually all native XML databases are built from scratch.) A more theoretically correct way to say this is that XML-enabled databases have their own data model — relational, hierarchical, object-oriented — and map instances of the XML data model to instances of their data model. Native XML databases use the XML data model directly.

XML-enabled databases are useful when publishing existing data as XML or importing data from an XML document into an existing database. However, they are not a good way to store complete XML documents. The reason is that they store data and hierarchy but discard everything else: document identity, sibling order, comments, processing instructions, and so on. In addition, because they require design-time mapping of schemas, they cannot store documents whose schema is not known at design time. Native XML databases, on the other hand, store complete documents and can store any document, regardless of schema.

For a complete discussion of native XML databases, see Introduction to Native XML Databases or XML and Databases. We will only define two more terms. In a native XML database, a document is the fundamental unit of storage, equivalent to a row in a relational database. While this might be a document in the traditional sense, such as a chapter in a book, it might simply be a set of related data, such as the sequence of a gene, a list of known mutations to the gene, and a list of papers about that gene. The latter sense is important for things like semi-structured data. A collection is a set of related documents and plays a role similar to that of a table in a relational database or a directory in a file system.

And now, on to the use cases.

Storing and Querying Document-Centric XML

As might be guessed, the most common use case for native XML databases is storing and querying document-centric XML. Of the vendors surveyed, all but one listed it as a use case, and many noted it was their most common use case.

Documents in the real world

Let's start with some examples of how native XML databases are used with document-centric XML in the real world:

  • Elsevier Science, a publisher of scientific, technical, and medical information, uses Mark Logic's Content Interaction Server to manage more than two terabytes of data: five million full-text journal articles, 60 million citations and abstracts, thousands of complete books, and five thousand informational pamphlets. The system is used to search and transform documents. Raining Data's TigerLogic XML Data Management Server is used in similar fashion by large scientific publishing companies.

  • A variety of content management applications are built on top of IXIASOFT's TEXTML Server. For example, third-party editorial systems such as AILINK, Eurocortex, KnowledgeView, Protec, Rosebud, Modulo, and PCI use it to manage production, archive content, or both, and are "installed at most major publishing companies in the US and in Europe." General Dynamics uses TEXTML Server in AFCV, a browser used to view technical manuals for various aircraft, including AWACS surveillance airplanes and the F-16.

  • The Las Vegas Sun uses Snapbridge's Cross Media Server (a content management system built on its FDX XML Server native XML database) to manage content for its day-to-day operations as well as its Web site. The system holds more than 750,000 items (XML documents, images, and PDF documents), comprising tens of gigabytes of data.

  • An unnamed bank uses Xyleme Zone Server as the basis for an equity research portal, which is composed of multiple sub-portals. It is used by roughly 10,000 employees inside the bank and more than 30,000 customers outside the bank, who perform both contextual and full-text searches. Analysts add thousands of documents daily. As these are loaded, a separate engine queries them and notifies users of any changes of interest. Latency between loading and user notification is "a couple of minutes".

  • The Tasmanian Government uses TeraText DBS to power a Web site that allows users to track Tasmanian legislation. A single piece of legislation is stored as a series of time-stamped fragments. This allows users to track changes to the legislation over time. In addition, links in the documents refer to fragments also stored in the database, allowing them to be implemented as queries on the database.

  • Autodesk stores software manuals in Idiom's GEP, a content management system built on top of X-Hive/DB. Content is stored as reusable components, from which manuals are built. Manuals are published in a variety of formats (HTML, PDF, CHM, print) and over 30 languages. X-Hive/DB's versioning mechanism is used to track changes, which reduces translation time. The system also relies heavily on XQuery and XLink.

  • Amirsys uses Ipedo XML Store to manage descriptions of radiology diagnostic cases, image data, and data used to drive a document editor. The editor is designed for writing books about radiology and contains features such as queries on existing descriptions and the ability to insert images. Books are published using Apache FOP. In addition, managers can query across documents, such as to track the progress of authors or check that elements are used in the same way by all authors.

  • Le Monde uses Xyleme Zone Server to manage an archive holding more than 800,000 documents and using 6 gigabytes of storage. The archive is used by employees, partners, and customers.

  • The US Navy uses the GEMt content management system, which is built on Tamino, to store roughly 100,000 "volumes of on-board binders" for systems used on aircraft carriers.

  • The XML Transactional DOM from Ontonet is used to manage data about an archive of medieval manuscripts: a thousand-page book cataloging the archive, scholarly markup about the archive, and METS metadata about each manuscript in the archive.

Native XML databases are also used to store a variety of other types of documents, such as contracts, case law, drug information sheets, insurance claims, e-forms, product support procedures, classified ads, and intelligence documents.

Inside the applications

Applications use document-centric documents in a variety of ways, but most use falls into four broad categories: managing documents, finding documents, retrieving information, and reusing content.

Managing documents

Many applications need to store and retrieve documents. For example, a content management system might add a new document to its data store, or a Web server might retrieve a document for display. At the level of a native XML database, managing documents is quite simple. Applications either submit documents to be stored or request documents to be retrieved; the latter functionality uses a document ID, which is usually assigned by the user. (More complex functionality, such as the versioning, check-in/check-out, and workflow facilities found in content management systems, is usually built on top of the database.)

Finding documents

A wide class of applications needs to find whole documents. For example, a Web portal might allow users to search for all documents about a particular company and a content management system might allow users to find all documents relating to a certain part.

The least complex way to search for documents is with full-text searches. In native XML databases, these are XML-aware. That is, they distinguish between content (which is searched) and markup (which is not).

Related Reading

XPath and XPointer
Locating Content in XML Documents
By John E. Simpson

Table of Contents
Index
Sample Chapter

Read Online--Safari Search this book on Safari:
 

Code Fragments only

More complex are structured queries, which can query markup, text, or both. (XPath and XQuery are examples of structured query languages; native XML databases support a number of proprietary languages as well.) For example, consider the following queries:

  • Find all books that Maria Lopez wrote:

    for $b in collection("books")
    where $b//Author="Maria Lopez"
    return $b
  • Find all articles written after June 1, 2004 with the words "presidential election" in the title:

    for $a in collection("articles")
    where $a//Date > 2004-06-01 and
          fn:contains($a//Title, "presidential election")  
    return $a
  • Find all procedures with more than seven steps:

    for $p in collection("procedures")
    let $s := $p//Step
    where fn:count($s) > 7
    return $p

Although these queries are relatively simple, none can be satisfied by a full-text search: the first two queries restrict the search to certain sections of the document and the third doesn't even query the text. It is also interesting to note that these queries do not require all documents to use the same schema; they only require documents to contain certain common elements that have roughly the same meaning.

Retrieving information

Although documents contain useful data, they haven't traditionally been used as a source of data. XML and XML query languages make that possible. For example, consider the following queries:

  • From a procedure for synthesizing a compound, list the required chemicals:

    for $p in collection("procedures")
    return
       <Chemicals procedure="{$p/Title}">
          {$p//Chemical}
       </Chemicals>
  • Find maintenance procedures for a specific spare part of a specific airplane with a specific effectivity:

    let $today = fn:current-date()
    for $proc in collection("maintenance_docs")//Procedure
    let $p = $proc//Part
    where $p = "AX723" and $p//AppliesTo = "Model 1023i"
          and fn:date-greater-than($today, $p//EffectivityStart)
          and fn:date-less-than($today, $p//EffectivityEnd)
    return $proc
  • Create a table of contents or index from a book document.

These queries are fundamentally different from those that return whole documents to be read or modified. Instead, they answer questions, create reports, or construct entirely new documents.

Reusing content

Reuse represents an important way for companies to extend the value of their investment in content. For example:

  • Knowledge-based companies, such as newspapers and scientific publishers, commonly re-package and re-sell content. In addition, their own writers can reuse content, such as for background in a newspaper article or the basis of a quickly evolving online story.

  • Companies that manufacture complex systems, such as airplanes and ships, must create and maintain large amounts of documentation. Since such systems are often configured for a specific client, each must have its own documentation. By modularizing documentation, custom manuals can be built from a library of topics. This reduces the chance of inconsistent documentation as well as surrounding costs, such as for editing and translation.

Other examples include building contracts from libraries of boilerplate text, publishing blogs and news releases as RSS feeds, and making documents available through an internal Web portal for arbitrary reuse.

Why You Need a Native XML Database

Assuming they were built at all, many of the applications described earlier were originally built from full-text search engines, relational databases, and flat files. For example, they stored metadata in the relational database and documents in the file system, or they stored documents as CLOBs and copied them to the file system for use with a full-text search engine.

These systems suffered from two main problems. The first was scalability. According to one vendor, such systems "usually degrade very quickly past a few thousand documents, while the applications typically involve millions of documents." The second problem was the lack of structured queries, since the full-text search engines were not XML-aware, and queries over metadata were limited to a few fields. Other problems included synchronization between the database and non-database components, the need to write custom code to process results, lack of node-based updates (a problem for large documents), brittleness in the face of evolving schemas, and the usual laundry list of reasons for using a database in the first place: concurrency, security, transactional safety, and so on.

(This is not to say that such systems don't work, especially under controlled conditions. For example, the American Geophysical Union uses one to manage roughly 95,000 scientific papers. Full papers are stored in a search engine repository, which provides full-text searches to Web site users. Metadata is extracted and stored in a relational database, where it is used to generate abstracts, bibliographic entries, and live citations. In the system's favor are a low growth rate (about 25 papers per day), very few updates, XML-awareness in the search engine, and a slowly evolving schema. The latter is particularly important, as schema evolution is the biggest problem in the system, requiring changes to the database schema, the XML-to-database mapping, and the extraction code.)

On the other hand, native XML databases have a number of features that are useful for working with document-centric XML. The most important are the XML data model, which is flexible enough to model documents, XML-aware full-text searches, and structured query languages like XQuery. These allow documents to be stored and queried in a single location, rather than multiple locations held together by glue code. Other useful features include node-level updates (which reduce the cost of updating large documents), links, versioning, and more flexibility in handling schema evolution than is found in relational databases. And while not all native XML databases can scale into the gigabyte or terabyte range, some clearly can.

A Peek into the Future

We will finish with a look at some recent developments in the field of content management:

  • Some content management systems are being built on native XML databases rather than the file system or relational databases. For example, GEMt is based on Tamino; Docato, WorldServer GEP, and UltraXML are based on X-Hive/DB; Ingeniux CMS and Ektron CMS300 use TEXTML Server; TeraText DMS is based on TeraText DBS; and Syncato is based on Berkeley DB XML. While it is too early to say whether major systems like Documentum — which currently uses a relational database — will switch to native XML storage, it seems that the only barrier, at least for XML, is migration.

  • JSR 170, which specifies a standard Java API for content repositories, has replaced SQL with XPath as its required query language. SQL support is now optional.

  • Native XML databases are being used to extend systems that treat XML documents as opaque objects. IXIASOFT offers a plug-in for Microsoft's Content Management Server, which stores XML documents as CLOBs and allows queries only on metadata stored separately. The plug-in uses TEXTML Server to index documents and allows users to perform XML-aware full-text queries. These return pointers to the original documents. Similarly, QuiLogic offers a filter for the Windows Indexing Service. This uses SQL/XML-IMDB to query XML documents and return the contents of particular elements or attributes to the Indexing Service for indexing.

In our next installment, we will look at how native XML databases are used to integrate data and work with semi-structured data.


Comment on this articleShare your experience in our forums.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • Battery Jumpstart Assistance Los Angeles 1-877-364-5264
    2009-07-02 14:55:07 carpetcare [Reply]

    24 Hour Emergency Towing Service
    24 Hour Winching Service
    24 Hour Battery and Jumpstart Assistance
    24 Hour Flat Tire Service
    24 Hour Lost Key and Lockout Service
    Up to $100 for a Locksmith
    24 Hour Out-of-Gas Fuel Delivery
    Emergency Transportation Service
    24 Hour Emergency Personal Assistance
    24 Hour Map and Routing Service
    Roadside Assistance Small Maintenance Repair Fixing Auto Services.
    24 Hour Emergency Towing Service: As a member, your vehicle can be towed up to 15 miles to the repair facility of your choice. We will pay up to $100.00 for winching services if necessary. Note: Only one tow per disablement.


    24 Hour Winching Service: If your vehicle is stuck and you only need to have your vehicle extracted by use of a winch, then we cover that too.


    24 Hour Battery and Jumpstart Assistance: If you are unable to get your car started, a maintenance person will be sent to provide a battery jump and help you get your car running again. Whether on the road or parked at home we will send someone out. If your battery cannot be jumped then your vehicle will be towed to a repair facility of your choice.


    24 Hour Flat Tire Service: We will send a maintenance person to you to change your vehicle's flat tire to a drivable spare, enabling you to drive to safety. It is your responsibility to have a safe and working spare with your vehicle at all times.


    24 Hour Lost Key and Lockout Service: We will send a pre-authorized contractor to unlock your vehicle using manufacturer approved tools. If there is not a network contractor available in your area, as a member, you are entitled to $100.00 toward the cost of a locksmith.
    Licensed, Insured & Bonded Professional Locksmiths



    24 Hour Out-of-Gas Fuel Delivery: If you should run out of gasoline, a service technician will deliver fuel to get you on your way. Your membership includes 3 gallons of fuel delivered to you free of charge.


    Emergency Transportation Service: If you find yourself on the side of the road with your car disabled give us a call and we will arrange transportation for you for up to 15 miles. This service can only be used instead of another service. It is for those times where you just have to get where you are going whether your vehicle is disabled or not.


    24 Hour Emergency Personal Assistance: During any weather related or roadside emergency, our customer service staff can relay messages to loved ones and receive messages for you, helping you to put your mind at ease.
    http://www.roadsideassistancelosangeles.com
    1-877-364-5264

  • XML data model vs. relational model
    2005-04-01 02:41:31 gregoryjorgensen [Reply]

    From the article:


    "While most real-world uses of native XML databases do not fit cleanly into any single category, it is possible to characterize them in terms of a limited number of use cases. The most popular of these are storing and querying document-centric XML, integrating data, and storing and querying semi-structured data. Native XML databases are used in these cases because the data involved does not easily fit the relational data model, while it does fit the XML data model."


    What is the XML data model? Hierachy expressed with markup? Does the XML model have any of the mathematical foundations of the relational model? On what basis can the XML and relational models be compared?


    I've never seen an example of structured data that doesn't fit the relational model but does fit the XML model. Can you give an example and explain why it doesn't fit the relational model?


    Greg Jorgensen
    PDXperts LLC
    Portland, Oregon USA


    • XML data model vs. relational model
      2005-04-07 16:53:04 Mike Trotman [Reply]

      I think the key difference between the relational and XML database models - which the article does point out - is that the native XML model stores / makes available information about the structure of the data - as well as just the data - through the same interface.


      This is what I have found to be the most important distinction and consideration when deciding to use XML for data representation - your data structure is 'naturally' also data of the same kind.


      An analogy is the difference between a functional programming language like LISP that can operate on itself as part of the design vs. a procedural one like C that either cannot - or requires tortuous code / hacking to achieve the same effect.


      This often means that queries, processing etc. can be easily made much more generic / re-useable as they can have a large independence from any specific structure / syntax. And this also makes it much easier to mix information from many different structures.


      So - it's less a question of whether the data in the structure can be mapped between the two cases, more a question of what you can do with the data structures.

    • XML data model vs. relational model
      2005-04-01 22:55:46 rpbourret [Reply]

      There are actually a number of XML data models. For example, DOM, SAX, XPath 1.0, XQuery 1.0 / XPath 2.0, and the Infoset all explicitly or implicity define data models. What these all have in common is that they are ordered trees of nodes (elements, attributes, text, etc.) and scalar values. (The XQuery data model is actually an ordered forest.)


      Although it's not quite technically correct, it's easiest to think of the XML data model as a DOM tree. No markup in sight -- just an ordered hierarchy of nodes, with values at the leaves.


      There is no single XML data model used by all native XML databases. This is because most native XML databases predate XQuery and some even predate XPath. Consequently, many invented their own query languages and corresponding data models. Currently, the most common query language in native XML databases is XPath (with extensions for multi-document queries). In the future it will be XQuery. For information about the XQuery / XPath 2.0 data model, see http://www.w3.org/TR/xpath-datamodel/


      As Michael noted, there isn't anything you can model with the XML data model but not the relational data model. In fact, there is a generic mapping from DTDs to relational schemas (see http://www.xml.com/lpt/a/2001/05/09/dtdtodbs.html). The only problem will occur with schemaless XML, and you can always map that to a generic set of tables (Elements, Attributes, etc.)


      So to quote Michael again, the problem is practical, not theoretical. For example, if you map the XHTML schema to a relational schema, the p element alone is mapped to 36 different tables. As you can imagine, reconstructing an XHTML document requires an impractical number of joins.

      • XML data model vs. relational model
        2005-04-22 21:22:24 gtnicol [Reply]

        >So to quote Michael again, the problem is
        >practical, not theoretical. For example, if you map
        >the XHTML schema to a relational schema, the p >element alone is mapped to 36 different tables. As
        >you can imagine, reconstructing an XHTML document >requires an impractical number of joins.


        This is not strictly true... there are a number of ways of mapping XML onto a relational database that don't require joins... especially given that most real-world documents are not as deeply nested as people might think (typically 3-10 levels).


        Speaking as someone that has used XML/SGML databases under CMS systems for years (we have 2 XML databases), one of the main reasons I've seen for it is that real-world documents tend to be somewhat variable, even in very tightly constrained environments where DTD/Schema validation is required (such as the military). XML databases, for the most part tend to handle such variability well (especially those that are schema-independent in their storage). Typical RDBMS-based systems tend to be much more rigorous, and hence, less flexible (though that is not always the case).


        The other area where, for documents, XML helps, is with the combination of fulltext and structural queries. Given those capabilities, you can build fairly sophisticated hypertext applications, and also enable fine-grained reuse of content in ways that are difficult in most RDBMS-based systems. The tradeoff here is much like the tradeoff between tightly and loosely coupled distributed systems.


        In fact, looking at the list of deployments listed, I would say that for all of these, reuse and flexibility would be two of the reasons people used XML. I bet other factors, such as management paradigm, and ease of integration played a significant role as well.


    • XML data model vs. relational model
      2005-04-01 06:33:34 mchampion [Reply]

      As I understand it, E. F. Codd demonstrated conclusively that any data schema can be normalized into the relational model. Thus I believe it is logically impossible to show an "example of structured data that doesn't fit the relational model but does fit the XML model".


      The use case issue comes down to pragmatics, not theoretics: XML and XDBs tends to be a more *practical* solution for data that irregular, deeply hierarchical, and recursive ... like an awful lot of real world documents, or data about real-world hierarchies (organizations, component-subcomponent assemblies, etc.). One can certainly build a pure relational model of these things, and might possibly be able to build a relatively portable SQL implementation, but one hits all sorts of practical limitations - multiway joins are usually required (which gets pretty slow in practice with more than a few tables), queries get complex, and only a few geniuses can understand how it all fits togther.


      XML and XML-capable DBs allow ordinary mortals who've "got XML" to work with this sort of data in a standardized, portable, and reasonably efficient way. The theoretical links back to the relational model are being forged, albeit slowly. A couple data points for that: Don Chamberlin, one of the inventors of SQL, is editor of the XQuery spec; and some of the set theory that Codd cited in his original work has been shown to handle XML (see xsp.xegesis.org for a collection of papers).