Going Native, Part 3
May 25, 2005
In parts one and two of this article, we discussed how native XML databases are used to store and query document-centric XML; integrate data; and work with semi-structured data. In this, the final part, we discuss their use in schema evolution, long-running transactions, handling large documents, and a number of other cases, as well as how relational databases are evolving to handle XML.
As someone accustomed to the relatively rigid schemas of the relational world, I react to stories of rapid schema evolution with a mixture of horror and a sense that perhaps the people involved aren't as, well, responsible as they should be. In spite of this, almost every vendor and customer I spoke to listed schema evolution as one reason to use a native XML database. Worse yet, most had good reasons for doing so.
Schema Evolution in the Real World
Schema evolution is a normal thing. In the relational world it moves slowly, for both technical and political reasons. On the technical side, relational databases do not handle schema changes easily: existing data must be updated to match the new schema and altering tables may require unloading and reloading data. On the political side, database administrators (DBAs) tend to approach change cautiously because they don't want to break existing applications or destabilize database tuning.
(A number of vendors also noted that native XML databases allow developers to do an end run around DBAs, resulting in faster development times. One reason for this might be that native XML databases are often used to cache data on the middle tier, meaning DBAs are not aware of them and have not yet brought them under their control. Another reason might be that native XML databases do not have as many tuning options as relational databases, meaning that DBAs have less reason to exert control.)
In the XML world, change moves faster. This is sometimes due to the newness of XML. For example, FIXML has had four versions in six years and FpML has had four versions in just three years. XML has also exposed users to more sources of change. For example, the schemas used to move data across organizational boundaries are often controlled by other departments or trading partners. And XML is being used in rapidly evolving fields, such as finance and biology, as well as fields with long life spans, such as mortgage and insurance contracts, both of which force users to handle many versions of a schema.
Inside the Applications
Handling schema evolution is rarely easy. The easiest solution is to update data to conform to the new schema and update applications accordingly. Unfortunately, this is not always possible. For example, updating existing data might be too expensive or might be prohibited (such as with contracts), new fields might not have reasonable defaults, or multiple applications, which cannot all be updated, might use the data.
When data cannot be updated, applications must handle both backwards and forwards compatibility. Since documents conforming to multiple versions of a schema are commonly stored together in native XML databases, applications must determine which version of a schema is being used, such as by checking a version attribute or checking whether a particular field exists.
Handling backwards compatibility usually just means a lot of hard work, such as providing default values for fields added in a new schema or processing each version of a field differently. However, some problems have no definitive solution, such as how to compute the average of a field not found in all documents.
Handling forwards compatibility means protecting applications against an unknown future. A liberal strategy is to ignore all unrecognized fields. Unfortunately, this is risky, as new fields may change the semantics of existing fields. A more conservative strategy is to only process documents with a known version number. This allows applications to continue working until it can be determined whether a schema change breaks existing code.
A strategy that avoids many forwards and backwards compatibility problems is to query only those fields that are unlikely to change. This works particularly well when humans are involved. For example, a customer service representative might search for contracts involving a particular customer or a researcher might search for documents describing a particular organism. In both cases, searches are done on stable fields (customer or species name) and the reader can resolve any differences in schemas.
Why You Need a Native XML Database
The main advantage of native XML databases with respect to schema evolution is the ability to store documents conforming to several different versions of a schema. This has several advantages over relational databases, which require data to conform to a single schema:
Schemas can be changed without having to migrate data, as is the case for relational databases. For large data sets or rapidly evolving schemas, migration can be prohibitively expensive. Furthermore, it is not always possible to migrate data. For example, changing a contract invalidates it.
The database can handle schema changes for which there is no data migration path, such as when a new field is required and has no reasonable default. In a native XML database, new documents can be stored in the same collection as old documents. In a relational database, new data must be stored in a different table from the old data, since the old data cannot be migrated. As a result, queries over unchanged fields continue to work in the native XML database but fail in the relational database because they do not include the new table.
Data can be stored, even if it conforms to an unknown version of a schema. This means that no data is lost, even if it cannot be used immediately. Depending on the forward compatibility strategy, it might even be possible to process the new data.
A secondary advantage of native XML databases is support for XQuery. The conditional expressions and user-defined functions in this language are very useful in querying documents conforming to multiple versions of a schema.
This is not to say that native XML databases solve all schema evolution problems. Far from it; schema evolution remains a painful problem that requires both foresight and hard work. However, the consensus among vendors and customers is that the flexibility of native XML databases makes solutions possible where they weren't before.
A Peek into the Future
The only happy news about schema evolution is that more people are becoming aware of the problem, as evidenced by the number of emails and conference presentations on the subject. Personally, I hold little hope for any silver bullets, as the problem pre-dates XML and has not been solved yet.
Long-running transactions are real-world transactions such as processing insurance claims, approving mortgages, or fulfilling orders. They generally require a mixture of human and machine processing and take anywhere from hours to weeks. They differ from traditional transactions in that they do not lock resources for the duration of the transaction, and they use compensating transactions, such as refunds, instead of rollbacks.
How data flows through a long-running transaction depends on the application. It might be stored in a database and modified by a succession of applications, or it might be passed from application to application in one or more XML documents, as in a Service Oriented Architecture (SOA).
Native XML databases can be used in long-running transactions in a number of capacities:
Data stores. Much of the data in long-running transactions is document-centric (contracts, appraisals, accident descriptions) or integrated from a variety of sources (credit agencies, appraisers, backend databases). As we have seen, native XML databases are useful both for storing document-centric XML and integrating data. Whether the native XML database is the database of record depends on the application. In many cases, they serve as mid-tier data caches and data is off-loaded to backend databases.
Message queues. Unlike traditional message queues, native XML databases can perform content-based routing and transform messages into different formats. While native XML databases are slower than traditional message queues due to parsing and reassembling messages, as well as querying and transforming them, vendors did not report any performance problems. However, this may be because native XML databases have not yet been used in sufficiently demanding environments.
Metadata archives. In addition to storing application data, native XML databases are also used to store information used by applications. For example, Raining Data's TigerLogic XML Data Management Server is used in metadata-driven SOAs to store metadata about web services, access policies, and aggregated views of UDDI and home-grown service registries.
Data warehouses. When native XML databases are used as data stores or message queues, they can also serve as data warehouses that can be mined for information about data or messages.
It is interesting to note that several Enterprise Service Buses (ESBs), which are used to implement SOAs, include native XML databases: Sonic ESB includes Sonic XML Server, Software AG's Enterprise Service Integrator includes Tamino, and OpenLink's Virtuoso includes a BPEL engine. These systems use native XML databases for all of the reasons described above.
Handling Large Documents
Large documents are difficult to query due to the time it takes to parse them. Native XML databases solve this problem by parsing and indexing documents when they are inserted. This allows documents to be queried without further parsing and may even allow queries to be resolved only by searching indexes.
Large documents are also difficult to process with XSLT and DOM, as these require the entire document to be in memory. Since sufficiently large documents exceed available memory, some native XML databases solve this problem by implementing XSLT and DOM directly on top of the database. These implementations populate in-memory nodes as necessary and swap nodes back to disk as needed, allowing DOM and XSLT to be used with documents of almost arbitrary size. In addition, changes made to DOM trees are reflected back to the database, either immediately or in response to a special call.
The main use of such DOM implementations is in browsers and editors for document-centric documents, such as catalogs and technical manuals. While most of these are custom applications built on top of native XML databases, Infonyte has built a customizable browser (the Infonyte Reader) on top of its native XML database (Infonyte DB), which features both query and XSLT engines.
Hierarchical data is a use case that overlaps all other use cases, since virtually all XML is hierarchical. Hierarchical data is either heterogenous, like sales orders, in which parents and children have different types, or homogenous, like catalogs or bills of material, in which parents and children have the same type. In a relational database, heterogenous hierarchies are stored in multiple tables, which must be joined during queries, and homogenous hierarchies are stored in a single table, for which there are a variety of query strategies, including nested sets and recursive queries.
While there is little public data available for the relative performance of native XML databases and relational databases in querying hierarchical data, it is interesting to note that Xyleme Zone Server outperforms Oracle 9i by a factor of 19.5 when using Oracle's test data and Oracle's object-relational XML storage. Similarly, another native XML database vendor asserted that even "three to four levels [in a heterogenous hierarchy] present a problem for relational [databases], once the [number] of documents is in the hundreds of thousands." (Of interest, vendors report that most of their customer's hierarchies are five to ten levels deep, although up to thirty levels are not uncommon.)
Unfortunately, similar data is not available for relative query performance in homogenous hierarchies. However, there is persistent confusion among SQL programmers about how to store and query hierarchical data. While this may be alleviated with the introduction of recursive queries in relational databases (available for several releases in Oracle, one release in DB2, and the next release in SQL Server), perhaps the most important thing that native XML databases bring to the table with respect to hierarchical data is a set of tools--notably query languages--that are explicitly designed for working with hierarchies.
This article has described the most common use cases for native XML databases. Some other use cases include
Local and shared data management. QuiLogic's SQL/XML-IMDB is primarily used as a local data manager. That is, rather than using structures like lists and queues, SQL/XML-IMDB is used to store data of arbitrary complexity. This allows such structures to be handled in a declarative manner using XQuery and SQL. Furthermore, because SQL/XML-IMDB supports the use of shared memory, it can be used to share data among processes. For example, one such use allows a laboratory data collector written in C++ to talk to a front end written in Python.
Complete web site. Native XML databases can be used to build web sites: data is stored in the database as XML, queried and updated with XQuery, and transformed into XHTML with XQuery or XSLT. This is an experimental use case for Sedna, and has been implemented by a number of people. For example, the University of Virginia's Rotunda site is built on Mark Logic's Content Interaction Server.
Performance. Vendors report that customers do not choose native XML databases solely for performance reasons--that is, when a relational or object-oriented database would do--but that performance was frequently a secondary criteria. For example, applications that use document-centric XML or semi-structured data, and which originally used the file system or a relational database, were migrated to a native XML database for both feature and performance reasons.
Mid-tier data cache. While this use case overlaps other use cases, it is worth mentioning separately. Native XML databases are often used to cache data on the middle tier, such as in data integration, e-commerce, web sites, and long-running transactions. This is done both for performance and to manage data in a common format (XML).
A Final Peek into the Future
Our final peek into the future looks at relational databases. In the strongest endorsement of native XML databases to date, the major relational databases are adding native XML storage. This is used to implement a first-class XML data type and data stored as this type can be queried with XPath or XQuery. It can also be mixed with relational data.
The implementation strategies used by relational databases are as varied as those found in commercial native XML databases: Oracle indexes documents and stores them as CLOBs; Sybase also indexes documents (it is not known how they store them); SQL Server stores pre-parsed documents as BLOBs, as well as in node-level storage built on relational tables (the query engine decides which to use at run time); and DB2 uses node-level storage built from the ground up.
In addition, Oracle is working on XML Data Synthesis (XDS), an XQuery-based data integration engine.
This article has examined how native XML databases are used in the real world--most commonly for managing documents, integrating data, and managing semi-structured data. What is important about these uses is that most represent cases where people have tried to use relational or other types of databases and have either failed or written less sophisticated applications than they would like. Native XML databases have succeeded because of their query languages (most notably XQuery, but also XML-aware full-text queries), the flexibility of the XML data model, and their ability to handle schema-less data.
So is a native XML database in your future? That question is best answered by quoting Arun Gaikwad. In an article about Xindice, a native XML database from Apache, he wrote: "A [native XML database] is something which you may think is unnecessary but once you start using it, you wonder how you would survive without it."
Thanks to the following organizations for contributing time and ideas to this article: American Geophysical Union, Bluestream Database Software, Cincom, data ex machina, IBM, Ipedo, ISPRAS modis, IXIASOFT, M/Gateway Developments, Mark Logic, Ontonet, OpenLink Software, QuiLogic, RainingData, Snapbridge Software, Software AG, X-Hive, Xpriori, and Xyleme. Thanks also to developers and users who chose to remain anonymous.
Use cases for native XML databases:
Selected case studies:
Native XML databases: