XML and Databases? Follow Your Nose

October 24, 2001

Following a recent XML-DEV discussion on how to choose the most appropriate database for your XML application, the XML-Deviant captures the indicators that will help bring you closer to a decision.

Decisions, decisions

One of the most difficult aspects of selecting a technology is getting an unbiased view of the pros and cons. Anyone who has found themselves mired in the marketing gloss applied to many white papers while questing for concrete technical information will be familiar with this problem.

This is where high quality user communities, like XML-DEV, can have their greatest benefit: providing a forum for objective discussion on the relative merits of particular technologies. Armed with facts, and aware of the trade-offs, the plucky developer is better able to make an informed decision. This process can be facilitated if others are willing to summarize these findings and report them to the community.

Faced with the trial of determining the appropriate database technology to use for an XML application, Brian Magick asked XML-DEV whether anyone had attempted to put together a relevant "decision tree". Magick later elaborated on his goal.

I just want to know in a general sense when to XML and when not to in regards to creating database applications or storing data in databases. Why would one want to keep data in a relational format versus converting it into XML to exploit an XML database (without getting into the specifics of any one product). For new applications what are the concerns a developer would need to consider when deciding to go the traditional route versus deciding to use this new "cool" XML technology. One of our concerns is that by adopting an XML database all developers will want to use this cool new tool when in fact in has specific purposes that are not relevant for all new databases/apps..

While no one had yet created such a beast, several XML-DEV members were happy to provide guidance on important and relevant factors. This prompted a lengthy thread with a great deal of useful information. Technology evaluations are hard to do in a vacuum, one must have some idea of the requirements and a decision generally involves a trade-off between different factors. This makes it hard to synthesize these kinds of discussion into a decision tree, which would involve rating one factor above another to build the tree structure (or end up in a complex interlinked mess). In summarizing this week's discussion, therefore, the Deviant has decided to use a different approach: smell.

The extreme programmers among you, and anyone familiar with the code refactoring work of Kent Beck and Martin Fowler, will be aware of the concept of code smells. These are code hints that indicate something is wrong or in need of a change; for example, too many method parameters. Code smells, like design patterns, are a condensation of developer experience. Therefore we might apply a similar technique, "requirement smells", to identify the indicators that will help guide you toward the right decision.

There are some caveats to add here. First, these indicators are those deemed important by the XML-DEV community, there may be other issues to consider, e.g. business or budgetary constraints that may skew things one way or another. Second, some recommendations are based on the current state of the database market; as products develop other considerations might become important and distinguishing factors may disappear. Third, specific products have not been considered. Not all database management systems are created equal so there are sure to be exceptions to every rule. There may be roses amongst the manure, and well-perfumed products that hide something, well, nasty.

If you have any comments to add, then share them through the XML.com comment facility (see below) or post to XML-DEV, where they're likely to get a great deal of peer review.

Sniffing Out a Database

Each of the following sections covers a decision factor, including useful opinions and comments extracted from the recent discussion. Some of these overlap, and some may contradict others. Some are concerned with the type of data you're storing, others with how you intend to manipulate that data; some are specific while others are more general. As for terms of art, Native XML Database (NXD) and XML Enabled Database (XED) are defined in the XML:DB FAQ.

Largely Data

The document-data distinction is quite prevalent in XML circles. Data-oriented XML describes information such as purchase transactions or phone book entries, i.e. the kinds of information usually only of interest to an application. Conversely, document-oriented XML, including XHTML docs, is usually of interest to users. Ronald Bourret describes the distinction further in his XML and Database paper. Data oriented XML is generally easier to map into a relational database, although the complexity of the DTD is obviously a factor. This makes an RDBMS an obvious candidate for data storage. An XED may offer features to help automate the transformations to and from XML. It's likely that some development effort will be required to define a relational mapping for your schema, particularly if it is non-trivial.

If you have XML "data" that is easily normalized into RDBMS tables, an RDBMS or XML-enabled RDBMS will probably do at least as good a job as a native XML DBMS. (Michael Champion)

I've seen DTDs with hundreds of elements and figuring out a useful mapping from them to the database is distinctly non-trivial, especially when a lot of the elements are wrappers that don't represent real structure in the database. (Ronald Bourret)

Largely Documents

If what you're storing is largely documents and not largely data, then an NXD may be the best option. However it will depend on the kinds of questions you need to ask about the data. Some document-oriented schemas have obvious metadata which will make a relational mapping easier. It's difficult to cleanly map mixed content to a relational schema, and document-oriented XML often has a varying structure

If you have XML "documents" with mixed content, recursive content models, a complex mix of elements and attributes, and you want to search on the XML structure *and* content, a native XML DBMS will almost certainly be superior. (Michael Champion)

...mixed content doesn't map well with an object-relational mapping. (I won't go into the details here. If you want to read more about this, see sections 3.3 and 3.4 of [Mapping DTDs to Databases] (Ronald Bourret)

It's much easier to maintain collections of XML documents using a native XML DB than to map those documents into a relational database, or even to store them as blobs. (Tom Bradford)

Varying Structure

Relational databases are particularly good for storing highly structured information, and not particularly good at managing semi-structured data. Document-oriented XML usually has a varying structure to allow for the flexibility inherent in prose. Not all semi-structured data is document-oriented however, e.g., bills of materials. While it is is possible to map semi-structured information into a relational model there can be performance overheads, particularly with querying. An NXD or XED with XML-aware text indexing facilities is a good fit for this kind of information.

Semi-structured data is data that has some structure, but is not rigidly structured. An example of semi-structured data is a health record. For example, one patient might have a list of vaccinations, another might have height and weight, another might have a list of operations they have had. Other examples of semi-structured data are legal documents and genealogical records...

Semi-structured data is difficult to store in a relational database because it means you either have many different tables (which means many joins and slow retrieval time) or a single table with many null columns. Semi-structured data is very easy to store as XML and is a good fit for a native XML database.
(Ronald Bourret)

One of the main virtues of (data-centric) XML's data model is its handling of semi-structured data. Relational DBMS's do not handle semi-structures data very well... (Dan Weinreb)

Rigorous Structure

If you're storing XML which has a schema with a fixed structure (i.e. little or no use of mixed content or recursive content models) then a relational database may be the better solution.

Structured data is very rigid. That is, all records have the same fields. An example of structured data is a telephone book -- every entry has a name, an address, and a telephone number. (Ronald Bourret)

...Relational databases are generally a better way to store highly structured data. Then again, most native XML vendors I've talked to say the same thing, although nobody is quite sure where the line is going to be drawn. (Ronald Bourret)

Obvious Metadata

It is a fairly common design pattern in XML schemas to see a head-body structure, with metadata being collected into the head of a document, and the content in the body. In this situation it's often the metadata in the head which needs to be indexed and queried. This pattern is quite easy to map into a relational database or XED, particularly as the metadata rarely makes significant use of mixed content. If your schema(s) conform to this pattern, then this storage option is a quick solution. XML enabling features can be used if you need to manipulate or query the body content. However if you need to ask particularly complex questions of your document content then a native XML database may be the better option.

If you have "document" XML and the schema is easily separated into easily normalizeable metadata and arbitrarily structured text content, and most queries will be on the metadata, an XML-enabled RDBMS that extracts some elements into tables and other elements into CLOBs may be the most appropriate. If you anticipate frequent queries on the actual document structure and content, a native XML DBMS may be more appropriate. (Michael Champion)

Complex Questions

The ways in which information will be queried is an important factor when considering a database technology. If you have data-oriented XML, an RDBMS is likely to be the better option, similarly for XML documents that have obvious metadata. However once you need to use full-text searching, or manipulate a recursive content model, then a NXD or an XED with an appropriate XML query language looks like a better option.

If you need to do queries that exploit the recursive structure of your data, a native XML DBMS will tend to be more appropriate than a "raw" RDBMS. For example, if you need to know which products contain assemblies or sub-assemblies, or sub-sub-assemblies ... that contain a particular part, XPath-based query languages will handle it better than SQL-based languages. (The "bill of materials" query is a well-known challenge for SQL, but a piece of cake in XPath). I guess to be fair, I'd have to say "if you need to do queries that involve joins across metadata in different collections, an RDBMS-based approach will tend to be more appropriate than a native XML DBMS, at least until some flavor of XQuery is widely supported." (Michael Champion)

...SQL has a far richer set of data manipulation operators than XPath (XQuery addresses this to some extent), and is (loosely) based on a general theory of data, and XML is much more ad hoc...if you don't have relatively complex document-like XML, the *searching* features of a native XML DBMS don't buy you much over storing simple XML in an RDBMS. But if you do have "interesting" XML schema, e.g. with recursive elements, simple XPath queries can address problems that seriously challenge SQL experts. (Michael Champion)

While the most popular query [language] is probably XPath (usually with extensions for multi-document queries), numerous other query languages -- all of them proprietary -- are supported as well. This is true of XML-enabled relational databases as well as native XML databases.

There is good reason for this, as XPath is not rich enough to perform many of the queries needed by users and XQuery is not yet finished. I suspect that when XQuery is done, you will see many implementations of it. (Ronald Bourret)

Shared Repository

It's often the case that you have an existing data repository (with the accompanying investment) whose data you'll be manipulating along with your XML data. In these circumstances you may need to consider the benefits of enterprise integration using an NXD. There may also be data integrity issues. It's also a common requirement to be able to share the data with applications which are not XML-aware; this is often true for data-oriented XML, and commonly arises when one is adding an XML view onto an existing database or repository. In these circumstances an RDBMS or XED may offer the best of both worlds.

For many situations, you can get by with a relational database, especially if the majority of the applications drawing from that database may not need to manipulate or view the data as XML. If that is the situation, XML Middleware, or the mapping facilities that many RDBMSes are providing directly these days should suffice. (Tom Bradford)

Data in an XML-enabled relational database is accessible to both XML and non-XML applications. The same is not true of almost all native XML databases. That is, almost all native XML databases do not (yet) support ways to return the data in non-XML format. (Ronald Bourret)

Another issue is whether non-XML applications can use the data without seeing any XML. The answer is yes for XML that is mapped to tables and no for XML stored in a BLOB. (Ronald Bourret)

If you expose data to both existing RDBMS applications and XML applications, you're probably better off leaving it in an RDBMS; XML simply has no good notion of "referential integrity" and that could bite you hard. (Michael Champion)

Existing Investment

As with any technology evaluation existing investment is an important consideration; investment in developing particular skillsets shouldn't be downplayed. For many applications an XML Database isn't required, existing technologies already provide the needed functionality. Just as many database vendors have added object features to their databases, they'll also add XML features. Adopting a XED or even NXD may be as simple as staying on your current platform, and taking advantage of new features as they arrive. This strategy may be suitable in the cases where you are adding XML support to an existing application, and will end up with a shared repository

If you have a heavy investment in an Object-Relational DBMS (Oracle, MS, IBM) and have learned to live with them, their XML add-on support may well be cost effective for you. If you are starting from scratch, the native XML DBMSs are generally cheaper, lighter, and easier to manage than a "universal server" DBMS. (Michael Champion)

...Why should people currently using RDBMSes for their classic key corporate data have any interest in converting to XML as a data model? After all, the relational model basically serves their needs pretty well. There is not currently an industry crisis among RDBMS users caused by deficiencies in the relational model. Furthermore, changing the data model in such systems from relational to anything else would entail a huge amount of work and risk. A vast superstructure has been built around these relational databases: report generators, reports written for those report generators, all kinds of tools and add-ons from the relational database vendors, the fact that lots of people have been trained in concepts of third normal form, the SQL language, and so on and so forth.

...The image of native XML DBMSes trying to supplant relational DBMSes, trying to take them over and replace them, is a strawman, and we need not waste our time thrashing it. (Dan Weinreb)

If in five years the currently dominant RDBMS vendors support automatic XML schema import/conversion tools, seamless XML instance shredding/storage capabilities, and XQuery as a query language, what will distinguish them from "native XML" DBMSes? Oracle 9i already has a "native XML SQL type". "Native XML DBMSes" will be around, although we may not CALL them that... (Michael Champion)

Multiple Schemas, or even No Schema

The effort required to develop a relational mapping for particular document types (particularly one that will satisfy all your query requirements) can be prohibitive if you have to store documents conforming to multiple schemas or even no schema at all (meaning you have to deal with semi-structured data). When schemas are constantly evolving, particularly when this is outside your control, this problem is exacerbated. In these circumstances the flexible storage options of a NXD are a good choice.

If you have a small number of XML DTDs/schema to work with and can afford the time to do detailed analyses of the storage strategies and build data loading programs, the XML-enabled RDBMSes may be satisfactory. If you have lots of different XML datatypes/schema or you have lots of well-formed XML to deal with, the native XML DBMSes make it much easier to build type-specific collections or efficiently store/retrieve well-formed XML. (Michael Champion)

Speaking generally, the native XML DBMS tends to be more flexible in handling data with no schema, or with evolving schema, than solutions based on normalizing XML into RDBMS tables. (Michael Champion)

The various native XML DBMSes, by diverse means, allow well-formed XML to be stored and efficiently queried without demanding rigid adherence to a schema. Some do this by never defining an XML schema at all or by storing well-formed XML that doesn't match a known schema in a "miscellaneous" collection, some by allowing the schema to evolve and re-indexing on the fly or whatever, some by allowing open content models at any point in the schema, or by some combination. (Michael Champion)

Using XML to describe structured data can be very advantageous when there are multiple data sources... If you do not have complete control over the "schema" or shape of the information that comes from the various servers..., using a schema-independent XML database will reduce your development time and ongoing maintenance burden: you will not have to build and map a [set of] relational schemas to represent all of you various ... sources, nor do you have to maintain that schema in the presence of changes. (Bob Lord)

Round-tripping

When mapping an XML document into a relational database (or even an XED if you're not using a BLOB to store the whole document), it is inevitable that some information about the document will be lost. Native XML databases don't suffer from this shortcoming. Therefore if you require that data must retain its original structure, into and out of storage, then an NXD is the best solution. Round-tripping is likely to be an important requirement when performing enterprise integration and one needs to maintain logs of messages exchanged between systems.

If you want the XML representation to reflect a snapshot of reality (e.g., the XML purchase order is a "contract" that is digitally signed for purposes of non-repudiation), and you don't want the XML representation to change (breaking the digital signature!) when the underlying data changes, a native XML DBMS is generally more appropriate. (Michael Champion)

Data Integrity

Referential integrity is at the core of the relational model. However, the XML notion of integrity is less well defined; linking technologies exist but it's often not clear how best these should be employed, and it's likely application code will be required to maintain integrity. If this is a strong requirement, then a relational or XML enabled database may be the better solution.

If you WANT your XML representation to change when an underlying information item changes (e.g., the customer address in a purchase order should change when the address in the customer database changes), RDBMS-based solutions of one sort or another are generally more appropriate. (Michael Champion)

...XML simply has no good notion of "referential integrity" and that could bite you hard. (Michael Champion)

Enterprise Integration

Also in XML-Deviant

The More Things Change

The most common use for, particularly data-oriented, XML is as a means of integration or data interchange between enterprise applications inside and outside the firewall. Middleware has been a growth industry over the last few years, and with the increasing use of XML for this kind of messaging, a NXD becomes a prime candidate for managing the message flow. This is particularly true when messages will not be consumed immediately (and need to be cached), must be logged for legal reasons, or require transaction support.

XML DBMSs *are* solutions for all sorts of interesting problems, especially in a world where XML is the "lingua franca" of electronic commerce. You *can* normalize XML purchase orders that come in from SOAP, Web pages, etc. into 20-30 RDBMS tables, with the help of an army of programmers and DBAs to cope with diversity and evolution... or you can do this much more simply with an XML DBMS. (Michael Champion)

Native XML databases look like a good way to integrate data from a variety of backends, and I think the winning native XML databases of the future will do this transparently and bi-directionally. (Some are already doing it today.) Relational databases may have problems in this area, since the result of integrating data from a variety of sources is likely to be semi-structured data, not structured data. (Ronald Bourret)

I agree very much that integrating data from a variety of backends is one of the promising areas for native XML DBMSes, and that ability to handle semi-structured data is a key benefit.

Some native XML DBMSes are also well-suited to serve the role of a middle-tier persistent transactional distributed cache. (Dan Weinreb)

Whether physical XML or a virtual view, data integration is one of the biggest reasons for XML, and providing database functionality for the XML view is one of the biggest reasons for an XML database. (Jonathan Robie)

...Hide the godawful mess in the back office behind an XML representation cached in an XML DBMS so that other applications don't have to know or care about the voodoo behind the scenes.

...if the godawful mess in the back office is glued to the buzzword-du-jour-MLs out on the wireless web somewhere via some tangled mess of CGI/ASP/JSP/SOAP triggered by diverse XML B2B/B2C/A2A messages, how will you ever keep track of what is really going on from a point of view that management cares about, or to give customers a coherent answer when the widget they ordered from your virtual store didn't show up, but the credit card statement did? Logging the "transient" XML data can help. (Michael Champion)

No doubt there are other factors which could be added to the above list and more opinions that could be added to help weigh one against another. I hope this is a useful starting point for further discussions.