Against the Grain
July 5, 2001
This week the Deviant summarizes some of the comments made by XML-DEV members in response to a recent critical article on the relationship between XML and databases.
It's been an interesting week on XML-DEV with one particular topic being hotly debated. Namespaces? Relative URIs? No. The debate revolved around a single question: "What is the correct plural form of 'schema'? Is it 'schemas' or 'schemata'?". This important topic involved many learned postings discussing the minutiae of the plural forms of endless Latin and Greek words. Not qualified to make a judgment on this vital question, not knowing one end of a neuter plural perfect passive participle from another, the Deviant this week takes a look at another debate entirely.
Ken North shared a pointer to a recent article by Fabian Pascal which attacked XML as a means of describing data, XML databases, and indeed pretty much anything (Java, object relational databases, even SQL) other than a pure relational model housed in a central database. Not surprisingly, this prompted some feedback from XML-DEV members.
As Jonathan Robie noted, Fabian appears to argue himself in circles:
..Fabian makes an argument that should lead to the conclusion that XML databases are an important thing to pursue - his central claim is that XML needs a database behind it!
Michael Champion was prompted to wonder why, if the relational model is so perfect, DBMS vendors are adding "post-relational" features to their products, and whether there's a sweet spot between the mathematical rigor of the relational model and the flexibility of XML.
My biggest question after reading his stuff is "If the pure relational model is so powerful, why have the RDBMS vendors, presumably driven by customer demand, supported 'post-relational' Object-Relational and XML features in their recent releases?" I personally doubt if "ignorance" is the answer.
I keep hoping that there is some middle ground where the rigorous mathematics of the relational model and the pragmatic usability of XML can meet and inform one another. In private correspondence, Mr. Pascal assured me that a truly mathematical model of XML is impossible, but I'm keeping an open mind.
Presumably these features are being added because customers are keen to use their data in different ways; for example, in closer conjunction with business objects or to store different kinds of data that don't fit cleanly into a relational system. Documents are an obvious example, and the Web is a gold mine of semi-structured data just begging to be usefully manipulated. Much of the XML database and query work is geared toward exploiting this information. And as Joshua Allen observed, while relational databases have been steadily optimized for many years, research on semi-structured data is only now becoming mainstream.
The only reason that RDBMS software dominates the market right now is because we are good at solving these problems, and RDBMS design has evolved to disallow users from asking questions that the database isn't good at answering. The fact that we ship databases that only permit things that we know how to answer efficiently does NOT imply that we will never be able to answer other questions more efficiently (in fact, RDBMS systems have evolved and gobbled up much of the research on data warehousing to include those techniques into the engines -- witness materialized views and bitmapped indexes). It is quite easy to see a trend in the industry that shows consistent continual progress at solving hard query problems. Of course some problems will always be hard (distributed cost-based query optimization is one), but I would point out that research on RDBMS optimizations has tapered off quite a bit and we have seen major increases in research geared toward semi-structured data in the past decade. So we are simply easing off on some of the traditional RDBMS constraints and beginning to allow things like recursive self-joins, ragged hierarchies, etc. and we are optimizing these things.
Allen also seemed certain that the mathematics of graph theory, the underpinnings of semi-structured (hence XML) data, would bring dividends.
...I think that areas of discrete mathematics that deal with graphs are currently the most vibrant area of research in the industry. The web itself is one huge graph structure, and research on ways to index the web, optimize routing, etc. all feed directly into techniques for optimizing XML processing....
Indeed one may find it hard to criticize the current XML Query efforts, which are defining the algebraic underpinning for querying XML data sources. If this formal work were not being carried out, Mr Pascal's claims might make more sense. How else will advances happen if the basic research is not carried out? In a later message Joshua Allen painted an interesting picture of "the honkin' graph" that is the Internet.
...The web is a graph. XML is the web made just a bit less sloppy, but we still have key/keyref and XLink, XPointer, RDF -- all that stuff John mentions. Take the graph that is the web and make it more machine-readable. Take all of the services and data in silos at the edges of the web and expose it as XML documents (as appropriate of course). Now you have one big huge honkin' graph. What is more fun that that?
It's hard to reconcile this image with Fabian Pascal's vision of a centralized DBMS.
Not everyone was happy with the current state of XML databases; indeed one contributor called them "snake oil". Yet this perception seems to be more the fault of hype and marketing than any technological shortcoming. There is still a great deal of work to be done. Unfortunately it appears that demand is out-stripping supply. Speaking again to this topic, Joshua Allen predicted good things from native XML databases, but recommended fully understanding your requirements before committing to any single product.
...there are good reasons to use a pure native approach for XML. The "native XML" people will be able to show you blazingly fast queries over massive data stores that would make an RDBMS croak. The "XML-adaptor" people will show you queries against *their* XML that run blazingly fast but make a "native" engine croak. The moral is that there is no "one true way" at this point, and both models will converge. I think it's a teeny bit unfair to call XML databases "snake oil"; instead think of 2001-XMLDB as 1980-SQL. As "native" databases evolve to support traditional relational-type stuff better and relational-XML adapters evolve to support things that native implementations excel at, the distinction will become irrelevant and the code bases will be pretty much the same. In the meantime, using XML databases means having a good understanding of your use cases, needs, etc. and evaluating each product individually.
Nichola Lehuen agreed that understanding your application's data and choosing the appropriate modeling technology would bring benefits, although Lehuen was ultimately less effusive about potential benefits.
Also in XML-Deviant
...for any given data to model, you can find a hierarchical (e.g. XML) representation, a network representation (the node-labeled graph model), a relational representation, an object representation, or more exotic representations (e.g. the Caché model). But depending on your data, one of these models will rise out as the "best" one, in terms of ease of implementation and of efficiency in queries and updates.
So I believe there is a whole set of problems that will benefit from XML databases...The storage, indexation and querying of a set of document-oriented data is a good example.
But XML databases isn't or (won't) be a revolution, blasting all other storage models. We could even say that the XML database model is just a come back of the hierarchical model that was supposedly "killed" by the relational model back in the 80s. I don't think XML databases are the "next thing".
Other messages in this thread picked up on the data modeling issue. Jeff Lowery observed that at the moment constructing an efficient relational model for an XML structure is more art than science.
I think object-relational databases have some promise. Knowing how to decompose an XML hierarchy just enough to result in an efficient relation model is more of an art than a science right now: I don't think you get much benefit if all parent-child relations are rigorously broken down into primary/FK pairs, for instance. Knowing how the data will be fetched is the main design criteria of an object-relational model, with performance gains for 'fixed' fetches coming at the cost of degrading some ad-hoc queries (adding an XPath-based index for complex XML elements stored in columns might speed up finding, but not fetching).
Extending the capabilities of database management systems to facilitate the move from art to science can only be a good thing. At least, it is difficult to see how it could be a bad thing. It also seems obvious that building this work using formal models is smart, which exactly what the XML Query work is doing. Only this time the model and the query syntax are being developed hand in hand. Unlike relational theory and SQL we should hopefully have a standard XML query language very shortly. One might also hope that this would limit mismatches between the two, which seems to be the case for pure relational models and those expressed by SQL. Promisingly, just this week two early implementations have been announced which means developers can finally begin to come to grips with this new technology.