Menu

Storing and Querying

April 5, 2000

Leigh Dodds

As the number of real-world applications that manipulate XML data increases, developers are beginning to consider issues like persistence and scalability. This week XML-DEV has been discussing persistent storage APIs, query mechanisms, and search engines.

XML and Databases

Start integrating XML into any "serious" application and before long you'll find that certain questions need answering. How do you store an XML document in a database? How do you subsequently retrieve it? How is the XML stored within the database?

These questions have been asked before in the context of mapping object-oriented data structures into relational databases. And in the same way, the answers are dependent on proprietary solutions.

Gopinath M.R., summarizing the current state of affairs, suggested that there's a need for a standard API for storing XML.

... if no standard APIs are defined for storing and retrieving [an] XML document (DOM tree) in storage engine, it will end up with everyone having their own way of storing XML documents. If somebody wants to switch over from one database vendor (or product providing XML native storage) to another, it will not be easy.

Gopinath also observed that there are big differences between providing an XML interface to an existing database, storing XML data as large chunks of text, and storing a DOM tree in its "native" hierarchical format. Most vendors are pursuing the first two mechanisms to leverage the features of existing database engines.

Ingo Macherius, in an excellent summary of the current issues, commented that while some vendors are building native XML storage systems,

...there is little experience in doing so. Thus they end up with vendor specific solutions. When experience increases how to build such systems, standardization can take place in second or third generation systems. It's probably too early now.

Macherius laid out a suggested roadmap for the development of a standard XML database architecture, stressing the importance of a well-designed query language:

The XPath/XQL/URL/XPointer community is used to /slash/separated/syntax, while SQL/OQL/DB-research people tend to use some variation of SELECT-FROM-WHERE style. A QL is the prime "API" for a DB, not method calls. Thus the query language is (in my opinion) the most important short term goal.

XML will benefit greatly from a query language, just as the relational world benefited from SQL (particularly its algebraic basis). Multi-database APIs like ODBC/JDBC were the next step forward for relational databases. The XML community could certainly benefit by following suit.

(The interested reader should look at Ronald Bourret's "XML and Databases," the definitive paper in this area.)

Query Languages

Defining query languages for markup is not a new science. Tim Bray observed that, because search engines for SGML have been available since 1991,

... there is a considerable body of experience as to what such an API might be like. For some reason, little of it is reflected in XPath. W3C held a workshop on the notion of a query language for XML back in 1998, the proceedings are at http://www.w3.org/TandS/QL/QL98/, there are tons of position papers.

Bray's own paper, "Element Sets: A Minimal Basis for an XML Query Engine," provides an elegant illustration of how simple a query language can be. Michael Rossi agreed with Bray, and believed that the industry needs a robust query language:

A real query language for XML has been under consideration for years ... and is still necessary - XPath (wonderful thing that it is) isn't going to cut it for robust query capabilities across loads of data. The expression of interest in the last query workshop should be evidence enough of industry's desire to see this happen.

The XML Query Working Group recently published a Working Draft of the requirements for an XML query language. The draft includes scenarios describing how a query language may be put to use in different environments. These include querying native XML repositories, as well as the Document Object Model. This latter use case was explored further on XML-DEV.

Querying the DOM

Currently there is no standard way to query a DOM tree using an XPath expression. However, some parsers and utilities do provide this capability. Didier Martin observed that an XPath query extension is extremely useful:

I made several experiments in Didier's labs with the Microsoft DOM extension that allows [the return of] a node-set from an XPath expression and found it tremendously useful. In the Microsoft universe it is simply stated as nodeset= selectNodes(XPath expression).

The issue was also discussed in a thread last year (see "XPath and DOM"), and the feature has been deemed useful by many developers. It would certainly simplify a lot of application code that uses the DOM API. This is another instance, like the XSLT extension functions discussed last week, where real-world experience has highlighted desirable features for standardization.

Addition of a single method, or an additional interface, would support this feature, whilst laying the foundations for future query language standardisation. Martin suggested an example Node method of the form:


public org.w3c.dom.NodeList 

       selectNodes(String queryType, String queryExpression);

This provides an extension mechanism supporting multiple query types (XPath, XQL, or some successor defined by the XML Query Working Group). Some alternative suggestions were made in the discussion last year.

Lauren Wood invited developers to present use cases for the addition of an XPath-based query mechanism for DOM Level 3.

Now is the ideal time to ask for things to be added to Level 3, since the DOM WG and IG is starting work on Level 3. So please post suggestions (preferably with use cases, so it's easy to understand exactly what's needed) to the public DOM mailing list.

The requirements for DOM Level 3 are due to be published later this month. This doesn't leave a lot of time to submit comments. A full query language may be some way off, but some short term wins remain available.