The Quest for an XML Query Standard

March 2, 1999

Last December, the W3C convened a workshop on query languages called QL'98 and invited companies to identify the problems and opportunities involved in creating or adapting a query language capable of handling XML. Lisa Rein has talked to QL'98 organizers and participants to find out how people agree and disagree, and where they might go from here.

If XML has the power to transform the Web itself into a giant, distributed database, how shall we "interrogate" that database?

Well, a lot depends on whether you view the structure of that database as a tree or a table.

SQL (the Structured Query Language) is a well-established and standardized language that is designed for retrieving relational data. This is data stored as a collection of tables in a relational database management system (RDBMS). XML presents a different but not necessarily opposing view. Its data model is a hierarchical tree of elements and attributes. There is increasing interest in developing an XML-aware query language to take advantage of XML's data model while enabling the kinds of applications that SQL provides for databases. An XML query language could make it easier to process large collections of XML documents and extract relevant information. It might also increase the precision of searching in documents because queries could utilize the document's structure. In short, such a query language would be an improvement over current approaches to full-text searching.

In the sections that follow, we are going to look at QL'98 in more detail and discuss the different proposals that were presented there.

What Went On at QL'98

QL'98, the W3C Workshop on Query Languages, brought together over 92 participants from 31 different commercial companies, 7 different research facilities, and 19 academic institutions to discuss what could be done to standardize a query language that understood XML documents in the way that SQL understands a relational database. Reports from numerous participants agree that the workshop activities provided a useful education for its attendees, helping them to understand the many different approaches to solving this problem.

Although only about 50 participants were expected, the response was so great that the W3C had to invoke rules to limit participation to two people for each member company and one for each invited organization. "If the workshop would have been "open call", I think we could have had a really amazing number of participants," explained QL'98 chair, Massimo Marchiori.

Who Came?

The XML "document" crowd (XML/SGML);
Database/Data Management companies;
RDF/ Metadata enthusiasts;
Online library/Dublin Core community;
Researchers from universities and many large corporations (GTE Research, AT&T Labs).

"It's been an amazing workshop, for the variety of people attending, the number of incredibly high-quality contributions, and the deep lively discussions," explained Marchiori. One goal of the workshop was to solicit input on whether the W3C should start a new working group to define an XML-based query language. Another goal was to clarify the needs in this area of different W3C working groups like XML, RDF, P3P, XSL, Math, and DOM. Its organizers also wanted to understand the related needs of commercial database companies as well as those whose business relies on query technologies.

"There were many high-quality presentations, coming from very distinct positions," explained XQL co-editor Jonathan Robie (Texcel Research). "Some proposals focused on querying individual documents; others focused on querying repositories or indexed collections. Some proposals focused on documents; others on data taken from sources such as databases."

According to Robie, some of the proposals were more closely based on XML than others. Some suggested using RDF graphs or semi-structured database models. Most proposals suggested a query language, although a few others proposed that full-fledged query protocols (such as the Z39.50 information retrieval protocol) were needed in order to negotiate the details of query processing among clients and servers in a fully distributed, heterogeneous environment.

"It was an amazingly diverse group," remembers IBM's Paul Cotton. "Diverse in the way that each group approached the question the W3C had proposed, as well as diverse in the solutions that each group proposed. After a while, we realized that we were sometimes talking about different data models, and that before we could go any further, we'd have to agree on just what that XML data model was."

A total of 66 position papers were submitted to the workshop (position papers were required for attendance). Many of the papers commented directly on one or more of the other proposals. Although some position papers were rather lengthy (such as Microsoft's Adam Bosworth's), most were only one or two pages. Still others, such as Quark, Inc., submitted a non-position paper explaining just how simple their requirements were.

Others have been pursuing related work as part of RDF or the Dublin Core activities or the Z39.50 information retrieval protocol developed for the library community. Representatives from these communities came to the workshop to help educate other groups about their requirements.

So what were the issues everyone talked about?

Considering XSL Extensions, XQL and Other Proposals

Nearly all of the participating companies saw eye to eye on most of the major issues:

That a query language should take XML in and put XML out (whether as a serialized representation or a DOM node);
That a schema should not be required, but the language should have the ability to take advantage of one, when present;
That they are anxious to get the show on the road so they can move forward with shipping product.

It was also clear that there are many different ways to design a query language. There's even the concern that we don't necessarily need only one standardized means of walking a document tree or looking up items in a table. But it is becoming increasingly important to have a standard query language that works across many different systems.

XSL Extensions

XSL has two very distinct parts: a transformation language, and a formatting vocabulary. The pattern matching facility of XSL's transformation language comes closest to providing the basic functionality required of an XML-based query language. The majority of the proposals presented at QL'98 suggested different ways to extend XSL's pattern language to improve its query capabilities.

XSL allows you to specify which elements should be selected for processing by specifying the path through the document tree used to locate the element. (For example, chapter, section, title.) After the transformation has taken place on the document, a formatting vocabulary can then be applied to the different parts of a document, or not.

Whether XSL's pattern language should be extended to enhance its ability to handle queries remains a topic of controversy. The general consensus is that its "selection" and "pattern matching" facilities (which provide the ability to locate specific elements or text within elements) are not sufficient. More importantly, XSL is only built for accessing and transforming one document at a time.

"XSL style sheets provide a mechanism for querying and transforming single documents, one at a time, using DOM or SAX, but we need to be able to query more than one document at a time," explains Paul Cotton, one of IBM's representatives to the W3C XML activity. "From our point of view, we need a language that will work in an environment where you have multiple instances of XML documents. In this scenario, as you can imagine, scalability is very important to your design."

The controvery is not over whether or not XSL should be extended, but how, when, and by whom.

XQL Proposal

XML Query Language (XQL) was one of the workshop's position papers that received a lot of attention, largely due to Microsoft's evangelistic efforts. XQL extends XSL's "querying" facility: its pattern language. Jonathan Robie of Texcel Research presented the XQL proposal, which was co-authored by webMethods and Microsoft.

"XQL was designed to allow efficient implementation for large collections of documents, using indexing strategies similar to those used for full-text search engines," explains Robie. "Scalability was a concern from the beginning. Also, XQL queries extend only the patterns of XSL, not the entire XSL language, so they are much simpler to write than XSL stylesheets. XQL queries can be used in many environments where an entire XSL stylesheet would be inappropriate - e.g., to embed queries in attributes, as simple strings in programming languages, or when queries must be typed by hand."

Other companies, such as webMethods, Inc., who co-authored the XQL proposal and another position paper with Microsoft and Texcel Research, seemed to have very different needs. As webMethods' Joe Lapp explains, his company is more interested in querying single XML documents than in querying document repositories.

"The fact that we can get at XML data using queries inclines us to put XML in many more places. We don't have to write lengthy code that walks XML trees via DOM APIs since we need only construct a short and readable query," explains Lapp.

The XSL Working Group submitted a position paper pleading the case for keeping such query efforts within the existing XSL Working Group, and additionally suggesting the formation of an official "coordination group" to "take responsibility for coordinating query requirements with other working groups." The paper also warned of the possibility of fragmentation between implementations if developers are given too many choices, and inevitably, too many incompatible choices.

Others felt a bit uncomfortable with the XSL pattern language as a general-purpose query language. "I think we should start with something dramatically simpler, and I know I'm not alone in that viewpoint," explains Tim Bray, XML Recommendation co-editor.

During the XQL presentation, Microsoft's Adam Bosworth made a point of clarifying that he doesn't even consider XQL to be a full-fledged query language.

"A query language, as far as I am concerned, handles sorting, shaping, relating, and in general taking any XML in and generating any custom XML shape out. This isn't what XQL does," explains Bosworth. "XQL merely offers a model for asking for specific sets of elements. That is fine as far as it goes, but it doesn't go very far. It is like saying that SQL would be a query language if it just had "FROM", but no ORDER-BY or SELECT. I think the XQL folks are trying to generalize path expressions to be a full query language, and I think this is a mistake. Query languages need other constructs than those that describe interesting elements to process. They need to say what to do with them (e.g. order them, extract important elements from them, sum them, ...). I'm a huge fan of rich path expressions. I don't consider them a query language, just a useful part of one."

Element Sets Proposal

Tim Bray submitted a proposal entitled "Element Sets." Although the technique the paper discusses has significantly less functionality compared to several of the other proposals, Bray contends that his "element sets" solution has already been used to solve many real-world business problems, while introducing much less complexity.

Unlike many of the proposals that stressed the importance of using parent-child and sibling relationships, Bray's "Element Sets" proposal provided an alternative technique for acheiving the same kinds of functionality.

For example:

Element sets can't do parent-child (find the 3rd child of element x), but they can do ancestor-descendant.

find me all the <procedures> that contain a <step> which contains the phrase "fuel-injection."

Element sets can't locate siblings, but they can work with preceding- following relationships

find me all the <steps> in <procedur es> which contain a reference to www.whitehouse.gov and are preceded by a reference to www.ken-starr.org in the same <procedure>.

"This is an awfully abstract point," admits Bray, "but one that I thin k is central: ancestor-descendent and precede-following solve the same set of problems that parent-child and left sibling-right sibling solve, and are way easier to implement efficiently."

Database Vendors Weigh In

The greatest concern among database vendors appeared to be a loss of functionality as the cost of incorporating XML into existing systems. Many suggest that the big DB vendors, and even some of the XML repository companies, might actually want a complex all-powerful syntax. This approach might work best to protect their interests from alternative approaches.

"If developers pay $30,000 per seat and have to go to week-long training, why bother making the language simple?" expressed a developer who preferred to remain anonymous. This viewpoint conflicts with XML's golden rule of keeping XML technologies relatively simple and accessible. It also doesn't jibe with the approach that almost every database company took in their position paper: to keep it simple.

One recurring issue seemed to be whether or not formatting properties should be "built-in" to the querying mechanism or kept separate (in the interests of interoperability). For some, it isn't hard deciding on these issues; a query language should be just about querying. Agranat Systems, for instance, took the time to write a less-than-one-page position paper in order to state specifically that it feels an XML Query Language should not "govern the formatting of a query result".

Oracle's paper stated that that any query language should have an underlying algebra and provide support for XML data types. It should also have the ability to query multiple documents with a single query. Other features on Oracle's wish list might be considered by some "extra" functions that many feel don't need to be "built-in" to XML's query language, such as the embedding of SQL statements or the ability to query other data types besides XML.

Interestingly enough, Oracle's only "non-requirement" was defining a "user-friendly search engine kind of query language, since there is unlikely to be early consensus on exactly what results should be returned." And indeed, it would seem that most of the points of disagreement do center around the form in which results are returned. (Or as Oracle put it: "what kind of expression is permitted in specifying the <result>" Oracle's position is that the <result>may be an expression of any type.)

Adam Bosworth authored Microsoft's official position paper, stating clearly that the goals of the paper were "to make more concrete the very large list of work items that the W3C needs to address, and to motivate the W3C to kick off a working group to start doing just that." Bosworth also stressed that the language shouldn't be "too hard to use, too verbose to enter, or too hard to teach."

"It only makes sense that Microsoft would be much more focused on usability because they know it's one of the key determinants of technology adoption," explains Simeon Simeonov, Allaire Corporation's Manager of Language Technology. "The syntax needs to be simple to learn because developers will be writing queries by hand, despite the fact that people have been promising all-visual SQL query builders for more than a decade."

"The only concern I have is, if XSL will focus on generating XML/HTML from XML. We need a way of generating any target format, e.g. RTF or CSV," explains Ralf Westphal, editor-in-chief of BasicPro, the German VB magazine, and an XML Consultant in Hamburg, Germany. "It should be made easy to insert target language fragments of any kind into XSL, not just XML-elements."

However, other developer's disagree with this approach. "The XSL FO (formatting object) tree generation allows for the back-end creation of alternate binary formats," explains Dr. Jonathan Borden, an XML consultant and neurosurgeon at New England Medical Center in Boston, Massachusets. "Unfortunately, XSL FO's aren't implemented by Microsoft or James Clark or IBM because the interest is only in generating XML and HTML from XML."

"What this all means is that the query processor implementations will have to find a way to deal with relational datastores, even if this requires a bridge/adapter of sorts, because most of enterprises' data is in relational DBMS's," explains Simeonov. "This may impose design constraints on the language, but we'd better decide on something soon since vendors are coming up with their own micro-languages already."

"The SQL standardization was possible because the participants had agreed to build a language based on their common understanding of the relational model," remembers IBM's Cotton. "One of the major conclusions of the QL'98 workshop was that the participants needed to agree upon a common data model for XML that could then be used to support a query facility."

SIDEBAR: Looking at Trees and Tables

Consider how an invoice might appear as a relational database and as an XML document. In XML terms, a simple invoice might consist of an invoice element with an ID attribute and a series of one or more line item elements. In XML, the connection is implicit: an invoice element contains its line item elements.

In a relational database, you'd have a table for invoices and a table for line items. A "join" is used to make the connection between a single invoice and a set of line items.

Here's an XML fragment:


<invoice number=3D"990302">

        <lineitems>



<item><partnumber>1234</partnumber><price>300< /price><qty>2</qty></item>



<item><partnumber>9923</partnumber><price>400< /price><qty>1</qty></item>

        </lineitems>

        <sum>1000</sum>

</invoice>

This could be represented in a table as follows:

	partnumber	price	qty
item	1234	300	2
item	9923	400	1

We'd like to be able to regard the document and the record known by its ID number as equivalent.

However, mapping from tables to document trees and back is no trivial task. There is information from the document (the sum) that has not been integrated into the database table record. Likewise, the fact that the "300" value in the price column stands for $300.00 ("currency datatype"); it is an example of information contained in the tables of the database structure that is not represented in our tree structured example.

A unified data model will help to bridge the gaps between these different kinds of data structures.

Wrapping Up QL'98

"Many of the groups who attended the workshop had not even been aware of the work being done by other groups, so this workshop provided a wonderful setting for cross-fertilization," explains Jonathan Robie. "One of the real accomplishments of the workshop was simply to introduce people to each other!"

"There was a big melting pot out there, and the workshop just came at the right moment," explains Massimo Marchiori. "The difficult part that we are tackling now is just to state clearly and promptly (the market doesn't wait) what the future actions of W3C in this area will be."

In terms of next steps, the W3C must decide whether they will simply incorporate feedback into the existing XML working group activity? Or will a separate Query Language Working Group or XML Query Language Working Group be formed?

One of the difficulties is the somewhat delicate placement of the work within the W3C's architecture, since so many different groups will depend heavily on its deliverables. At this point, the W3C is gathering input from its members as to how the query effort should proceed.

However, looking back, the workshop appears to have already been a success as a catalyst for bringing together the various database, document, metadata, and knowledge representation communities into a single forum. It seems like a clear expression of potentially converging interests.

"Searching the web is essential for a variety of other applications," explains the QL'98 Chair Marchiori. "These search technologies will provide the basis for a coherent model of organizing and reasoning on web data: once there, imagination is our only limit."

QL'98 Links

If you want to learn more about individual proposals, use the guide below to find key papers presented at the workshop as well as more information about the organizations that sent representatives.

Some of the Key Papers Discussed At the Event

David Maier (Oregon Graduate Institute): "Database Desiderata for an XML Query Language"

Jonathan Robie (Texcel), Joe Lapp (webMethods Inc.), David Schach (Microsoft): "XML Query Language (XQL)"

Alin Deutsch (University of Pennsylvania), Mary Fernandez (AT&T Labs), Daniela Florescu (INRIA), Alon Levy (University of Washington), Dan Suciu (AT&T Labs): "XML-QL"

W3C Math Working Group: "The Query Language Position Paper of the Math Working Group"

W3C XSL Working Group: "The Query Language Position Paper of the XSL Working Group"

List of Companies Who Attended the Query98 workshop

Academic Institutions Who Participated

Research Facilities

DSTC
Lawrence Berkeley National Laboratory
Library of Congress
MITRE Corporation, (2)
NASA
National Library of Medicine
OCLC (Online Computer Library Center, Inc.), (2)
US Geological Survey, (2)