Considering XSL Extensions, XQL and Other Proposals
March 2, 1999
Considering XSL Extensions, XQL and Other Proposals
Nearly all of the participating companies saw eye to eye on most of the major issues:
- That a query language should take XML in and put XML out (whether as a serialized representation or a DOM node);
- That a schema should not be required, but the language should have the ability to take advantage of one, when present;
- That they are anxious to get the show on the road so they can move forward with shipping product.
It was also clear that there are many different ways to design a query language. There's even the concern that we don't necessarily need only one standardized means of walking a document tree or looking up items in a table. But it is becoming increasingly important to have a standard query language that works across many different systems.
XSL has two very distinct parts: a transformation language, and a formatting vocabulary. The pattern matching facility of XSL's transformation language comes closest to providing the basic functionality required of an XML-based query language. The majority of the proposals presented at QL'98 suggested different ways to extend XSL's pattern language to improve its query capabilities.
XSL allows you to specify which elements should be selected for processing by specifying the path through the document tree used to locate the element. (For example, chapter, section, title.) After the transformation has taken place on the document, a formatting vocabulary can then be applied to the different parts of a document, or not.
Whether XSL's pattern language should be extended to enhance its ability to handle queries remains a topic of controversy. The general consensus is that its "selection" and "pattern matching" facilities (which provide the ability to locate specific elements or text within elements) are not sufficient. More importantly, XSL is only built for accessing and transforming one document at a time.
"XSL style sheets provide a mechanism for querying and transforming single documents, one at a time, using DOM or SAX, but we need to be able to query more than one document at a time," explains Paul Cotton, one of IBM's representatives to the W3C XML activity. "From our point of view, we need a language that will work in an environment where you have multiple instances of XML documents. In this scenario, as you can imagine, scalability is very important to your design."
The controvery is not over whether or not XSL should be extended, but how, when, and by whom.
XML Query Language (XQL) was one of the workshop's position papers that received a lot of attention, largely due to Microsoft's evangelistic efforts. XQL extends XSL's "querying" facility: its pattern language. Jonathan Robie of Texcel Research presented the XQL proposal, which was co-authored by webMethods and Microsoft.
"XQL was designed to allow efficient implementation for large collections of documents, using indexing strategies similar to those used for full-text search engines," explains Robie. "Scalability was a concern from the beginning. Also, XQL queries extend only the patterns of XSL, not the entire XSL language, so they are much simpler to write than XSL stylesheets. XQL queries can be used in many environments where an entire XSL stylesheet would be inappropriate - e.g., to embed queries in attributes, as simple strings in programming languages, or when queries must be typed by hand."
Other companies, such as webMethods, Inc., who co-authored the XQL proposal and another position paper with Microsoft and Texcel Research, seemed to have very different needs. As webMethods' Joe Lapp explains, his company is more interested in querying single XML documents than in querying document repositories.
"The fact that we can get at XML data using queries inclines us to put XML in many more places. We don't have to write lengthy code that walks XML trees via DOM APIs since we need only construct a short and readable query," explains Lapp.
The XSL Working Group submitted a position paper pleading the case for keeping such query efforts within the existing XSL Working Group, and additionally suggesting the formation of an official "coordination group" to "take responsibility for coordinating query requirements with other working groups." The paper also warned of the possibility of fragmentation between implementations if developers are given too many choices, and inevitably, too many incompatible choices.
Others felt a bit uncomfortable with the XSL pattern language as a general-purpose query language. "I think we should start with something dramatically simpler, and I know I'm not alone in that viewpoint," explains Tim Bray, XML Recommendation co-editor.
During the XQL presentation, Microsoft's Adam Bosworth made a point of clarifying that he doesn't even consider XQL to be a full-fledged query language.
"A query language, as far as I am concerned, handles sorting, shaping, relating, and in general taking any XML in and generating any custom XML shape out. This isn't what XQL does," explains Bosworth. "XQL merely offers a model for asking for specific sets of elements. That is fine as far as it goes, but it doesn't go very far. It is like saying that SQL would be a query language if it just had "FROM", but no ORDER-BY or SELECT. I think the XQL folks are trying to generalize path expressions to be a full query language, and I think this is a mistake. Query languages need other constructs than those that describe interesting elements to process. They need to say what to do with them (e.g. order them, extract important elements from them, sum them, ...). I'm a huge fan of rich path expressions. I don't consider them a query language, just a useful part of one."
Element Sets Proposal
Tim Bray submitted a proposal entitled "Element Sets." Although the technique the paper discusses has significantly less functionality compared to several of the other proposals, Bray contends that his "element sets" solution has already been used to solve many real-world business problems, while introducing much less complexity.
Unlike many of the proposals that stressed the importance of using parent-child and sibling relationships, Bray's "Element Sets" proposal provided an alternative technique for acheiving the same kinds of functionality.
- Element sets can't do parent-child (find the 3rd child of element x), but they can do ancestor-descendant.
- Element sets can't locate siblings, but they can work with preceding- following
find me all the <procedures> that contain a <step> which contains the phrase "fuel-injection."
find me all the <steps> in <procedur es> which contain a reference to www.whitehouse.gov and are preceded by a reference to www.ken-starr.org in the same <procedure>.
"This is an awfully abstract point," admits Bray, "but one that I thin k is central: ancestor-descendent and precede-following solve the same set of problems that parent-child and left sibling-right sibling solve, and are way easier to implement efficiently."
Database Vendors Weigh In
The greatest concern among database vendors appeared to be a loss of functionality as the cost of incorporating XML into existing systems. Many suggest that the big DB vendors, and even some of the XML repository companies, might actually want a complex all-powerful syntax. This approach might work best to protect their interests from alternative approaches.
"If developers pay $30,000 per seat and have to go to week-long training, why bother making the language simple?" expressed a developer who preferred to remain anonymous. This viewpoint conflicts with XML's golden rule of keeping XML technologies relatively simple and accessible. It also doesn't jibe with the approach that almost every database company took in their position paper: to keep it simple.
One recurring issue seemed to be whether or not formatting properties should be "built-in" to the querying mechanism or kept separate (in the interests of interoperability). For some, it isn't hard deciding on these issues; a query language should be just about querying. Agranat Systems, for instance, took the time to write a less-than-one-page position paper in order to state specifically that it feels an XML Query Language should not "govern the formatting of a query result".
Oracle's paper stated that that any query language should have an underlying algebra and provide support for XML data types. It should also have the ability to query multiple documents with a single query. Other features on Oracle's wish list might be considered by some "extra" functions that many feel don't need to be "built-in" to XML's query language, such as the embedding of SQL statements or the ability to query other data types besides XML.
Interestingly enough, Oracle's only "non-requirement" was defining a "user-friendly search engine kind of query language, since there is unlikely to be early consensus on exactly what results should be returned." And indeed, it would seem that most of the points of disagreement do center around the form in which results are returned. (Or as Oracle put it: "what kind of expression is permitted in specifying the <result>" Oracle's position is that the <result>may be an expression of any type.)
Adam Bosworth authored Microsoft's official position paper, stating clearly that the goals of the paper were "to make more concrete the very large list of work items that the W3C needs to address, and to motivate the W3C to kick off a working group to start doing just that." Bosworth also stressed that the language shouldn't be "too hard to use, too verbose to enter, or too hard to teach."
"It only makes sense that Microsoft would be much more focused on usability because they know it's one of the key determinants of technology adoption," explains Simeon Simeonov, Allaire Corporation's Manager of Language Technology. "The syntax needs to be simple to learn because developers will be writing queries by hand, despite the fact that people have been promising all-visual SQL query builders for more than a decade."
"The only concern I have is, if XSL will focus on generating XML/HTML from XML. We need a way of generating any target format, e.g. RTF or CSV," explains Ralf Westphal, editor-in-chief of BasicPro, the German VB magazine, and an XML Consultant in Hamburg, Germany. "It should be made easy to insert target language fragments of any kind into XSL, not just XML-elements."
However, other developer's disagree with this approach. "The XSL FO (formatting object) tree generation allows for the back-end creation of alternate binary formats," explains Dr. Jonathan Borden, an XML consultant and neurosurgeon at New England Medical Center in Boston, Massachusets. "Unfortunately, XSL FO's aren't implemented by Microsoft or James Clark or IBM because the interest is only in generating XML and HTML from XML."
"What this all means is that the query processor implementations will have to find a way to deal with relational datastores, even if this requires a bridge/adapter of sorts, because most of enterprises' data is in relational DBMS's," explains Simeonov. "This may impose design constraints on the language, but we'd better decide on something soon since vendors are coming up with their own micro-languages already."
"The SQL standardization was possible because the participants had agreed to build a language based on their common understanding of the relational model," remembers IBM's Cotton. "One of the major conclusions of the QL'98 workshop was that the participants needed to agree upon a common data model for XML that could then be used to support a query facility."
SIDEBAR: Looking at Trees and Tables
Consider how an invoice might appear as a relational database and as an XML document. In XML terms, a simple invoice might consist of an invoice element with an ID attribute and a series of one or more line item elements. In XML, the connection is implicit: an invoice element contains its line item elements.
In a relational database, you'd have a table for invoices and a table for line items. A "join" is used to make the connection between a single invoice and a set of line items.
Here's an XML fragment:
<invoice number=3D"990302"> <lineitems> <item><partnumber>1234</partnumber><price>300< /price><qty>2</qty></item> <item><partnumber>9923</partnumber><price>400< /price><qty>1</qty></item> </lineitems> <sum>1000</sum> </invoice>
This could be represented in a table as follows:
We'd like to be able to regard the document and the record known by its ID number as equivalent.
However, mapping from tables to document trees and back is no trivial task. There is information from the document (the sum) that has not been integrated into the database table record. Likewise, the fact that the "300" value in the price column stands for $300.00 ("currency datatype"); it is an example of information contained in the tables of the database structure that is not represented in our tree structured example.
A unified data model will help to bridge the gaps between these different kinds of data structures.