Getting Started with XQuery

March 2, 2005

Although the W3C's XQuery language for querying XML data sources is still in Working Draft status, the recent XML 2004 conference showed that there's already plenty of interest and many implementations. While the Saxon implementation may not scale up as much as the disk-based versions that use persistent indexes and other traditional database features, you can download the free version of Saxon, install it, and use XQuery so quickly that it's a great way to start playing with the language in order to learn about what this new standard can offer you.

Running a Query

Let's start with a toy example that demonstrates how to tell Saxon which query to run against which XML, and then we'll move on to examples that show useful queries run against real XML data. For our first two queries, we'll use the following document, which is named data1.xml:

<doc>
  <p>this is a sample file</p>
  <p>this p has <emph>inline</emph> markup</p>
</doc>

When run from a command line, the following tells Saxon to run the query shown and to send the result to standard output. As with XSLT, the curly braces enclose an expression to be evaluated and replaced by the result of the evaluation. Unlike XSLT, curly braces can be nested in XQuery queries as they get more complex. In this particular case, the curly braces have more to do with the Saxon implementation than XQuery syntax, because they indicate to Saxon that the enclosed string is an actual query and not some other command line option.

java net.sf.saxon.Query {doc('data1.xml')//p[emph]}

(On a Linux machine, I also had to put quotation marks around the expression with curly braces.) This query asks for the p elements in the data1.xml file that have an emph child element. Saxon's XQuery processor responds with the following:

<?xml version="1.0" encoding="UTF-8"?>
<p>this p has <emph>inline</emph> markup</p>

(In the remaining examples, I'll omit the XML declaration from the output.) A query doesn't have to be much more complex than this one before it's too long to fit on a command line, so Saxon can accept a query stored in a text file. To demonstrate, I put the query above into its own file, called query1.xqy, without the curly braces from above that told Saxon the role of that string on the command line:

(: Here is an XQuery comment. :)
doc('data1.xml')//p[emph]

(I also added a comment to show how XQuery uses parentheses and colons to delimit comments for the query processor to ignore. As a long-time hater of smileys, I can't say I like the XML Query Working Group's choice of comment delimiters much.) With those two lines stored in query1.xqy, the following command has the same result as the previous one:

java net.sf.saxon.Query query1.xqy

While the query above is more concise than the equivalent XSLT stylesheet, the XSLT version of the query would be very simple, and many have debated whether either language makes the other unnecessary. As with many programming language comparisons, the answer is that while both languages may be able to perform the same functions, each makes certain tasks quicker and easier for the developer than the other. Let's look at some of XQuery's strengths.

Looking for Some Sugar

To really test the usefulness of XQuery, I wanted to use real-world data, so I downloaded a collection of recipes from Squirrel's RecipeML archive that conform to the RecipeML DTD. (Because a cookbook is such an obvious candidate for multiple back-of-the-book indexes, I've often wondered why no Topic Map advocates have created a Topic Map from a collection of RecipeML recipes. The availability of XQuery implementations should make it easier.) Like much of the XML available on the internet, we can't assume that these are all clean, well-formed documents, so several recipe files required a little clean-up before I could start running queries against the collection.

Issuing a query against multiple documents at once is an example of a task that, while not impossible in XSLT, is much easier in XQuery when we use the collection function. (Like all functions mentioned in this article, you can use collection in XSLT 2.0 as well as in XQuery, because it's one of the XQuery 1.0 and XPath 2.0 Functions and Operators. Its use with XQuery generally allows more concise requests than it does with XSLT.) In Saxon, the argument for this function is a URI identifying a file that lists the collection's XML documents in this format:

<collection>
  <doc href="_Band__Sloppy_Joes.xml"/>
  <doc href="_Cheese__Fricadelle.xml"/>
  <!-- more doc elements... -->
  <doc href="Walton_Mountain_Coffee_Cake.xml"/>
  <doc href="Walty's_Dressing.xml"/>
  <doc href="Wan_Tan_(Wonton).xml"/>
</collection>

I named this document docs.xml and put it in a recipeml subdirectory with the 290 or so recipe documents that I extracted from the Squirrel Archive zip files that I downloaded. The first query against this collection lists the title value of all recipes that have the string "sugar" in any item child of the ing ("ingredient") element (carriage return added to queries for readability):

collection('recipeml/docs.xml')/recipeml/recipe/
     head/title[//ingredients/ing/item[contains(.,'sugar')]]

The output looks like this:

<title>"Band" Sloppy Joes</title>
<title>"Best" Apple Nut Pudding</title>
<!-- more title elements... -->
<title>Waltons Mountain Coffee Cake</title>
<title>Walton Mountain Coffee Cake</title>

Because XPath 2.0 allows function calls as location steps, this query is simply one big XPath expression. Part of the appeal of XQuery to people with more of a traditional database background and less of an XML geek background is that XQuery also offers a more SQL-like syntax, so that you get the same result from your XQuery processor with this query:

for $ingredient in collection('recipeml/docs.xml')//
                   ingredients/ing/item[contains(.,'sugar')]
  return $ingredient/../../../head/title

The for clause iterates across a collection of nodes, and the return clause creates the result of the iteration by identifying which node(s) in the collection to return in the expression.

These two queries each asked for a list of title elements and got the same result. The output, like the query itself (but unlike an XSLT stylesheet) is not a well-formed XML document. You can make the result well-formed easily enough; the following variation on the last query wraps the result in a sweets element and demonstrates some XQuery features that make queries more flexible.

<sweets> 
  {
    let $target := 'sugar'

    for $ingredient in collection('recipeml/docs.xml')//
                   ingredients/ing/item[contains(.,$target)]
    return $ingredient/../../../head/title 
  }
</sweets>

As I mentioned above, curly braces in XQuery show an expression to be evaluated and replaced by the result. In the case above, the data returned by the multi-line expression between the braces will appear between the sweets start- and end-tag in the result. One part of this expression is another for expression, which tells the XQuery engine to iterate across the specified set of nodes and then return the title element in each node's recipe. The condition specifying the nodes to iterate through is a little more flexible than its equivalent in previous examples; instead of looking for item elements with the hardcoded string "sugar" as a substring, it looks for the value of the $target variable as a substring. The $target variable is set to the value "sugar" by the let expression preceding the for clause, so the for expression has the same effect that it has in the preceding example, but it's easier to customize to make it search for something else.

The for and let keywords give us the first two letters in FLWOR, an umbrella term (pronounced "flower") used in XQuery for expressions that use the keywords for, let, where, order by, and return. In the words of the W3C Working Draft XQuery 1.0: An XML Query Language, "a FLWOR expression ... supports iteration and binding of variables to intermediate results. This kind of expression is often useful for computing joins between two or more documents and for restructuring data." To someone approaching XQuery from the relational database world, these keywords will be more familiar than the axes, node tests, and predicates of XPath expressions, which is why the first "for $ingredient" example above will feel more natural to a typical database administrator than the example that retrieves the title elements with a single XPath expression.

Let's look at a query that uses the where keyword and builds a web page, complete with links to the documents with the target text.

Feeding Multitudes

Which recipes will feed more than 20 people? The following one-line query takes an XPath-oriented approach to listing the recipe titles that meet this condition.

collection('recipeml/docs.xml')/recipeml/recipe/head/
                                        title[../yield > 20]

A more FLWORy approach allows more flexibility. While the query above says "get the title element for each recipe whose yield is greater than 20," the following says "go through all the documents in the collection, and for any with a yield of more than 20, get the title."

for $doc in collection('recipeml/docs.xml')/recipeml
where $doc/recipe/head/yield > 20
return $doc/recipe/head/title

It may not seem like much of a difference, but once we get past that where clause, the $doc variable gives us a handle to each document meeting the where condition, letting us pull all we want out of it; the title, and if we want, even more. The following query wraps the preceding one with a simple HTML document and uses the document-uri function to add a link to each document meeting the where condition.

(: Create an HTML page linking to recipes  
   that serve more than 20 people.         :)

<html><head><title>Food for a Crowd</title></head>
<body>
  <h1>Food for a Crowd</h1>
  { 
    for $doc in collection('recipeml/docs.xml')
    where $doc/recipeml/recipe/head/yield > 20
    return
      <p><a href="{document-uri($doc)}">
      {$doc/recipeml/recipe/head/title/text()}
      </a></p>
  }
</body></html>

In the future, we can look forward to more server-side XQuery support that lets sites dynamically generate HTML pages using XQuery queries. With XQuery's ability to query combinations of XML and relational databases, it could end up playing a huge role in many dynamically generated web sites.

Extreme Recipes

A let clause can call functions to compute values that you can then use in a where clause or an XPath predicate. The following query checks for the maximum yield value and then pulls out any recipes with that yield figure:

(: Which recipe(s) serves the most people?  :)

let $maxYield :=
  max(collection('recipeml/docs.xml')/recipeml/recipe/head/
                                                      yield)

return collection('recipeml/docs.xml')/recipeml/recipe[head/
                                      yield =     $maxYield]

In part two of this article, we'll see how XQuery's ability to sort and aggregate data lets us create a list of ingredient headings from the recipe collection, with each heading followed by a list of links to recipes that contain that ingredient. We'll also see how user-defined functions in queries can expand the possibilities for how you select and use the data in your XML documents with XQuery.