Menu

Getting Started with XQuery, Part 2

March 23, 2005

Bob DuCharme

In part one of this article, we saw how the free version of the Saxon program lets you jump right in with XQuery, the increasingly popular (although not quite finished) W3C standard for querying XML data sources. We learned how to pass Saxon XQuery queries to execute on the command line or from a disk file, and we looked at queries that went through a collection of RecipeML documents to find which recipes used a certain ingredient, which served more than twenty people, and which served the most people. This week, we'll learn more about how a query can manipulate the XML that it pulls out of a collection, and how user-defined functions can provide even greater flexibility in the sorting and arrangement of that data.

What's in the Fridge

Sometimes, the decision of what to make for dinner is based on what ingredients you already have. A list of all the ingredients used in a cookbook, with each followed by a list of the recipes that use it, can be very handy when planning a meal. The query below demonstrates one approach to creating this list, with URLs of the recipes included in the result. I could have written the query to create an HTML file as output, like I did with the "Food for a Crowd" query in part one of this article, but instead this query creates a new XML file that I would most likely convert to HTML (and perhaps other formats) using XSLT.

(: A unique, sorted list of all unique 
  ingredients in  the recipe collection, 
    with URLS to link to the recipes. :) 
<ingredients> 
{
  for $ingr in
distinct-values(collection('recipeml/docs.xml')/recipeml/recipe/ingredients/ing/item)
  order by $ingr
  return
    <item name="{$ingr}"> 
    {
      for $doc in 
         collection('recipeml/docs.xml')
      where $doc/recipeml/recipe/ingredients/ing/item = $ingr
return 
      <title url="{document-uri($doc)}">
{$doc/recipeml/recipe/head/title/text()}
             </title>
    } 
    </item>
} 
</ingredients>

The distinct-values function lets the for clause iterate across a collection of the specified nodes with no duplicates, and the order by clause sorts the upcoming output alphabetically by the values of the item elements that the for clause finds. The return keyword following the order by line doesn't simply return the found nodes, but takes each one and goes further with it: a nested for clause searches for recipes with that ingredient and outputs a title element with the recipe name as content and the recipe's URI in a url attribute.

Note that the contents of each title element in the output is the text() node child of the $doc/recipeml/recipe/head/title pulled from the recipe. If I had put the whole $doc/recipeml/recipe/head/title element in there, I would have gotten its start- and end-tags as well. I didn't want those because the query includes a title start-tag for the output that includes the new url attribute.

An excerpt from the output is shown below. For an HTML version, a simple XSLT stylesheet could convert the item/@name values to h2 headers and each title element to an HTML p element containing the recipe name as a link to the recipe.

<ingredients>

  <!-- some item elements removed -->

  <item name="(12-oz) tomato paste">
    <title url="file:/C:/dat/recipeml/_Best_Ever__Pizza_Sauce.xml">
      "Best Ever" Pizza Sauce</title>
  </item>
  <item name="Baking Powder">
    <title url="file:/c:/dat/recipeml/_Blondie__Brownies.xml">
      "Blondie" Brownies</title>
    <title url="file:/c:/dat/recipeml/Walnut_Pound_Cake.xml">
      Walnut Pound Cake</title>
  </item>
  <item name="Baking Soda">
    <title url="file:/c:/dat/recipeml/_Faux__Sourdough.xml">
      "Faux" Sourdough</title>
  </item>
  <item name="Baking potatoes">
<title url="file:/c:/dat/recipeml/_Indian_Chili_.xml">
      "Indian Chili"</title>
  </item>
  <item name="Baking powder">
    <title url="file:/c:/dat/recipeml/_Best__Apple_Nut_Pudding.xml">
      "Best" Apple Nut Pudding</title>
    <title url="file:/c:/dat/recipeml/_Gold_Room__Scones.xml">
      "Gold Room" Scones</title>
    <title url="file:/c:/dat/recipeml/_Outrageous_Chocolate_Chipper.xml">
"Outrageous" Chocolate-Oatmeal Chipper 
(Cooki</title>
  </item>
  <item name="Baking soda">
    <title url="file:/c:/dat/recipeml/_First__Ginger_Cookies.xml">
"First" Ginger Molasses Cookies</title>
    <title url="file:/c:/dat/recipeml/_Foot_in_the_Cake.xml">
"Foot in the Fire" Chocolate Cake</title>
  </item>
  <item name="Tomato paste">
    <title url="file:/C:/dat/recipeml/Crawfish_Etouff'ee.xml">
"Frank's Place" Crawfish Etouff'ee
</title>
    <title url="file:/C:/dat/recipeml/Hamburger____Ground_Meat_Balti.xml">
      "Hamburger" / Ground Meat Balti
</title>
    <title url="file:/C:/dat/recipeml/Indian_Chili_.xml">
       "Indian Chili"</title>
  </item>
  <!-- some item elements removed -->
</ingredients>

The bolded text gives clues to potential improvements. The query treats "Baking Powder" and "Baking powder" as two different ingredients, but I'd rather see recipes that require baking powder appearing in the same list. Another problem is that for many ingredients in the RecipeML files, the item element was used to store too much information, like in the following:

<ing><item>
  (12-oz) tomato paste
</item></ing>

It's too much because RecipeML offers a structure to break down this information, making the data richer. This ingredient could have been broken down like this:

<ing>
  <amt>
    <qty>12</qty>
    <unit>oz</unit>
  </amt>
  <item>tomato paste</item>
</ing>

As the query output above shows, this decision meant that "(12-oz) tomato paste" and "Tomato paste", even if their case was normalized, would still be treated as two different ingredients.

Anyone familiar with data created by volunteer efforts will recognize these issues and others that can cause inconsistencies, such as the occasional use of more than one space when one would do. Fortunately, XQuery and XSLT 2.0 let you create customized processing by writing and calling user-defined functions.

Normalized Ingredients

Our final XQuery example query declares two new functions to normalize ingredient names so that it can use cleaner data for the subheads that list cookbook ingredients. Because new functions in a query must be in a declared namespace, the query begins by declaring a namespace and assigning a namespace prefix to it, just like user-defined functions in XSLT 2.0 stylesheets.

(: A unique, sorted list of all unique
   ingredients in the recipe collection,
   with URLs to link to them. Ingredient
   names get normalized by functions
   declared in the query prolog. :) 

declare namespace sn = 
        "http://www.snee.com/ns/misc/" ;
declare function sn:normIngName($ingName) 
                          as xs:string {
  (: Normalize ingredient name. :)

  (: remove parenthesized expression
     that may begin string, e.g. in
     "(10 ozs) Rotel diced tomatoes":)
  let $normedName := 
       replace($ingName,"^\(.*?\)\s*","")
  (: convert to all lower-case :)
  let $normedName := 
                  lower-case($normedName)
  (: replace multiple spaces with a 
     single one :)
  let $normedName := 
             normalize-space($normedName) 

  return $normedName
};

declare function sn:normIngList($ingList)
                           as item()* {
  (: Normalize a list of ingredient
     names. :)
  for $ingName in $ingList
    return sn:normIngName($ingName)
};


<ingredients>
{
  let $normIngNames :=
sn:normIngList(collection('recipeml/
docs.xml')//ing/item)

for $ingr in 
          distinct-values($normIngNames)
  order by $ingr
  return
    <item name="{$ingr}">
    {
      for $doc in 
        collection('recipeml/docs.xml'),
$i in $doc/recipeml/recipe/ingredients/ing/item
      where sn:normIngName($i) = $ingr
      return 
      <title url="{document-uri($doc)}">
{$doc/recipeml/recipe/head/title/text()}
        </title>
    }
    </item>
}
</ingredients>

Together, this query's namespace and function declarations comprise a prolog for the query. As you can tell from the previous queries, a prolog is not required, but it gives your query new options by customizing the query's environment before it's executed. Prologs can also include variable declarations, imports of other schema modules, and imports of W3C XSD schemas to assign data types or to validate input and output data. (A schema can even validate referenced names in a query. For example, if you misspell the name of an element as a name not defined by the schema, static analysis of the query identifies the error before the XQuery processor even looks at the source data.) Note how the prolog statements are separated by semicolons.

The sn:normIngName function uses the replace function, which accepts regular expressions as arguments, to clean up the ingredient values. (See my "Transforming XML" column Regular Expression Matching in XSLT 2 for an introduction to using this function with XSLT 2.0.) In this query, it removes leading parenthetical expressions and converts the whole string to lower-case, and then the XPath 1.0 normalize-space function cleans up whitespace in the string. The second declared function, sn:normIngList, uses the first one to normalize the ingredient list passed to it so that the query's main for clause can iterate over a normalized list.

I set and referenced the $normIngNames variable just to make the main for clause more readable. The rest of the query is pretty similar to the earlier one that created a list of recipes with ingredient subheads, with one important difference: instead of going through the recipes looking for ingredients that match the current subhead, the query goes through the ingredients themselves so that it can normalize them with another call to sn:normIngName before comparing them to the subhead ingredient. This way, it knows that a recipe with "(12-oz) tomato paste" as an ingredient does belong under the "tomato paste" subhead.

The for clause iterates over individual ingredient item elements by taking advantage of the XQuery for clause's ability to associate multiple variables with expressions; in this case, "$doc" with "collection('recipeml/docs.xml')" and "$i" with "$doc/recipeml/recipe/ingredients/ing/item". (Note the comma separating the two associations at the end of the first bolded line in the for clause.) This way, it can check each ing/item within each document.

The result of this query resembles the output of the previous query, but has fewer headings, because the normalization has resulted in the combination of the "Baking Powder" and "Baking powder" lists, the "(12-oz) tomato paste" and "Tomato paste" lists, and others whose names were cleaned up by the normalization functions.

What Next

Official W3C XQuery specs include XQuery 1.0: An XML Query Language, XQuery 1.0 and XPath 2.0 Formal Semantics, the XQuery 1.0 and XPath 2.0 Data Model, XSLT 2.0 and XQuery 1.0 Serialization, and XQuery 1.0 and XPath 2.0 Functions and Operators. Except for the last one, they're all pretty high-level, and the Functions and Operators document, while easy enough to follow, won't teach you much about how to use the different structures that you can incorporate into a query. I've learned the most from the W3C's XML Query Use Cases document; instead of being organized around XQuery features, it's organized around the specific tasks that people want to accomplish with an XML query language. Each use case begins with a plain English description, followed by the query itself and the result.

While writing my sample queries, I also found the use case queries to be useful models for formatting style as I added carriage returns and whitespace to XQuery's strange mixture of XML markup, SQL-like keywords, curly braces, and semicolons. Many of my queries could have been expressed other ways, perhaps more efficiently, but they do work as shown, and I was surprised how quickly the syntax I put together for my first queries worked.

Another great place to learn about the workings of XQuery queries is the sample queries page for eXist, the open source native XML database. eXist is also a good tool to try when you want to scale your XQuery use beyond the memory-based storage used by Saxon. At XML 2004, vendors such as MarkLogic, DataDirect, Oracle and IBM were all very happy to talk about XQuery products with varying degrees of price and commercial readiness.

Don't expect too much in the way of commercial readiness just yet; as I write this, all the W3C specifications that affect XQuery are still Working Drafts. For playing around with XQuery to see what role it might play in your tool set, the free version of Saxon will painlessly give you a quick start at a great price.