Scaling Up with XQuery, Part 1

June 14, 2006

The W3C's XQuery language is now a Candidate Recommendation, and more and more implementations are appearing. In an XML.com introduction to the language (part 1, part 2), I wrote that "while the Saxon implementation may not scale up as much as the disk-based versions that use persistent indexes and other traditional database features, you can download the free version of Saxon, install it, and use XQuery so quickly that it's a great way to start playing with the language in order to learn about what this new standard can offer you."

What if you do want to scale up? That's where the real fun begins. The value of XQuery is not in its role as an alternative syntax to XSLT 2.0 for manipulating XML; it's in the implementations, which let you quickly retrieve, sort, and manipulate specific subsets of XML from collections that can measure in the terabytes. The ability to store large, indexed collections of data that don't fit neatly into normalized relational tables will create possibilities for all kinds of new applications, both inside and outside of the publishing world.

In this two-part article, we'll see how to set up and use three of these XQuery implementations. If you try them, there are two key issues to keep in mind when moving development from one XQuery engine to another:

Because the XQuery specification says nothing about how to load XML into such a database, popular XQuery engines each have their own way to do this, and batch loading of documents is often badly documented.
The central XQuery concept of the "collection" basically refers to a container of documents, and the implementation of the container varies from one XQuery engine to another. Part 1 of my earlier tutorial on the language described how the argument to Saxon's implementation of the collection() function is a URL pointing to an XML file that lists file locations. The XQuery engines covered in this article each map the XQuery collection concept to their own way of doing things, a way often influenced by pre-XQuery technology that plays a role in that product.

In this article, we'll see how to load a large amount of XML documents into each of these XQuery engines, how to run short interactive queries against them, and how to run an XQuery query stored in a file against the stored collection of XML documents.

The Task At Hand

"A large amount of XML documents" — how many is that? For my purposes, it's enough files that filling out a dialog box to load each one individually into a database (the first and sometimes only option properly documented by some XQuery engines) would be unreasonable. I want to create a script, perhaps by sending the output of a dir command on a Windows machine or an ls command on a Linux machine, through a bit of Perl or Python, and then I want to run that script to load those files, whether there are 15 or 1,500 of them.

For testing, I used the same source data that I used in my earlier XQuery articles: 291 files from Squirrel's RecipeML archive, which I cleaned up a bit (volunteer XML data entry isn't always as well-formed as you'd like) and put in a zip file for anyone who'd like to try it. The zip file also includes full versions of the scripts shown in this article.

After loading these data files, I wanted to try an interactive query to simulate ad-hoc use. I also ran an XQuery query stored in a file against the loaded database, including a query that has a parameter passed to it, because query files such as this will play an important role in a production system. Instead of making up new examples, I used queries from my earlier articles.

My code may not be the most efficient approach to using each product, but it works, and therefore provides a starting point. Because all three of the tested products are available in free versions, it should give users enough to start prototyping their ideas.

For XQuery engines, I chose three that offer free versions that let you get real work done with fairly arbitrary XML: MarkLogic Server release 3.0-6, eXist 1.0, and SleepyCat's Berkeley DB XML release 2.2.13. X-Hive DB's free version only works for 30 days, and DataDirect XQuery and IBM's support seem more geared toward using XQuery against data stored in relational databases. (XQuery's ability to query non-XML data along with XML is one of its strengths.)

I ran all tests on a Windows XP machine, but most of it should work on a Linux box.

MarkLogic Server

MarkLogic is carving out an expanding niche as a professional high-end XQuery engine. They make a free version of the product available as a marketing tool, and unlike the free software offered by other companies on a "try before you buy" basis, the limitations imposed on the use of the free MarkLogic server will still let you get some serious work done. If you need professional services, want to deploy an app beyond one or two desktops, want to take advantage of features such as automatic batch conversion of various non-XML formats to XML for loading into their server, or want to scale up your XML database to a really large size, they'd be happy to sell you licenses.

There are various approaches to using the MarkLogic server, but the classic one is to install it as a web server and run HTTP requests against it. After downloading and installing the MarkLogic server, pick Admin MarkLogic Server from the MarkLogic Server section of your Windows Start menu. The first screen asks for a license key and provides a link to get one for an evaluation copy of the software. For an evaluation license, you have two choices:

The Community License, which stores up to 100 megabytes and can be used for an unlimited time on personal projects. A given company can only use two copies.
The Trial license lets you store up to a gigabyte of data, but only works for 30 days.

After you get the license key, the server restarts, asks you a few questions (including a username and password that you'll need to access the administration screen and the data in your applications), and takes you to the Server's System Summary page. In the default server configuration, http://127.0.0.1:8001/ is the URL for the admin page. http://localhost:8000/use-cases/ takes you to a use cases page with frames that let you enter and view the results of XQuery queries. This includes links to sample queries, but you can enter any query you like in the "XQuery Source" frame, including queries against the RecipeML recipes once they're loaded.

Before creating a MarkLogic application, you must create an application server. On the tree at the left of the administration screen, pick Groups, Default, App Servers, and then select the Create HTTP tab that appears. It needs a port number to use for your new application server and a directory where that server's data will be stored; I assigned a port number of 8009 and created a myserverroot subdirectory of the default \Program Files\MarkLogic production installation directory for the data.

At this point, it's a good idea to put a small, simple HTML file named default.xqy in that directory and to then send a browser to http://localhost:8009. You should see your HTML file in the browser. Just as index.html is the default filename to retrieve from a directory stored on an Apache web server and default.asp is used for a Microsoft IIS server, default.xqy is the default filename for the MarkLogic server. As you'll see, you can store complex XQuery queries in these files, but simple HTML works as well.

I had my recipe files in a directory named c:\dat\xquery\recipeml, so I created the following script to load them into the MarkLogic server.

(: Load recipe files into MarkLogic database. :)

<html xmlns="http://www.w3.org/1999/xhtml"><body>
{ 

  (: Instead of 3 file names, the actual loadrecipes.xqy script has 291. :)
  let $filenames := ("_Baking_the_Best_Muffins_","_Butter_","_Brown_Bag__French_Apple_Pie") 

  for $dataFilename in $filenames
    return (xdmp:document-load(concat("c:/dat/xquery/recipeml/",$dataFilename,".xml"),
    <options xmlns="xdmp:document-load" xmlns:http="xdmp:http">
      <repair>none</repair>
      <permissions>{xdmp:default-permissions()}</permissions>
      <format>xml</format>
    </options>),
    xdmp:document-add-collections(concat("c:/dat/xquery/recipeml/",$dataFilename,".xml"),
                                  "recipes"))
}
<p>OK</p>
</body></html>

After creating a subdirectory of my new myserverroot directory named recipes, I put this script in a file named loadrecipes.xqy in that directory and sent my browser to http://localhost:8009/recipes/loadrecipes.xqy to run it. As with many XQuery queries (and ASP files), this one takes the form of an HTML file with delimited sections to generate the necessary data. In this case, the only generated HTML outside of the basic skeleton is a paragraph that says "OK" when all the files are loaded, but if there are problems loading the documents into the MarkLogic Server database, the error messages will also appear in your browser.

As I mentioned, each XQuery engine has its own special way to load documents, and one MarkLogic method is to use the custom non-standard function xdmp:document-load. Typical MarkLogic XQuery queries treat the xdmp namespace prefix for MarkLogic extensions as a predeclared namespace prefix, like xml, xs, xsi, and a few others that all XQuery engines are required to recognize even if they're not declared.

After I loaded the recipe files into MarkLogic, I went to the http://localhost:8000/use-cases/ screen that is installed with the administration server and tried a query from my earlier article listing the titles of all recipes with sugar mentioned as an ingredient. The original query didn't work verbatim, probably because of some confusion between the root of the document and the root of the collection (the same thing happened with eXist), but the following worked just fine:

collection('recipes')/recipeml/recipe/head/title[../../ingredients/ing/item[contains(.,'sugar')]]

The following multiline query, which takes a more FLWOR-like approach to retrieve the same information instead of just being one big XPath expression, also worked from the use-cases form:

for $ingredient in collection('recipes')//
                   ingredients/ing/item[contains(.,'sugar')]
  return $ingredient/../../../head/title

The last test was to take a query that I had stored in a file and run that against the database. In a way, I had already done this when I loaded the data, because the loadrecipes.xqy file is a query file, but I wanted to run a query from my earlier articles that extracted specific data from the database. So, I tried the Food for a Crowd query:

(: Create an HTML page linking to recipes  
   that serve more than 20 people.         :)

<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Food for a Crowd</title></head>
<body>

  <h1>Food for a Crowd</h1>
<div xmlns="">
  { 
  for $doc in collection('recipes')
    where $doc/recipeml/recipe/head/yield > 20 
      return <p><a href="getRecipe.xqy?recipeName={document-uri($doc)}">
        {$doc/recipeml/recipe/head/title/text()}</a>
      </p>
  }
</div>
</body></html>

Sending my browser to the URL http://localhost:8009/recipes/4acrowd.xqy properly displayed the created HTML page.

There were two important tweaks necessary to get it to work with all three XQuery engines:

I added the div wrapper. Without it, MarkLogic assumed that the RecipeML element names in the XPath expressions in the where and return expressions were in the http://www.w3.org/1999/xhtml namespace, since they didn't have namespace prefixes and the XHTML namespace was in scope. The div element with no namespace keeps its contents out of any namespace without causing problems in the resulting HTML.
In the Saxon version of this query, the a/@href attribute pointed to the individual XML recipe files sitting on the hard disk. For the MarkLogic and eXist versions, this attribute holds a URL that calls another query file named getRecipe.xqy, passing it the identifier of that recipe's XML within the database. The getRecipe.xqy query file, which is included in this article's accompanying zip file, retrieves that XML and converts it to HTML before sending it to the browser. The MarkLogic and eXist versions of getRecipe.xqy are identical except for the first line of each, which calls a different extension function to get the value of the recipeName parameter passed from 4acrowd.xqy to getRecipe.xqy in the URL. This ability to pass parameters from one XQuery file to another in HTTP server XQuery implementations lets you combine individual query files into larger, more complex applications.

What if you want to issue a command from a script that runs a query and saves the result in a file, instead of running your query by sending your web browser to a particular URL? A command line utility such as wget or curl can request the result of a query from an HTTP server, including MarkLogic's, as described in this article on retrieving XML from a TiVo. In addition to the URL of the query, you'll want to add the --http-user and --http-passwd parameters to your wget command line to tell the MarkLogic server that you are authorized to retrieve the data from that server. Use the administrator username and password that you created when you set up the MarkLogic server. (The TiVo article describes the use of these parameters in more detail, although I found that the password parameter in the version of wget that I'm currently using is --http-passwd and not --http-password as I described in the article. When in doubt, enter wget -h at your command line to check on the correct spelling of the parameter names.)

You don't have to retrieve data via an HTTP request — the various products provide APIs that may be more efficient for your application to use.

More of the Same

Next week, we'll see how to perform the same tasks on the same data with eXist and Berkeley DB XML, and you'll be ready to explore multiple options for the right XQuery platform for your system.