Menu

Scaling Up with XQuery, Part 2

June 21, 2006

Bob DuCharme

Last week, we learned that although scaling up from Saxon's implementation of an in-memory XQuery database to a disk-based version requires a bit of extra effort, it's worth doing because you can create applications around much larger collections of data. And, it can be done for free. We also looked at how to set up and use MarkLogic server. This week, we'll see how to perform the same setup and usage tasks with two more servers: eXist and Sleepycat's Berkeley DB XML. You can find the sample data and scripts in this zip file.

eXist

As with MarkLogic, you usually interact with the open source eXist XQuery engine through an HTTP server that is part of the program. After you download and install it, the routine instructs you to start the server by selecting eXist Database Startup from the eXist group of the Windows Start menu. Once you do this, send your browser to http://localhost:8080/exist to view the administration screen.

Here, you will find a lot of good tips for getting started with eXist, including information about starting the Client shell (which is also available from the eXist group of the Windows Start menu). Before loading documents into an eXist collection, you must create the collection either by clicking the client shell's "Create new collection" icon (a little document with a star) or by entering a command such as the following at the exist:/db prompt in the lower pane of the client shell:

mkcol recipes

(Note that you can navigate and create new collections within a collection, which caused me some confusion at first. A /db> in the shell prompt shows that you're at the root of the main collection.) After creating the collection with the command above, I loaded data into it by:

  1. Putting the loadrecipes.xqy file below into a cookbook subdirectory that I created in C:\Program Files\eXist\webapp. (This directory is automatically created by the eXist installation.)

  2. Sending a browser to the URL http://localhost:8080/exist/cookbook/loadrecipes.xqy.

(: Load recipe files into eXist database. :)

xquery version "1.0";

declare namespace xmldb="http://exist-db.org/xquery/xmldb";

<html><body>
{ 
  (: We'll load each file into the coll1 collection as the administrator. :)
  let $collection := xmldb:collection("xmldb:exist:///db/recipes", "admin", "")
  (: Instead of 3 file names, the actual loadrecipes.xqy script has 291. :)
  let $filenames := ("_Baking_the_Best_Muffins_","_Butter_",
                     "_Brown_Bag__French_Apple_Pie")
  for $dataFilename in $filenames
      let $name := $dataFilename
      let $URI := xs:anyURI(concat("file:///c:/dat/xquery/recipeml/",$name,".xml"))
      let $retCode :=  xmldb:store($collection, $name, $URI)
      return <p>loaded {$retCode}</p> 
 }
</body></html>

This "query" file has a few things in common with the loadrecipes.xqy file that I created to load data into MarkLogic:

  • Because the XQuery specification doesn't describe a way to load data, the actual loading is done by an extension function called, in the eXist case, xmldb:store.

  • Because I will use a browser to execute the query, the process of loading creates an HTML page with messages confirming that each file has successfully loaded. If there are problems, eXist will display error messages as well.

After you load these files, you can query them. For an interactive query, click the binoculars icon in the eXist client shell to bring up the "Query Dialog" dialog box. Clicking in the top pane, entering an XPath expression, and then clicking the Submit button puts the query results in the bottom pane. Although the pop-up tool tip for the binoculars icon says that it lets you "Query the database with XPath," full XQuery expressions seem to work as well. Both of the queries to list the titles of recipes with "sugar" as an ingredient, shown in Part 1 of this article, worked fine here.

To run a query stored in a file against the loaded database, I copied the 4acrowd.xqy query file from Part 1 into C:\Program Files\eXist\webapp\cookbook and sent a browser to http://localhost:8080/exist/cookbook/4acrowd.xqy. This displayed the results properly, and a wget command line also retrieved the query results.

To be honest, there were actually lots of intermediate steps. The first version of 4acrowd.xqy that worked in MarkLogic didn't work in eXist, and when I fixed that, it didn't work anymore in MarkLogic. I had to tweak it further for the version shown above, which works with MarkLogic, eXist, and Berkeley DB XML. (The most significant tweak was adding the div element to change the namespace scope of the XPath expressions in the FLWOR section.) Remember, the XQuery standard is not set in stone as a W3C Recommendation yet. While many XQuery advocates complain that the standardization process moves too slowly, it is not nearly as easy to write non-trivial XQuery code for one implementation and then use it with another — which is, after all, the point of using standards-based software — as it is to move an XSLT 1.0 stylesheet from Xalan to Saxon to libxslt unchanged. (XQuery advocates I've met tend to be strongly committed to one implementation, which makes porting a minor issue for them.)

Berkeley DB XML

Sleepycat's open source Berkeley DB XML is not a server, but a library built on top of their Berkeley DB database. It's written in C++, and although I have no numbers to compare its speed with that of MarkLogic, casual use shows both to be much faster than the Java-based eXist. Sleepycat offers APIs for DB XML in C++, Java, Perl, Python, Ruby, and Tcl. Paul Ford's 2003 article on an early version of DB XML goes into more detail on the product and demonstrates the use of the Perl API; I chose to use Python. To install the Python support, I ran the two EXE files I found in the Python subdirectory of the Berkeley DB XML installation directory.

Before you write any programs that use the API, you can use the dbxml utility that comes with Berkeley DB XML to perform some basic database operations. Sleepycat's Introduction to Berkeley DB XML (PDF) covers the use of the dbxml command prompt to load and query data, the creation of indexes, and a bit of introductory XQuery.

One dbxml command is createContainer, which lets you create Sleepycat DB XML's version of a document container. While they suggest creating containers with .dbxml as an extension so that you recognize it on your hard disk, I just called mine "recipes" for compatibility with my earlier examples, and entered the following at the dbxml utility's prompt:

createContainer recipes

The dbxml utility includes commands for adding documents to a container. As mentioned in Part 1, I wanted to automate this process in order to add a large number of documents, so I wrote a Python script. Sleepycat DB XML comes with an examples.py file that demonstrates the use of the library from Python, and I used the appropriate lines from there as models to create the following:

# loadrecipes.py: load RecipeML recipes into Sleepycat DB XML container
# named "recipes".
from bsddb.db import *
from dbxml import *

print "loading XML data..."
mgr = XmlManager()
uc = mgr.createUpdateContext()

recipePath = "/dat/xquery/recipeml/"

#  Instead of 3 file names, the actual loadrecipes.py script has 291. 
recipeFilenames = ["_Baking_the_Best_Muffins_","_Butter_","_Brown_Bag__French_Apple_Pie"]

container = mgr.openContainer("recipes")

for filename in recipeFilenames:

  fileObject = open(recipePath + filename + ".xml")
  fileContents = r""

  for line in fileObject:
    fileContents = fileContents + line

  container.putDocument(filename, fileContents, uc)

  fileObject.close()

print "Finished loading XML data."

It was a pleasant surprise to see how quickly Sleepycat loaded the 291 files.

To enter an interactive query, I started up the dbxml utility again and entered the following three commands:

openContainer "recipes"
query collection('recipes')/recipeml/recipe/head/title[../../ingredients/ing/item[contains(.,'sugar')]]
print

The multiline version of the same query with the FLWOR expression didn't work, so I moved on to executing a stored query against the database. I didn't want to embed each query inside a bunch of Python code, as I did to load the files into the database. So, I wrote the following Python script to load a file with an XQuery query into memory and to run that query against a container named on the command line with the query file's name:

# Take command line arguments of Berkeley DB XML container name and a
# file with an XQuery query and run the query against the container.

from bsddb.db import *
from dbxml import *
import sys

if len(sys.argv) != 3:
    print "Enter\n\n   python querydbxml.py containername queryfile\n"
    print "to run the XQuery query stored in queryfile against the "
    print "containername query in Berkeley DB XML."
    sys.exit()

container = sys.argv[1]
queryFile = sys.argv[2]

mgr = XmlManager()
qc = mgr.createQueryContext()
container = mgr.openContainer(container)

fileObject = open(queryFile)
queryString = r""

for line in fileObject:
  queryString = queryString + line

fileObject.close()

results = mgr.query(queryString, qc)
results.reset()
for value in results:
    print value.asString()

The following use of this script ran the 4acrowd.xqy query that we saw earlier with no problems, although the values of the a/@href attributes don't mean much when the HTML isn't being delivered from a server-based XQuery engine.

  python querydbxml.py recipes 4acrowd.xqy

The a/@href values were actually empty after I ran this, because DB XML 2.2.13's implementation of the XQuery document-uri() function has a bug. According to Sleepycat's John Snelson, this has been been fixed in the current development version and will be available in the next release.

Moving on

Each of these XQuery engines has many more features than I've covered here, such as index control, updating, and full-text searching. My goal was to get you to the point where you could start exploring those features with a reasonably large collection of your own data. Without spending any money, you can check them all out and discover the advantages to having large amounts of your XML stored in a database where you (or an application!) can use a W3C standard language to quickly retrieve what you want from that database.