Scaling Up with XQuery, Part 2
by Bob DuCharme
|
Pages: 1, 2
Berkeley DB XML
Sleepycat's open source Berkeley DB XML is not a server, but a library built on top of their Berkeley DB database. It's written in C++, and although I have no numbers to compare its speed with that of MarkLogic, casual use shows both to be much faster than the Java-based eXist. Sleepycat offers APIs for DB XML in C++, Java, Perl, Python, Ruby, and Tcl. Paul Ford's 2003 article on an early version of DB XML goes into more detail on the product and demonstrates the use of the Perl API; I chose to use Python. To install the Python support, I ran the two EXE files I found in the Python subdirectory of the Berkeley DB XML installation directory.
Before you write any programs that use the API, you can use the dbxml utility that comes with Berkeley DB XML to perform some basic database operations. Sleepycat's Introduction to Berkeley DB XML (PDF) covers the use of the dbxml command prompt to load and query data, the creation of indexes, and a bit of introductory XQuery.
One dbxml command is createContainer, which lets you create Sleepycat DB XML's version of a document container. While they suggest creating containers with .dbxml as an extension so that you recognize it on your hard disk, I just called mine "recipes" for compatibility with my earlier examples, and entered the following at the dbxml utility's prompt:
createContainer recipes
The dbxml utility includes commands for adding documents to a container. As mentioned in Part 1, I wanted to automate this process in order to add a large number of documents, so I wrote a Python script. Sleepycat DB XML comes with an examples.py file that demonstrates the use of the library from Python, and I used the appropriate lines from there as models to create the following:
# loadrecipes.py: load RecipeML recipes into Sleepycat DB XML container
# named "recipes".
from bsddb.db import *
from dbxml import *
print "loading XML data..."
mgr = XmlManager()
uc = mgr.createUpdateContext()
recipePath = "/dat/xquery/recipeml/"
# Instead of 3 file names, the actual loadrecipes.py script has 291.
recipeFilenames = ["_Baking_the_Best_Muffins_","_Butter_","_Brown_Bag__French_Apple_Pie"]
container = mgr.openContainer("recipes")
for filename in recipeFilenames:
fileObject = open(recipePath + filename + ".xml")
fileContents = r""
for line in fileObject:
fileContents = fileContents + line
container.putDocument(filename, fileContents, uc)
fileObject.close()
print "Finished loading XML data."
It was a pleasant surprise to see how quickly Sleepycat loaded the 291 files.
To enter an interactive query, I started up the dbxml utility again and entered the following three commands:
openContainer "recipes"
query collection('recipes')/recipeml/recipe/head/title[../../ingredients/ing/item[contains(.,'sugar')]]
print
The multiline version of the same query with the FLWOR expression didn't work, so I moved on to executing a stored query against the database. I didn't want to embed each query inside a bunch of Python code, as I did to load the files into the database. So, I wrote the following Python script to load a file with an XQuery query into memory and to run that query against a container named on the command line with the query file's name:
# Take command line arguments of Berkeley DB XML container name and a
# file with an XQuery query and run the query against the container.
from bsddb.db import *
from dbxml import *
import sys
if len(sys.argv) != 3:
print "Enter\n\n python querydbxml.py containername queryfile\n"
print "to run the XQuery query stored in queryfile against the "
print "containername query in Berkeley DB XML."
sys.exit()
container = sys.argv[1]
queryFile = sys.argv[2]
mgr = XmlManager()
qc = mgr.createQueryContext()
container = mgr.openContainer(container)
fileObject = open(queryFile)
queryString = r""
for line in fileObject:
queryString = queryString + line
fileObject.close()
results = mgr.query(queryString, qc)
results.reset()
for value in results:
print value.asString()
The following use of this script ran the 4acrowd.xqy query that we saw earlier with no problems, although the values of the a/@href attributes don't mean much when the HTML isn't being delivered from a server-based XQuery engine.
python querydbxml.py recipes 4acrowd.xqy
The a/@href values were actually empty after I ran this, because DB XML 2.2.13's implementation of the XQuery document-uri() function has a bug. According to Sleepycat's John Snelson, this has been been fixed in the current development version and will be available in the next release.
Moving on
Each of these XQuery engines has many more features than I've covered here, such as index control, updating, and full-text searching. My goal was to get you to the point where you could start exploring those features with a reasonably large collection of your own data. Without spending any money, you can check them all out and discover the advantages to having large amounts of your XML stored in a database where you (or an application!) can use a W3C standard language to quickly retrieve what you want from that database.
- Easier way to put an XML file into Berkeley DB XML
2006-06-28 14:37:45 dschachter - Easier way to put an XML file into Berkeley DB XML
2006-06-28 14:41:54 Bob DuCharme