Scaling Up with XQuery, Part 1
The W3C's XQuery language is now a Candidate Recommendation, and more and more implementations are appearing. In an XML.com introduction to the language (part 1, part 2), I wrote that "while the Saxon implementation may not scale up as much as the disk-based versions that use persistent indexes and other traditional database features, you can download the free version of Saxon, install it, and use XQuery so quickly that it's a great way to start playing with the language in order to learn about what this new standard can offer you."
What if you do want to scale up? That's where the real fun begins. The value of XQuery is not in its role as an alternative syntax to XSLT 2.0 for manipulating XML; it's in the implementations, which let you quickly retrieve, sort, and manipulate specific subsets of XML from collections that can measure in the terabytes. The ability to store large, indexed collections of data that don't fit neatly into normalized relational tables will create possibilities for all kinds of new applications, both inside and outside of the publishing world.
In this two-part article, we'll see how to set up and use three of these XQuery implementations. If you try them, there are two key issues to keep in mind when moving development from one XQuery engine to another:
Because the XQuery specification says nothing about how to load XML into such a database, popular XQuery engines each have their own way to do this, and batch loading of documents is often badly documented.
The central XQuery concept of the "collection" basically refers to a container of documents, and the implementation of the container varies from one XQuery engine to another. Part 1 of my earlier tutorial on the language described how the argument to Saxon's implementation of the
collection()function is a URL pointing to an XML file that lists file locations. The XQuery engines covered in this article each map the XQuery collection concept to their own way of doing things, a way often influenced by pre-XQuery technology that plays a role in that product.
In this article, we'll see how to load a large amount of XML documents into each of these XQuery engines, how to run short interactive queries against them, and how to run an XQuery query stored in a file against the stored collection of XML documents.
The Task At Hand
"A large amount of XML documents" — how many is that? For my purposes, it's enough files that filling out a dialog box to load each one individually into a database (the first and sometimes only option properly documented by some XQuery engines) would be unreasonable. I want to create a script, perhaps by sending the output of a
dir command on a Windows machine or an
ls command on a Linux machine, through a bit of Perl or Python, and then I want to run that script to load those files, whether there are 15 or 1,500 of them.
For testing, I used the same source data that I used in my earlier XQuery articles: 291 files from Squirrel's RecipeML archive, which I cleaned up a bit (volunteer XML data entry isn't always as well-formed as you'd like) and put in a zip file for anyone who'd like to try it. The zip file also includes full versions of the scripts shown in this article.
After loading these data files, I wanted to try an interactive query to simulate ad-hoc use. I also ran an XQuery query stored in a file against the loaded database, including a query that has a parameter passed to it, because query files such as this will play an important role in a production system. Instead of making up new examples, I used queries from my earlier articles.
My code may not be the most efficient approach to using each product, but it works, and therefore provides a starting point. Because all three of the tested products are available in free versions, it should give users enough to start prototyping their ideas.
For XQuery engines, I chose three that offer free versions that let you get real work done with fairly arbitrary XML: MarkLogic Server release 3.0-6, eXist 1.0, and SleepyCat's Berkeley DB XML release 2.2.13. X-Hive DB's free version only works for 30 days, and DataDirect XQuery and IBM's support seem more geared toward using XQuery against data stored in relational databases. (XQuery's ability to query non-XML data along with XML is one of its strengths.)
I ran all tests on a Windows XP machine, but most of it should work on a Linux box.