Menu

Using Embedded XML Databases to Process Large Documents

October 22, 2003

Mark Wilcox

XML has been around for several years now and many, if not most, of you reading this article are using it in your job. You're most likely familiar with using one of the standard APIs for XML processing -- DOM and SAX; if you're are a Java programmer, you may be familiar with JDOM.

It's a general consensus that DOM programming is easier to program with because it allows you to get at elements and attributes without having to maintain processing state as you must do with SAX.

However, when you are processing a large XML document (for example one that is several megabytes in size), you often have to drop out of DOM due to memory constraints, in which you probably use a SAX processor which allows you greater control over memory consumption. But you pay a price for this control. SAX programming can be quite a bit more complex if you need to do a lot of processing based on parent-children element relations.

I propose as an alternative the use of an embedded XML database so that you can continue to utilize DOM for processing but without eating all of your memory. Additional benefits include persistence and data manipulation capabilities of XML databases which can make future processing more efficient.

What is an XML Database?

An XML database is a specific species of database whose record format is stored as XML data items, as opposed to a relational table or flat file. XML databases have gained in popularity in the past couple of years as database vendors such as Oracle have added these capabilities into baseline RDBMSes.

However, something like Oracle (or even MySQL) is a bit too much overhead for a large class of applications. This is where the concept of embedded databases can really help. Embedded databases are database software that provide many of the basic capabilities of their larger cousins, yet are small and simple enough to ship with your own application.

For the purposes of this article I will focus exclusively on one type of embedded database, the embedded XML database.

There are several open source XML databases, including Sleepycat's DB XML, Xindice (an ASF project) and my personal favorite, eXist. I became an eXist fan when I had a project that had to parse a several megabyte XML file and transform it into another XML document. XSLT wasn't an option because I needed to do some complex logic that might have been possible with XSLT, but I knew it would be a lot easier with Java.

The file was too large for the JVM to process without either crashing the JVM or completely hosing my machine (a Dell D800 1.6 GHz P-M with 1 GB of RAM) if I tried to use DOM/JDOM. SAX programming drove me nuts trying to maintain the state machine.

I just felt there had to be an easier way. Thus, I took an hour to read the eXist docs, run some sample code, and then I was processing my doc very painlessly.

Here is some example that shows exactly how easy it is to embed eXist into your application:

First you must download and unzip the eXist distribution from http://exist.sourceforge.net/

Then in your application you import:

import org.xmldb.api.* 

The next series of statements imports the database driver and looks very similar to JDBC setup.

String driver = "org.exist.xmldb.DatabaseImpl";

Class cl = Class.forName(driver);

Database database = (Database)cl.newInstance();

DatabaseManager.registerDatabase(database);

database.setProperty("create-database","true");

Then, we create a database instance and store in the XML document in it.

Collection root = DatabaseManager.getCollection("xmldb:exist:///db");

CollectionManagementService mgtService = 

 	(CollectionManagementService)root.getService("CollectionManagementService", "1.0");

Collection col = mgtService.createCollection("test");

XMLResource document = (XMLResource)col.createResource(null, "XMLResource");

String xml = readFile(file);

document.setContent(xml); // show using SAX handler

System.out.print("storing document " + document.getId() + "...");

col.storeResource(document);

System.out.println("ok.");

		 

col.setProperty("pretty","true");

col.setProperty("encoding","ISO-8859-1");

Finally, we frame our query and process the results.

PathQueryService service = (XPathQueryService)

	col.getService("XPathQueryService", "1.0");



String query  = "document(*)//GROUP[contains(.,'.200330')][.//SHORT[contains(.,'CUR') ]]";



	ResourceSet result = service.query(query);

		ResourceIterator i = result.getIterator();

			while(i.hasMoreResources()) {

			XMLResource r = (XMLResource) i.nextResource();

			System.out.println(r.getId());

			SAXHandler handler = new SAXHandler(); // from jdom

			r.getContentAsSAX(handler);

			Document doc = handler.getDocument(); -- jdom Doc

			...

}

In this instance I destroy the database; however, if I knew I might need to sort through the database later, even if only to transform the XML docs via XSLT, you could set this up to be persistentt, to minimize the initial time wasted on parsing and loading data into the database.

Conclusion

If you have to parse large XML documents or simply need to sort or search through a large collection of XML documents, you should consider implementing an embedded XML database to improve your application performance, robustness and quicken your development time.