Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

XML has been around for several years now and many, if not most, of you reading this article are using it in your job. You're most likely familiar with using one of the standard APIs for XML processing -- DOM and SAX; if you're are a Java programmer, you may be familiar with JDOM.

It's a general consensus that DOM programming is easier to program with because it allows you to get at elements and attributes without having to maintain processing state as you must do with SAX.

However, when you are processing a large XML document (for example one that is several megabytes in size), you often have to drop out of DOM due to memory constraints, in which you probably use a SAX processor which allows you greater control over memory consumption. But you pay a price for this control. SAX programming can be quite a bit more complex if you need to do a lot of processing based on parent-children element relations.

Related Reading

Java and XML
Solutions to Real-World Problems
By Brett McLaughlin

I propose as an alternative the use of an embedded XML database so that you can continue to utilize DOM for processing but without eating all of your memory. Additional benefits include persistence and data manipulation capabilities of XML databases which can make future processing more efficient.

What is an XML Database?

An XML database is a specific species of database whose record format is stored as XML data items, as opposed to a relational table or flat file. XML databases have gained in popularity in the past couple of years as database vendors such as Oracle have added these capabilities into baseline RDBMSes.

However, something like Oracle (or even MySQL) is a bit too much overhead for a large class of applications. This is where the concept of embedded databases can really help. Embedded databases are database software that provide many of the basic capabilities of their larger cousins, yet are small and simple enough to ship with your own application.

For the purposes of this article I will focus exclusively on one type of embedded database, the embedded XML database.

There are several open source XML databases, including Sleepycat's DB XML, Xindice (an ASF project) and my personal favorite, eXist. I became an eXist fan when I had a project that had to parse a several megabyte XML file and transform it into another XML document. XSLT wasn't an option because I needed to do some complex logic that might have been possible with XSLT, but I knew it would be a lot easier with Java.

The file was too large for the JVM to process without either crashing the JVM or completely hosing my machine (a Dell D800 1.6 GHz P-M with 1 GB of RAM) if I tried to use DOM/JDOM. SAX programming drove me nuts trying to maintain the state machine.

I just felt there had to be an easier way. Thus, I took an hour to read the eXist docs, run some sample code, and then I was processing my doc very painlessly.

Here is some example that shows exactly how easy it is to embed eXist into your application:

First you must download and unzip the eXist distribution from http://exist.sourceforge.net/

Then in your application you import:

import org.xmldb.api.* 

The next series of statements imports the database driver and looks very similar to JDBC setup.

String driver = "org.exist.xmldb.DatabaseImpl";
Class cl = Class.forName(driver);
Database database = (Database)cl.newInstance();
DatabaseManager.registerDatabase(database);
database.setProperty("create-database","true");

Then, we create a database instance and store in the XML document in it.

Collection root = DatabaseManager.getCollection("xmldb:exist:///db");
CollectionManagementService mgtService = 
 	(CollectionManagementService)root.getService("CollectionManagementService", "1.0");
Collection col = mgtService.createCollection("test");
XMLResource document = (XMLResource)col.createResource(null, "XMLResource");
String xml = readFile(file);
document.setContent(xml); // show using SAX handler
System.out.print("storing document " + document.getId() + "...");
col.storeResource(document);
System.out.println("ok.");
		 
col.setProperty("pretty","true");
col.setProperty("encoding","ISO-8859-1");

Finally, we frame our query and process the results.

PathQueryService service = (XPathQueryService)
	col.getService("XPathQueryService", "1.0");

String query  = "document(*)//GROUP[contains(.,'.200330')][.//SHORT[contains(.,'CUR') ]]";

	ResourceSet result = service.query(query);
		ResourceIterator i = result.getIterator();
			while(i.hasMoreResources()) {
			XMLResource r = (XMLResource) i.nextResource();
			System.out.println(r.getId());
			SAXHandler handler = new SAXHandler(); // from jdom
			r.getContentAsSAX(handler);
			Document doc = handler.getDocument(); -- jdom Doc
			...
}

In this instance I destroy the database; however, if I knew I might need to sort through the database later, even if only to transform the XML docs via XSLT, you could set this up to be persistentt, to minimize the initial time wasted on parsing and loading data into the database.

Conclusion

If you have to parse large XML documents or simply need to sort or search through a large collection of XML documents, you should consider implementing an embedded XML database to improve your application performance, robustness and quicken your development time.


Comment on this articleWhat's your favorite way to handle large XML documents? Share your hints and tips in the forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article


Titles Only Titles Only Newest First
  • A better example
    2007-03-12 04:51:31 stuarty [Reply]

    A better example of how to do this (and a maintained one too) can be found in the exist-db docs at:


    http://exist.sourceforge.net/deployment.html#N1042E

  • Need help
    2005-07-12 16:01:49 jennyla [Reply]

    Do you think it is posible to use eXist with > 3GB db - I follow your tutorial then use Sax Transform to get the data out. But if seem so slow - Can you provide more detail example with JDom. Thank you.

  • problem about driver
    2005-01-27 17:54:15 PengfeiLi [Reply]

    nice article,and helpful to me,I think,and thank you.But I meet some problems when I try to run a simple program:
    I used following statements to register a driver:
    String driver="org.exist.xmldb.DatabaseImpl";
    Class cl = Class.forName(driver);
    Database database = (Database)cl.newInstance();
    DatabaseManager.registerDatabase(database);
    But it reports errors:
    java.lang.NoClassDefFoundError
    at org.exist.xmldb.DatabaseImpl.
    why so?what mistake did I make?Could you give me a favour?Thank you very much!

    • problem about driver
      2005-02-18 02:13:53 antonio25 [Reply]

      you have to add the following libraries:


      exist.jar
      xindice-1.1b4.jar
      xmldb.jar
      xmlrpc-1.2-b1.jar
      commons-pool-1.2.jar


      from apache.


      Regards Tony

  • I do not get the point
    2003-12-20 09:57:11 ronald ploeger [Reply]

    nice example for using a native xml database. But at the start you were talking about DOM/JDOM using too much memory for the processing you wanted to do in Java. Then you store the xml in the DB and retrieve it later as a JDOM document. I assume then you did your processing. How can that save you memory. Or did you in your specific case only retrieve part of the document (by limiting it with xpath) as opposed to the whole document?

  • MEMORY USAGE
    2003-11-06 09:11:21 Julian Turner [Reply]

    My programming skills are limited to JScript and DHTML.


    I have been using some 3MB plus XML files in a project of mine, which have had memory problems.


    This article is of interest, and I wonder if there are any such databases which can be used as an ActiveX control in JScript?


    My alternative solution was to write my own JScript slimline parser, which only uses memory as needed. Essentially I still need to load the 3MB text file into memory (which is a big shortcoming), but instead of parsing the file, my firstChild etc methods, search the text string, and return a node which simply stores the start and end points of the node in the XML text string, and methods such as insertBefore splice the main XML text string. Since my project does not demand that I have a large number of nodes in existence at any one time, or major node processing, memory usage is confined to the string, and those light nodes which I do call up.





    • MEMORY USAGE
      2003-11-10 15:31:12 Mark Wilcox [Reply]

      Knowing Javascript and DHTML is quite a task, if perhaps not always the best solution to all problems.


      Unfortunately, XML databases and specifically embedded XML are so new, that I don't know of any out there that can be embedded this way.


      I think you're probably better off with something on the server that you can reference via an URL or XPath query embedded in a post response.

  • The org package
    2003-10-30 22:51:34 Lesego Raselemane [Reply]

    Hi
    I don't know if this is the right place for me to ask my question. My question is which .jar file should I use to be able to use org.xml.sax.driver?
    Regards


    • The org package
      2003-11-10 15:22:33 Mark Wilcox [Reply]

      Sorry for late reply -- this package should be in the Apache Xerces package (and probably as well as the Sun XML packages).