Using Embedded XML Databases to Process Large Documents
by Mark WilcoxOctober 22, 2003
XML has been around for several years now and many, if not most, of you reading this article are using it in your job. You're most likely familiar with using one of the standard APIs for XML processing -- DOM and SAX; if you're are a Java programmer, you may be familiar with JDOM.
It's a general consensus that DOM programming is easier to program with because it allows you to get at elements and attributes without having to maintain processing state as you must do with SAX.
However, when you are processing a large XML document (for example one that is several megabytes in size), you often have to drop out of DOM due to memory constraints, in which you probably use a SAX processor which allows you greater control over memory consumption. But you pay a price for this control. SAX programming can be quite a bit more complex if you need to do a lot of processing based on parent-children element relations.
|
Related Reading
Java and XML |
I propose as an alternative the use of an embedded XML database so that you can continue to utilize DOM for processing but without eating all of your memory. Additional benefits include persistence and data manipulation capabilities of XML databases which can make future processing more efficient.
What is an XML Database?
An XML database is a specific species of database whose record format is stored as XML data items, as opposed to a relational table or flat file. XML databases have gained in popularity in the past couple of years as database vendors such as Oracle have added these capabilities into baseline RDBMSes.
However, something like Oracle (or even MySQL) is a bit too much overhead for a large class of applications. This is where the concept of embedded databases can really help. Embedded databases are database software that provide many of the basic capabilities of their larger cousins, yet are small and simple enough to ship with your own application.
For the purposes of this article I will focus exclusively on one type of embedded database, the embedded XML database.
There are several open source XML databases, including Sleepycat's DB XML, Xindice (an ASF project) and my personal favorite, eXist. I became an eXist fan when I had a project that had to parse a several megabyte XML file and transform it into another XML document. XSLT wasn't an option because I needed to do some complex logic that might have been possible with XSLT, but I knew it would be a lot easier with Java.
The file was too large for the JVM to process without either crashing the JVM or completely hosing my machine (a Dell D800 1.6 GHz P-M with 1 GB of RAM) if I tried to use DOM/JDOM. SAX programming drove me nuts trying to maintain the state machine.
I just felt there had to be an easier way. Thus, I took an hour to read the eXist docs, run some sample code, and then I was processing my doc very painlessly.
Here is some example that shows exactly how easy it is to embed eXist into your application:
First you must download and unzip the eXist distribution from http://exist.sourceforge.net/
Then in your application you import:
import org.xmldb.api.*
The next series of statements imports the database driver and looks very similar to JDBC setup.
String driver = "org.exist.xmldb.DatabaseImpl";
Class cl = Class.forName(driver);
Database database = (Database)cl.newInstance();
DatabaseManager.registerDatabase(database);
database.setProperty("create-database","true");
Then, we create a database instance and store in the XML document in it.
Collection root = DatabaseManager.getCollection("xmldb:exist:///db");
CollectionManagementService mgtService =
(CollectionManagementService)root.getService("CollectionManagementService", "1.0");
Collection col = mgtService.createCollection("test");
XMLResource document = (XMLResource)col.createResource(null, "XMLResource");
String xml = readFile(file);
document.setContent(xml); // show using SAX handler
System.out.print("storing document " + document.getId() + "...");
col.storeResource(document);
System.out.println("ok.");
col.setProperty("pretty","true");
col.setProperty("encoding","ISO-8859-1");
Finally, we frame our query and process the results.
PathQueryService service = (XPathQueryService)
col.getService("XPathQueryService", "1.0");
String query = "document(*)//GROUP[contains(.,'.200330')][.//SHORT[contains(.,'CUR') ]]";
ResourceSet result = service.query(query);
ResourceIterator i = result.getIterator();
while(i.hasMoreResources()) {
XMLResource r = (XMLResource) i.nextResource();
System.out.println(r.getId());
SAXHandler handler = new SAXHandler(); // from jdom
r.getContentAsSAX(handler);
Document doc = handler.getDocument(); -- jdom Doc
...
}
In this instance I destroy the database; however, if I knew I might need to sort through the database later, even if only to transform the XML docs via XSLT, you could set this up to be persistentt, to minimize the initial time wasted on parsing and loading data into the database.
Conclusion
If you have to parse large XML documents or simply need to sort or search through a large collection of XML documents, you should consider implementing an embedded XML database to improve your application performance, robustness and quicken your development time.
What's your favorite way to handle large XML documents? Share your hints and tips in the forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- A better example
2007-03-12 04:51:31 stuarty [Reply]
A better example of how to do this (and a maintained one too) can be found in the exist-db docs at:
http://exist.sourceforge.net/deployment.html#N1042E
- Removing the nodes returned by a query
2006-04-17 07:19:20 wizard_romania [Reply]
I am developing a Java aplication that manages a XML database stored on an eXist servlet (wich runs on a Tomcat server). I am new to eXist...
I want to remove some XML elements from one of the XML files stored. A ResourceSet object containing these nodes is returned by running a query. How can I delete these nodes from my document? I could browse the DOM tree corresponding to the XML doc. and delete the nodes that equal a node in the ResourceSet, but this would slow the application...
I've tryed the following:
Element root=... //the root element of my xml doc
String query=... //the XPath query string
XPathQueryService service = (XPathQueryService) collection.getService("XPathQueryService","1.0");
service.setProperty("indent","yes");
ResourceSet results = service.query(query);
ResourceIterator i = results.getIterator();
while (i.hasMoreResources()) {
XMLResource r = (XMLResource) i.nextResource();
root.removeChild(r.getContentAsDOM());
}
but a got the following exception: "NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist."
Some method probably returns a shallow copy of each node... What should I do?
- Need help
2005-07-12 16:01:49 jennyla [Reply]
Do you think it is posible to use eXist with > 3GB db - I follow your tutorial then use Sax Transform to get the data out. But if seem so slow - Can you provide more detail example with JDom. Thank you.
- problem about driver
2005-01-27 17:54:15 PengfeiLi [Reply]
nice article,and helpful to me,I think,and thank you.But I meet some problems when I try to run a simple program:
I used following statements to register a driver:
String driver="org.exist.xmldb.DatabaseImpl";
Class cl = Class.forName(driver);
Database database = (Database)cl.newInstance();
DatabaseManager.registerDatabase(database);
But it reports errors:
java.lang.NoClassDefFoundError
at org.exist.xmldb.DatabaseImpl.
why so?what mistake did I make?Could you give me a favour?Thank you very much!
- problem about driver
2005-02-18 02:13:53 antonio25 [Reply]
you have to add the following libraries:
exist.jar
xindice-1.1b4.jar
xmldb.jar
xmlrpc-1.2-b1.jar
commons-pool-1.2.jar
from apache.
Regards Tony
- problem about driver
2005-05-06 02:24:57 PengfeiLi [Reply]
Thank you very much!I have solved this problem.
- problem about driver
- problem about driver
- I do not get the point
2003-12-20 09:57:11 ronald ploeger [Reply]
nice example for using a native xml database. But at the start you were talking about DOM/JDOM using too much memory for the processing you wanted to do in Java. Then you store the xml in the DB and retrieve it later as a JDOM document. I assume then you did your processing. How can that save you memory. Or did you in your specific case only retrieve part of the document (by limiting it with xpath) as opposed to the whole document?
- MEMORY USAGE
2003-11-06 09:11:21 Julian Turner [Reply]
My programming skills are limited to JScript and DHTML.
I have been using some 3MB plus XML files in a project of mine, which have had memory problems.
This article is of interest, and I wonder if there are any such databases which can be used as an ActiveX control in JScript?
My alternative solution was to write my own JScript slimline parser, which only uses memory as needed. Essentially I still need to load the 3MB text file into memory (which is a big shortcoming), but instead of parsing the file, my firstChild etc methods, search the text string, and return a node which simply stores the start and end points of the node in the XML text string, and methods such as insertBefore splice the main XML text string. Since my project does not demand that I have a large number of nodes in existence at any one time, or major node processing, memory usage is confined to the string, and those light nodes which I do call up.
- MEMORY USAGE
2003-11-10 15:31:12 Mark Wilcox [Reply]
Knowing Javascript and DHTML is quite a task, if perhaps not always the best solution to all problems.
Unfortunately, XML databases and specifically embedded XML are so new, that I don't know of any out there that can be embedded this way.
I think you're probably better off with something on the server that you can reference via an URL or XPath query embedded in a post response.
- MEMORY USAGE
- The org package
2003-10-30 22:51:34 Lesego Raselemane [Reply]
Hi
I don't know if this is the right place for me to ask my question. My question is which .jar file should I use to be able to use org.xml.sax.driver?
Regards
- The org package
2003-11-10 15:22:33 Mark Wilcox [Reply]
Sorry for late reply -- this package should be in the Apache Xerces package (and probably as well as the Sun XML packages).
- The org package
