Berkeley DB XML: An Embedded XML Database

May 7, 2003

Berkeley DB XML is an open source, embedded XML database created by Sleepycat Software. It's built on top of Berkeley DB, a "key-value" database which provides record storage and transaction management. Unlike relational databases, which store data in relational tables, Berkeley DB XML is designed to store arbitrary trees of XML data. These can then be matched and retrieved, either as complete documents or as fragments, via the XML query language XPath.

Berkeley DB XML is written in C++, APIs for Berkeley DB XML exist for C/C++, Java, Perl, Python, and TCL, and more languages interfaces are currently under development.

What is an XML Database Good For?

An XML database has several advantages over key-value, relational, and object-oriented databases:

XML data is dropped straight into the database; it does not need to be manipulated or extracted from a document in order to be stored.
When inserted into the database, most (in Berkeley DB XML, all) aspects of an XML document, including white space, are maintained exactly.
Queries return XML documents or fragments, which means that the hierarchical structure of XML information is maintained.

For the XML community, XML databases solve two specific problems:

It is prohibitively costly (in terms of memory and processor requirements) to build in-memory trees of very large documents and then query those trees. Anyone using XSLT or any DOM-aware application to process very large documents will run into this problem; a 100 megabyte file may require as much as 1.2 gigs of available memory. Berkeley DB XML can easily hold gigabytes of XML data, making it all easily addressable via XPath queries.
Accessing one part of a document, à la XInclude, usually requires parsing an entire XML document before the requested fragment can be returned. For small XML documents, this is not a problem, but for large documents transmitted over a network, it's a waste of bandwidth and processor time. An XML database like Berkeley DB XML can find arbitrary sections of a stored XML document without any parsing. Wrapped in a web service, Berkeley DB XML could easily be programmed to provide "remote XInclude" functions via HTTP, which means that a language like XSLT, using its document() call, could easily fetch chunks of XML from a network at minimal cost in bandwidth and processing time.

In general, XML databases allow programmers with XML data to quickly create data stores with that data, with the minimum of programming time required, and eliminate the need to convert XML into other data structures.

With several free and proprietary XML databases available, what is the specific advantage of Berkeley DB XML? According John Merrells, a developer for Berkeley DB XML, "Application developers currently have the choice of storing their XML in the filesystem or in a remote database system. DB XML offers benefits over both in terms of reliability and performance. And by offering our product as a library that is linked into the application, we provide a lot of configurability."

Installing Berkeley DB XML

At this writing, DB XML is still in beta (version 1.0.11), moving toward a full public release, and can only be downloaded by request, with an official release expected summer 2003. Binaries are available for Windows users, but Unix users will need to follow the detailed configure-build instructions included with the source. Installation requires the pre-installation of several external libraries, namely Apache Xerces-C++ and the Berkeley DB library.

High-level documentation is available with the package; the C++ and Java API document is currently fairly complete. Example code is available in the Unix download for each supported programming language; exploring that example code while reviewing API documents is the best way to become acquainted with DB XML.

Programming with Berkeley DB XML

Everything of interest in Berkeley DB XML happens inside a container. Whereas a relational database contains tables, a Berkeley DB XML container contains XML documents. In the Perl example below, we create a container with the prefix "/etc/xml/db" and populate it with a minimal XML document.

#!/usr/local/bin/perl



use Sleepycat::DbXml 'simple';



$xml_content="<section><title>Testing</title></section>";

$dbname="/etc/xml/db";



$container = new XmlContainer($dbname);

$container->open(Db::DB_CREATE);

$document = new XmlDocument;

$document->setContent($xml_content);

$container->putDocument($document);

$container->close();

This initializes a container, creates a variety of files beginning with the prefix db in the /etc/xml directory, holding the document as well as a data dictionary and document statistics, then closes the container. The container, which now holds the contents of the $xml_content variable, can now be reopened and queried and have other documents added to it.

Each document added to a container is assigned a unique numeric ID, which can be retrieved via the getID function in the XmlDocument class (as in $document->getID()).

Metadata

Metadata can be associated with individual documents inside of a container, which allows you to store information about a document without having to alter the content of the document itself. Metadata is stored inside the top-level element of a document.

$author = 'me@address.com';

$val = new XmlValue($author);

$document->setMetaData("http://mynamespace.org", 

                       "ns", "author", $val);

Note that, throughout Berkeley DB XML, recasting values as an XmlValue is usually required; you can't simply drop in a scalar value and hope it works. Assuming the top-level element of your document is <section>, after setting metadata, you can now think of your document as

<section xmlns:ns="http://mynamespace.org" 

	    ns:author="me@address.com">...

Which means the document can now be retrieved via an XPath query like

/*[@ns:author='me@address.com']

Finally, the existence of metadata for a document in a container can be tested for with the getMetaData function.

$type = new XmlValue(XmlValue::STRING);

$exists = $document->getMetaData("http://mynamespace.org", 

                                 "author", $type);

The getMetaData function returns a boolean (which would be stored, above, in $exists), evaluating to True if a given metadata attribute exists for a document.

Querying a Document

Once a container is opened, whether it has been added to or not, it can be queried with an XPath statement.

$result = $container->queryWithXPath("/sections/section/title");

In this example, adapted from the examples included with the Perl API documentation, all documents inside a container that match the XPath statement are returned completely along with their unique IDs.

$result = $container->queryWithXPath("/sections/section/title");

$value = new XmlValue ;

while($result->next($value)) {

  my $document = $value->asDocument();

  print $document->getID() . " = " . $value->asString() . "\n";

}

What is returned -- whether the whole document or the element data which matches the XPath query -- is based on the "context" in which the query is executed. In the above example, in the default context, if the document inside the container matched the XPath statement, the entire document would be returned, not just the title, which experienced XPath users might expect. To produce only the fragments which match the XPath expression, the context must be changed.

Indexing

Like relational databases, XML databases allow developers to indicate which data should be indexed for faster retrieval. Berkeley DB XML offers a single mechanism for indexing, which indexes XML data according to four characteristics:

Path Type
Node Type
Key Type
Syntax Type

The type of index is always of the form Path-Node-Key-Syntax. Thus a node-element-equality-string index on the <title> element would optimize XPath queries for all elements with a title that matched a given string (i.e. //section[title='My Title']). Or, given content like

<section id="myid">

  <title>My Story</title>

</section>

then node-attribute-equality-string would perform a similar query, but optimized for matching an element attribute. So, if the index was performed on the @id attribute, the performance of the XPath query //section[@id='myid'] will be greatly enhanced.

Indexes are declared after a container is initialized and before any documents are added. An index declaration might take the form:

$container->declareIndex("", "title",

                         "node-element-equality-string")

Which creates a database that is indexed on the XML element <title>. The different parts of the index are explained below. The first value passed to the object, left blank (""), is reserved for the XML namespace of the node to be indexed; if it's blank, it uses the default namespace.

Multiple indices can be declared per container. This approach allows for finer-grained control over indexing than other XML indexing schemes, such as XSLT key() functions.

The different kinds of indices are fully described in Berkeley DB XML's documentation, but briefly summarized, they are:

Path Type: If an XPath refers to deeply nested content -- for example, /section/author/address/street/apartment_number -- then edge indexing is better (like edge-element-equality-string). Otherwise, use node indexing (like node-element-equality-string).
Node Type: If the thing we're referring to is an element, use element indexing; if it's an attribute, use attribute.
Key Type: If testing the value of the element against a provided value (/section[title='Great Expectations']), use an equality index (like node-element-equality-string). If testing for the existence of an element, use a presence index (like node-element-presence), and if testing using the XPath contains() function (//section[contains(title, 'Expectations')]), use substring (like node-element-substring-string).
Syntax Type: Used to identify your data as string (like node-element-equality-string) or number, i.e. node-element-equality-number) -- not necessary for presence tests.

Context

In Berkeley DB XML, all queries are executed within a particular context. The default context, for instance, returns whole documents if an XPath query is matched. If you wanted fragments to be returned, you would create a new XmlQueryContext object with the return values set to "ResultValues":

my $context = new XmlQueryContext(XmlQueryContext::ResultValues);

and issue your query with the context included:

my $context = new XmlQueryContext(XmlQueryContext::ResultValues);



my $results = $x->queryWithXPath('/sections/section/title', 

                                  $context) ;

my $value = new XmlValue ;

$results->next($value);

print "First matching result: " . $value->asString() . "\n" ;

Which, if the initial XML document is

<sections>

  <section id="a">

    <title>A Section</title>

  </section>

</sections>

will print

First matching result: <title>A Section</title>.

So, while the version of that statement without a context would return the full text of any XML document inside a container that matched the XPath statement, the second version will return only the XML fragment that matched the XPath -- namely, the <section> elements that are the direct children of the top-level <section> element.

Conclusions

Berkeley DB XML, even in beta, is a promising solution for XML storage and retrieval. According to Merrells, it is being evaluated by "several serious commercial enterprises." Based on Berkeley DB, it has an well-proven foundation for data storage, and SleepyCat's prior releases have proven them to be a reliable provider of well-documented open source tools for data storage. SleepyCat allows for commercial licensing of their open source tools, which may make this solution attractive for corporations that are skittish about open source.

It is also worth noting that Berkeley DB XML users essentially get Berkeley DB "for free" with the product. In other words, it's easy to mix and match regular DB data sources with XML data sources. This combination may provide a strong alternative to relational and object-oriented databases.

The one major feature that is missing from the current version is an update facility. Currently, documents are atomic: they cannot be altered, only inserted or deleted in totality. According the Merrells, SleepyCat has tentative plans to support the XUpdate language. However, for many applications, the ability to query large databases of XML data is enough to make the tool useful in many applications, even if that data cannot be updated inside the database.

Another promising sign is the community arising around Berkeley DB XML. Over the last several months, when issues arose during testing, the community of developers was responsive and quick to help and respond, and SleepyCat has consistently expressed a desire to hear about bugs and shown a willingness to respond to feature requests. Features currently under discussion include the aforementioned XUpdate support, as well as support for schema validation and performance improvements. In the short term, indexing is being rewritten so that indexes can be created on the fly, while a container is in use.

Since any data storage technology requires a significant investment in time and effort, this strong level of community and corporate support is encouraging; Berkeley DB XML, currently in its infancy, seems likely to be around for a long time, and by offering a standard embedded interface it may provide a very useful tool for programmers in need of robust data storage who want to avoid the overhead of a relational database. The tool has some growing to do, but even in its current form many programmers will find it a useful tool with a logical, powerful interface.

Postscript: Sample Code

A very simple piece of sample Perl code which creates and queries a database can be downloaded from XML.com.