Berkeley DB XML: An Embedded XML Database
Berkeley DB XML is an open source, embedded XML database created by Sleepycat Software. It's built on top of Berkeley DB, a "key-value" database which provides record storage and transaction management. Unlike relational databases, which store data in relational tables, Berkeley DB XML is designed to store arbitrary trees of XML data. These can then be matched and retrieved, either as complete documents or as fragments, via the XML query language XPath.
Berkeley DB XML is written in C++, APIs for Berkeley DB XML exist for C/C++, Java, Perl, Python, and TCL, and more languages interfaces are currently under development.
An XML database has several advantages over key-value, relational, and object-oriented databases:
For the XML community, XML databases solve two specific problems:
In general, XML databases allow programmers with XML data to quickly create data stores with that data, with the minimum of programming time required, and eliminate the need to convert XML into other data structures.
With several free and proprietary XML databases available, what is the specific advantage of Berkeley DB XML? According John Merrells, a developer for Berkeley DB XML, "Application developers currently have the choice of storing their XML in the filesystem or in a remote database system. DB XML offers benefits over both in terms of reliability and performance. And by offering our product as a library that is linked into the application, we provide a lot of configurability."
At this writing, DB XML is still in beta (version 1.0.11), moving toward a full public release, and can only be downloaded by request, with an official release expected summer 2003. Binaries are available for Windows users, but Unix users will need to follow the detailed configure-build instructions included with the source. Installation requires the pre-installation of several external libraries, namely Apache Xerces-C++ and the Berkeley DB library.
High-level documentation is available with the package; the C++ and Java API document is currently fairly complete. Example code is available in the Unix download for each supported programming language; exploring that example code while reviewing API documents is the best way to become acquainted with DB XML.
Everything of interest in Berkeley DB XML happens inside a container. Whereas a relational database contains tables, a Berkeley DB XML container contains XML documents. In the Perl example below, we create a container with the prefix "/etc/xml/db" and populate it with a minimal XML document.
#!/usr/local/bin/perl
use Sleepycat::DbXml 'simple';
$xml_content="<section><title>Testing</title></section>";
$dbname="/etc/xml/db";
$container = new XmlContainer($dbname);
$container->open(Db::DB_CREATE);
$document = new XmlDocument;
$document->setContent($xml_content);
$container->putDocument($document);
$container->close();
This initializes a container, creates a variety of files beginning with
the prefix db in the /etc/xml directory, holding the document as well as a
data dictionary and document statistics, then closes the container. The
container, which now holds the contents of the $xml_content
variable, can now be reopened and queried and have other documents added
to it.
Each document added to a container is assigned a unique numeric ID,
which can be retrieved via the getID function in the XmlDocument
class (as in $document->getID()).
Metadata can be associated with individual documents inside of a container, which allows you to store information about a document without having to alter the content of the document itself. Metadata is stored inside the top-level element of a document.
$author = 'me@address.com';
$val = new XmlValue($author);
$document->setMetaData("http://mynamespace.org",
"ns", "author", $val);
Note that, throughout Berkeley DB XML, recasting values as an
XmlValue is usually required; you can't simply drop in a scalar
value and hope it works. Assuming the top-level element of your document
is <section>, after setting metadata, you can now think of
your document as
<section xmlns:ns="http://mynamespace.org" ns:author="me@address.com">...
Which means the document can now be retrieved via an XPath query like
/*[@ns:author='me@address.com']
Finally, the existence of metadata for a document in a container can be
tested for with the getMetaData function.
$type = new XmlValue(XmlValue::STRING);
$exists = $document->getMetaData("http://mynamespace.org",
"author", $type);
The getMetaData function returns a boolean (which would be
stored, above, in $exists), evaluating to True
if a given metadata attribute exists for a document.
Once a container is opened, whether it has been added to or not, it can be queried with an XPath statement.
$result = $container->queryWithXPath("/sections/section/title");
In this example, adapted from the examples included with the Perl API documentation, all documents inside a container that match the XPath statement are returned completely along with their unique IDs.
$result = $container->queryWithXPath("/sections/section/title");
$value = new XmlValue ;
while($result->next($value)) {
my $document = $value->asDocument();
print $document->getID() . " = " . $value->asString() . "\n";
}
What is returned -- whether the whole document or the element data
which matches the XPath query -- is based on the "context" in which the
query is executed. In the above example, in the default context,
if the document inside the container matched the XPath statement, the
entire document would be returned, not just the title, which
experienced XPath users might expect. To produce only the fragments which
match the XPath expression, the context must be changed.
Like relational databases, XML databases allow developers to indicate which data should be indexed for faster retrieval. Berkeley DB XML offers a single mechanism for indexing, which indexes XML data according to four characteristics:
The type of index is always of the form
Path-Node-Key-Syntax. Thus a
node-element-equality-string index on the
<title> element would optimize XPath queries for all
elements with a title that matched a given string
(i.e. //section[title='My Title']). Or, given content
like
<section id="myid"> <title>My Story</title> </section>
then node-attribute-equality-string would perform a
similar query, but optimized for matching an element
attribute. So, if the index was performed on the @id
attribute, the performance of the XPath query
//section[@id='myid'] will be greatly enhanced.
Indexes are declared after a container is initialized and before any documents are added. An index declaration might take the form:
$container->declareIndex("", "title",
"node-element-equality-string")
Which creates a database that is indexed on the XML element
<title>. The different parts of the index are explained below. The
first value passed to the object, left blank (""), is
reserved for the XML namespace of the node to be indexed; if it's blank,
it uses the default namespace.
Multiple indices can be declared per container. This approach allows
for finer-grained control over indexing than other XML indexing schemes,
such as XSLT key() functions.
The different kinds of indices are fully described in Berkeley DB XML's documentation, but briefly summarized, they are:
/section/author/address/street/apartment_number
-- then edge indexing is better (like
edge-element-equality-string). Otherwise, use
node indexing (like
node-element-equality-string).element indexing; if it's an attribute, use
attribute./section[title='Great Expectations']), use an
equality index (like
node-element-equality-string). If testing for the existence
of an element, use a presence index (like
node-element-presence), and if testing using the XPath
contains() function (//section[contains(title,
'Expectations')]), use substring (like
node-element-substring-string).string
(like node-element-equality-string) or number,
i.e. node-element-equality-number) -- not necessary for
presence tests.In Berkeley DB XML, all queries are executed within a particular context. The default context, for instance, returns whole documents if an XPath query is matched. If you wanted fragments to be returned, you would create a new XmlQueryContext object with the return values set to "ResultValues":
my $context = new XmlQueryContext(XmlQueryContext::ResultValues);
and issue your query with the context included:
my $context = new XmlQueryContext(XmlQueryContext::ResultValues);
my $results = $x->queryWithXPath('/sections/section/title',
$context) ;
my $value = new XmlValue ;
$results->next($value);
print "First matching result: " . $value->asString() . "\n" ;
Which, if the initial XML document is
<sections>
<section id="a">
<title>A Section</title>
</section>
</sections>
will print
First matching result: <title>A Section</title>.
So, while the version of that statement without a context
would return the full text of any XML document inside a container that
matched the XPath statement, the second version will return only
the XML fragment that matched the XPath -- namely, the
<section> elements that are the direct children of the
top-level <section> element.
Berkeley DB XML, even in beta, is a promising solution for XML storage and retrieval. According to Merrells, it is being evaluated by "several serious commercial enterprises." Based on Berkeley DB, it has an well-proven foundation for data storage, and SleepyCat's prior releases have proven them to be a reliable provider of well-documented open source tools for data storage. SleepyCat allows for commercial licensing of their open source tools, which may make this solution attractive for corporations that are skittish about open source.
It is also worth noting that Berkeley DB XML users essentially get Berkeley DB "for free" with the product. In other words, it's easy to mix and match regular DB data sources with XML data sources. This combination may provide a strong alternative to relational and object-oriented databases.
The one major feature that is missing from the current version is an update facility. Currently, documents are atomic: they cannot be altered, only inserted or deleted in totality. According the Merrells, SleepyCat has tentative plans to support the XUpdate language. However, for many applications, the ability to query large databases of XML data is enough to make the tool useful in many applications, even if that data cannot be updated inside the database.
Another promising sign is the community arising around Berkeley DB XML. Over the last several months, when issues arose during testing, the community of developers was responsive and quick to help and respond, and SleepyCat has consistently expressed a desire to hear about bugs and shown a willingness to respond to feature requests. Features currently under discussion include the aforementioned XUpdate support, as well as support for schema validation and performance improvements. In the short term, indexing is being rewritten so that indexes can be created on the fly, while a container is in use.
Since any data storage technology requires a significant investment in time and effort, this strong level of community and corporate support is encouraging; Berkeley DB XML, currently in its infancy, seems likely to be around for a long time, and by offering a standard embedded interface it may provide a very useful tool for programmers in need of robust data storage who want to avoid the overhead of a relational database. The tool has some growing to do, but even in its current form many programmers will find it a useful tool with a logical, powerful interface.
A very simple piece of sample Perl code which creates and queries a database can be downloaded from XML.com.
Related Links
|
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.