XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XQuery, libferris, and Virtual Filesystems

July 27, 2007

Introduction

By bringing together an XQuery engine and a virtual filesystem you can use a familiar query language to access relational databases, Berkeley db4 databases, kernel filesystems, and network files as well as XML. libferris, at its, core is a virtual filesystem allowing many different data sources to be exposed through a filesystem interface. These include the expected things like file://, http://, ftp:// as well as not so expected things like databases, XML files, and even applications like emacs, Evolution, XWindow, and Firefox.

For example, a slight change in an XPath expression to use a db4 file instead of an XML file for a customer lookup can substantially improve query performance. Using a db4 file to perform a lookup is faster because the db4 file is indexed and the right piece of information can be directly accessed in only a few disk seeks. Using an XML file for the lookup table would require reading the whole file and parsing it; then there is no index in RAM so searching a large DOM would also require time.

Some familiarity with XQuery will be of help to the reader in understanding this article. The example XQueries are kept simple in order to demonstrate the libferris-related features instead of XQuery itself.

At times, direct filesystem interactions are shown using tools like ferrisls from the libferris distribution. This is the analogy of looking at an input XML file to debug a particular XQuery or to work out an XPath. In early experimentation it can be of great help to use ferrisls to directly view a data source (filesystem) in order to see why an XQuery might not be returning the result you expect.

libferris is available for POSIX platforms. A binary package repository exists for easy installation on Fedora Core 6 and Fedora 7 will be available shortly. In contrast to the older Linux style of having to explicitly mount filesystems, most of the time, libferris handles the mounting of filesystems for you. For example, if you try to read an XML file as a directory, then libferris will automatically mount it for you and allow you to view inside the XML file.

XQuery on db4

Libferris can mount both XML files and db4 files as filesystems. Because the data model of db4 is fairly flexible, both db4 and XML can appear very similar with libferris. The main difference is the underlying data storage; for example, slight modifications to db4 are much faster than for XML (db4 can save only the changes, XML updates usually mean streaming the entire DOM to disk).

Assume we have a customer lookup table in an XML file as partially shown in the following example code.

<customers>
 <customer id="3"   givenname="Foo"   familyname="Bar"/>
 <customer id="15"  givenname="Bobby" familyname="McGee"/>
 <customer id="131" givenname="Ziggy" familyname="Stardust" />
</customers>

The following commands will create a db4 file containing the same information. Another method more suitable for keeping a db4 file in sync with a master XML file is shown at the end of the article.

Because we are dealing with mixing XML and db4 and using a virtual filesystem, there is some different terminology used to refer to very similar things. An XML attribute is referred to using the filesystem terminology of Extended Attribute (EA) in libferris. Thus the --show-ea command-line option to ferrisls is telling it what (extended) attributes we are interested in seeing. The --preserve-list option to ferriscp tells it explicitly which (extended) attributes we want to be copied from the source to the destination.

As can be seen from the two ferrisls commands the same information is available in both the XML and db4 files, the only difference is the extension of the customers file.

$ fcreate --create-type=db4 --rdn=customers.db .
$ ferriscp -av --preserve-list=id,givenname,familyname customers.xml/customers customers.db
$ ferrisls --show-ea=givenname,id customers.xml/customers
Foo     3
Bobby   15
Ziggy   131
$ ferrisls --show-ea=givenname,id customers.db/customers
Foo     3
Bobby   15
Ziggy   131

A simple XQuery on the customers.xml file is shown below. The document URL and the customer ID are declared as variables and the standard doc() function is used to open the customers.xml file to access a customer's information by their unique ID.

declare variable $docurl     := "file:///.../customers.xml";
declare variable $customerid := "131";
<resultdata>
 {
  for $c in doc( $docurl )/customers/*[@id=$customerid]
  return
    <person cid="{ $c/@id }" surname="{ $c/@familyname }" />
 }
</resultdata>

The above query can be executed with either the xqilla command from the XQilla distribution or the ferris-xqilla command from the libferris distribution. The difference is that the latter command makes some extension functions available to the XQuery.

$ ferris-xqilla customers.xq
<resultdata>
  <person cid="131" surname="Stardust"/>
</resultdata>

Very little has to change to use customers.db instead of customers.xml. The docurl will have to have the file name changed to customers.db and instead of using the standard doc() function we use the ferris-doc() extension function. These two functions behave the same way, the latter makes use of libferris to provide the data and query resolution. The query that operates directly on customers.xml could also use the ferris-doc() extension function, though there is no advantage.

declare variable $docurl     := "file:///.../customers.db";
declare variable $customerid := "131";
<resultdata>
 {
  for $c in ferris-doc( $docurl )/customers/*[@id=$customerid]
  return
    <person cid="{ $c/@id }" surname="{ $c/@familyname }" />
 }
</resultdata>

As the XQuery uses the ferris-doc() extension function we must use the ferris-xqilla command to evaluate it. The results will be identical.

$ ferris-xqilla customers-db4.xq
<resultdata>
  <person cid="131" surname="Stardust"/>
</resultdata>

XQuery on PostgreSQL

libferris can mount PostgreSQL as a filesystem. The result is available under both the postgresql:// and the shorthand pg:// URL scheme. At the top level is the name of servers, then databases, and then tables. For example, a table foo in database bar on the local machine can be accessed as pg://localhost/bar/foo.

A relational table is shown by libferris by making each tuple a file and using (extended) attributes to show each column in the table. So, for example, a table with two columns ( name varchar, id int ) with two tuples will make a directory with two files, each of which will have the name and ID (extended) attribute available.

Database functions are also available alongside the tables and can be called to create a virtual directory with the results. For example, if we have a database called xmldotcom2007 with a table customers and a function customerlookup( int, int ), which returns a range of customers by their customer ID, then both of these will be available directly under the database itself—pg://localhost/xmldotcom2007. The logic of the postgresql function is kept trivial for demonstration purposes.

Such a database might be created and populated as the following code shows.

bash$ psql
create database xmldotcom2007;
\c xmldotcom2007
create table customers ( id serial primary key, givenname varchar(100), familyname varchar(100) );
insert into customers values ( 3, 'Foo', 'Bar' );
insert into customers values ( 15, 'Bobby', 'McGee' );
insert into customers values ( 131, 'Ziggy', 'Stardust' );

CREATE TYPE customerlookup_result 
    AS (f1 int, fname varchar(100), lname varchar(100));
CREATE FUNCTION customerlookup( int, int ) 
   returns setof customerlookup_result 
AS
$BODY$
DECLARE
  rec customerlookup_result;
BEGIN
  for rec in 
     select id,givenname,familyname from customers 
     where id >= $1 and id <= $2
  LOOP
  return next rec;
END LOOP;
return;
END;
$BODY$
LANGUAGE 'plpgsql' ;
\q
bash$

Pages: 1, 2

Next Pagearrow