XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

A First Look at the Kowari Triplestore

June 23, 2004

Kowari is an open-sourced (Mozilla Public License) triplestore optimized for RDF storage, created by Tucana Technologies, and written entirely in Java 1.4.2. It began its life as the storage component of the Tucana Knowledge Server (TKS), Tucana's proprietary knowledge management suite, and remains under active development by Tucana.

Installation

Kowari is named for a small, mouselike Australian mammal, but given that the full version of the software is a 40+ meg download, and includes a host of open-sourced Java components (including Apache's SOAP implementation, the Jetty web server, and the Lucene search engine), a better name might be "platypus". In fairness, a "Lite" version of the software is also available, at about 14 megs, which includes two *.jar files, one to run the server, and the other to run a console.

This simplicity of installation and operation is quite welcome. Most of the available open-sourced triplestores currently require either compilation, or the installation of a relational database like PostgreSQL for persistence, or are reliant on a host programming language like Perl or Python. In contrast, Kowari's installation is a snap (if your machine has Java 1.4 installed)-- download, unpack, and run. On launch, Kowari sets up a web server, on port 8080 (the port number can be configured), which contains a number of useful resources.

A key component in Kowari's bag is a simple console app that allows for direct interaction with the server using Tucana's own SQL-like query language, iTQL. While most applications will end up calling the database via an external program, this easy install allows you to quickly get a feel for the product, and provides an easy way to perform common DBA-like tasks.

A Demo iTQL Session

Below, we'll use that console interface to create a database, populate it with an RDF file that describes United States senators, and query that data. A sample chunk of our RDF:

  <USSenator rdf:about="http://xml.com/example/LiebermanJoseph">
    <Name>Lieberman, Joseph</Name>
    <Party>Democrat</Party>
    <State>CT</State>
    <URI>http://lieberman.senate.gov</URI>
  </USSenator>

First, we create a database on localhost (127.0.0.1) named "Senators". Kowari uses Java RMI URIs to identify databases.


iTQL> create <rmi://127.0.0.1/server1#Senators>;

Our next command will load senators.rdf into that just-created database.


iTQL> load <file:///C:/rdf/senators.rdf>
      into <rmi://127.0.0.1/server1#Senators>;

Kowari allows for aliases to be declared and used in a way akin to namespaces.


iTQL> alias <http://xml.com/example/> as ex;
iTQL> alias <http://tucana.org/tucana#> as kowari;

That first alias allows us to abbreviate the namespace of our senatorial RDF in all further queries. The second alias is a convenience abbreviation for the "is" equivalency operator built into Kowari, which we'll use below.

Now that we're initiated, propagated, and aliased, we can query the triplestore. The query below selects all senators and their party afilliations.


iTQL> select $subj $obj 
      from <rmi://127.0.0.1/server1#Senators> 
      where $subj <ex:Party> $obj;

Here's what's happening: the "where" clause in the select statement defines constraints on the triplestore. In the example above, our "where" clause asks for all triples that have a predicate equal to ex:Party (which is an alias for http://xml.com/example/Party).

The output of the query above is a list of the 100 URIs making up the Senate, and their party affilliations:


[ http://xml.com/example/AkakaDaniel, "Democrat" ]
[ http://xml.com/example/BaucusMax, "Democrat" ]
[ http://xml.com/example/BayhEvan, "Democrat" ]
[ http://xml.com/example/BidenJoseph, "Democrat" ]
...

What if we only want to list Democrats? Using Kowari's built-in equivalency operator, <kowari:is> (aliased above), we can match string literal values.


iTQL> select $subj $obj 
      from <rmi://127.0.0.1/server1#Senators> 
      where $subj <ex:Party> $obj 
      and ($obj <kowari:is> 'Democrat');

Now we'll use more than one constraint in the where clause, and return more columns in our results. The query below names the different kinds of subjects and objects we expect, in order to allow us to list the name, web address (URI), and party affilliations for the senators from Connecticut (CT).


iTQL> select $name $uri $party 
      from <rmi://127.0.0.1/server1#Senators> 
      where $senator <ex:Name> $name 
      and $senator <ex:URI> $uri 
      and $senator <ex:Party> $party 
      and $senator <ex:State> $state 
      and $state  <kowari:is> 'CT';
      order by $name;

Our output:


[ "Dodd, Christopher", "http://dodd.senate.gov", "Democrat" ]
[ "Lieberman, Joseph", "http://lieberman.senate.gov", "Democrat" ]

And one final example:


iTQL> create <rmi://127.0.0.1/server1#feeds>
iTQL> load <http://www.oreillynet.com/meerkat/?_fl=rss10&t=ALL&c=47>
      into <rmi://127.0.0.1/server1#feeds>;
iTQL> select $uri $title 
      from <rmi://127.0.0.1/server1#feeds> 
      where $uri <http://purl.org/rss/1.0/title> $title;

The code above creates a database called "feeds", populates it with the most recent site summary XML from O'Reilly/XML.com; and, then, in response to a query, lists the URIs and titles of each article, that is, the bare bones of a queryable RSS aggregator in a few lines of iTQL.

As shown above, iTQL's syntax looks quite a bit like SQL and is clearly intended to make transitioning to Kowari as simple as possible for DBAs. XML hackers used to the brevity of XPath might be less accepting, however.

The iTQL console is one of several interfaces to the server. Access methods exist for JSP, SOAP, a JDBC driver, as well as for an iTQL JavaBean and Kowari's own low-level driver interface

Other Features

Three other features worth noting are Lucene full-text integration, descriptors, and named graphs.

Lucene full-text integration. RDF is not simply triples made up of URIs; in practice, much RDF (as in the examples above) contains string literal or XML data. Kowari can use the open-sourced Lucene search engine to index this text.

To use Lucene indexing, the DBA creates a separate database using the Lucene "model". Queries can then be constrained by the results returned from a Lucene search. In practice, this allows for searches that keep track of the source of a given token within a graph. In simple English, Lucene integration allows queries like: "select all articles where the title includes the words 'hacking' and 'library'," or "show me the publication dates of all books that contain the word 'Texas'." Lucene allows for basic keyword lookups as well as complex queries, including fuzzy matching and wildcards, and its presence in the database provides Kowari users with an appealing combination of Semantic Web-style, graph-based querying with old-school text lookups.

Descriptors. Descriptors bind iTQL commands to XSLT variables. Using descriptors, a developer can create an XSLT template and then populate it, dynamically, with values fetched from an iTQL query. This feature will be of particular interest to web developers who want to create custom, navigable web interfaces above large RDF stores, along with anyone who wants to convert RDF data into legacy XML formats. (Descriptors are not included in Kowari Lite.)

Named graphs. One problem that frequently comes up in the RDF community is the "provenance" problem -- how do we know, in a large triplestore, where a given triple comes from? Many have suggested named graphs as a solution, which will turn triples into "quads". Kowari has taken this path. According to Tom Adams, "Our triplestore is really a quad store, the 4th tuple being the group/model that a triple belongs to."

Benchmarks

Kowari is written in Java 1.4.2 and uses that version's New I/O (NIO). This provides for an decrease in access times, as Kowari is able to bypass the need for a storage layer (such as BerkeleyDB or MySQL), and write data blocks directly to disk.

Tucana has tested the 32-bit version of Kowari with 10 million statements, and the 64-bit with 50 million; TKS has been scaled up to 250 million statements and can conceivably manage a billion triples. Currently the software is used by a variety of clients, with applications in genomics research defense integration, and automobile manufacturing, and the firm reports dramatically increased performance on graph queries over relational databases.

What's Next for Kowari?

While Kowari is capable of doing real work today, Tucana plans to continue adding features to both make the triplestore more standards compliant. Inferencing support via OWL support is planned, and Tucana hopes to eventually support OWL DL, with stops at RDFS and "OWL Tiny" along the way. Support for arbitrary data types is also planned.

Tucana is also developing a new approach to file addressing, which they call a "resolver". Resolvers allow any resource to be assigned a special "file://" URI, and allow for the processing of arbitrary files as "pseudo-RDF". For instance, a resolver that points to an MP3 file can automatically extract and store a description of the file based on the ID3 tags embedded in the MP3; the same could be done with JPEG files containing metadata. This approach seems particularly interesting because it provides a simple way to absorb the "ambient data" on a computer -- unstructured content like photo and MP3 directories -- into a database, where it can be searched and explored.

Kowari Caveats and Conclusions

Kowari is a solid tool created by an enthusiastic, knowledgeable team. That said, it's not for everyone: the architecture of the application is clearly focused on the server, and developers looking for an embeddable RDF store for desktop apps will likely want to look elsewhere, unless they are willing to add several megs to their applications. Kowari's dependence on Java is another possible sticking point for those developing tools using other frameworks. Documentation is brief and unfinished, but what's there is useful for the adventurous.

Perhaps the most important caveat, however, is that Kowari lacks a security model. Tucana clearly expects security-minded customers to look into TKS, which provides full network-based authentication as part of its package. But DBAs looking to replicate the user/privileges model of MySQL or other databases may be disappointed by Kowari.

These minor issues aside, Kowari works as promised. SQL users should find it easy to migrate their skills to iTQL. Most commendably, in the open source tradition, the database has been designed to "play nice with others," allowing anyone who has invested their energies in building, for instance, a Jena solution to migrate to Kowari with minimum pain. Kowari should be a welcome addition to any Semantic Web developer's toolbox.



1 to 1 of 1
  1. Confusing (non-)distinction
    2004-07-01 09:37:11 Uche Ogbuji
1 to 1 of 1