TMQL: A Brief Introduction

June 1, 2005

Editor's Note: Topic Maps are--for people more familiar with RDF--like a kissing cousin: of some interest; related, but we're not entirely sure how; and sometimes vaguely disquieting. While it's clear to most Semantic Web proponents that RDF and OWL are key technologies, it is often less clear how or whether Topic Maps fits in. The W3C's Semantic Web Best Practices and Deployment Working Group has created an RDF/Topic Maps Interoperability Task Force in order to sort out the relationship between these two knowledge representation formalisms for the web. And since the W3C's RDF query language standardization effort, SPARQL, is maturing rapidly, it is a fortunate time to introduce the Topic Maps Query Language, TMQL, which Roberto Barta does admirably well in this article.

Work on TMQL started more than a year ago, kicked off by a number of proposals. The editors have attempted to consolidate these approaches into an official draft specification. It is still in flux but sufficiently mature now to justify asking for public feedback. In this introduction I assume that you are at least superficially familiar with Topic Maps and how to create maps. (If not, see Lars Marius Garshol's "What Are Topic Maps?" for a refresher.)

One topic map I use in this preview contains information about music, including various albums, (female or male) musicians and various music groups (which are all artists and have persons as members). Some of these topics are connected via associations, such as is-produced-by or is-part-of.

Setting Off

If you have used SQL before, then you will not be completely puzzled by the following query:


SELECT $album

WHERE

   is-produced-by ($album: production, tom-waits: producer)

This query (technically a query expression) will return all albums where Tom Waits is known to be a producer. tom-waits is an identifier of a topic which we happen to know uniquely pinpoints a topic about that person in the map we are querying. The query processor will try to find an association of type is-produced-by and will check whether the topic tom-waits is playing the role producer there. If so, it will bind the variable $album to the topic playing the role production in that same association. It will work like this through the whole map, and collect all these variable bindings. Finally, the query processor will return a list of these variable bindings. We will see a bit later how exactly this works.

If we wanted to make the query more watertight to return albums only (and not something else that is produced), then we have to add another constraint to the WHERE clause:


SELECT $album

WHERE

   is-produced-by ($album: production, tom-waits: producer),

   $album : album

The special binary predicate : (alternatively denoted as is-a) checks whether the thing we have bound to $album is an instance of the class album, at least according to the map we query. It is worth noting that is-a honors the (transitive) subclass-superclass relationship.

If we are not fixated on Tom Waits and would instead want a list of all albums together with their producers, we can extend the wishlist in the SELECT clause:


SELECT $album, $producer

WHERE

   is-produced-by ($album: production, $producer: producer),

   $album : album

Again, the processor would walk through the whole map, find all associations of the given type, and bind the playing topics to their respective variables. One particular binding now consists of a pair (tuple of two components); all these pairs are collected in a list which is then returned.

Controlling What is Returned

In the SELECT clauses we have used so far, we asked for whole topics. Any query processor will hand over a complete topic data structure (probably according to TMDM) into the application. If an application were interested, say, in the name of such a topic only, it would have to use some API outside the scope of TMQL to navigate to that name.

To let the TMQL processor do the work, we can tweak the SELECT clause by adding a path expression:


SELECT $album / bn

WHERE

   $album : album

Now the processor will do the navigation for us, as we requested with / bn, to find all the basenames for the thing bound to $album. This may not be exactly what we want, though: First, a topic can have any number of names, so we actually would get a whole list of those for each individual topic. And, secondly, these names would still be returned as data structure and not automatically as the string holding the name.

To fix the first problem, we can choose to only accept names in a particular scope, say, English (en). This is achieved by appending a filter to the path expression:


SELECT $album / bn [ @ en ]

WHERE

   $album : album

While we first ask for all album names, we select only those which are in the scope en. To additionally force the processor to stringify the name and to return such strings, we add a back-tick:


SELECT $album / bn [ @ en ] `

WHERE

   $album : album

Path expressions are also a convenient way to impose a sorting order on the list of tuples we return:


SELECT $album, $producer

WHERE

   is-produced-by ($album: production, $producer: producer)

ORDER BY

   $album / bn [ @ en ] `

That way we get albums and their producer, but the whole list becomes a sequence sorted according to the English album title.

The ordering can also include more than one ordering criterion, like in


SELECT $album, $producer

WHERE

   is-produced-by ($album: production, $producer: producer)

ORDER BY

   $producer / bn [ @ en ] ` desc, $album `

Here, we first sort the list of topic pairs according to the name of the producer. For demonstration only we choose descending order. More importantly though, for one specific producer name (in the English scope) we sort the sublist containing different albums according to the album's identifier. This may not be overly useful by itself, but at least it takes care that the whole returned list always appears in the same order, if we keep repeating the same query.

As you would expect, TMQL makes it possible to make the list of returned tuples unique and to select only slices out of the whole result set. It could look like this:


SELECT $album

ORDER BY

   $album / bn `

UNIQUE OFFSET 10 LIMIT 20

Identifying Things

You may argue--correctly--that identifying topics with their (internal) map identifier (the TMDM model calls this source locators) is not an immensely robust idea if that identifier may change any second. Topic maps have a flexible way to address subjects and resources. and TMQL provides syntax for this. If, for instance, there is a subject indicator (a resource which helps to indirectly identify a subject), you can use that instead of an internal identifier:


SELECT $album

WHERE

   is-produced-by ($album: production, s'http://www.u2.com/ : producer)

We use U2's web site, assuming that the topic map data contains that URL as a subject indicator as well. Similar syntax also exists if the URI for the subject itself is known.

Association Templates

In the queries so far, we have made use of association templates. Writing inside a query is-produced-by ($album : production, $producer: producer) makes the processor try to find matching associations in the queried map. Such associations must be of type is-produced-by, and must have exactly two roles, one for production and one for producer. If an association in the map has a third role, say, location, to capture where an album has been produced, then such association would never match the template.

To allow for such associations with additional roles to match, TMQL allows you to append an ellipsis:


SELECT $album

WHERE

   is-produced-by ($album : production, $whoever: producer, ...)

Association templates also have more implicit meaning than is obvious at first sight. If, for example, the map contained an association of type is-remastered-by which also connects an album with a producer and is-remastered-by is a subtype of is-produced-by, then such associations would also match the template.

Honoring subclassing also applies to roles and their types. If we had an association of type is-remastered-by in our queried map, but the role (type) for the album is not production but the subclass remastering, such association would also match the association template.

If you don't care about the role type, you can omit it for some players:


SELECT $album

WHERE

   is-produced-by ($album : production, $whoever, ...)

Of course, this may be walking on thin ice in some situations (or may make processors slower as they have fewer things to grasp).

Path Expressions

The textual overhead of the SQLish style which we have used so far may not be convenient if queries are trivial. Especially for web applications where pages have to be filled with lots of content from a TM backend, a much shorter notation is adequate.

To return all albums from the map bound, say, to the variable %m, we can simply write


%m // album

If we need the English names only, then


%m // album / bn [ @ en ]

will do it as well.

Path expressions can become quite complex and longish, so their readability may suffer. It is no problem, though, to formulate a query which returns only the English names of Tom Waits' albums:


%m // album [ . -> production [ * is-produced-by ] / producer = 

     tom-waits ] / bn [ @ en ] `

The processor will again start off with all albums, and will subject each of them to a test provided by the contents of the first [] group. That will effectively test whether Tom Waits is one of the producers. Only these albums will be post-processed in that the English name is selected from them. Finally, the string value is taken.

Chocolate, Vanilla, Caramel

The different language flavors, SELECT and path expressions, can--as we have already seen-- be mixed. Not so obvious is the fact that both styles are (almost?) equivalent in terms of expressivity; every SELECT query expression can be transformed into an equivalent path expression. It is up to the developer to choose the most appropriate combination on a case-by-case basis.

Both styles allow returning sequences of tuples of things into the application. This may be exactly along your line of thinking in most cases, but it does not help tremendously if you want to comfortably embed the query results into one of these shiny XML applications servers. To avoid that, developers have to write their own template engines. TMQL allows the creation of XML content using a third flavor, which--you may have guessed it--is otherwise equivalent to the other styles. This flavor, FLWR, is inspired by XQuery and uses RETURN clauses to specify the output:


return

   <albums>{

     for $a in %m // album return

         <album>{$a / bn [@ en ]}</album>

   }

   </albums>

The return value here will create one XML 'document' as data structure (probably DOM) with a root element <albums>. Nested into that will be all albums in the map. The way this is achieved is by iterating over albums in a FOR loop. It uses a path expression %m // album to compute first all instances of albums in our map. Each such album is bound to the iteration variable $a, and with the new binding, the body of the loop is evaluated.

Such body is defined by a nested RETURN clause. It contains an element <album> which wraps only text content which we specify with the embedded TMQL path expression $a / bn [@ en ]. Like in XQuery, XML content and query text is separated using {} brackets. Since the processor knows that text has to be there, it will implicitly take the string value of that basename.

Not surprisingly, query expressions following the FLWR structure can return lists. They can also return whole topic maps. What syntax this should follow, though, is left for a later discussion.

Using Exist and All Quantification

On some occasions you will have to test whether particular things exists in a map or whether certain things have a relevant property. For illustration, let us ask for all music groups in our map which have at least one female group member:


for $group in %m // group

where

    some $person in $group -> whole / member

         satisfy $person : female

return

    ($group)

While we iterate over all groups in the map, we find for each all members using the path expression $group -> whole / member. If only one satisfies the condition that it is an instance of female then the existential SOME clause is satisfied

Conversely, we might be interested to find all boy groups--well, at least those groups where all members are male:


for $group in %m // group

where

    every $person in $group -> whole / member

         satisfies $person : male

return

    ($group)

More Language Features

Queries can be wrapped into functions. This is quite straightforward, as functions can be named, have formal parameters, and can be invoked everywhere that their name is visible.

More contentious is a feature which would allow the import of ontological information into a query. This could be simply a list of names, or a taxonomy, or a type system. Ontologies also might contain topics which actually represent functions written in some programming language; that way TMQL could be extended by external packages.

Most ontology definition languages, though, are more expressive in that constraints on a domain can be formulated, such as every music group must have at least two members or a relationship has-created between an album and a person is given implicitly, if either the person has directly produced that album or that person belongs to a group, which in turn produced it.

The contentious part is whether TMQL should adopt the approach taken in tolog only, or whether it should take a more promiscuous position and specify an import mechanism for ontological information only. Then it may be left to implementations to provide optional notations and inferencing strength (like for OWL ) when and how they see fit.

Wrapping Up

If you are interested in more details, you may want to read an extended version of this document, or you may want to walk through a recent TMQL presentation (PDF). Potential implementors may even take a peek at the current TMQL draft. We believe that the language will be small enough to be implementable by small teams.

With all that, we request feedback from users and developers, regardless of whether that revolves around usability, applicability to particular application domains, or general feasibility of implementation.