TMQL: A Brief Introduction
June 1, 2005
Editor's Note: Topic Maps are--for people more familiar with RDF--like a kissing cousin: of some interest; related, but we're not entirely sure how; and sometimes vaguely disquieting. While it's clear to most Semantic Web proponents that RDF and OWL are key technologies, it is often less clear how or whether Topic Maps fits in. The W3C's Semantic Web Best Practices and Deployment Working Group has created an RDF/Topic Maps Interoperability Task Force in order to sort out the relationship between these two knowledge representation formalisms for the web. And since the W3C's RDF query language standardization effort, SPARQL, is maturing rapidly, it is a fortunate time to introduce the Topic Maps Query Language, TMQL, which Roberto Barta does admirably well in this article.
Work on TMQL started more than a year ago, kicked off by a number of proposals. The editors have attempted to consolidate these approaches into an official draft specification. It is still in flux but sufficiently mature now to justify asking for public feedback. In this introduction I assume that you are at least superficially familiar with Topic Maps and how to create maps. (If not, see Lars Marius Garshol's "What Are Topic Maps?" for a refresher.)
One topic map I use in this preview contains information about music, including
various albums, (female or male) musicians and various music groups (which are all
artists
and have persons as members). Some of these topics are connected via associations,
such as
is-produced-by
or is-part-of
.
Setting Off
If you have used SQL before, then you will not be completely puzzled by the following query:
SELECT $album WHERE is-produced-by ($album: production, tom-waits: producer)
This query (technically a query expression) will return all albums where Tom Waits is known to be a producer.
tom-waits
is an identifier of a topic which we happen to know uniquely
pinpoints a topic about that person in the map we are querying. The query processor
will try
to find an association of type is-produced-by
and will check whether the topic
tom-waits
is playing the role producer
there. If so, it will
bind the variable $album
to the topic playing the role production
in that same association. It will work like this through the whole map, and collect
all
these variable bindings. Finally, the query processor will return a list of these
variable
bindings. We will see a bit later how exactly this works.
If we wanted to make the query more watertight to return albums only (and not something else that is produced), then we have to add another constraint to the WHERE clause:
SELECT $album WHERE is-produced-by ($album: production, tom-waits: producer), $album : album
The special binary predicate :
(alternatively denoted as is-a
)
checks whether the thing we have bound to $album
is an instance of the class
album
, at least according to the map we query. It is worth noting that
is-a
honors the (transitive) subclass-superclass relationship.
If we are not fixated on Tom Waits and would instead want a list of all albums together with their producers, we can extend the wishlist in the SELECT clause:
SELECT $album, $producer WHERE is-produced-by ($album: production, $producer: producer), $album : album
Again, the processor would walk through the whole map, find all associations of the given type, and bind the playing topics to their respective variables. One particular binding now consists of a pair (tuple of two components); all these pairs are collected in a list which is then returned.
Controlling What is Returned
In the SELECT clauses we have used so far, we asked for whole topics. Any query processor will hand over a complete topic data structure (probably according to TMDM) into the application. If an application were interested, say, in the name of such a topic only, it would have to use some API outside the scope of TMQL to navigate to that name.
To let the TMQL processor do the work, we can tweak the SELECT clause by adding a path expression:
SELECT $album / bn WHERE $album : album
Now the processor will do the navigation for us, as we requested with / bn
, to
find all the basenames for the thing bound to $album
. This may not be exactly
what we want, though: First, a topic can have any number of names, so we actually
would get
a whole list of those for each individual topic. And, secondly, these names would
still be
returned as data structure and not automatically as the string holding the name.
To fix the first problem, we can choose to only accept names in a particular scope,
say,
English (en
). This is achieved by appending a filter to the path
expression:
SELECT $album / bn [ @ en ] WHERE $album : album
While we first ask for all album names, we select only those which are in the scope
en
. To additionally force the processor to stringify the name and to
return such strings, we add a back-tick:
SELECT $album / bn [ @ en ] ` WHERE $album : album
Path expressions are also a convenient way to impose a sorting order on the list of tuples we return:
SELECT $album, $producer WHERE is-produced-by ($album: production, $producer: producer) ORDER BY $album / bn [ @ en ] `
That way we get albums and their producer, but the whole list becomes a sequence sorted according to the English album title.
The ordering can also include more than one ordering criterion, like in
SELECT $album, $producer WHERE is-produced-by ($album: production, $producer: producer) ORDER BY $producer / bn [ @ en ] ` desc, $album `
Here, we first sort the list of topic pairs according to the name of the producer. For demonstration only we choose descending order. More importantly though, for one specific producer name (in the English scope) we sort the sublist containing different albums according to the album's identifier. This may not be overly useful by itself, but at least it takes care that the whole returned list always appears in the same order, if we keep repeating the same query.
As you would expect, TMQL makes it possible to make the list of returned tuples unique and to select only slices out of the whole result set. It could look like this:
SELECT $album ORDER BY $album / bn ` UNIQUE OFFSET 10 LIMIT 20
Identifying Things
You may argue--correctly--that identifying topics with their (internal) map identifier (the TMDM model calls this source locators) is not an immensely robust idea if that identifier may change any second. Topic maps have a flexible way to address subjects and resources. and TMQL provides syntax for this. If, for instance, there is a subject indicator (a resource which helps to indirectly identify a subject), you can use that instead of an internal identifier:
SELECT $album WHERE is-produced-by ($album: production, s'http://www.u2.com/ : producer)
We use U2's web site, assuming that the topic map data contains that URL as a subject indicator as well. Similar syntax also exists if the URI for the subject itself is known.
Association Templates
In the queries so far, we have made use of association templates. Writing inside a
query is-produced-by ($album : production, $producer: producer)
makes the
processor try to find matching associations in the queried map. Such associations
must be of
type is-produced-by
, and must have exactly two roles, one for
production
and one for producer
. If an association in the map
has a third role, say, location
, to capture where an album has been produced,
then such association would never match the template.
To allow for such associations with additional roles to match, TMQL allows you to append an ellipsis:
SELECT $album WHERE is-produced-by ($album : production, $whoever: producer, ...)
Association templates also have more implicit meaning than is obvious at first sight.
If,
for example, the map contained an association of type is-remastered-by
which
also connects an album with a producer and is-remastered-by
is a subtype of
is-produced-by
, then such associations would also match the template.
Honoring subclassing also applies to roles and their types. If we had an association
of
type is-remastered-by
in our queried map, but the role (type) for the album is
not production
but the subclass remastering
, such association
would also match the association template.
If you don't care about the role type, you can omit it for some players:
SELECT $album WHERE is-produced-by ($album : production, $whoever, ...)
Of course, this may be walking on thin ice in some situations (or may make processors slower as they have fewer things to grasp).
Path Expressions
The textual overhead of the SQLish style which we have used so far may not be convenient if queries are trivial. Especially for web applications where pages have to be filled with lots of content from a TM backend, a much shorter notation is adequate.
To return all albums from
the map bound, say, to the variable %m
, we can simply write
%m // album
If we need the English names only, then
%m // album / bn [ @ en ]
will do it as well.
Path expressions can become quite complex and longish, so their readability may suffer. It is no problem, though, to formulate a query which returns only the English names of Tom Waits' albums:
%m // album [ . -> production [ * is-produced-by ] / producer = tom-waits ] / bn [ @ en ] `
The processor will again start off with all albums, and will subject each of them
to a test
provided by the contents of the first []
group. That will effectively test
whether Tom Waits is one of the producers. Only these albums will be post-processed
in that
the English name is selected from them. Finally, the string value is taken.
Chocolate, Vanilla, Caramel
The different language flavors, SELECT and path expressions, can--as we have already seen-- be mixed. Not so obvious is the fact that both styles are (almost?) equivalent in terms of expressivity; every SELECT query expression can be transformed into an equivalent path expression. It is up to the developer to choose the most appropriate combination on a case-by-case basis.
Both styles allow returning sequences of tuples of things into the application. This may be exactly along your line of thinking in most cases, but it does not help tremendously if you want to comfortably embed the query results into one of these shiny XML applications servers. To avoid that, developers have to write their own template engines. TMQL allows the creation of XML content using a third flavor, which--you may have guessed it--is otherwise equivalent to the other styles. This flavor, FLWR, is inspired by XQuery and uses RETURN clauses to specify the output:
return <albums>{ for $a in %m // album return <album>{$a / bn [@ en ]}</album> } </albums>
The return value here will create one XML 'document' as data structure (probably DOM)
with
a root element <albums>
. Nested into that will be all albums in the map.
The way this is achieved is by iterating over albums in a FOR loop. It uses a path
expression %m // album
to compute first all instances of albums in our map.
Each such album is bound to the iteration variable $a
, and with the new
binding, the body of the loop is evaluated.
Such body is defined by a nested RETURN clause. It contains an element
<album>
which wraps only text content which we specify with the
embedded TMQL path expression $a / bn [@ en ]
. Like in XQuery, XML content and
query text is separated using {}
brackets. Since the processor knows that text
has to be there, it will implicitly take the string value of that basename.
Not surprisingly, query expressions following the FLWR structure can return lists. They can also return whole topic maps. What syntax this should follow, though, is left for a later discussion.
Using Exist and All Quantification
On some occasions you will have to test whether particular things exists in a map or whether certain things have a relevant property. For illustration, let us ask for all music groups in our map which have at least one female group member:
for $group in %m // group where some $person in $group -> whole / member satisfy $person : female return ($group)
While we iterate over all groups in the map, we find for each all members using the
path
expression $group -> whole / member
. If only one satisfies the condition
that it is an instance of female
then the existential SOME clause is
satisfied
Conversely, we might be interested to find all boy groups--well, at least those groups where all members are male:
for $group in %m // group where every $person in $group -> whole / member satisfies $person : male return ($group)
More Language Features
Queries can be wrapped into functions. This is quite straightforward, as functions can be named, have formal parameters, and can be invoked everywhere that their name is visible.
More contentious is a feature which would allow the import of ontological information into a query. This could be simply a list of names, or a taxonomy, or a type system. Ontologies also might contain topics which actually represent functions written in some programming language; that way TMQL could be extended by external packages.
Most ontology definition languages, though, are more expressive in that constraints
on a
domain can be formulated, such as every music group must have at least two members
or a relationship has-created
between an album and a person is given
implicitly, if either the person has directly produced that album or that person belongs
to
a group, which in turn produced it.
The contentious part is whether TMQL should adopt the approach taken in tolog only, or whether it should take a more promiscuous position and specify an import mechanism for ontological information only. Then it may be left to implementations to provide optional notations and inferencing strength (like for OWL ) when and how they see fit.
Wrapping Up
If you are interested in more details, you may want to read an extended version of this document, or you may want to walk through a recent TMQL presentation (PDF). Potential implementors may even take a peek at the current TMQL draft. We believe that the language will be small enough to be implementable by small teams.
With all that, we request feedback from users and developers, regardless of whether that revolves around usability, applicability to particular application domains, or general feasibility of implementation.