Introducing SPARQL: Querying the Semantic Web
November 16, 2005
An Introduction to SPARQL
This tutorial, the first of a three-part series, introduces SPARQL -- a query language and data access protocol for the Semantic Web. SPARQL is defined in terms of the W3C's RDF data model and will work for any data source that can be mapped into RDF. The specification is under development by the RDF Data Access Working Group (DAWG) and has recently reached Last Call Working Draft.
At this point in its life cycle the specification is stable enough that developers can begin seriously exploring its capabilities. And the availability of several SPARQL query engines means that this exploration can be practical rather than theoretical.
But what if you're a lot more interested in Web 2.0, which is practical and real, than in the Semantic Web, about which opinions vary widely? Why might you want to go to the trouble of learning SPARQL? For dyed-in-the-wool Semantic Web fans, this question may well be a no-brainer: RDF has needed a standard query language for some time and having one will make many development tasks much easier.
However SPARQL has a much wider potential audience. A key aspect of the Web 2.0 idea is the ability to extract and query information held across many different ad hoc, third-party apps, services, or repositories. That ability to move in and among various data sources is key to the Web 2.0 idea of the mashup -- take a little Google Maps, salt with some eBay, and sprinkle with a heaping hunk of Flickr, right?
SPARQL, which is both a query language and a data access protocol, has the ability to become a key component in Web 2.0 applications: as a standard backed by a flexible data model, it can provide a common query mechanism for all Web 2.0 applications. XML.com managing editor Kendall Clark has published an excellent essay (Web 2.0 Meet The Semantic Web) that expands more fully on this idea. SPARQL should be of interest to developers exploring the available options for publishing open data on the Web.
The goal of these tutorials is to enable developers to quickly become productive with SPARQL. All of the key language features will be introduced with abundant examples. No previous experience with RDF query languages is required, but a basic familiarity with RDF and RDF/XML is essential. There are many good primers on RDF available for readers interested in a quick refresher course or a bottoms-up introduction.
This first tutorial introduces the key concepts in SPARQL and its relationships to the other specifications under development by the DAWG. By the end of the tutorial you'll be able to write some simple SPARQL queries to extract data from RDF.
In the second tutorial we'll cover some of the more advanced query options, including working with multiple data sources. That tutorial will also demonstrate the ease with which data can be merged and queried using SPARQL.
The third and final tutorial will introduce the other SPARQL query forms:
CONSTRUCT
, DESCRIBE
, and ASK
. Far from being
limited to querying data, SPARQL also offers the ability to extract information from
a data
repository according to rules of the client's devising. Powerful stuff.
Before jumping into the syntax, let's put SPARQL into some context, and take a brief look at the data we'll be using throughout the series.
SPARQL in Context
Work on RDF query languages has been progressing for a number of years. Several different approaches have been tried, ranging from familiar looking SQL-style syntaxes, such as RDQL and Squish, through to path-based languages like Versa.
Of these approaches, those that emulate SQL syntactically have probably been the most
popular and widely implemented. This is perhaps surprising given the very different
models
that lurk behind relational databases and RDF -- familiarity with syntax has no doubt
contributed to this success. SPARQL follows this well-trodden path, offering a simple,
reasonably familiar (to SQL users) SELECT
query form which will be the main
focus of this first article.
SPARQL actually consists of three separate specifications. The query language specification makes up
the core. But alongside it sits the query
results XML format which, as you might guess, describes an XML format for serializing
the results of a SPARQL SELECT
(and ASK
) query. This simple format
is easily processable with common XML tools such as XSLT; we'll look at an example
of that
later.
The third specification is the data access protocol which uses WSDL 2.0 to define simple HTTP and SOAP protocols for remotely querying RDF databases. (Or, cunningly, for querying any data repository that can be mapped to the RDF model). The XML results format is used to generate responses from services that implement this protocol.
In total, then, SPARQL consists of a query language, a means of conveying a query to a query processor service, and the XML format in which query results will be returned.
There are a number of issues that SPARQL does not address yet; most notably, SPARQL is read-only and cannot modify an RDF dataset. Work on this area is currently out of scope for the DAWG, as noted in Section 2 of their charter. It seems likely that this will become a later task for the Working Group once the initial specifications have reached Recommendation status. A similar strategy of "query first, update later" was also adopted by the XQuery Working Group.
SPARQL Query Tools
Happily the SPARQL specifications don't exist in isolation. There are several tools and APIs that already provide SPARQL functionality, and most of them are up to date with the latest specifications. A brief list includes:
- ARQ, a SPARQL processor for Jena
- Rasqal, the RDF query library included in Dave Beckett's comprehensive Redland framework
- RDF::Query
- twinql, a SPARQL processor for Lisp written by Richard Newman
- Pellet, an open source OWL DL reasoner in Java, that has partial SPARQL query support
- KAON2, another OWL DL reasoner that has partial SPARQL support.
My SPARQL query tool Twinkle offers a simple GUI interface to the ARQ library, and supports multiple output formats and simple facilities for loading, editing, and saving queries. Handy if you want to play with SPARQL on the desktop. But for a minimum of installation fuss you can't beat an online SPARQL query tool, which we'll use throughout the rest of the tutorials. As it happens, the service is also a self-contained example of the SPARQL protocol in action.
The Periodic Table in RDF
Tutorial writers can burn a lot of time crafting a good set of examples. A balance needs to be struck between making the data clear versus making it too trivial. What you really want is for the examples to reflect the power of the technology being introduced. For this series, I'm going to dispense with the art of data design and instead pick up some data already published wild on the Web. That is, we're doing real RDF processing of real-world data. Not only will this help illustrate SPARQL's utility, we may even learn a few interesting facts along the way.
Bob DuCharme has done an excellent job of curating public collections of RDF on his site rdfdata.org. I've picked out this RDF representation of the periodic table for our purposes. It's data that most people will have at least a passing familiarity with, so won't take a great deal of review in order for you to get started. Here's a handy periodic table to use as a reference if your chemistry is a little rusty.
The RDF data provides some essential facts about each element including its name, symbol, atomic weight and number, plus a good deal more. We'll focus on these simple properties for now. A slightly edited extract of the data, showing a description of sodium, is included here:
<Element rdf:ID="Na" xmlns="http://www.daml.org/2003/01/periodictable/PeriodicTable#"> <name>sodium</name> <symbol>Na</symbol> <atomicNumber>11</atomicNumber> <atomicWeight>22.989770</atomicWeight> <group rdf:resource="#group_1"/> <period rdf:resource="#period_3"/> <block rdf:resource="#s-block"/> <standardState rdf:resource="#solid"/> <color>silvery white</color> <classification rdf:resource="#Metallic"/> <casRegistryID>7440-23-5</casRegistryID> </Element>
Note that the namespace for this data is
http://www.daml.org/2003/01/periodictable/PeriodicTable#
-- that'll be
important when we start formulating our SPARQL queries. The RDF includes a mixture
of
properties; some are simple literals such as name
and
atomicWeight
, while others such as group
and
standardState
have resources as values.
Introducing the Triple Pattern
RDF is built on the triple, a 3-tuple consisting of subject, predicate, and object. Likewise SPARQL is built on the triple pattern, which also consists of a subject, predicate and object. In fact an RDF triple is also a SPARQL triple pattern. A triple from our data expressed using the SPARQL triple pattern syntax looks like this:
<http://www.daml.org/2003/01/periodictable/PeriodicTable#Na> table:name "sodium".
A triple pattern is written as subject, predicate, and object and is terminated with
a full
stop. URIs, e.g. for identifying resources, are written inside angle brackets. Literal
strings are denoted with either double or single quotes. While properties, like
name
, can be identified by their URI, it's more usual to use a
qname
-style syntax to improve readability. Later in the tutorial I'll show
you how to associate a prefix with a URI using a mechanism very similar to XML
namespaces.
SPARQL specifies a number of handy abbreviations for writing complex triple patterns. Both the basic syntax and abbreviations borrow heavily from Turtle, a very terse RDF serialization alternative to RDF/XML. As a text rather than XML format, Turtle can be used to express RDF very succinctly. Rather than exhaustively list all of the SPARQL syntax shortcuts here, we'll introduce them throughout the examples contained in this and later tutorials.
The triple pattern above is fine for demonstrating syntax but isn't very useful as a query. If we know all the data, there's no need to run a query. However, unlike a triple, a triple pattern can include variables. Any or all of the subject, predicate, and object values in a triple pattern may be replaced by a variable. Variables are used to indicate data items of interest that will be returned by a query. The next example shows a pattern that uses variables in place of both the subject and the object:
?element table:name ?name.
Since a variable (which has in SPARQL an alternative spelling using the $
character, like $element
) matches any value, this pattern will match any RDF
resource that has a name
property. Each triple that matches the pattern will
bind an actual value from the RDF dataset to each of the variables. For example,
there is a binding of this pattern to our dataset where the element
variable is
bound to <http://www.daml.org/2003/01/periodictable/PeriodicTable#Cl
and the
name
variable is "chlorine."
In SPARQL all possible bindings are considered, so if a resource has multiple instances of a given property, then multiple bindings will be found. Which is a good thing to remember if you end up with more data than expected in your query results.
At this point you may be wondering if it's legal for a triple pattern to include only variables. Well, it is:
?subject ?predicate ?object.
This pattern matches all triples in an RDF graph.
Triple patterns can also be combined to describe more complex patterns, known as graph patterns. These will be clearer when seen within the context of some sample queries. So let's look at the basic structure of our first SPARQL query.
Structure of a Query
This SPARQL query selects the names of all the elements in the periodic table:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name. }
Let's break down the query into its parts to better understand the syntax.
Starting from the top we encounter the PREFIX
keyword. PREFIX
is
essentially the SPARQL equivalent of declaring an XML namespace: it associates a short
label
with a specific URI. And, just like a namespace declaration, the label applied carries
no
particular meaning. It's just a label. A query can include any number of PREFIX
statements. The label assigned to a URI can be used anywhere in a query in place of
the URI
itself; for example, within a triple pattern. In the single triple pattern included
in this
query we can see the table
prefix in use as a shorthand for
http://www.daml.org/2003/01/periodictable/PeriodicTable#name
, the full URI of
the name
property.
The start of the query proper is the SELECT
keyword. Like its twin in a SQL
query, the SELECT
clause is used to define the data items that will be returned
by a query. In Example 6 we're returning a single item, the name of the element.
As you might expect, the FROM
keyword identifies the data against which the
query will be run. In this instance, the query references the URI of the periodic
table in
RDF. A query may actually include multiple FROM
keywords, as a means to
assemble larger RDF graphs for querying. We'll have more to say about that (and SPARQL
datasets in general) in the next tutorial. For now, think of all the lovely mashups
. .
.
Finally, we have the WHERE
clause. A graph pattern is a collection of triple
patterns that identify the shape of the graph that we want to match against. In this
instance you'll recognize the pattern for this query as the triple pattern we used
earlier.
The WHERE
keyword is actually optional and can legally be omitted to make
queries slightly terser:
BASE <http://www.daml.org/2003/01/periodictable/> PREFIX table: <PeriodicTable#> SELECT ?name FROM <PeriodicTable.owl> { ?element table:name ?name. }
URIs are often long and unwieldly, and you can never have too much syntactic sugar
to help
avoid typing them out repeatedly. BASE
is another form of URI abbreviation,
defining the base URI against which all relative URIs in the query will be resolved,
including those defined with PREFIX
. As you can see, the common prefixes of the
two URIs in the previous example have been factored out into a BASE
URI
declaration.
Now that we've written a complete query, let's run it and get some results.
Our First Results
Here's a table that lists the first few results (you can view the complete results using the online query tool):
row | name |
---|---|
1 | sodium |
2 | neon |
3 | iron |
The result of a SPARQL SELECT
query is a sequence of results that,
conceptually, form a table or result set. Each row in the table corresponds to one
query
solution. And each column corresponds to a variable declared in the SELECT
clause. If you've done any kind of database development, this kind of table-oriented
result
set should be immediately familiar.
In later sections we'll look at how that sequence can be modified, e.g. to apply a sort order, limit the number of returned results, etc. We'll also take a quick look at the XML results format. But for now, let's make the query to do something more interesting.
Graph Patterns
Taking what we've learned about the simplest kind of triple patterns and the structure of a SPARQL query, we can now explore how to do more complex and useful queries.
The next example shows a query that selects the name, symbol, and atomic number of all elements in the periodic table:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name ?symbol ?number FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name. ?element table:symbol ?symbol. ?element table:atomicNumber ?number. }
What's new here is that the query pattern consists of multiple triple patterns. A
collection of triple patterns is a graph pattern. In this instance the graph pattern
consists of three triple patterns, one to match each of the desired properties:
name
, symbol
, and atomicNumber
. Understanding how
this query operates involves a bit more background on the pattern matching process.
The most important point is that within a graph pattern a variable must have the same
value
no matter where it is used. So in the previous example the variable element
will always be bound to the same resource. In other words, this query will match any
resource that has all three of the desired properties. A resource that does not contain
all
of these properties will not be included in the results because it won't satisfy all
of the
triple patterns. We'll cover optional matching in a later section.
The other notable item here is that there is one triple pattern for each of the variables
required to be present in the result set. In SPARQL one cannot SELECT
a
variable if it is not listed in the graph pattern. This may seem slightly odd if you're
only
used to SQL; in that language it is quite common to return variables that are not
listed in
a WHERE
clause. But remember a SPARQL query processor has no data dictionary
that lists all columns (i.e. properties) of a resource. Variables must be bound to
an RDF
term via a triple pattern in order for the processor to be able to extract that term
from
the graph.
Graph Pattern Shortcuts
SPARQL includes a number of syntax shortcuts that simplify the writing of patterns. Let's rewrite our query more succinctly:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT * FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name; table:symbol ?symbol; table:atomicNumber ?number. }
We've used two shortcuts here. The first should be familar to SQL users: *
.
This shortcut means "return all variables listed in the graph pattern." It saves having
to
itemize every variable at the cost of relying on the processor to order the columns
in the
result set.
The second shortcut is, formally, the use of a predicate-object list. This shortcut allows a query author to list the subject of a series of triple patterns only once. When we're using this form, each triple pattern is terminated with a semicolon rather than a full stop. This shortcut can be used when several patterns share the same subject.
SPARQL offers a similar shortcut, an object list, which simplifies patterns that differ only in their subject.
OPTIONAL
Patterns
RDF graphs are often semi-structured; some data may be unavailable or unknown. How do we allow for this when querying for data? Let's work through an example to illustrate the problem. Imagine that we wanted to adapt the previous query to also return the color of the element. Our first attempt may look like this:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name ?symbol ?number ?color FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name. ?element table:symbol ?symbol. ?element table:atomicNumber ?number. ?element table:color ?color. }
We've extended our SELECT
statement to include the new variable,
color
, and have also added a match for the relevant property
(table:color
) to the graph pattern. So far, so good.
If you run this query though, you'll notice that some elements are missing. Ununtrium,
for example. (No, I'd never heard of it either). If we look closely at the RDF data,
we find
that this ununtrium, and several other of the heavier elements, do not have the relevant
table:color
property. So these elements are not returned in the results.
We need to alter the query to allow for the fact that we have some missing or incomplete data. We achieve this by indicating that the relevant triple pattern is optional:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name ?symbol ?number ?color FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name. ?element table:symbol ?symbol. ?element table:atomicNumber ?number. OPTIONAL { ?element table:color ?color. } }
If you run this version of the query you'll find that all of the elements are now correctly
included. The OPTIONAL
keyword must be followed by a sub-pattern containing the
optional aspects of the query. Within the result set, if an element doesn't have a
color
property, then the color
variable is said to be unbound for that particular
solution (row).
Matching Alternatives with UNION
Now that we've seen how to explore optional data, let's see how we can select from alternatives. If we were interested in the chemistry of the halogens and the noble gases, we might simply construct and run separate queries in order to find out their atomic weights and CAS registry numbers.
But using the SPARQL UNION
keyword we can write a single query that matches
all of the elements. That query looks like this:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?symbol ?number FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable#> WHERE { { ?element table:symbol ?symbol; table:atomicNumber ?number; table:group table:group_17. } UNION { ?element table:symbol ?symbol; table:atomicNumber ?number; table:group table:group_18. } }
There are a few things to notice. First, the query pattern consists of two nested
patterns
joined by the UNION
keyword. If an element resource matches either of these
patterns, then it will be included in the query solution. For clarity the patterns
use the
predicate-object list shortcut.
The query also includes another demonstration of URI shortening, this time within
the
object of a triple pattern. The value (range) of the table:group
property is a
resource. Each of the groups in the table is modeled as a resource with its own URI.
The
full URI for group 17 is
http://www.daml.org/2003/01/periodictable/PeriodicTable#group_17
. As we've
already declared a URI PREFIX
for
http://www.daml.org/2003/01/periodictable/PeriodicTable#
we can truncate this
to table:group_17
.
Any number of UNION
s can be included in a query, providing a great deal of
flexibility in assembling data from alternatives.
Sorting
With all of the examples we've seen so far, we've been content to let the results be returned in whatever order the query engine chooses. This is rarely desirable in practice, as we commonly need to impose some sensible and relevant ordering to the data.
SPARQL offers the ORDER BY
clause to let us do precisely that. The next example demonstrates the new syntax:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name ?number FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name; table:atomicNumber ?number; table:group table:group_18. } ORDER BY ?number
This example selects the name
and atomicNumber
of all of the
elements in group 18 of the periodic table, the noble gases. The ORDER BY
clause indicates that the elements should be ordered by their atomic number property,
in
ascending order.
Formally, ORDER BY
is a solution sequence modifier -- it manipulates
the result set prior to it being returned by the query processor. As such, it is not
part of
the graph pattern and so is listed after the WHERE
clause in the query
syntax.
An ORDER BY
clause can list one or more variable names, indicating the
variables that should be used to order the result set. The query processor will sort
by each
variable in turn, in order of their declaration. By default all sorting is done in
ascending
order, but this can be explicitly changed using the DESC
(descending) and
ASC
(ascending) functions. The next example sorts all of the elements in the periodic table in descending order of
atomic weight:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name; table:atomicWeight ?weight. } ORDER BY DESC(?weight)
SPARQL also allows us to limit the total number of results in a result set using the
LIMIT
keyword, which indicates the maximum number of rows that should be
returned. A value of zero will return no results; if the value is greater than the
size of
the result set, then all rows will be returned. Used in combination with ORDER
BY
we can modify our query to create a new query that returns the ten heaviest elements in the periodic table:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name; table:atomicWeight ?weight. } ORDER BY DESC(?weight) LIMIT 10
When building user interfaces to navigate through a database or set of results, it's
common
to break the results into pages, e.g. displaying 10 search results at a time. SPARQL
supports such paging by allowing a query to specify an OFFSET
into the result
set. This indicates that the processor should skip a fixed number of rows before
constructing the result set. This usage is naturally combined with ORDER BY
in
order to ensure a consistent and meaningful order. By way of example, let's assume
that
we've already listed the ten heaviest elements in the periodic table and now want
to fetch
the next ten heaviest. In this query we use OFFSET
to skip the data we've already seen:
PREFIX table: <http://www.daml.org/2003/01/periodictable/PeriodicTable#> SELECT ?name FROM <http://www.daml.org/2003/01/periodictable/PeriodicTable.owl> WHERE { ?element table:name ?name; table:atomicWeight ?weight. } ORDER BY DESC(?weight) LIMIT 10 OFFSET 10
SPARQL Query Results XML Format
For readability the examples we've viewed so far have been rendered as HTML tables. Most SPARQL processors will include a custom API to allow the direct manipulation of a result set, allowing a programmer to manipulate results in whatever way suits an application. But if we want to serialize a SPARQL result set in a standard way, perhaps to return data via a web service, we can use the SPARQL Query Results XML Format.
By way of an example, here's an extract of the results from the first example above. To view the complete set of results, refer to the online service:
<sparql xmlns="http://www.w3.org/2005/sparql-results#"> <head> <variable name="name"/> </head> <results ordered="false" distinct="false"> <result> <binding name="name"><literal datatype="http://www.w3.org/2001/XMLSchema#string">sodium</literal></binding> </result> <result> <binding name="name"><literal datatype="http://www.w3.org/2001/XMLSchema#string">neon</literal></binding> </result> <result> <binding name="name"><literal datatype="http://www.w3.org/2001/XMLSchema#string">iron</literal></binding> </result> <!-- more results --> </results> </sparql>
As you can see, the format is fairly simple and regular:
- All of the key elements belong to a single namespace,
http://www.w3.org/2005/sparql-results#
- The root element is
sparql
, which contains ahead
and aresults
element that together describe the result set - The
head
section declares all variables that will be returned in the result set. It's equivalent to the column headings in an HTML table - The
results
section lists each queryresult
, i.e. oneresult
element for each row in the result set - A
result
element contains onebinding
for each variable. A binding is one ofliteral
oruri
. These elements contain the actual values returned. If a variable is not bound in a query (see the above section onOPTIONAL
Patterns), then it is marked asunbound
.
Given its obvious simplicity and regular structure, manipulating this format with XSLT or XQuery is fairly trivial. The SPARQL Query Results XML Format specification includes several relevant examples.
Summary
This brings us to the end of our first look at SPARQL.
We've seen how SPARQL allows us to match patterns in an RDF graph using triple patterns, which are like triples except they may contain variables in place of concrete values. The variables are used as "wildcards" to match RDF terms in the dataset.
We introduced the SELECT
query which can be used to extract data from an RDF
graph, returning it as a tabular result set. We built up more complex graph patterns
from
simple triple patterns and illustrated how to deal with both required and
OPTIONAL
data. UNION
queries were also introduced as a way of
dealing with selecting alternatives from our dataset. Finally, we demonstrated how
to apply
ordering to our results, LIMIT
the amount of data returned, and jump forward
through results using OFFSET
.
Along the way we took a brief look at the SPARQL XML Query Results Format, and a number of the syntax shortcuts that make writing queries much simpler. These are especially useful with repetitive graph patterns and long URIs.
Armed with this information, and the growing range of SPARQL implementations, you can start to investigate the language yourself and put it to good use. As you begin working with the language you'll no doubt find Dave Beckett's query language reference a handy resource.
In our next tutorial in this series we'll look more closely at how SPARQL deals with data typing, applying constraints to our data, and the facilities for querying data from multiple sources.
Finally, I'd like to thank Katie Portwin and Priya Parvatikar for early feedback on this article.