XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

A Relational View of the Semantic Web

March 14, 2007

As people are increasingly coming to believe, Web 2.0 and the Semantic Web have a lot in common: both are concerned with allowing communities to share and reuse data. In this way, the Semantic Web and Web 2.0 can both be seen as attempts at providing data integration and presenting a web of data or information space. As Tim Berners-Lee wrote in Weaving the Web[1]:

If HTML and the Web made all the online documents look like one huge book, RDF, schema and inference languages will make all the data in the world look like one huge database.

RDF is at the core of W3C's Semantic Web architectural layers. It is the standard specifically designed to provide a way to produce and consume data on the Web. It sits on top of standards such as XML, URIs, and Unicode and is used as a basis for schemas and ontologies. It consists of a set of statements that are composed of a subject, predicate, and object that form propositions of fact [7].

How are queries performed on this "one huge database"? Up until recently, manipulating or retrieving RDF data has been done through vendor specific query languages or imperatively through APIs in languages such as Java, PHP, and Ruby. The W3C's proposed standard, SPARQL, is set to provide a declarative language to query and manipulate Semantic Web data [8].

SPARQL consists of operations that are reasonably similar to those found in existing and mature technologies such SQL or relational algebra including: join, union, left outer join (SPARQL's OPTIONAL), and comparison operators (SPARQL's FILTER) such as equal to, less than, greater than, etc. [8]

The current suite of existing technologies, such as SQL and the relational model, were devised without the specific requirements of disparate, uncontrolled, large-scale integration. It is unclear whether they are flexible enough to adapt to these new set of requirements in order to enable this idea of a global database.

Advantages of Loose Structure

Before attempting to defined SPARQL and RDF in relational terms it's useful to explore some of the reasons why you would store data in this manner.

One of the difficulties in creating this shared information space is to agree on a schema for the data. Traditional databases require an agreement on a schema, which must be made before data can be stored and queried. One of the great strengths of the RDF model is that it allows data to be stored and queried without first requiring a schema. This decoupling of schema and data also allows the schema to change independently of the data without requiring any existing data to be thrown away or padded with NULLs. It also allows a schema to be automatically generated by looking at relationships between imported instance data.

RDF also allows database design and management to be much more agile, similar to agile software development, where a schema can be designed incrementally, after the data has been collected, and it evolve over time as new requirements are encountered. It allows data that is structured slightly differently to be stored together in the lowest common denominator of an RDF statement (subject, predicate, and object). It eliminates the decision to weigh good design against performance in order to store data that might be slightly different in structure. For example, it allows suppliers without cities and names to be stored along alongside suppliers with that information.

This lack of padding (not needing NULLs) removes one of the most debated topics in SQL and the relational modelís use of it (see "Much Ado About Nothing" [5]). The argument has generally revolved around the possibly confusing uses of NULLs and what a NULL value actually means. This becomes especially important when one of the main tasks of the Semantic Web is to integrate data from many different sources. A NULL value can mean different things from different data sources and may have been produced as a result of different types of queries from different database implementations. This lack of context, which is often lost in traditional databases too, means it becomes prohibitively costly and difficult to retain the specific meaning of NULL values from the wide variety of sources available on the Semantic Web.

Removing the use of NULLs also has a positive impact when you consider the inconsistent handling that occurs across various SQL database implementations. It can also simplify aggregate functions where a NULL value is considered when counting rows but not when performing other operations such as averaging values.

RDF Using the Relational Model

An RDF statement or proposition seems fairly abstract but it is actually familiar to most developers in the form of database management systems (DBMS) and the most popular relational language SQL. These databases provide a way to represent statements of facts or propositions and to ask questions (queries) as to whether a given proposition is true or not.

For the purposes of storing propositions and answering queries its possible to represent RDF in an SQL or relational database and vice versa. The advantage in storing RDF using these previous models is to allow previous work done such as formalizing query operations and query optimization to be applied to SPARQL.

It should be made clear that the work conducted here does not concern itself with specific ways of storing RDF but merely using previous models as examples of what can be applied to RDF. There are many different approaches to creating efficient RDF stores including more efficient table structures, manipulating RDF data so that it can be stored more efficiently, and creating databases (not based on SQL) specifically designed to efficiently store RDF (which is narrow, regular, and requires many joins).

In order to describe a relational model of RDF a familiar example is used throughout: the supplier and parts tables as used by C.J Date [4]. Table 1 shows a typical set of data from the supplier table. It consists of a table heading, where the columns (or attributes) consist of a name and type, and a body that consists of rows with values for each of these columns. The first row of the body in Table 1 is a proposition that represents, "A supplier 'S1', has a name called 'Smith', a status of '20' and a city of 'London'".

SNO

sno

SNAME

name

STATUS

integer

CITY

char

S1

"Smith"

20

"London"

S2

"Jones"

10

"Paris"

S3

"Blake"

30

"Paris"

Table 1. Example of a Supplier Table

Figure 1 shows the mapping of this data (containing the same propositions) represented as an RDF graph. This representation takes the table headings (columns) as arrows to connect the values and their data types to an identifier ("_1", "_2", "_3"). These are RDF identifiers, called blank nodes, which are a placeholder for the other properties and values to be associated to one another, similar to a table row. The blank nodes represent the existence of a supplier but do not describe any properties of the supplier.

Example of a Supplier Graph

Figure 1. Example of a Supplier Graph

Pages: 1, 2, 3, 4

Next Pagearrow