An Introduction to Prolog and RDF
Introduction: SW is AI
Many Semantic Web advocates have gone out of their way to disassociate their visions and projects from the Artificial Intelligence moniker. No surprise, since the AI label has been the kiss of, if not death, at least scorn, since Lisp machines were frozen out of the marketplace during the great "AI winter" of the mid-1980s. Lisp still suffers from its association with the AI label, though it does well by being connected with the actual technologies.
However, it is a curious phenomenon that the AI label tends to get dropped once the problem AI researchers were studying becomes tractable to some degree and yields practical systems. Voice recognition and text-to-speech, expert systems, machine vision, text summarizers, and theorem provers are just a few examples of classic AI tech that has become part of the standard bag of tricks. The AI label tends to mark things which aren't yet implemented in a generally useful manner, often because hardware or general practices haven't yet caught up.
That seems to describe the Semantic Web pretty well.
An aside -- one interesting phenomenon is that a lot of AI ends up, after fleeing the CS department, in Information and Library Sciences. And, of course, librarians, even the non-techie ones, are really into cataloging, searching, sharing, correlating, using metadata, intelligent agents... to wit, all the elements of the Semantic Web. AI folks don't end up in library departments because librarians are pushovers (as my overdue fines attest), but because there's a pretty good fit between what (some) AI-ers like to do, what the library folks want, and between what the librarians want and what the Semantic Web requires.
So the Semantic Web is an AI project, and we should be proud of that fact. Not only is it more honest, but it means that we can be clearer about what constitutes prior art, relevant research and literature, similar projects, and available technology. As I've written before, narrowness of understanding is a pernicious barrier to sensible progress. Reinventing the wheel isn't nearly as bad as having to continually reconceptualize it: "not thought here" generally causes more systematic problems than "not invented here".
In these articles, I'm going to do a little down-to-earth exploration of RDF, a core Semantic Web technology, using a classic AI programming language, Prolog, plus some standard AI techniques and technologies.
Prolog was the first logic programming language, and it's still popular in industry and in the classroom. There are many implementations, most of rather good quality. Interestingly, Prolog implementations are often used as logic servers or drop-in inference engines for larger programs, so the implementations have gotten fairly good at integrating with other programs (for example, there are several Prolog-style inference engines for the JVM, and some truly fine ones built on Common Lisp).
Prolog is an excellent prototyping language. It's quite easy to pull together programs with interesting and sometimes surprising properties. There is a large, high-quality corpus of Prolog literature and code, much of which is easily adaptable to one's ad hoc needs. For example, a simple backward-chaining expert system is perhaps a page or two of sample code in just about any Prolog textbook. While not production quality, such toys are ideal for getting a concrete sense of the problems and possibilities of an idea.
Syntax and Simple Semantics
There's not room in this article to give a reasonable Prolog tutorial, but a few preliminaries will be useful for getting a grip on RDF and how Prolog can deal with it.
It's helpful to contrast Prolog programs with invocations of them. A typical Prolog program will form a knowledge base -- a database of facts and rules which is used as a basis for inferences. To initiate computations, you query the knowledge base. Here's a very simple Prolog program which forms a small knowledge base about the readers of some popular web sites.
reads(john, 'XML.com'). reads(mary, 'XML.com'). reads(mary, xmlhack). reads(cristina, xmlhack).
Each line in this program asserts a fact. The first line claims
john reads 'XML.com'; the second that
mary reads 'XML.com', and so on.
xmlhack are all Prolog
atoms (a.k.a. "symbols"). The atom is the most basic and
prevalent datatype in Prolog. If an atom begins with an uppercase
letter, or contains certain special characters (like the full stop,
which is also the statement terminator), then one encloses it in
single quotes (hence,
'XML.com'; while standard, you may
find Prolog systems with alternative syntax for atom literals).
Given the types of characters that tend to show up in URIs, they almost always need to be enclosed in single quotes to produce their eponymous atoms. RDF makes heavy use of URIs, which basically means that, worst case, when processing RDF with Prolog you'll be writing
'http://purl.org/yadda/yadda/yadda/'a lot (for some reasonable value of "yadda").
Now that we have our knowledge base, we can interrogate our Prolog system. After loading the program into my Prolog ("consulting" it, in Prolog lingo), I can enter questions and receive answers at the "query" prompt.
?-reads(john, 'XML.com'). yes
"John reads 'XML.com'?" Prolog says, "Sure does."
?-reads(mary, X). X = 'XML.com' yes
|"mary reads what?" X is a variable. Prolog searched the knowledge base and found that if X was bound to 'XML.com' we get a "true" statement (i.e., one in the knowledge base).|
?-reads(Person, 'XML.com'). Person = john; Person = mary; No
"What Person reads 'XML.com'?" "john does!" (read "Who else?" for ";") "And mary!" "Anyone else?" "Nope."
(Thus we see one standard Prolog development cycle: edit the knowledge base in a text editor. Load it into the system, i.e. "consult it". Then interact with it from the read-query, evaluate, print loop.)
(Note: an unquoted capitalized atom is a variable. Hence
Xis a variable, as is
Notice that in the second and third examples, there's more than one
answer that will satisfy the query:
mary reads both
xmlhack, and both
'XML.com'. In the last session, after Prolog told me that
'XML.com', instead of hitting
"enter", I hit the semicolon, which told Prolog to look again for
other ways my query can be satisfied. I kept doing this until there
were no solutions that hadn't already been given. (While these
particular commands are quite common in Prolog read-query-print loops,
they are not universal.)
Suppose we want to know if any one person reads both
?-reads(Person, 'XML.com'), reads(Person, xmlhack). Person = mary; No
(The comma between the clauses is pronounced "and".)
Suppose we want to derive some targeted email marketing
lists. We will probably find, in those circumstances, that this last
query is quite a common one . It would be quite a drag to have to type
that query out every time we wanted to send some spam. More
importantly, the concept "a reader of both
xmlhack" has a special status for us: it defines the term
spam_target. We could add the statement
spam_target(mary) to our knowledge base, but that's both
redundant (as we can figure out that
spam_target from what we already know) and a pain to
maintain (e.g., if
mary stops reading
xmlhack due to having to spend all her time deleting our
spam, we have to change two lines in the program which aren't
obviously connected). Fortunately, we can add a rule to our
knowledge base to define our new concept.
spam_target(Sucker) :- reads(Sucker, 'XML.com'), reads(Sucker, xmlhack).
"A Sucker is a spam_target if That Sucker reads 'XML.com' and That Sucker reads xmlhack."
Assuming that we don't alter our knowledge base any other way, the query
spam_target(Person) will return
Moving to RDF
In the pre-rule knowledge base, each fact had three parts:
- the predicate,
- the subject of the predicate, i.e., the reader (
mary, and so forth);
- and the thing they read, i.e., the object of the predicate
By a striking and carefully planned coincidence, these are exactly the components of an RDF triple (hereafter, I'll use "RDF triple" and "triple" interchangeably). The RDF triple is one of several formal models offered by the core RDF spec, and it consists of an ordered 3-tuple of URIs (with the exception that the object position may take a string literal) with the first URI naming a predicate, the second naming a subject, and the last item being either an URI naming an object or a string literal. While the example Prolog facts have the same slots as a triple, the symbols which fill those slots aren't URIs. Happily, it's not that hard to convert our simple knowledge base:
- The Objects: since all the objects currently in our
knowledge base are web sites, it seems natural to use their base
URL as their name, thus,
'http://www.xmlhack.com/'(remember, to make URIs into standard Prolog atoms, you typically need to single quote them).
- The Predicate: the predicate atom (
reads) has no intrinsic, natural URI, but we can simply use the URL of this article (which is unique and not particularly useful for anything else) prepended to the atom, which yields:
- The Subjects: again, there's no natural intrinsic URIs,
but it seems a little nasty to use that same long URL prefix that we used
for the predicate. To add a little visual difference, we'll invent
mailto:based URIs for each person:
We can now covert the example knowledge base to a collection of RDF triples:
Of course, this table presentation of the triples is bit hard to query. It would be nice if we could encode these triples in a form that Prolog understood. Fortunately, those URI atoms are just atoms, and we can use them just as we did our original (more concise) ones:
'http://www.xml.com/pub/a/2001/04/25/prologrdf/reads'( 'mailto:firstname.lastname@example.org', 'http://www.xml.com/'). 'http://www.xml.com/pub/a/2001/04/25/prologrdf/reads'( 'mailto:email@example.com', 'http://www.xml.com/'). 'http://www.xml.com/pub/a/2001/04/25/prologrdf/reads'( 'mailto:firstname.lastname@example.org', 'http://www.xmlhack.com/'). 'http://www.xml.com/pub/a/2001/04/25/prologrdf/reads'( 'mailto:email@example.com', 'http://www.xmlhack.com'). ?-'http://www.xml.com/pub/a/2001/04/25/prologrdf/reads'( Person,'http://www.xml.com/'). Person = 'mailto:firstname.lastname@example.org' Yes
This is rather ugly as it stands (adding XML style namespaces will help), but it gives us a nice, constructive demonstration of how RDF triples are, or can be seen as, Prolog facts; and, hence, how a collection of RDF triples (say, as serialized in an RSS document) can be a Prolog program.
However, since Prolog knowledge bases can have facts with many
arguments, and can have rules, we might want to keep our RDF-based
facts somewhat distinct from the rest of program. One way we might do
this is by explicitly saying that a triple of URIs are in the
RDF subject-predicate-object relation. We could call that predicate
rdf_triple, as in
rdf_triple('http://www.xml.com/pub/a/2001/04/25/prologrdf/reads', 'mailto:email@example.com', 'http://www.xml.com/'). rdf_triple('http://www.xml.com/pub/a/2001/04/25/prologrdf/reads', 'mailto:firstname.lastname@example.org', 'http://www.xml.com/'). rdf_triple('http://www.xml.com/pub/a/2001/04/25/prologrdf/reads', 'mailto:email@example.com', 'http://www.xmlhack.com/'). rdf_triple('http://www.xml.com/pub/a/2001/04/25/prologrdf/reads, 'mailto:firstname.lastname@example.org', 'http://www.xmlhack.com').
We can recover our old, easier to type, formulation by defining a few rules:
reads(Person, Website) :- rdf_triple('http://www.xml.com/pub/a/2001/04/25/prologrdf/reads', Person, Website).
spam_target rule will work with this new knowledge
base essentially as it did with the old one, without
The definition of the
rdf_triple predicate establishes
a RDF knowledge base. Our
reads rule can be thought of as
an RDF application. In other words, our rules process the
RDF. The kind of processing we do is a form of inference. We
can use inferences to produce results similar to other forms of
processing (such as transformations or SQL queries) though often with
less work and more clarity.
The root RDF data model is deliberately very minimal and, as with XML, that minimalism is intended to make things easier for programs. One consequence of that minimalism, when coupled with other machine-friendly design tropes, is that though "human readable", RDF is not generally very human writable (although the Notation3 syntax tries to improve things.) Furthermore, while RDF's data model is specified, the processing model isn't (deliberately), so one should expect a wide variety of processors, each working in its own way, depending on a variety of constraints and desiderata.
Standard Prolog provides a rich processing model which naturally subsumes RDF data. As we saw above, deriving RDF triples from Prolog predicates, and then the reverse, can deepen our understanding of both. Furthermore, there is a lot of experience implementing a variety of alternative processing models (both forward and backward chaining systems, for example) in Prolog -- from the experimental toy, through the serious research project, to the industrially deployed, large-scale production system level. Furthermore, Prolog's roots in symbolic processing and language manipulation support a wide array of mechanisms for building expressive notations and languages for knowledge management, which serve well for hiding the less friendly aspects of RDF.
Some Useful Links
Here are a few more online Prolog tutorials:
- Adventure In Prolog
- Building Expert Systems in Prolog (read Adventure In Prolog first)
- Prolog Programming A First Course (an excellent starter)
- Quick Prolog (even better for a fast overview)
- Logic, Programming and Prolog (2ed) (the whole text in PDF)
And a few links to information about RDF and the Semantic Web:
- The W3C's Semantic Web Activity
- The RDF Model and Syntax Specification
- The RDF Interest Group and RDFIG IRC Scratchpad