Of Presidents and Ontologies

November 3, 2004

As I write this, the outcome of the elections in the United States it entirely uncertain. But, eventually, someone must win. Let's imagine it's George W. Bush. How would we define him in RDF?

First, we need to uniquely identify him. We've already created a Tag URI for him, like so: tag:hackingcongress.info,2004-10-05:Bush,George+W. Next, we make some assertions about him:

<Human rdf:about = "tag:hackingcongress.info,2004-10-05:Bush,George+W">

 <FullName>Bush, George W.</FullName>

 <NickName>Dubya</NickName>

 <OfficeHolder>

  <rdf:Bag>

   <rdf:li rdf:resource = "tag:hackingcongress.info,

   2004-10-05:/Governorship/TX/Bush,George+W"/>

   <rdf:li rdf:resource="tag:hackingcongress.info,

   2004-10-05:/U.S.+Presidency/Bush,George+W"/>

  </rdf:Bag>

 </OfficeHolder>

<Arrested rdf:datatype="http://www.w3.org/2001/XMLSchema#date">

   1976-10-04

 </Arrested>

 <MarriedTo rdf:resource = "tag:hackingcongress.info,

 2004-10-05:/U.S.+People/Bush,Laura"/>

 <AdvisedBy>

  <rdf:Bag>

   <rdf:li rdf:resource = "tag:hackingcongress.info,

   2004-10-05:/U.S.+Officials/Goss,Porter"/>

   <rdf:li rdf:resource = "tag:hackingcongress.info,

   2004-10-05:/U.S.+Officials/Powell,Colin"/>

   <-- other advisors go here-->

 </rdf:Bag>

</AdvisedBy>

<IntroducedLegislation>

 <rdf:Bag>

  <rdf:li rdf:resource = "tag:hackingcongress.info,

  2004-10-05:Legislation/No+Child+Left+Behind"/>

  <rdf:li rdf:resource = "tag:hackingcongress.info,

  2004-10-05:Legislation/Sarbanes-Oxley+Act"/>

  <-- other legislation goes here-->

 </rdf:Bag>

</IntroducedLegislation>

</Human>

I'll spare you more RDF, as I think it's plain where things are going. As you can see, this RDF points to a number of different resources: roles, people, legislation, and so forth, using Tag URIs for each (and if Thomas's site ever makes its XML versions of legislation publicly usable, we can use HTTP URIs to represent pieces of legislation. More about this in a future column). We also dropped in the date for when Bush was arrested for drunk driving, and indicated that the date is defined in terms of an XML Schema datatype. (When we create RDF for Clinton, we'll need an Impeached date.) We've introduced several new predicates to handle these new types of information, including MarriedTo, AdvisedBy, and IntroducedLegislation. We might also want a HasChildren predicate, with pointers to Tag URIs representing Bush's children. Each of these resources needs to be created as its own set of RDF statements.

Another thing to note are the rdf:Bag statements in the code above. A bag specifies an unsorted collection of objects; this way you can list all of the advisors, and all of the legislation at once. Notice that by adding just a pointer to individual pieces of legislation, we're counting on our ability to later sort this data by some intelligent querying. That is, we know that a person has a role, and a role is bound by time; and and we know that a piece of legislation is introduced by a person with the role of OfficeHolder; and that piece of legislation is introduced at a certain time. Given data in this format, we should be able to do a query that says, "Show all of the legislation introduced by George Bush while he was president," and expect consistent, accurate answers. For now, we've separated people from roles, so we can simply point to the roles a person has filled, and then when we define those roles, we can define their duration using more RDF. Our application, when we build it, will need to be aware of this separation. In a later column, we'll use an RDF query language -- in particular, we'll use SPARQL -- to explore the data set in exactly this way.

Thinking about Tag URIs

A user named "ajeru" commented on the last Hacking Congress, writing:

One problem I see with this version of your model, is that it doesn't allow you to do everything that the old model allowed you to do. In particular, you cannot easily extract a list of all current U.S. senators without parsing tag IDs.

This is true: the sample data is incomplete, and I should have made sure to say so. Under no circumstance should a Semantic Web application attempt to derive meaning from a URI string alone. Tag URIs have exactly one purpose: to allow us to quickly, consistently create unique identifiers that aren't intended to be de-referenced and without needing to register a URI scheme with anyone.

It's very important to build a mental wall between unique identifiers and the RDF itself. When you look at a Tag URI, it seems to have meaning -- there's an ISO date in there, and some words and slashes. But that appearance is deceptive, and if you're working with RDF, it's important to remember that such identifiers are there for your convenience, as a human person, and not for machines, which don't care about the same things you care about. Of course, this is easier to say than do; my brain (your brain might work better) wants them to have meaning.

OK, so how would we solve this problem? Well, we have RDF for every senatorial role, like so:

<Role rdf:about= 

"tag:hackingcongress.info,2004-10-05:/U.S.+Senate/MA/2">

  <rdfs:Label>Senator 2 from Massachusetts</rdfs:Label>

</Role>

But this block of RDF isn't enough; it doesn't indicate that the person actually is a senator. We can fix that by first creating a role for senators.


<Role rdf:about = 

"tag:hackingcongress.info,2004-10-05:/U.S.+Senator">

  <rdfs:Label>A United States Senator</rdfs:Label>

</Role>

And then, we add an rdf:type statement to the role.

<Role rdf:about = 

"tag:hackingcongress.info,2004-10-05:/U.S.+Senate/MA/2">

  <rdfs:Label>Senator 2 from Massachusetts</rdfs:Label>

  <RepresentState rdf:resource = 

  "tag:hackingcongress.info,2004-10-05:/States/MA"/>

  <rdf:type rdf:resource = 

   "tag:hackingcongress.info,2004-10-05:/U.S.+Senator"/>

</Role>

Now we're able to identify which roles are of the type "senator" (that is, are of the type type tag:hackingcongress.info,2004-10-05:/U.S.+Senator), and connect the roles with the humans that hold them. In this way, we'll be able to do different kinds of queries more easily, such as:

Show me all of the senators.
Show me all of the senators from Massachusetts.
Show me a specific person, and the roles he or she has filled.

Next Steps

I'm finding lots of good, open-sourced data out there describing the U.S. government. One particularly promising place is GovTrack.us, which is already doing much of what we're trying to do with Hacking Congress, and even makes its data available in RDF. That's great -- the more we can slurp in, the better; the goal has never been to be the first, or even the best, but to see what benefits ordinary hackers and developers can derive from RDF and the Semantic Web.

I'm going to keep the details of transforming the House, and the Supreme Court out of this column -- I'll make notes and XSL available, in time, but readers have already seen how to screen scrape and do document conversions, and there's no need to repeat ourselves.

Our next step is to get an RDF triple store running, and to get a site up, so that we can begin to explore our government data and work out how to improve it. At this point, there are two main contenders for our triple store: Kowari and Redland. Both are excellent tools, and both are open source, but they're very different in their designs. Kowari is in Java, scales very well, and has a large number of features that we probably don't need. It's very enterprise-friendly. Redland is in C, smaller, less complex, and has more native interfaces to more of the programming languages we care about.

Redland may not be as scalable as Kowari, but it's got a hacker-friendly, native-Unix feel that appeals. Since XML.com editor Kendall Clark is going to be working on the Hacking Congress Semantic Web site with me, and he's a Python fan, and because our data set is unlikely to get into the millions-of-triples area where Kowari excels, I've decided to go with Redland. The next column will start to document the process of taking our RDF data and "going live" with it.

"But wait," I hear some people saying, "this is a Semantic Web application. And the Semantic Web means you need an ontology. How can you get started without one?"

Where is the ontology?

Ontology development is an important part of the Semantic Web vision; and you might think that, in order to reap benefits from the Semantic Web, you need an ontology. But that's not entirely true, and while I took a first stab at building a congress-describing ontology in the first column of this series, I've decided to hold off on building one in the near future.

What is an ontology, with regard to the Semantic Web? I'm going to try to explain, but I'm not a logic expert, and readers should weigh in using the comment system with their own definitions. As far as most people are concerned, a Semantic Web ontology is a set of statements expressed using a Description Logic (DL).

A DL, which, the experts assure me, is a restricted form, or a fragment, of first order predicate log. There are lots of different kinds of DL, distinguished by which expressive mechanisms they allow. In general, however, the more expressive a DL is, the harder it is to write software to process it. There are some DLs for which we still don't even even have good algorithms.

But DLs let you encode knowledge about a specific domain. One bit of knowledge might be "The Senate is a House of Congress;" another would be "The President is an elected official;" and a third might be, "There are at most two senators from any state in the Senate." When you create classes, subclasses, and properties using a DL, plus relations between them, as well as individuals which are instances of those classes, then you're building a knowledge base. But sometimes an RDF graph is called a knowledge base, too, since it's not a very strict term.

Importantly, many DLs are computationally decidable. This means that computers can be programmed to make inferences using those DLs within a finite amount of time, and often within a reasonable amount of time. You can't make that promise with full-blown, first order logic, which isn't decidable. Thus, ontology computing is often a tradeoff between expressivity and computational efficiency. You can say more in first-order predicate logic than you can say in DLs, but that's not much comfort if you might be waiting until after the heat death of the universe for an answer.

For the Semantic Web, the standard W3C ontology language is OWL, which stands for Web Ontology Language. And yes, it should be WOL. It's OWL because, in Winnie the Pooh, the character Owl wrote his name "WOL." The W3C folks ... well, you can read the FAQ.) OWL is expressed in RDF.

There are three official varieties of OWL. They are:

OWL Lite
OWL DL
OWL Full

The OWL spec explains what each of these are, but in essence, OWL Lite is for people who need simple constraints on their data, and a way to express hierarchy, like those you might find in a biology textbook: "a human is kind of primate; a primate is a kind of mammal." OWL DL adds a few other features, and makes it possible for classes to be subclasses of more than one class -- so, if you live in a society that uses flowers as its unit of currency, you can say that a flower is both a kind of plant, and a kind of currency. But it's still a DL. OWL Full is for people (or Semantic Web researchers, not all of whom are human) who want all the flexibility, and complexity, of first-order logic. Or that's the way it seems to me today. For those who want a bare minimum of ontology complexity, there's also the unofficial OWL Tiny, which is a subset of OWL Lite.

So that's what an ontology is, but what it isn't is a set of rules. There's a fairly common misperception that a Semantic Web framework involves creating a heap of RDF data, writing an ontology, and sitting back while the computer sorts and indexes everything together. In a language like SQL, you can define your tables and your procedures together. But an ontology on its own is just a bunch of statements about the world.

One thing you might do, after creating an ontology, is to create a set of rules to operate upon it. You might write rules to do data validation against items in the knowledge base, using the ontology as a guide. For example, you might write a rule to insure that, based on the cardinality restriction about senators and states, the knowledge base never says that there are more than two senators from any state at the same time. You can do an awful lot with a rule language and rule processor like N3 and cwm. You might be able to do everything you need with some rules, without using an OWL reasoner at all. But, in general, the Semantic Web is about using a variety of these techniques, rather than picking only one. The real goal is to use whatever will be the most efficient, given the kinds of requirements you have.

The main focus of the Hacking Congress is to create a richly interlinked knowledge base of people, organizations, and ideas that people and software programs can explore and use -- think of it as a map, not a model. To accomplish this, I can make good use of an RDF Query language, and plain RDF. But do I need to do reasoning? Probably not. At least, not right now. At this point, I don't need any kind of reasoning or inference, which means that I don't really need an OWL ontology. We want to use the simplest tool possible, until the complexity of the task and the real world forces us to use something more advanced.

There's two things to note about all of this: one, I don't need an ontology to get started using RDF and applying Semantic Web technologies. In fact, because my data model keeps changing, it's probably in my best interest to hold off creating one for now: I don't want to formalize anything until I've got a firm grip on it. And second, if I need one, I can write it later, referring back to my existing data model.

Summing Up

That wraps up Hacking Congress this month; the electoral season is over in the United States, but the need for good information about the government remains. When we come back, we should have some working code and sample data, ready for others to play with.