XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

GovTrack.us, Public Data, and the Semantic Web
by Joshua Tauberer | Pages: 1, 2, 3

So in the rest of this article I'll go over some of the design of the RDF version of GovTrack's data (which you can also download to play around with).

Here's some biographical data for Senator Schumer from New York:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pol: <tag:govshare.info,2005:rdf/politico/> .
@prefix usgov: <tag:govshare.info,2005:rdf/usgovt/> .
@prefix people: <tag:govshare.info,2005:data/us/congress/people/> .

people:S000148
    rdf:type pol:Politician ;
    foaf:name "Charles Schumer" ;
    foaf:gender "male" ;
    usgov:party "Democrat" .

This is RDF data in a format called Notation 3, which is a nice alternative to the XML format of RDF. Since RDF is really an abstract way to represent information, rather than a particular data format, we're always free to choose the serialization syntax that's easiest to read (or scribble or exchange) for the task and data at hand. (When in doubt about the syntax, run it through a validator to see what the underlying triples are.) This data was intended to mean that the entity identified by the URI tag:govshare.info,2005:data/us/congress/people/s000148 (put together by simply concatenating the prefix URI with the local name) is a politician, has the name given, is a Democrat, etc.

In addition to the data above, there is RDF data about Schumer's role in Congress, including the state he represents. This is where some real modeling choices came in. There are a number of sensible ways to relate a politician to the region he or she represents. Here's one:

people:S00148 pol:represents "New York" .

This is very to-the-point. Schumer represents New York. It's accurate enough, but not particularly precise. The literal expression "New York" isn't very informative. New York State or New York City? We could get around this problem by stating in the pol: vocabulary that pol:represents only refers to states and not cities, and that would be a fine solution if that restriction were acceptable. But we can make a small change to make it better:

@prefix states: <tag:govshare.info,2005:data/us/> .
people:S00148 pol:represents states:ny .

Now it's very precise. Except, when a computer reads in the URI tag:govshare.info,2005:data/us/ny it has no idea what that means. So we have to list somewhere else:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
states:ny rdf:type <tag:govshare.info,2005:rdf/usgovt/State> .
states:ny dc:isPartOf <tag:govshare.info,2005:data/us> . 
                         (i.e. the United States)

The computer may have no idea what tag:govshare.info,2005:rdf/usgovt/State means either, but at least it knows it's the same type of thing as the other states. Or the application writer can assign a special meaning to the URI tag:govshare.info,2005:rdf/usgovt/State.

Using a URI rather than a literal value also lets you, or others, contribute information about the entity. If I'm publishing information on Congress and someone else transforms some census data into this:

@prefix census: <tag:govshare.info,2005:rdf/census/> .
states:ny census:population "18976457" .

then immediately one can start writing queries that bridges the two data sets.

This is a fine way of representing the information. Beyond this point, the modeling choices become a real trade-off between simplicity and informativeness. There are two shortcomings with the representation of the pol:represents relation above. The first is that it misses the generalization that anyone who is a senator from New York represents New York. Or, rather, it's not an inherent property of Schumer that he represents New York, but rather it's in virtue of another property of his, which is holding the office of senator. So then we should revise the information as this:

@prefix senate: <tag:govshare.info,2005:data/us/congress/senate/> .
people:S00148 pol:holdsOffice senate:ny .
senate:ny pol:represents states:ny .

That's more informative, at the cost of being more complex to create and query.

The second shortcoming is a pervasive problem in any representation of the real world, and it's that the world isn't static. There are two ways to look at this. First, it's not an inherent property of Schumer that he holds the office of senator. Compare that to the assertion above that New York is a part of the United States, which we could reasonably say is a time-invariant truth. The second perspective is that this information may be correct now, but it won't be when Schumer leaves office. So when we write RDF, are we asserting time-invariant information or information that's claimed to be true only at the time of writing?

The answer is, we don't know. Some predicates are time-sensitive, some are time-invariant. Lots of information in RDF out there on the internet is time-sensitive with no indication of the time that it was written, or how long it might be correct for. This is a problem we'll have to deal with in the future.

So while it would be appropriate to leave the design as time-sensitive, GovTrack goes a step further and models the time that someone holds an office:

@prefix time: <http://pervasive.semanticweb.org/ont/2004/06/time#> .
people:S00148 pol:hasRole
    [
        rdf:type  pol:Term ;
        time:from [ time:at "2005-01-01" ] ;
        time:to   [ time:at "2010-12-31" ] ;
        pol:forOffice senate:ny .
    ] .

The practical benefit of this is that GovTrack can include historical data this way without it seeming like George Washington is still in office.

In the Semantic Web, it's easy to get caught up in the theory. Modeling issues are fun to think about (at least for me), but it's good to have a practical application too, at least from time to time. Stay tuned for a future article where I'll get into the nuts and bolts of bringing government data onto the Semantic Web.

As a final note: as an American, my knowledge of my own government is fair and my knowledge of the outside world is, alas, minimal. I encourage you to post comments below about the Semantic Web and government in other countries.



1 to 2 of 2
  1. Some Congressional data is already in XML
    2006-02-28 09:49:24 joe.carmel
  2. Swedish law xmlized by student, too
    2006-02-10 01:43:08 Pär Lannerö
1 to 2 of 2