XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Query Census Data with RDF
by Joshua Tauberer | Pages: 1, 2, 3

Getting Answers out of the RDF

With the data now in RDF, working with the data becomes a lot simpler because there are already RDF toolkits for various languages. Compare that with the number of free libraries available for accessing census data (zero?).

One toolkit is RDFLib for Python. With it, we can load the RDF file in Turtle format into memory, find the entity that denotes (represents) Philadelphia, and then get that entity's population.

To load the file into memory, create a new Graph object and then call its parse method. A Graph is an in-memory place to put the RDF statements found in the file.

from rdflib import Graph, Namespace, Literal

store = Graph();
store.parse("census-data.n3", None, 'n3');

The next step is to get a reference to the entity denoting Philadelphia. There are two ways to do this. If you know the URI of Philadelphia already, you're done. But, let's say all we have is the name entered by a user. Then, what we need to do is ask the Graph for the entity whose dc:title is "Philadelphia". Here's the code:

dc = Namespace("http://purl.org/dc/elements/1.1/")
philadelphia = store.value(None, dc["title"], Literal("Philadelphia"));

The value method takes three arguments, a subject, a predicate, and an object, exactly one of which must be None. It then scans through the statements in the graph looking for ones that match the arguments, i.e. ones whose predicate is dc:title and object is the literal value "Philadelphia". When it finds a matching statement, it returns the field that was None, in this case the subject. Recall that the Turtle-formatted RDF really translates into three statements:

<tag:govtrack.us,2006:...ia/Philadelphia> dc:title "Philadelphia" .
<tag:govtrack.us,2006:...ia/Philadelphia> census:population "1517550" .
<tag:govtrack.us,2006:...ia/Philadelphia> census:households "661958" .

The first statement matches the filter, so the method returns the subject, <tag:govtrack.us,2006:us/Pennsylvania/Philadelphia>.

The next step is to get that entity's population. The same method is used, but this time the object part is None.

census = Namespace("tag:govshare.info,2005:rdf/census/")
population = store.value(philadelphia, census["population"], None);

Now the second statement matches, putting the literal value "1517750" into the population variable.

Let's do something a little more complicated, like finding the average population of the 50 states. A common part of an API for RDF is a method to loop through statements that match a pattern. In RDFLib, the method is aptly called triples. (A triple is another name for an RDF statement.) For this example, we need to find the entities in the entire graph that denote states, and what makes an entity a state is whether it is the subject of a statement that ends with rdf:type usgovt:State. So what we need to do is loop through all of the statements that look like "____ rdf:type usgovt:State".

rdf = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")
usgovt = Namespace("tag:govshare.info,2005:rdf/usgovt/");

for statement in store.triples((None, rdf["type"], usgovt["State"])):
    state = statement[0] # the subject is the first element

Now triples returns a list of statements. Each statement is represented as an array: the first element is the subject, the second is the predicate, the third is the object.

Now we can get the population as before and sum it up:

totalpop = 0
statecount = 0

for statement in store.triples((None, rdf["type"], usgovt["State"])):
    state = statement[0] # the subject is the first element
    population = store.value(state, census["population"], None)
    population = int(population) # convert from Literal to integer
    totalpop = totalpop + population
    statecount = statecount + 1

print totalpop/statecount

The complete Python source is also posted.

Meshing Data Sources

Any information can be modeled with RDF, and RDF really shines when one program can access entirely different data sources with ease. So, what other information is related to some aspect of the census? The census contains information about U.S. states, and U.S. states have senators in Congress--and as you know from my last article, I've already created RDF files about the members of Congress.

Recall this snippet of RDF about Senator Charles Schumer of New York:

# people:S00148 is the URI for the senator
people:S00148 pol:hasRole [
    time:from [ time:at "2005-01-01" ] ;
    time:to   [ time:at "2010-12-31" ] ;
    pol:forOffice senate:ny . ] .
    
# senate:ny represents the office of senator for New York
senate:ny pol:represents <tag:govshare.info,2005:data/us/NewYork> .

Since I created both of these data sets, I chose to use the same URI to denote New York in both sets. Having made that choice, the two data sets are connected through those entities. When loaded into the same Graph object, requests for triples can seamlessly span the data sets.

Let's try asking for the senators who represent the most populous state. The examples are getting a little academic, I know, but as more information makes its way into RDF, more exciting questions will be able to be answered.

As with finding the average state population, we will loop through the entities that are typed as a usgovt:State. But after finding the most populous state, we will also ask for its senators. This gets tricky because the entities denoting senators are only indirectly linked to their states. Starting with the state, the pol:represents predicate takes us backwards to an office that represents the state, then the pol:forOffice predicate takes us backwards to an abstract notion of a role held by a person, and lastly pol:hasRole takes us backwards to the person who holds that role. Then to get the name we follow the foaf:name predicate forward.

To follow predicates backwards, use the subjects method. It takes a predicate and an object and returns all of the subjects found in statements with that predicate and object.

Note how the structure of the RDF snippet above guides how the data is accessed:

# 'maxstate' holds the entity denoting the most populous state
for office in store.subjects( pol["represents"] , maxstate ) :
    for role in store.subjects( pol["forOffice"] , office ) :
        for person in store.subjects( pol["hasRole"] , role ) :
            print store.value( person, foaf["name"] , None ) ;

Running this prints out a whole bunch of people, most of whom aren't relevant because they used to represent California but don't anymore. (There are also some errors in the data for former Congress members, causing some to erroneously show up.) I chose to model Congress historically, with dates within the RDF data, rather than as a snapshot of the current Congress. As a result, we have to look only at roles that aren't over yet, i.e. roles that will end at a point in the future (versus having already ended).

for office in store.subjects( pol["represents"] , maxstate ) :
    for role in store.subjects( pol["forOffice"] , office ) :

    enddate1 = store.value( role , time["to"] , None ) ;
        enddate2 = store.value( enddate1 , time["at"] , None ) ;
        if (str(enddate2) > "2006-03-27") :
    
            for person in store.subjects( pol["hasRole"] , role ) :
                print store.value( person , foaf["name"] , None ) ;

Now the output is correct: Barbara Boxer and Dianne Feinstein, the senators from California.

Pages: 1, 2, 3

Next Pagearrow