Query Census Data with RDF

April 12, 2006

Let's say you're curious to find out the population of the city you live in. Go to Google and ask it "population of philadelphia, pa." Ha! We're too smart for Google. It tries to answer the question but comes up with "Pennsylvania – Population: 12,281,054 ; 6th, 12/00." That's not what we were looking for. Now, Google understood the question. We can tell because it gave us a special answer. But it couldn't come up with the right answer because it can't reliably understand the pages containing answers. We can't blame Google for that, really.

Hold that thought for a while. At the end of this article I'll demonstrate a simple Python program using the Semantic Web that can answer questions about population and more. First, though, we need to find some population statistics.

The U.S. government is a treasure trove of structured information. In my last article I talked about legislative information, but there's much more--gigabytes upon gigabytes more. The 2000 Census compiled tons of population statistics. Let's get some of it into the Semantic Web.

RDFizing the Census

So I'll grab one small 14MB slice of data out of the census records and turn it into RDF. The file is called usgeo.uf1 and it contains some basic population and geographic data for the United States as a whole, the states, and their counties, towns, and "places." You can get this file here. All of the census files are downloadable from FTP and are documented.

This file isn't pretty. It has columns in a fixed-width format, with no indication in the file what the columns mean. If any file highlights the benefits of XML, this one does. Here's the entry for Philadelphia, broken down to fit the width of this page:

uSF1F US05000000  027049212234210121    0  61622377N61609999900

349881748      19544394Philadelphia County

CN  1517550   661958+39998012-07514479306

The census documentation isn't simple, but it is comprehensive. With a short Perl script, it's easy to extract all of the information. Each line gets split on certain columns:

@FIELDS = (FILEID, 6, STUSAB, 2, SUMLEV, 3, GEOCOMP, ...); # name of each field

%FIELDSIZE = (FILEID => 6, STUSAB => 2, SUMLEV => 3, ...); # width

$start = 0;

foreach $field (@FIELDS) {

    $value = substr($line, $start, $FIELDSIZE{$field});

    $start += $FIELDSIZE{$field};

    $info{$field} = $value;

}

After that, %info contains all of the information for that record. Next, the information just gets written out in RDF. Turtle (or Notation 3, a related syntax) is easiest for this. The goal is to produce something like this for each of the 497,515 states, counties, etc. in the census file:

@prefix dc: <http://purl.org/dc/elements/1.1/> .

@prefix census: <tag:govshare.info,2005:rdf/census/> .



<some URI representing Philadelphia>

    dc:title "Philadelphia" ;

    census:population "1517550" ;

    census:households "661958" .

(When in doubt, use a validator to see what triples are generated by the syntax.)

dc:title is a very common predicate. It's used in most RSS feeds to indicate the title of the feed, and it sort of makes sense to reuse it here to relate a place to its name. I'll just make up some predicates here for population and number of households.

Before we can print the Turtle-formatted RDF, we need to create a URI to identify each of the regions covered by the census.

The first question to ask is whether there is an existing URI in use to represent U.S. states, counties, etc. A search on Swoogle shows one existing URI for Philadelphia, based on a RDF representation of WordNet, a machine-readable dictionary created at Princeton. If WordNet had an entry for every city, we might reuse those URIs. Since it doesn't, we'll just make up new URIs for everything.

Having two URIs for Philadelphia isn't the best situation, but it's not critical. The downside is that the two data sources (RDF-WordNet and the census) won't relate to each other, but on the other hand they never did.

Then, what URI should these things be given? It would take too long to assign each city a URI by hand, so we'll have Perl generate a URI for each entity by combining the URI of the thing containing the entity with a slash and then the name of the entity itself. If the United States gets the URI <tag:govtrack.us,2006:us> (arbitrarily), then Pennsylvania would get the URI <tag:govtrack.us,2006:us/Pennsylvania> and Philadelphia would get <tag:govtrack.us,2006:us/Pennsylvania/Philadelphia>. Since no state has two counties with the same name, for instance, this guarantees that no URI is accidentally used to represent two different things. (The resulting URI should also be guaranteed to be a legitimate URI, and to do that, the names are just normalized by removing spaces and other problematic characters.)

At first glance, the RDF looks redundant by having Philadelphia in two places: in the URI and again as the object of dc:title. If the consumer wants to get the name, he can just take everything after the last slash. This would be bad design. It's important to avoid putting information within the URI because URIs aren't structured in any meaningful way. A consumer might be able to figure out that the strings Pennsylvania and Philadelphia were relevant to the entity, but it wouldn't know why.

I've posted the complete Perl script that loops through the data file and prints out RDF for each line in the file, as well as the input data file and the resulting RDF.

Getting Answers out of the RDF

With the data now in RDF, working with the data becomes a lot simpler because there are already RDF toolkits for various languages. Compare that with the number of free libraries available for accessing census data (zero?).

One toolkit is RDFLib for Python. With it, we can load the RDF file in Turtle format into memory, find the entity that denotes (represents) Philadelphia, and then get that entity's population.

To load the file into memory, create a new Graph object and then call its parse method. A Graph is an in-memory place to put the RDF statements found in the file.

from rdflib import Graph, Namespace, Literal



store = Graph();

store.parse("census-data.n3", None, 'n3');

The next step is to get a reference to the entity denoting Philadelphia. There are two ways to do this. If you know the URI of Philadelphia already, you're done. But, let's say all we have is the name entered by a user. Then, what we need to do is ask the Graph for the entity whose dc:title is "Philadelphia". Here's the code:

dc = Namespace("http://purl.org/dc/elements/1.1/")

philadelphia = store.value(None, dc["title"], Literal("Philadelphia"));

The value method takes three arguments, a subject, a predicate, and an object, exactly one of which must be None. It then scans through the statements in the graph looking for ones that match the arguments, i.e. ones whose predicate is dc:title and object is the literal value "Philadelphia". When it finds a matching statement, it returns the field that was None, in this case the subject. Recall that the Turtle-formatted RDF really translates into three statements:

<tag:govtrack.us,2006:...ia/Philadelphia> dc:title "Philadelphia" .

<tag:govtrack.us,2006:...ia/Philadelphia> census:population "1517550" .

<tag:govtrack.us,2006:...ia/Philadelphia> census:households "661958" .

The first statement matches the filter, so the method returns the subject, <tag:govtrack.us,2006:us/Pennsylvania/Philadelphia>.

The next step is to get that entity's population. The same method is used, but this time the object part is None.

census = Namespace("tag:govshare.info,2005:rdf/census/")

population = store.value(philadelphia, census["population"], None);

Now the second statement matches, putting the literal value "1517750" into the population variable.

Let's do something a little more complicated, like finding the average population of the 50 states. A common part of an API for RDF is a method to loop through statements that match a pattern. In RDFLib, the method is aptly called triples. (A triple is another name for an RDF statement.) For this example, we need to find the entities in the entire graph that denote states, and what makes an entity a state is whether it is the subject of a statement that ends with rdf:type usgovt:State. So what we need to do is loop through all of the statements that look like "____ rdf:type usgovt:State".

rdf = Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#")

usgovt = Namespace("tag:govshare.info,2005:rdf/usgovt/");



for statement in store.triples((None, rdf["type"], usgovt["State"])):

    state = statement[0] # the subject is the first element

Now triples returns a list of statements. Each statement is represented as an array: the first element is the subject, the second is the predicate, the third is the object.

Now we can get the population as before and sum it up:

totalpop = 0

statecount = 0



for statement in store.triples((None, rdf["type"], usgovt["State"])):

    state = statement[0] # the subject is the first element

    population = store.value(state, census["population"], None)

    population = int(population) # convert from Literal to integer

    totalpop = totalpop + population

    statecount = statecount + 1



print totalpop/statecount

The complete Python source is also posted.

Meshing Data Sources

Any information can be modeled with RDF, and RDF really shines when one program can access entirely different data sources with ease. So, what other information is related to some aspect of the census? The census contains information about U.S. states, and U.S. states have senators in Congress--and as you know from my last article, I've already created RDF files about the members of Congress.

Recall this snippet of RDF about Senator Charles Schumer of New York:

# people:S00148 is the URI for the senator

people:S00148 pol:hasRole [

    time:from [ time:at "2005-01-01" ] ;

    time:to   [ time:at "2010-12-31" ] ;

    pol:forOffice senate:ny . ] .

    

# senate:ny represents the office of senator for New York

senate:ny pol:represents <tag:govshare.info,2005:data/us/NewYork> .

Since I created both of these data sets, I chose to use the same URI to denote New York in both sets. Having made that choice, the two data sets are connected through those entities. When loaded into the same Graph object, requests for triples can seamlessly span the data sets.

Let's try asking for the senators who represent the most populous state. The examples are getting a little academic, I know, but as more information makes its way into RDF, more exciting questions will be able to be answered.

As with finding the average state population, we will loop through the entities that are typed as a usgovt:State. But after finding the most populous state, we will also ask for its senators. This gets tricky because the entities denoting senators are only indirectly linked to their states. Starting with the state, the pol:represents predicate takes us backwards to an office that represents the state, then the pol:forOffice predicate takes us backwards to an abstract notion of a role held by a person, and lastly pol:hasRole takes us backwards to the person who holds that role. Then to get the name we follow the foaf:name predicate forward.

To follow predicates backwards, use the subjects method. It takes a predicate and an object and returns all of the subjects found in statements with that predicate and object.

Note how the structure of the RDF snippet above guides how the data is accessed:

# 'maxstate' holds the entity denoting the most populous state

for office in store.subjects( pol["represents"] , maxstate ) :

    for role in store.subjects( pol["forOffice"] , office ) :

        for person in store.subjects( pol["hasRole"] , role ) :

            print store.value( person, foaf["name"] , None ) ;

Running this prints out a whole bunch of people, most of whom aren't relevant because they used to represent California but don't anymore. (There are also some errors in the data for former Congress members, causing some to erroneously show up.) I chose to model Congress historically, with dates within the RDF data, rather than as a snapshot of the current Congress. As a result, we have to look only at roles that aren't over yet, i.e. roles that will end at a point in the future (versus having already ended).

for office in store.subjects( pol["represents"] , maxstate ) :

    for role in store.subjects( pol["forOffice"] , office ) :



    enddate1 = store.value( role , time["to"] , None ) ;

        enddate2 = store.value( enddate1 , time["at"] , None ) ;

        if (str(enddate2) > "2006-03-27") :

    

            for person in store.subjects( pol["hasRole"] , role ) :

                print store.value( person , foaf["name"] , None ) ;

Now the output is correct: Barbara Boxer and Dianne Feinstein, the senators from California.

Question-Answering

Question-answering — like asking Google for the population of Philadelphia — is where I see the Semantic Web making its most important contribution to the world. Remember that the problem Google had is that it can't understand the information on web pages. Clearly, if we want to build a system that can do that, that can understand knowledge spread throughout the Internet, we all need to be using some common framework for representing knowledge, like RDF.

So let's go ahead and write a little question-answering system over the census data we've been using. It should recognize questions like this:

what is the ____ of _____ ?

ex. what is the population of California?

It's actually quite easy to get something crude working. Using a regular expression, the two blanks in the question can be extracted:

import re;

m = re.search('what is the (.*) of (.*)\??', question);

if m:

    predicatename = m.group(1);

    entityname = m.group(2)

    # do more processing

else :

    print "I don't understand the question."

Then we have to find the RDF entities that match the predicate and entity names given in the question. For the entities, we can use the dc:title predicate:

entity = store.value(None, dc["title"], Literal(entityname));

To find the predicate entity, we don't have any RDF statements to use that relate a predicate to a name for it. That is, we lack this:

census:population rdfs:label "population" .

That's the kind of statement you would find in an RDF schema. If we had that available, we would use the same technique that we used with dc:title, except with rdfs:label. Since we don't have that, we can fall back to looking at the URIs of the predicates as a hint:

predicate = None

for p in store.predicates() :

    if (p.lower().endswith(predicatename.lower().replace(' ', ''))) :

        predicate = p

Once we have the predicate and entity, there's just one more step to finding the corresponding value:

value = store.value(entity, predicate, None);

print entityname + "'s " + predicatename + " is " + value;

The complete Python source for this program is posted.

Running the program yields:

# python qa.py what is the population of California?

California's population is 33871648

If this were the only question we wanted to ask, we wouldn't have written the program. Of course we can ask it for any state, county, or town that the census reported statistics for (provided we know the exact name the census used for it). But we can also use other predicates.

# python qa.py what is the USPS state code of Mississippi?

Mississippi's USPS state code is MS



# python qa.py what is the land area of New York?

New York's land area is 122283145776 m^2

Surprise, right? Haven't you ever forgotten a state abbreviation for postal mail? I hadn't mentioned it, but in the RDFized census files that I posted there are predicates named census:landArea and census:uspsStateCode in the RDF data along with the population predicate. Maybe we got a little lucky that I chose good URIs for the predicates.

But it did work, after all.

That's the thing about RDF. We were able to write a totally generic question-answering program. It might only be able to answer a certain form of question, but it's not specific to any particular subject. Without revising the program, it could answer questions really about anything—if it has the answers in RDF.