XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Query Census Data with RDF

Query Census Data with RDF

April 12, 2006

Let's say you're curious to find out the population of the city you live in. Go to Google and ask it "population of philadelphia, pa." Ha! We're too smart for Google. It tries to answer the question but comes up with "Pennsylvania – Population: 12,281,054 ; 6th, 12/00." That's not what we were looking for. Now, Google understood the question. We can tell because it gave us a special answer. But it couldn't come up with the right answer because it can't reliably understand the pages containing answers. We can't blame Google for that, really.

Hold that thought for a while. At the end of this article I'll demonstrate a simple Python program using the Semantic Web that can answer questions about population and more. First, though, we need to find some population statistics.

The U.S. government is a treasure trove of structured information. In my last article I talked about legislative information, but there's much more--gigabytes upon gigabytes more. The 2000 Census compiled tons of population statistics. Let's get some of it into the Semantic Web.

RDFizing the Census

So I'll grab one small 14MB slice of data out of the census records and turn it into RDF. The file is called usgeo.uf1 and it contains some basic population and geographic data for the United States as a whole, the states, and their counties, towns, and "places." You can get this file here. All of the census files are downloadable from FTP and are documented.

This file isn't pretty. It has columns in a fixed-width format, with no indication in the file what the columns mean. If any file highlights the benefits of XML, this one does. Here's the entry for Philadelphia, broken down to fit the width of this page:

uSF1F US05000000  027049212234210121    0  61622377N61609999900
349881748      19544394Philadelphia County
CN  1517550   661958+39998012-07514479306

The census documentation isn't simple, but it is comprehensive. With a short Perl script, it's easy to extract all of the information. Each line gets split on certain columns:

@FIELDS = (FILEID, 6, STUSAB, 2, SUMLEV, 3, GEOCOMP, ...); # name of each field
%FIELDSIZE = (FILEID => 6, STUSAB => 2, SUMLEV => 3, ...); # width
$start = 0;
foreach $field (@FIELDS) {
    $value = substr($line, $start, $FIELDSIZE{$field});
    $start += $FIELDSIZE{$field};
    $info{$field} = $value;

After that, %info contains all of the information for that record. Next, the information just gets written out in RDF. Turtle (or Notation 3, a related syntax) is easiest for this. The goal is to produce something like this for each of the 497,515 states, counties, etc. in the census file:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix census: <tag:govshare.info,2005:rdf/census/> .

<some URI representing Philadelphia>
    dc:title "Philadelphia" ;
    census:population "1517550" ;
    census:households "661958" .

(When in doubt, use a validator to see what triples are generated by the syntax.)

dc:title is a very common predicate. It's used in most RSS feeds to indicate the title of the feed, and it sort of makes sense to reuse it here to relate a place to its name. I'll just make up some predicates here for population and number of households.

Before we can print the Turtle-formatted RDF, we need to create a URI to identify each of the regions covered by the census.

The first question to ask is whether there is an existing URI in use to represent U.S. states, counties, etc. A search on Swoogle shows one existing URI for Philadelphia, based on a RDF representation of WordNet, a machine-readable dictionary created at Princeton. If WordNet had an entry for every city, we might reuse those URIs. Since it doesn't, we'll just make up new URIs for everything.

Having two URIs for Philadelphia isn't the best situation, but it's not critical. The downside is that the two data sources (RDF-WordNet and the census) won't relate to each other, but on the other hand they never did.

Then, what URI should these things be given? It would take too long to assign each city a URI by hand, so we'll have Perl generate a URI for each entity by combining the URI of the thing containing the entity with a slash and then the name of the entity itself. If the United States gets the URI <tag:govtrack.us,2006:us> (arbitrarily), then Pennsylvania would get the URI <tag:govtrack.us,2006:us/Pennsylvania> and Philadelphia would get <tag:govtrack.us,2006:us/Pennsylvania/Philadelphia>. Since no state has two counties with the same name, for instance, this guarantees that no URI is accidentally used to represent two different things. (The resulting URI should also be guaranteed to be a legitimate URI, and to do that, the names are just normalized by removing spaces and other problematic characters.)

At first glance, the RDF looks redundant by having Philadelphia in two places: in the URI and again as the object of dc:title. If the consumer wants to get the name, he can just take everything after the last slash. This would be bad design. It's important to avoid putting information within the URI because URIs aren't structured in any meaningful way. A consumer might be able to figure out that the strings Pennsylvania and Philadelphia were relevant to the entity, but it wouldn't know why.

I've posted the complete Perl script that loops through the data file and prints out RDF for each line in the file, as well as the input data file and the resulting RDF.

Pages: 1, 2, 3

Next Pagearrow