Query Census Data with RDF
by Joshua Tauberer
|
Pages: 1, 2, 3
Question-Answering
Question-answering — like asking Google for the population of Philadelphia — is where I see the Semantic Web making its most important contribution to the world. Remember that the problem Google had is that it can't understand the information on web pages. Clearly, if we want to build a system that can do that, that can understand knowledge spread throughout the Internet, we all need to be using some common framework for representing knowledge, like RDF.
So let's go ahead and write a little question-answering system over the census data we've been using. It should recognize questions like this:
what is the ____ of _____ ?
ex. what is the population of California?
It's actually quite easy to get something crude working. Using a regular expression, the two blanks in the question can be extracted:
import re;
m = re.search('what is the (.*) of (.*)\??', question);
if m:
predicatename = m.group(1);
entityname = m.group(2)
# do more processing
else :
print "I don't understand the question."
Then we have to find the RDF entities that match the predicate and entity names given in the question. For the entities, we can use the dc:title predicate:
entity = store.value(None, dc["title"], Literal(entityname));
To find the predicate entity, we don't have any RDF statements to use that relate a predicate to a name for it. That is, we lack this:
census:population rdfs:label "population" .
That's the kind of statement you would find in an RDF schema. If we had that available, we would use the same technique that we used with dc:title, except with rdfs:label. Since we don't have that, we can fall back to looking at the URIs of the predicates as a hint:
predicate = None
for p in store.predicates() :
if (p.lower().endswith(predicatename.lower().replace(' ', ''))) :
predicate = p
Once we have the predicate and entity, there's just one more step to finding the corresponding value:
value = store.value(entity, predicate, None);
print entityname + "'s " + predicatename + " is " + value;
The complete Python source for this program is posted.
Running the program yields:
# python qa.py what is the population of California?
California's population is 33871648
If this were the only question we wanted to ask, we wouldn't have written the program. Of course we can ask it for any state, county, or town that the census reported statistics for (provided we know the exact name the census used for it). But we can also use other predicates.
# python qa.py what is the USPS state code of Mississippi?
Mississippi's USPS state code is MS
# python qa.py what is the land area of New York?
New York's land area is 122283145776 m^2
Surprise, right? Haven't you ever forgotten a state abbreviation for postal mail? I hadn't mentioned it, but in the RDFized census files that I posted there are predicates named census:landArea and census:uspsStateCode in the RDF data along with the population predicate. Maybe we got a little lucky that I chose good URIs for the predicates.
But it did work, after all.
That's the thing about RDF. We were able to write a totally generic question-answering program. It might only be able to answer a certain form of question, but it's not specific to any particular subject. Without revising the program, it could answer questions really about anything—if it has the answers in RDF.
- Query Census Data with RDF
2006-12-28 04:41:03 CharlesKinniburgh