XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


GovTrack.us, Public Data, and the Semantic Web
by Joshua Tauberer | Pages: 1, 2, 3

XML has been good for the job, but when you put lots of XML files together, you don't immediately get something special out of it — code has to be written. And so GovTrack has a way to browse bills by the subject terms assigned to the bills.

But I got an email the other day asking for legislation that falls into two categories, and I needed a way to write a simple query over the data, looking for bills that matched both subject terms. The simplest thing to do might have been to write a program that evaluates an XPath expression over each bill file:

count(bill/subjects/term[@name = "Medical care"]) > 0
and count(bill/subjects/term[@name = "Illegal aliens"]) > 0

Sure, that would have gotten the job done. But if I stuck with XPath for all of my querying needs, I'd be very limited in the types of queries I could run over the data. An XPath expression really can involve only one document, which is to say that the types of questions one can ask with XPath are whether or not a document matches an expression, and that match depends on the document itself. (True you can use the document() function to cross documents, but only if you can get the name of the file.)

If tomorrow someone asks me for a list of bills that Bill Frist and John Kerry voted differently on, I'll be stuck. Each roll call vote file looks something like this:

<roll where="senate" year="2005" roll="00230">
    <question>On the Motion (Motion To Table)</question>
    <voter id="300001" vote="+"/>
    <voter id="300002" vote="+"/>
    <voter id="300003" vote="+"/>
    <voter id="400546" vote="-"/>

The names of the senators aren't in the file, so the first step would be to look up the IDs of Frist and Kerry. Then, iterate through the bill XML files, open up the related vote file for each, and finally use an XPath (or even XQuery) expression to test the vote file to see if the votes differed.

Or what if someone wants to know whether the votes on a bill were correlated with the representative's age, amount of campaign contributions, or geographic location of his or her district? Better yet, what about the question of whether the board of directors of Disney made contributions to representatives introducing legislation about copyrights? GovTrack has all of this information (it's all public information downloadable from the Census, Federal Election Commission, and Securities and Exchange Commission). When we can ask these types of questions easily, things start to get much more interesting.

Here's an interesting question I can answer today with a simple query: what's the population of each senator's state? This is the query:

PREFIX dc:  <http://purl.org/dc/elements/1.1/>
PREFIX foaf:  <http://xmlns.com/foaf/0.1/>
PREFIX pol: <tag:govshare.info,2005:rdf/politico/>
PREFIX census: <tag:govshare.info,2005:rdf/census/>

SELECT ?name ?statename ?population WHERE {
  ?person foaf:name ?name .
  ?person pol:hasRole
      [ pol:forOffice [ pol:represents ?state ] ] .
  ?state dc:title ?statename .
  ?state census:population ?population . }

This is a SPARQL query that results in a table of the names of senators and the corresponding state and population. SPARQL is a new query language over information in RDF. Not to make a shameless plug, but I really recommend reading my own introduction to RDF. It goes beyond the old notion of RDF as an XML metadata format. RDF is more commonly thought of today as a general method for knowledge interchange. And for more about SPARQL itself, I recommend Leigh Dodds' SPARQL tutorial on XML.com, Introducing SPARQL: Querying the Semantic Web.

You can play around with queries over GovTrack's data here, but I don't want to talk about SPARQL in this article. I just wanted to show that the types of questions we can ask can easily grow in complexity and "interestingness" using RDF. No XPath or XQuery query is going to be nearly so concise for those questions.

Of course it's possible to do this with XML rather than RDF, and the difference is just in where the effort must be applied to get the data sources to link together. In XML, the burden is on the person with the query to figure out how the elements and attributes in one XML file relate to the elements and attributes in another. Glue-code has to be programmed to mesh the data. With RDF, the burden is on the people with the data to ensure that their identifiers for things overlap with other data sources. The difficulty in RDF is more of a design decision, and design decisions are tough too. But one of RDF's advantages in meshing disparate data is that the hard work is done once by the people who know the data best, rather than repeated by each programmer that has a new query to make.

Pages: 1, 2, 3

Next Pagearrow