GovTrack.us, Public Data, and the Semantic Web
by Joshua Tauberer
|
Pages: 1, 2, 3
So in the rest of this article I'll go over some of the design of the RDF version of GovTrack's data (which you can also download to play around with).
Here's some biographical data for Senator Schumer from New York:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pol: <tag:govshare.info,2005:rdf/politico/> .
@prefix usgov: <tag:govshare.info,2005:rdf/usgovt/> .
@prefix people: <tag:govshare.info,2005:data/us/congress/people/> .
people:S000148
rdf:type pol:Politician ;
foaf:name "Charles Schumer" ;
foaf:gender "male" ;
usgov:party "Democrat" .
This is RDF data in a format called Notation 3, which is a nice alternative to the XML format of RDF. Since RDF is really an abstract way to represent information, rather than a particular data format, we're always free to choose the serialization syntax that's easiest to read (or scribble or exchange) for the task and data at hand. (When in doubt about the syntax, run it through a validator to see what the underlying triples are.) This data was intended to mean that the entity identified by the URI tag:govshare.info,2005:data/us/congress/people/s000148 (put together by simply concatenating the prefix URI with the local name) is a politician, has the name given, is a Democrat, etc.
In addition to the data above, there is RDF data about Schumer's role in Congress, including the state he represents. This is where some real modeling choices came in. There are a number of sensible ways to relate a politician to the region he or she represents. Here's one:
people:S00148 pol:represents "New York" .
This is very to-the-point. Schumer represents New York. It's accurate enough, but not particularly precise. The literal expression "New York" isn't very informative. New York State or New York City? We could get around this problem by stating in the pol: vocabulary that pol:represents only refers to states and not cities, and that would be a fine solution if that restriction were acceptable. But we can make a small change to make it better:
@prefix states: <tag:govshare.info,2005:data/us/> .
people:S00148 pol:represents states:ny .
Now it's very precise. Except, when a computer reads in the URI tag:govshare.info,2005:data/us/ny it has no idea what that means. So we have to list somewhere else:
@prefix dc: <http://purl.org/dc/elements/1.1/> .
states:ny rdf:type <tag:govshare.info,2005:rdf/usgovt/State> .
states:ny dc:isPartOf <tag:govshare.info,2005:data/us> .
(i.e. the United States)
The computer may have no idea what tag:govshare.info,2005:rdf/usgovt/State means either, but at least it knows it's the same type of thing as the other states. Or the application writer can assign a special meaning to the URI tag:govshare.info,2005:rdf/usgovt/State.
Using a URI rather than a literal value also lets you, or others, contribute information about the entity. If I'm publishing information on Congress and someone else transforms some census data into this:
@prefix census: <tag:govshare.info,2005:rdf/census/> .
states:ny census:population "18976457" .
then immediately one can start writing queries that bridges the two data sets.
This is a fine way of representing the information. Beyond this point, the modeling choices become a real trade-off between simplicity and informativeness. There are two shortcomings with the representation of the pol:represents relation above. The first is that it misses the generalization that anyone who is a senator from New York represents New York. Or, rather, it's not an inherent property of Schumer that he represents New York, but rather it's in virtue of another property of his, which is holding the office of senator. So then we should revise the information as this:
@prefix senate: <tag:govshare.info,2005:data/us/congress/senate/> .
people:S00148 pol:holdsOffice senate:ny .
senate:ny pol:represents states:ny .
That's more informative, at the cost of being more complex to create and query.
The second shortcoming is a pervasive problem in any representation of the real world, and it's that the world isn't static. There are two ways to look at this. First, it's not an inherent property of Schumer that he holds the office of senator. Compare that to the assertion above that New York is a part of the United States, which we could reasonably say is a time-invariant truth. The second perspective is that this information may be correct now, but it won't be when Schumer leaves office. So when we write RDF, are we asserting time-invariant information or information that's claimed to be true only at the time of writing?
The answer is, we don't know. Some predicates are time-sensitive, some are time-invariant. Lots of information in RDF out there on the internet is time-sensitive with no indication of the time that it was written, or how long it might be correct for. This is a problem we'll have to deal with in the future.
So while it would be appropriate to leave the design as time-sensitive, GovTrack goes a step further and models the time that someone holds an office:
@prefix time: <http://pervasive.semanticweb.org/ont/2004/06/time#> .
people:S00148 pol:hasRole
[
rdf:type pol:Term ;
time:from [ time:at "2005-01-01" ] ;
time:to [ time:at "2010-12-31" ] ;
pol:forOffice senate:ny .
] .
The practical benefit of this is that GovTrack can include historical data this way without it seeming like George Washington is still in office.
In the Semantic Web, it's easy to get caught up in the theory. Modeling issues are fun to think about (at least for me), but it's good to have a practical application too, at least from time to time. Stay tuned for a future article where I'll get into the nuts and bolts of bringing government data onto the Semantic Web.
As a final note: as an American, my knowledge of my own government is fair and my knowledge of the outside world is, alas, minimal. I encourage you to post comments below about the Semantic Web and government in other countries.
Share your experience in our forums.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- A. Locksmith Los Angeles 877-364-5264
2008-12-21 12:40:26 services123 [Reply]
A. Locksmith Los Angeles 877-364-5264 Locksmith in Los Angeles - - (877) 364-5264
- Some Congressional data is already in XML
2006-02-28 09:49:24 joe.carmel [Reply]
GovTrack is a great achievement and an excellent example of private efforts to provide a value-added resource based on public information.
I'd like to specifically comment on your statement:
"GovTrack also fetches voting records and other documents and puts them into XML."
Congress has done quite a bit with XML (see http://xml.house.gov). The House has been posting voting records and legislation in XML for some time: voting records are available in XML back to 1990 (e.g., http://clerk.house.gov/evs/1990/roll010.xml) and legislation since the beginning of 2004 (check out http://thomas.loc.gov/home/gpoxmlc108/ and http://thomas.loc.gov/home/gpoxmlc109/).
These files use client-side XSL for rendering. The raw XML can be viewed by choosing View Source in the browser. The XML files have also been integrated into Thomas (http://thomas.loc.gov) where the public can search and retrieve Federal legislation. When a bill is displayed in Thomas, one of the options (at the top right) allows users to retrieve the corresponding XML file if the bill was created in XML. Congress has also added some links to the XML files for legislation, specifically for Public Laws going back to the 100th Congress (1987), US Code citations, and, when bills contain tables of content, the contents contain links to the appropriate locations within the file.
Congress doesn't have XML files for all legislation yet. Bills that pass both chambers are still missing and the Senate hasn't posted any of their legislation in XML yet, but they have been making good and steady progress.
The ids for Representatives you mentioned are already the same in the XML voting records and in the XML version of legislation:
VOTES
<recorded-vote>
<legislator name-id="B001244" sort-field="Bonner" unaccented-name="Bonner" party="R" state="AL" role="legislator">Bonner</legislator>
<vote>Aye</vote>
</recorded-vote>
LEGISLATION
<action-desc>
<sponsor name-id="I000056">Mr. Issa</sponsor> (for himself, <cosponsor name-id="C000059">Mr. Calvert</cosponsor>, and <cosponsor name-id="B001228">Mrs. Bono</cosponsor>) introduced the following bill; which was referred to the <committee-name committee-id="HII00">Committee on Resources</committee-name>
</action-desc>
These ids are also the same as the ids at http://bioguide.congress.gov which provides biographical information about all Senators and Representatives since the first Congress -- so there's definitely an id in common among these files. It certainly would be useful if Congress used the XML-XSL approach for the biographies that they are using for voting records and legislation. I'd bet having access to the biographical records in XML would further help tie the disparate information together.
I think Federal information is not in a single place because agencies and departments responsible for various pieces of the information. Fortunately, sites like GovTrack can be useful to the public by pulling the official information sources together in new and creative ways. Thanks again for all your efforts.
- Some Congressional data is already in XML
2006-03-01 05:26:22 JoshuaTauberer [Reply]
That's very true, and GovTrack actually uses the House's roll call votes in XML to get that information. The RDF URIs for people are also derived from the IDs used by Bioguide and in those roll call votes. The progress that you mentioned is good, but not having the same information and structure from the Senate makes it much less useful than it would otherwise be. If I have to screen-scrape *some* of the votes, or some of the text of legislation, I might as well be screen-scraping all of them, really. But, yeah, some XML is better than none.
- Josh Tauberer
- Some Congressional data is already in XML
- Swedish law xmlized by student, too
2006-02-10 01:43:08 Pär Lannerö [Reply]
A law student who previously worked as a programmer has made something similar in Sweden. His site, http://lagen.nu/ is based on xml files generated from lots of raw text sources by a python script. Except for the clean and efficient web interface, the xml files can be downloaded for anybody to further process. No RDF content, so far.
- Swedish law xmlized by student, too
2006-03-01 20:28:55 somegeek [Reply]
hi
I was wondering if it is possible to use this kind of data (metadata) to build a psearch engine that is able to take in queries on a certain Senator or a certain bill and return that information to the user.
i was thinking of using the popular search engine Nutch and customize it to do this kind of searches
Do you think this is possible
thanks
ilango
- Making a search engine
2006-03-12 05:00:00 JoshuaTauberer [Reply]
Yes, and some people have already tried to make search engines over semantic web data. Search on TAP (http://sp02.stanford.edu/) and Swoogle (http://swoogle.umbc.edu/) come to mind.
The limiting factor here is that there's simply very little data in RDF out there to use. Google is so great because there are 20 billion web pages indexed, each containing hundreds or thousands of "bits" of information (in this case, words). That's at least in the trillions of words. If I had to guess, there are probably less than 200 million RDF triples out there on the web (of real-life data, and not counting info on proteins from UniProt), the majority of which coming from just a handful of sources.
But also consider that the semantic web makes new types of question-answering applications possible. Search engines can't answer 'questions' with any accuracy unless the question is "who is using this word on their page?". The SW provides the structure for accurately answering a new, larger class of questions. We're going to have to think up new and better user interfaces for getting answers out of the semantic web.
People are working on this in various places in various ways. The idea of question-answering (in general) is a big topic in computational linguistics.
- Making a search engine
- Swedish law xmlized by student, too
