Screenscraping the Senate

September 1, 2004

Note: In this inaugural article of Paul Ford's new column, Hacking Congress, he introduces his plan to create an RDF description of the U.S. federal government. He starts by collecting data on U.S. Senators and converting it to RDF. Future columns will focus on the House of Representatives and the Executive branch. — Editor

The United States government and the Semantic Web are a perfect match: imagine all of those senators and representatives, each query-able by age, party affiliation, bills proposed, committee membership, and voting record. For the last few years, I've wanted to collect as much data on the U.S. government as I could, convert it to RDF, and build a site and a web service that make it possible to explore that data. This will be my goal over the next year, and I'll document my progress here on XML.com.

I am aware that I am reinventing the wheel with this project. Several other sites attempt to map the government, most notably the The Open Government Information Awareness project. Wikipedia also has solid, cross-linked information on the current U.S. government, among many other sites. But I still think it's a worthy undertaking because I'm curious to see whether the promise of the Semantic Web holds true.

Does creating a Semantic Web of data make it easier to analyze and explore that data in new ways? In addition to testing the Semantic Web concept, if all goes well, I'll have a nicely organized map of the U.S. government, structured using publicly available ontologies, available in a single, reliable format (RDF), which anyone can incorporate into their own Semantic Web projects. It seems worth trying.

There's also another reason: after years of reading and writing about the Semantic Web, I still can't tell you how to build a complete Semantic Web application from scratch. At first that was because the Semantic Web was only a vague set of half-finished specifications. But now, with publicly available triple stores like Redland and Kowari, and well-established specifications for ontology development and the like, it seems like a good time to start thinking in triples. Hopefully I can share my experience with other curious folk, and they can lower their own Semantic Web learning curve by following my progress and avoiding my mistakes.

In this inaugural installment, I'll take two kinds of publicly available data -- HTML from the Senate's web site, and a CSV list of senators -- and use those to generate data in RDF.

Screenscraping the Senate

In a perfect world, web sites would publish RDF versions of their content (and health care would be affordable). In an OK world, web sites would use XHTML in a consistent manner. In this world, the United States Senate creates some of the homeliest HTML I've ever seen, and its list of senators not only doesn't validate, but violates most rules of good HTML coding, to the point of leaving the ending ">" off of many of its tags.

A screen shot of the United States Senate site, taken in Firefox

Figure 1. The Senate's web site, proof that beauty is only skin deep.

Luckily, there are fine tools for turning bad HTML into something parseable. One of the best known is HTML Tidy, but, as I'm going to be doing my screen-scraping in XSLT, I'll use the HTML parser built into libxml/libxslt. This parser is quite accepting of error, even at the level of error seen on the Senate's web site. My goal is to have an XSLT script (called SenateToRDF.xsl) that will fetch a page from the Senate's site, parse it, and return a file called senators.rdf.

When libxslt's HTML parser slurps the Senate's HTML from the Web, it turns that HTML into an XPath-addressable document. So now I have a straightforward task: I need to figure out the structure of the list of senators on the web page, and then write an XSLT script that can slurp in one senator at a time, and produce appropriate RDF for each of them.

It turns out that the list of senators is inside a <table>, and the senators are separated by a horizontal line; in fact, an <img> of a horizontal line, to be entirely accurate. A typical senator's HTML, when rendered, looks like this:

A screenshot of a snippet of an HTML table, showing a single senator.

Figure 2. A single senator's rendered HTML.

And in its raw form, looks like this:




<TR>

  <TD align="left">

    <span class="contenttext">

      <a href="http://bennelson.senate.gov/">

        Nelson, Ben

      </a>

      - (D - NE)

    </span>

  </TD>

  <td align="right">

    <span class="contenttext">Class I</span>

  </td>

</TR>

<TR>

  <td colspan="2">

    <span class="contenttext">

      720 HART SENATE OFFICE BUILDING

      WASHINGTON DC 20510</span>

    </td>

  </TR>

  <TR>

    <td colspan="2" align="left">

      <span class="contenttext">(202) 224-6551</span>

    </td>

  </TR>

  <tr>

    <td colspan="2" align="left">

      <span class="contenttext">Web Form:  </span>

      <span class="contenttext">

<a href = 

"javascript:openwindow('http://bennelson.senate.gov/email.html');">

bennelson.senate.gov/email.html</a>

</span>

    </td>

  </tr>

  <tr>

    <td colspan="2">

      <img width="100%"

      src="/resources/graphic/horiz_content_break.gif"

      height="24">

    </td>

  </tr>

Figure 3. Typical HTML for a senator.

It's not pretty, but it's what we've got. First, I set up my XSLT file:


<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet 

  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

  xmlns="http://www.hackingcongress.org/ns/Politics#"

  version="1.0">



  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="/">

    <-- Create a wrapper rdf:RDF element -->

    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

      xmlns:pol="http://www.hackingcongress.org/ns/Politics#">

          <-- apply templates against every "tr" element in the document -->

          <xsl:apply-templates select="//tr"/>

    </rdf:RDF>

  </xsl:template>

  <-- More templates to come -->

</xsl:stylesheet>

Figure 4. The beginning stages of an XSLT file.

This is not the place to explain how XSLT works (Bob DuCharme's Transforming XML column is the place); I'll just say that the XSLT in Figure 4 will apply a template to every <tr> in the document. Next, I need to write that template. If it finds that we're in the vicinity of a senator, it should spit out RDF based on what it finds; otherwise, it should just ignore that <tr> entirely. The XPath statement in the following template gets us in the neighborhood.

 <xsl:template match="tr">

    <xsl:if test = 

    "following-sibling::tr[1]/td/img[

    @src='/resources/graphic/horiz_content_break.gif']">

<!-- We've got a senator! --> 

    </xsl:if>

  </xsl:template>

If you're not used to XPath, that will look like nonsense, but here's what's happening: I'm looking for content that appears above a horizontal line — or, more specifically, I'm trying to match any <tr> which is immediately followed by another <tr> that contains a <td> with an <img> inside of it, if that <img>'s <src> is equal to "/resources/graphic/horiz_content_break.gif." Phew!

Since the HTML on the page is fairly consistent in how it lays out senators, we can be confident that, once we find the <tr> that contains the horizontal line, the preceding <tr> elements will contain reliably formatted data about that senator. And we can use that data to create the RDF.

I'll leave out the ugly details of the rest of the script, but if you'd like, you can download the commented SenateToRDF.xsl script yourself and take a look. When all is said and done, all I need to do is run this command to get an RDF version of the Senate's site (I use xsltproc, the XSLT processor that comes with libxslt):

xsltproc --output senators.rdf --html SenateToRDF.xsl /

http://www.senate.gov/general/contact_information/senators_cfm.cfm

Figure 5. The XSLT processor command to generate RDF from the Senate's web site.

Running this command spits out a large number of error messages regarding the Senate's HTML (libxslt is willing to work with bad HTML, but that doesn't mean it won't complain), and produces an RDF file. Our senator now looks much better:

  <USSenator rdf:about="http://billnelson.senate.gov/">

    <FullName>Nelson, Bill</FullName>

    <URI>http://billnelson.senate.gov/</URI>

    <Party>Democrat</Party>

    <State>FL</State>

    <Address>716 HART SENATE OFFICE BUILDING WASHINGTON DC 20510</Address>

    <Phone>(202) 224-5274</Phone>

    <SenateClass>I</SenateClass>

    <ContactURI>http://billnelson.senate.gov/contact/index.cfm#email</ContactURI>

  </USSenator>

Figure 6. An RDF representation of a senator.

Of course screen-scraping is itself a dubious process. When the Senate decides to change its page design, moves the page, or alters the suffix, I'm out of luck. At the same time, it's hard to argue against the fact that the Senate's own web site is a definitive source for up-to-date, reliable information about the current composition of the Senate. This is a situation that we're likely to encounter again: the best, most reliable site to get some information is the worst place to get useful data. Hopefully, as we go forward, we'll have multiple sources of information on various members of the government, and can use them all together.

CSV for the USA

Screenscraping is not my only option. The The Open Government Information Awareness project makes much of its information on the U.S. government available in flat-file formats like CSV, available on its Sources page as pvs-people.csv (1.3 Meg).

The CSV files for the Open Government project don't include all of the information gathered by screen scraping (like a senator's current address), but they do include other useful information that the Senate site does not provide in a machine-readable format. Thus it's worth our time to create RDF from both sources, with the idea that all of the data will eventually coexist happily in a triple store.

CSV is a familiar, malleable format, and good libraries exist in many of the major high-level, dynamic languages for working with CSV data. However, the Mindswap lab at the University of Maryland makes a tool called ConvertToRDF available which will convert CSV data to RDF, focused exactly on the problem at hand.

ConvertToRDF is a small, alpha-quality command-line tool, written in Java. To run it, you create an DAML ontology that describes your data, and a "map" file (in plain text or RDF) that describes how different columns in the CSV document correspond to the RDF output. Then you run the script.

Taking this route meant that I needed an ontology. That hadn't been part of the plan: I'd wanted collect some more data before figuring out how to fit everything together. (Also, generating RDF is fairly easy, while creating ontologies is, in my view, harder.) But if I'm going to commit to a Semantic Web framework for my government-navigation site, I need to begin thinking in terms of ontological relationships. That way I'll know what data I'd like to collect and can think of ways to format it correctly.

The fact that the ontology needs to be defined according to the DAML specification, which is an older spec upon which the more up-to-date OWL was based, isn't too big a deal — DAML and OWL are roughly comparable, and what I learn at this stage will easily carry over when I create an OWL ontology. In addition, the RDF output by ConvertToRDF will work fine with either ontology.

So I took a quick stab at modeling a politics ontology, strictly in regards to senators. You can take a look at it, but be advised that it's only a rough sketch to enable the conversion of my CSV data. We'll come back to it at a later date.

The file I was working with listed not just Senators, but thousands of different people in government. With my very rough ontology created, I massaged the CSV file in a spreadsheet program to list just senators.

Then I create my map file, called senators.txt:

IPT file:///C:/HackingCongress/Ontologies/Politics.rdf pol



USE pol:USSenator

MAP pol:FullName	"name"

MAP pol:Gender		"gender"

MAP pol:RepresentsState	"state"

MAP pol:Office		"office"

MAP pol:Party		"party"

MAP pol:Religion	"religion"

MAP pol:Birthday	"birthday"

MAP pol:ElectedDate	"elected"

MAP pol:FamilyDesc	"family"

MAP pol:Seat		"seat"

MAP ID			"name"

Figure 7. A map file that establishes the relationships between spreadsheet columns and RDF data.

The first line identifies the DAML ontology to use, and the namespace to use to prefix our elements. After that, the USE command tells the application to create <pol:USSenator> records. Then we come to a set of MAP statements, which map RDF statements to the corresponding columns of the CSV file. Finally, we tell the application which ID to use as a unique ID for each element.

I ran the application with the following command:

java -classpath 'classes;jar\rdfparser.jar;jar\utilities.jar'/

org.mindswap.utils.ConvertToRDF fileIn="Senators.csv"/

fileOut="Senators_GIA.rdf" mapFile="senators.txt"/

useRowHeader="true"

Figure 8. The java command to convert CSV data to RDF.

And here is what a senator looks like in my output, from the file Senators_GIA.rdf:

<USSenator rdf:ID="Bill_Nelson">

  <FamilyDesc>"Wife: Grace & 2 Children: Nan Ellen</FamilyDesc>

  <Gender>Male</Gender>

  <Seat>Junior Seat</Seat>

  <ElectedDate>2000</ElectedDate>

  <Party>Democrat</Party>

  <Religion>Episcopalian</Religion>

  <Birthday>1942.09.29</Birthday>

  <RepresentsState>FL</RepresentsState>

  <Office>US Senator</Office>

  <FullName>Bill Nelson</FullName>

</USSenator>

Figure 9. A second RDF representation of a senator.

The ConvertToRDF tool is handy and free, but not really appropriate for industrial-strength conversion: it ran into some problems with the CSV file, when individual fields contained commas; its use of DAML instead of OWL is a problem; and it generates problematic XML (it doesn't convert ampersands to entities, for instance). However, its approach — create a map file, set up your ontology correctly, and receive RDF as a result — is handy, and if I end up coding my own CSV to RDF converters, it would be good to use ConvertToRDF as a model.

Summing Up

So now I've created two RDF files describing the same set of people — members of the U.S. Senate — and a very bare-bones ontology to describe the relationships between the data in these files. But astute readers probably noticed a real problem: the RDF for the first uses the rdf:about attribute to identify a senator, and the RDF for the second uses the rdf:ID attribute, and both resolve as very different URIs. As far as the RDF specification is concerned, these senators are totally distinct.

Solving that problem will wait for the next column. Another problem I'll need to solve is how to manage data that is currently in string literal format. For instance, to say that a senator is a "Democrat" is a very different thing than associating a senator with a URI that represents the concept of Democrat. To keep true to the Semantic Web concept, I'll need to create more RDF that defines concepts like Democrat, Republican, Male, Female, and so forth, so that I can move away from string literals in my triples. This will make it more efficient to query and navigate my data.

So that's what's next. I need to figure out how to make these two RDF representations of the Senate consistent, by coming up with a way to refer to individual senators as URIs. I need to take another look at my ontology, and take a stab at identifying the relationships between people and ideas that I'll need to create a map of the U.S Government. After that, I'll take on the House of Representatives.