Hacking Oscar!

March 23, 2005

XQuery is a rich and expressive language. I love exploring the types of questions you can pose using it. In fact, I enjoy exploring the types of queries you can pose almost as much as I enjoy discovering what those queries can discover (if you can parse that sentiment). I realized around Academy Awards time a year ago that the Oscars® were a rich and exciting domain that seemed to be crying out for XQuery exploration.

Think about it. Think of all the Oscar trivia sites on the web, and the newspaper columns that were appearing just a few short weeks ago, all focused on this year's awards. They're full of questions like:

What are the two most nominated films of all time? ("All About Eve" and "Titanic")
How many nominations did they each receive? (14)
What are the three movies that have won the most awards? ("Ben-Hur," "Titanic," and "The Lord of the Rings: The Return of the King")
How many awards did they each win? (11)
How many actors (male and female both) have been nominated for both Leading and Supporting roles in the same year? (10, including Jamie Foxx at this year's awards)
Which director has been nominated five times for Best Director but has never won? (Martin Scorsese)

Reading through questions like these, I suddenly had a minor epiphany. I realized that, given XQuery and a suitable XML database of Academy Award information, I'd be able to ask and answer all those questions myself. What power! Even better, I'd be able to make up trivia questions of my own, limited in scope only by my imagination and creativity. I started getting excited thinking about XPaths. (Hey, it's better than playing on the freeway!) I decided that automating an Oscars trivia database would be an interesting and challenging project.

Once I started playing around with hypothetical queries that could be posed against such a database, I quickly realized that the number of such hypotheticals was huge. And the database would be useful for many more things than simply asking and answering trivia questions. What about statistical analyses, say, of the factors correlating nominations and winners? What about "six degrees of separation"-type questions, but in an Oscars domain? What about adding Academy Award-based relationships to the semantic web?

I realized I'd probably be able to come up with some good trivia questions and ideas for interesting research. However, if I could provide a web-based front-end that made the data available via a query interface to other people as well, they'd probably be able to come up with far better trivia questions and research ideas than I ever could on my own. In short, thus was born many a sleepless night.

Once I'd decided to proceed, two questions immediately arose: What would such a database look like, and once I'd designed it, where would I get the data?

Structuring the Data

I thought about my requirements. Given the richness of the query domain, I realized I'd probably be doing a lot of ad hoc exploration at the keyboard. I decided that one of my main criteria would be query concision: I'd be doing a lot of typing, and the fewer the number of keystrokes I had to enter, the better. This meant I'd probably also want to have a fairly simple schema. Sitting at the keyboard, I didn't want to have to deal with complex structures or remembering a large number of attribute and/or element names.

Happily, it quickly became evident that every Academy Award nomination has at heart an exceedingly simple structure. Every nomination associates, in addition to the name of the award and the year it was awarded, just two basic items: A motion picture, and one or more people involved in that picture's production. Two of the key elements in my schema would thus be <picture> and <person>.

The role each <person> played in a particular nomination would be determined by the award category: In the case of a Best Picture nomination, for example, there might be multiple <person> entries, each one being a producer (and given Hollywood custom, there might be thousands of those :-), while in the case of Best Actor, the single <person> associated with each nomination would be the actor him- (or her-) self. If the award were for cinematography, the <person> would be a cinematographer. And so on. I couldn't think of a structure much simpler than that.

Being able to notate winners and losers was also important. Each of the bulleted questions above, for example, asks either directly or indirectly about a competitive result: Who was nominated, and who won and lost? So I decided that while there are a number of honorary and technical achievement awards given each year that don't have clear winners and losers (the Irving Thalberg Memorial award, the Scientific and Engineering Award, and the Jean Hersholt Humanitarian Award, to name just three), I wasn't interested in those and thus wouldn't attempt to be authoritative about everything Oscar. I'd let other sites enumerate that type of information; I just wanted to be able to ask, in interesting ways, who had won and who had lost in particular categories.

The schemas I came up were all minor variations on a basic structure. Here's one showing the data for Best Actor for the 77th Academy Awards just held (or best performance by an actor in a leading role, as the Academy of Motion Picture Arts and Sciences likes to put it). I figured some hands-on querying would quickly show me whether this was a reasonable format or not. If it wasn't, no big deal: I could easily use XQuery to transform this structure into something more suitable.

<award year="2004">

    <actor><won>

        <person>Jamie Foxx</person>

        <picture>Ray</picture></won></actor>

    <actor><lost>

        <person>Don Cheadle</person>

        <picture>Hotel Rwanda</picture></lost></actor>

    <actor><lost>

        <person>Johhny Depp</person>

        <picture>Finding Neverland</picture></lost></actor>

    <actor><lost>

        <person>Leonardo DiCaprio</person>

        <picture>The Aviator</picture></lost></actor>

    <actor><lost>

        <person>Clint Eastwood</person>

        <picture>Million Dollar Baby</picture></lost></actor>



</award>

You'll notice I'm using elements in several instances where you might typically expect to find attributes (<actor> as opposed to <award name="actor">, for example, and <won> and <lost> instead of <award won="yes"> and <award won="no">). That's because such a structure makes for more easily typed (as in "keyboarded") queries, as shown below.

Here's the corresponding data for Best Picture:

<award year="2004">

    <bestPicture><won>

         <picture>Million Dollar Baby</picture>

         <person>Clint Eastwood</person>

         <person>Albert S. Ruddy</person>

         <person>Tom Rosenberg</person></won></bestPicture>

    <bestPicture><lost>

         <picture>Finding Neverland</picture>

         <person>Richard N. Gladstein</person>

         <person>Nellie Bellflower</person></lost>

    </bestPicture>

    <bestPicture><lost>

         <picture>The Aviator</picture>

         <person>Michael Mann</person>

         <person>Graham King</person></lost></bestPicture>

    <bestPicture><lost>

         <picture>Ray</picture>

         <person>Taylor Hackford</person>

         <person>Stuart Benjamin</person>

         <person>Howard Baldwin</person></lost>

    </bestPicture>

    <bestPicture><lost>

         <picture>Sideways</picture>

         <person>Michael London</person></lost>

    </bestPicture>

</award>

An Oscars Trivia Sampler

Given the above structures, here are some of the trivia-type questions you might want to pose against this data:

List the nominees for Best Actor in 2004

for $actor in //award[ year="2004" ]/actor//person/text()
 return (
                              $actor, ", " )

=> Jamie Foxx, Don Cheadle, Johnny Depp, Leonardo DiCaprio, Clint Eastwood

How many nominees were there?

count( //award[ year="2004" ]/actor )
=> 5

Who won?

//award[ year="2004" ]/actor/won/person/text()
=> Jamie Foxx

What picture did he win for?

//award[ year="2004" ]/actor/won/picture
=> Ray

Has this actor previously been nominated for any other awards?

let $actorName := 

              //award[ year="2004" ]/actor/won/person/text()

return

    if 

    (

      exists( //award[ year<"2004" ]//person ftcontains 

                                                $actorName )

    )

    then "Yes!"

    else "No"

=> No

This list provides just the barest hint of the many types of queries you could ask. The ftcontains expression in the last query, by the way, is from the XQuery Full-Text working draft published last July.

Populating the Database

Once I knew more or less what my data was going to look like, I went looking for a way to populate my database. One thing was clear: The Academy Awards encompass 77 years of data, and I was not eager to start practicing my typing skills again.

My first thought was IMDB. Terms on their website clearly forbade either screen-scraping their site or creating a database from their downloadable files without prior consent, so I requested permission by email. I never got a reply, but rather than pursuing that further I settled on my number two choice, grabbing my data from a small, privately maintained site known as The Oscar Guy.

I'd never done any screen-scraping before and was a bit nervous about the legal ramifications of what I was intending. A well-connected friend put me in contact with one of the world's leading experts on digital rights and technology, who assured me that I should be fine, since there's no copyright on the facts of who won which Oscar. And while there might be a "thin" degree of copyright on the selection and arrangement of material on the Oscar Guy site, there shouldn't be a problem as long as I was building my own database and wasn't merely duplicating that selection and arrangement. Feeling somewhat reassured (there's no such thing as certainty when it comes to the possibility of litigation), I pressed on.

My next question was: How does one screen-scrape? The answer (no surprise) again involved XQuery. But that's where I'll stop for the moment. I'll leave the meat of the technical discussion for my next installment, when I'll outline how I used XQuery and TagSoup to convert the Oscar Guy's source HTML into the XML format I required. I'll also summarize my experience with some handy tips on how to use XQuery for screen-scraping in general. And I'll publish my promised query front-end for the Oscars Trivia Website.

If this topic motivates you to come up with some interesting XQuery-based Academy Award trivia questions of your own, by the way, send them in to me at If they're sufficiently novel or illustrative of interesting things you can do with XQuery, I'll include them as part of the site. Judges are standing by.