Music and Metadata
by Chris MitchellNovember 22, 2006
Introduction
Music Events
There are many independent music events in London, where Shawn lives, and he often visits the website www.drownedinsound.com, which lists events for the coming week. This site also lists artists' names and the types of artists played by the DJs. Another site, london.openguides.org, lists great places to eat in London. Both sites list the music event or restaurant's address, and Google Maps can be used to show these places on a map. So, both a restaurant and nearby music event can easily be chosen, and, using the Web, the music event can be reserved if the restaurant date is going well. Now, on Shawn's first date, an artist's name came up in conversation. His date said something about trying to find out if Zack de la Rocha's collaborations with Reprazent and DJ Shadow were going to be released or not. Shawn ran down to his local record shop the next day to find some of this music. He asked for help from the assistant, who heard the name de la Rocha and handed Shawn Rage Against the Machine's self-titled release from 1992. Shawn was grateful for finding the CD and for the clerk's help, as he knew nothing about dance music and had accidentally given the impression on his first date that he liked the artists she mentioned, too. The website that lists local music events also lists the artists playing, addresses, and dates, and Shawn would need to know related artists as well if the few artists he'd heard of were not listed. It would also be useful to know the music genres for dance music; some of the events don't list dance music but genres such as House or Jungle. He could find all this information on the Web, which would solve his dilemma but require a lot of cross-referencing and screen time. However, our guy has left this all to the last minute, arranging to meet his date at 8 p.m. and getting off the phone only an hour and a half before they are due to meet.
So what is practically different in using the Semantic Web to solve this problem? Like the Web, the Semantic Web describes information using labels that can help to find the web pages necessary to solve problems. These normally include the title of the page and lots of other metadata, such as location information or publish date. However, unlike the Web, the relationships between these labels can be defined across multiple web pages along with other abstract information, such as people or places that do not necessarily require an actual web page, only a common label. This is like the common headings used between tables in a database, or multiple databases, to relate tables. For example, if someone has labeled an event as playing Dixieland music, and someone else has defined both Dixieland and Bebop as kinds of jazz, the Semantic Web can be used. Shawn knows that his date likes dance music, and he can use the Semantic Web to establish the relationships between the labels, which explicitly state that Dixieland and Bebop are types of jazz. The Semantic Web's power is its ability to add labels; it uses the information that Dixieland is a type of jazz to assume that the event playing Dixieland music can also be labeled as a Jazz event. This is a true statement even though the original author did not think to add this label to the event's description. This process, known as inference, is simple and adds assumed labels to web pages or information on the Semantic Web. The inference system can be incredibly compact, containing a number of rules that state: if this label and another label are both present, then an additional label can be assumed to be present when querying. Now if our indie-loving guy knew absolutely nothing about music and was looking for "The Dixie Jazz Band" in the bargain bin of his favorite local retailer to see if the band played dance music, this would be a great bit of direct information to know. However, for our quirky but otherwise lovable hero, the direct use of this assumed information generated by the Semantic Web is hard to illustrate with such an example. This is because the Semantic Web's real advantage comes with scale. The advantage for music information comes when lots of people share lots of information about music, music events, and artists. Using the inference techniques, it can then be brought together by adding assumed labels to the original information, allowing the total body of information to be reused in ways not initially intended by its individual authors. This can generate a wealth of information that would otherwise require a lot of time to obtain. Now remember to keep thinking simply; the Semantic Web allows you to assume more labels describing web pages or other information than you would normally be able to obtain in a reasonable amount of time. This would most definitely help our panicking friend.
How can we store and collect semantic data from music events and get it into a Semantic Web format without spending hours rewriting information into compatible markup languages? When the Web started, many people were writing code to convert their existing information to the new web format. Well, the same is true for the Semantic Web, in which information is generated from databases and numerous other formats. We now can use the web pages as a source of information, which doesn't require the people running the site to give open access to their database. The way to convert these sources into a Semantic Web-compatible markup is to use a screen scraper. Screen scrapers are simple programs that, in this case, use XPATH queries to extract the relevant information from web pages by navigating the DOM and external links. This raw information is then fed to a Semantic Web programming environment. This environment outputs documents comprised of RDF or OWL – two of the Semantic Web's markup languages. These documents can hold any new inferred label information as well as any that is explicitly defined. They are just like the metadata labels used at the top of current web pages, title/author, or table names and values in a database. An open source programming library that Piggy Bank (discussed later in this article) is built around, Jena (Java) is the most functional of the libraries. However, there are libraries written in C# for the more .Net-oriented users among you. These libraries can output the Semantic Web documents and also house the inference systems to process multiple documents from different sources, adding the assumed labels.
This all sounds like a lot of work to start producing Semantic Web content – especially considering that the Semantic Web is more useful as more people use it, much like the current Web. For those of us who are less technically savvy, this could be a major barrier. Fortunately, a collaborative project being undertaken by MIT, W3C and HP has written tools to simplify this task of extracting information. Collectively, these tools are published under the SIMILE project name [1].
The first part of our screen scraper requires us to write the XPATH statements that will extract the information from the relevant web pages, and in this case, the relevant JavaScript to run those XPATH queries. We can use an application called Solvent [2], from the SIMILE project, to generate these queries and code. Solvent runs under Firefox. After navigating to a web page, you can scrape information by bringing up Solvent's interface (Figure 1) and selecting an item in the web browser – such as this week's music events list in London. If this information is part of a recurring pattern of entries, such as in a table or list in the browser, Solvent writes the appropriate XPATH and JavaScript to grab the recurring entries; if not, it just grabs the highlighted section(s).

Figure 1. Solvent running in Firefox
Using this automatically generated screen scraper, you can grab the raw information from the web pages. This information can then be reused in another application known as Piggy Bank [3], again from the SIMILE project, which will eventually store the scraper's raw information into a RDF document or data store. All this is done without touching a line of code. A simplified screen scraper code for the coming week's London events is listed for drownedinsound.com in Figure 2. For information on how to install the full version into Piggy Bank, please see the SIMILE project notes [4] and the Resources section in this article.
function processEntry(d, model, utilities, uris, uriToLocation) {
var elmt = d.evaluate("//div[@id='maincol']/div[@class='detail']", d, null, XPathResult.ANY_TYPE,null);
var urls1 = [];
var aElmt = elmt.iterateNext();
while (aElmt) {
if(aElmt.innerHTML.indexOf("venue") != -1)
{
utilities.debugPrint(aElmt.childNodes[1]);
urls1.unshift(aElmt.childNodes[1]);
}
aElmt = elmt.iterateNext();
}
var rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
var rdfs = "http://www.w3.org/2000/01/rdf-schema#";
var dc = "http://purl.org/dc/elements/1.1#";
var drownedinsound = "http://www.drownedinsound.com/";
var loc = "http://simile.mit.edu/2005/05/ontologies/location#";
var uri = d.location.href;
model.addStatement(uri, rdf + "type", drownedinsound + "event", false);
var rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
var rdfs = "http://www.w3.org/2000/01/rdf-schema#";
var dc = "http://purl.org/dc/elements/1.1#";
var drownedinsound = "http://www.drownedinsound.com/";
var loc = "http://simile.mit.edu/2005/05/ontologies/location#";
var uri = d.location.href;
model.addStatement(uri, rdf + "type", drownedinsound + "venue", false);
model.addStatement(uri, loc + "address", address, true);
uris.unshift(uri);
}
var uris = [];
var urls = [];
var iterator = doc.evaluate("//div[@id='maincol']/ul/li[@class='normal']/a", doc, null, XpathResult.ANY_TYPE,null)
var aElement = iterator.iterateNext();
while (aElement) {
urls.unshift(aElement.href);
aElement = iterator.iterateNext();
}
utilities.processDocuments(
browser, // current browser
null, // first document to process if any
urls, // array of urls to load asynchronously
function(d, cont) { // function to process each document as it gets loaded
try {
processEntry(d, model, utilities, urls, uriToLocation);
} catch (e) {
utilities.debugPrint(e);
}
cont(); // continue with the iteration
},
done, // what to do when all documents have been processed
function(e, url) { // error handler
alert("Error scraping data from " + url + "\n" + e);
}
);
wait(); // don't navigate to the collected data just yet
Figure 2. Simplified screen scraper code to get this week's music listings from drownedinsound.com
This screen scraper requests web pages and the venue pages to which the events link. These venue pages list the addresses of the events being held. The code makes calls to the Semantic Web libraries and to another resource that provides a function not already mentioned. It generates the RDF data that holds the information obtained from these pages. The additional function uses the venue's postcode and address to get its longitude and latitude information. This will come in handy later (I have included the full version of the files used in this example, including the screen-scraper file, its .n3 [used in its installation], and the server-side code used to obtain the U.K. geocoordinates for the venue's address). Piggy Bank, apart from storing all this information, also displays a navigable, searchable index of the Semantic Web data, which you can share with others. Figure 3 shows how the RDF information is displayed in Piggy Bank. This was not obtained from screen scraper but directly from RDF sources from london.openguides.org.

Figure 3. Piggy Bank running in Firefox showing an Italian restaurant's RDF description
Pages: 1, 2 |
