SIMILE: Practical Metadata for the Semantic Web

January 26, 2005

Stefano Mazzocchi, Stephen Garland, and Ryan Lee

Browsing digital libraries today can be a difficult process of navigating through different interfaces and different terminologies for each collection visited. The SIMILE Project is working to make it easier to wander from collection to collection and, more generally, to find your way around in the Semantic Web.

SIMILE (an acronym for Semantic Interoperability of Metadata In unLike Environments) was originally motivated by DSpace, a repository for storing, indexing, preserving, and redistributing digital assets, jointly developed by Hewlett-Packard Research Labs and the MIT Libraries. DSpace is a tool used by many research-producing organizations, and often by their libraries, to manage digital data and for researchers to find that data. DSpace is available as open source software and is used and supported by the DSpace Federation community.

Like any good system developed in collaboration with a research library, DSpace manages metadata about the content it manages and distributes on the web. However, its metadata support is currently limited to the general but relatively small Dublin Core descriptive metadata schema. In the future, DSpace needs to support additional metadata schemas for a variety of purposes: finding digital research material described in various, domain-specific ways, and managing that digital content over time in order to preserve it. As DSpace expands to use new metadata schemas, it will have to deal with the problem of interoperability.

Enter the Semantic Web and extensible metadata. The Semantic Web Core stack — RDF, RDFS, and OWL — enables people to create ontologies to describe their specialized metadata (perhaps building on existing, more general ontologies) and to make them generally reusable. But most people are not trained Semantic Web developers. They are going to need some tools for this and also to be able to assess whether they did the job correctly.

This is the problem space for which SIMILE was begun. A primary goal is to extend DSpace, enhancing support for arbitrary schemata and metadata and providing an architecture for disseminating digital assets. Although the problem domain of SIMILE originated in the library community, the tools we are developing will easily be reusable in other domains with similar problems. And because of limited expertise in defining ontologies, creating RDF, and converting existing XML-based metadata into RDF, it has been necessary to start with a secondary goal of creating the tools that metadata specialists (e.g., librarians) will need to produce good-quality RDF.

Pieces of the Puzzle

Longwell and Knowle

The ultimate concern in building software for librarians and researchers (and many others with related problems) is to provide the functions they expect in an attractive and easy-to-use package.

To make browsing RDF metadata practical for library users, SIMILE developed a suite of web applications that perform RDF browsing via standard web browsers. The suite is composed of Longwell, a faceted browser that targets users by hiding the presence of the underlying RDF model, and Knowle (shipped as part of the Longwell distribution), a node-focused graph navigation browser that is targeted at people who want to see or debug the underlying RDF model.

The browsing suite (named after the Longwell component) is written as Java servlets and is built around HP's Jena2 Semantic Web toolkit. The latest distribution of the Longwell suite can be downloaded from http://simile.mit.edu/dist/longwell/.

Figure 1: Longwell screenshot browsing an aggregated collection of art images. Click image for full-size screen shot.

Faceted browsing, as illustrated in the above screenshot, displays only the metadata fields that are configured to be 'facets' (i.e., to be important for the user browsing data in one or more specific domains) using values for those fields as a means for zooming into a collection by selecting those items with a particular field-value pair (e.g., 26 works of art in the example dataset have a subject of Abstract Expressionism). Faceted browsing thus provides a mechanism that allows users to explore different schemas from different domains with a unified interface and to discover the synergies across them. For example, the interface can be designed to show users (through relative screen placement) that one schema uses a "subject" facet while another uses a "topic" facet for similar information. The user from the first domain finds the familiar term ("subject") and sees the related term ("topic") next to it. Another way to achieve cross-schema discovery is by starting with a keyword search (e.g., by supplying a name) and then seeing results from different collections that use different facets and being able to browse those facets further to explore the unfamiliar terms and collections.

(If this sounds vaguely familiar, you may recall the May 2004 XML.com WWW2004 Semantic Web Roundup article by Paul Ford which mentions SIMILE.)

Welkin

Configuring tools like Longwell requires a thorough understanding of the structure of the data being examined. More generally, it is hard to get a global overview of an RDF model, and there are relatively few tools for summarizing RDF and giving a quick mental model of the data being manipulated with a browser. So we created Welkin.

Welkin is an interactive graphical RDF browser that visualizes any RDF model without requiring prior configuration (like Knowle, but unlike Longwell) and displays RDF as a clustered set of nodes and arcs. Welkin (written with the outstanding volunteer contributions of Paolo Ciccarese of the University of Pavia) is particularly useful for understanding and mining the layout of unfamiliar datasets.

Unlike existing graph-drawing methods that focus on "getting it right" with complicated layout algorithms, Welkin tries to empower the user with an interactive approach, allowing users to mine, zoom, drag, select, cluster, filter, and highlight nodes and arcs. Development is recent and ongoing, but even at an early stage, it has been very useful. In the screenshot below, Welkin is used to browse a fragment of the MIT OpenCourseWare metadata converted to RDF.

Figure 2: Welkin screenshot browsing a fragment of the metadata of the collection of art images. Click image for full-size screen shot.

Gadget

Another problem we faced was the transformation of existing XML datasets into RDF. Again, the problem is a lack of tools that give you an at-a-glance overview of an XML dataset (or a collection of XML documents). Gadget helps data managers understand the structure of an XML dataset by providing a summary of the count, unique values, and percentage of unique values for XML attributes. It's also very helpful when a dataset comes with no schema or in order to understand what parts of the schema are used by the given dataset (which is useful in simplifying transformation steps, since you can avoid transforming those parts of the schema that are not used at all).

RDFizers

The real strength of RDF lies in the support it provides for defining models and in the highly distributed nature that RDF models suggest. However, its RDF/XML serialization is considered a very unfriendly compromise by both the XML community and the community of potential users interested in leveraging the power of a model.

For this reason, we started a project to create and catalog software tools and scripts we call RDFizers, which are able to transform data from existing syntaxes into RDF. These tools are a practical way to resolve the chicken and egg problem from which the Semantic Web currently suffers — "not much RDF data will be created without a killer app, but no killer app will be created without more RDF data" — by making it easier for specialists (like librarians and other metadata experts) to convert popular and widely available metadata sources into RDF.

RDFizers allow people to explore their existing data in available RDF browsing tools, showing the benefit of using RDF as the lingua franca, not only to Semantic Web advocates but also to people who just want to get their job done in the easiest possible way.

Fresnel

In addition to developing our own tools and software, SIMILE is leveraging work at MIT on Haystack, an extensible "universal information client" that enables users to manage diverse sources of information (e.g., email, calendars, address books, and web pages) by defining whichever arrangements of, connections between, and views of information they find most effective. At times, the interaction offered by a web-browser interface is too limited. The Haystack project is exploring a "rich client" interface that allows RDF data to be manipulated as well as navigated. It might be used by librarians who wish to manage the collections described with SIMILE-produced metadata or by users who want to collect and manage their own subsets of the SIMILE information. Unlike Welkin, which displays information as a graph, Haystack aims for a Longwell-like presentation of information that is natural for naive end users. Haystack uses standard primitives like drag and drop and context menus to give users access to various operations on the data being viewed at any given time. Haystack is currently being repackaged as a plugin in the Eclipse platform.

In working on RDF browsing for both SIMILE and Haystack, we found that life would be easier if we had a general ontology governing how to display RDF, a kind of stylesheet for RDF that allows us to indicate how we would like to present some abstract data to the user. Together with other members of the Semantic Web development community, we're working on putting together Fresnel, a generic ontology for describing how to render RDF in a human-friendly manner.

Behind the Curtain

Four groups support SIMILE: Hewlett-Packard Research Labs, the W3C, MIT Libraries, and MIT CSAIL. The principal investigators have included Mick Bass, Eric Miller, MacKenzie Smith, and David Karger. The developers are Stefano Mazzocchi, Stephen Garland, and Ryan Lee. The development team would like to acknowledge the amazing work done by Mark Butler to bootstrap the Longwell project, even though he is no longer involved on a day-to-day basis.

An Incomplete Picture

Although we've mentioned some of the interesting issues for metadata browsing, there are still many more.

For metadata specialists and system developers, what about editing RDF? What about building new ontologies? What about storing vast quantities of (potentially distributed) RDF and accessing it efficiently? What about using performance-enhancing techniques (such as caching) for RDF? What about quickly inferencing over RDF data?

For users, can we design faceted browsing interfaces that scale to dozens of RDF ontologies? How about improving navigation across the linkages between ontologies? How can we support searching that will start in one domain/ontology and expand into relevant related domains/ontologies?

We don't have easy answers to these questions. But we are investing thought and code in finding acceptable ways to answer them, and we could use help. Even if we work in a university environment, we don't want these tools to reside on an ivory tower shelf or sit forgotten in a university's vault; we want to see people using them!

Should this brief description of our goals, status, and the issues we face sound interesting, we are an open source project, and we welcome your insights and contributions. We have an advanced development infrastructure (drawn from the experience of one of our developers' long-term participation in the Apache Software Foundation), and we do our development in a collaborative environment under a commercial-friendly BSD license to maximize the pool of potential users and contributors.

Start with our website, download our software, and, if you wish, join forces on our mailing lists.