XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Dublin Core in the Wild

October 25, 2000

The eighth meeting of the The Dublin Core Metadata Initiative (DCMI) was held October 4-6 at the National Library of Canada in Ottawa, Canada. Bringing together about 150 participants from 20 countries, DC8 was as much about focusing the future work of this group as it was an opportunity to educate newcomers like us on the work that had already been accomplished. We were there to explore the relationship between RSS and Dublin Core.

The Director of DCMI, Stu Weibel of Online Computer Library Center (OCLC), got things started. Weibel was not the only one to mention that the DCMI are a passionate group -- stemming from their conviction that metadata is the key to improving the state of the Web as an information resource.

Weibel explained that DCMI started in 1994 at the 2nd World Wide Web Conference, held in Chicago. Its name comes from Dublin, Ohio, home of OCLC, a library computing consortium. The original mission of DCMI was to improve resource discovery on the Web by establishing a minimal set of metadata constructs, and Weibel reaffirmed that mission in his opening talk. He said that DCMI has become an "open consensus-building initiative" dedicated to improving the ways users find things on the Web. While recognizing that DCMI is not the only group working on metadata standardization, Weibel noted that DCMI's approach has always been interdisciplinary, and its focus remains fixed on the Web.

The DCMI group has produced two specifications. The Dublin Core Metadata Element Set, Version 1.1 lists fifteen metadata elements that can be used to identify properly anything on the Web. Among these are such elements as Title, Creator, Description, and Subject. DCMI has also produced the Dublin Core Qualifiers, which includes two classes of qualifiers: those which further refine or narrow the meaning of an element and those which reference a known encoding scheme such as an existing controlled vocabulary.

If resources on the Web provide structured metadata in accordance with the Dublin Core, then this information can be catalogued, processed, and retrieved in ways that assist users in locating what they want. Some sites already use the meta tag to supply Dublin Core elements for HTML documents. This metadata can also be placed in a separate file and referenced. Some search engines pay attention to the presence of metadata, but the impact of supplying metadata is not easily discernible for the average web site developer. Obviously, the success of Dublin Core is dependent upon achieving critical mass with site developers who must see the benefits of supplying metadata. It is also dependent upon having tools and services that make use of this metadata in innovative ways.

A talk by Eric Miller of the OCLC (co-authored by Daniel Brickley of the World Wide Web Consortium (W3C) and Rael Dornfest) gave an overview of the current and future Dublin Core Metadata landscape, the latter firmly rooted in Resource Description Framework (RDF). RDF is an XML application rapidly gaining acceptance as an effective way to express the relationships described by metadata. Eric Miller of the OCLC made the point that Dublin Core Metadata could be expressed in the syntax of RDF as well as in simple English. Statements such as "The author is Alfred Jarry" and "The date of publication is March 1, 1999" may seem obvious to humans, but must be properly encoded (in RDF) so as to be understood by computers. When processed by a computer, RDF statements actually support fairly complex queries, the sort of which we cannot perform easily in a search engine today: "Find me all articles by the person who's email address is dale@oreilly.com."

Miller also pointed out that Dublin Core establishes a standard framework for metadata. It's a common foundation for all applications that can also be extended to serve the specialized needs of disparate groups. The virtue of sharing a common framework means that a lot of useful metadata can be shared across very different domains, and interoperability is there by design.

One only needs to look at Napster to see good and bad examples of effective metadata usage. Metadata is what allows one to search by Title, Artist, etc. Napster could be viewed as a system for managing the metadata associated with music files. However, Napster is also a bad example of metadata because users who upload an MP3 may be very loose in specifying this information. The same song may be listed under different titles. Nonetheless, Napster, like RSS, suggests that distributing metadata has commercial value.

In his plenary talk, "A Grammar of Dublin Core", Thomas Baker of the German National Research Center for Information Technology offered the view that the Dublin Core Metadata set was more than a card-catalog system for the Internet. He called it a "pidgin for digital tourists." A pidgin language is a specialized, small vocabulary that can be useful for speaking in simple but effective ways. He used sentence diagrams to demonstrate how Dublin Core statements work. The fifteen DC elements are a limited list of nouns, while the DC qualifiers provide a rich, yet standardised/restricted set of adjectives.

In some ways, metadata seems like an esoteric subject with potentially fractal qualities -- "If data about data is metadata, what is data about metadata?" Yet you can approach metadata from a very practical viewpoint. It's how we organize information everyday in our calendars, address books, organizational charts. An email message without To:, From:, and Subject: headers is practically useless; it's a draft of a message not ready for distribution. A specification like Dublin Core asks us to be more disciplined in how we think about organizing our everyday data. While most people title their Web pages, few think to add such tidbits as author, subject, publication date, language, etc. The Web page may live on your Web server, but it's the metadata that's picked up by search engines, screen scraped, and routed everywhere people might be looking for it. When it comes to metadata, each single addition brings a logarithmic increase in value; a little dab of metadata goes a long way.

RSS and Dublin Core

The reason for going to the Dublin Core conference was to strengthen the connection between the Dublin Core community and developers of RDF Site Summary (RSS). In many ways, RSS has already proved useful as a metadata testbed and validates many of the assumptions implicit in the Dublin Core efforts. RSS demonstrates that site developers will provide metadata, and that the aggregation and flow of metadata can increase a site's traffic. RSS originated at Netscape and it was meant to support Dublin Core, but Netscape dropped it from the specification at the last moment to the dismay of the DCMI community. Instead RSS 0.91 established a very small set of metadata constructs, essentially Title, Link, and Description. In managing metadata through our Meerkat aggregator, and as a publisher, we could see the limitations of the current RSS framework; we simply didn't know enough about the individual items flowing in an RSS channel; we believe that Dublin Core provides a much needed metadata framework.

The new RSS 1.0 proposal provides a way to utilize Dublin Core as a common framework for sharing richer metadata. The goal is to bridge these two efforts so that Dublin Core can benefit from the experience of RSS developers and their tools, and RSS can benefit from the expertise of the Dublin Core community. One can continue to use the current RSS and provide only Title, Link, Description; but if you already have the metadata and want to make it available, then we wanted to create a standard way to do so. That's the rationale behind RSS 1.0. Thus, the combination of RSS and Dublin Core can provide a powerful way of making this useful data available outside one's own content management system.

Like many companies, we use content management systems, and we have a lot of metadata about what we publish in our database. We generated an RSS file for O'Reilly Network that contains items that are Dublin Core compliant. Below is an example of one item, produced by our system.

<item rdf:about="http://www.oreillynet.com/pub/a/linux/2000/10/
 13/oa_openal.html"
<title>OpenAL Explained</title>
<link>http://www.oreillynet.com/pub/a/linux/2000/10/
 13/oa_openal.html</link>
<dc:description>
OpenAOL is the Open Audio Library, a cross-platform, open source solution
for programming 2D and 3D audio.
</dc:description>
<dc:creator>Dave Phillips</dc:creator>
<dc:subject>Linux, APIs, Game Development, Gaming, Multimedia
</dc:subject>
</dc:type>Technical Article</dc:type>
<dc:language>en-us</dc:language>
<dc:date>2000-10-13</dc:date>
<dc:format>text/html</dc:format>
<dc:rights>Copyright 2000, O'Reilly Network</dc:rights>
<dc:publisher>O'Reilly and Associates, Inc.</dc:publisher>

Table of Contents

8th International Dublin Core Metadata Initiative Workshop (DC-8)
Dublin Core Metadata Initiative (DCMI)
Dublin Core Metadata Element Set, Version 1.1
Dublin Core Qualifiers
Online Computer Library Center (OCLC)
HTML 4.01 : 7.4.4 Meta data
World Wide Web Consortium (W3C)
Resource Description Framework (RDF)
RDF Site Summary (RSS)
"A Grammar of Dublin Core" by Thomas Baker

As you can see, in addition, to Title, Link, and Description, we have supplied the following fields: Creator, who in this case is the author of the article; a list of Subject keywords; the Type of item, in this case, a technical article; the Language in which it is written; the Date it was published; the file Format; and a statement about who owns the Rights to this article as well as the name of its Publisher.

As we've said, Dublin Core provides a common set of metadata constructs. One can go beyond and supply even more detailed metadata for specific applications. However, there's a reasonable benefit to supplying just this amount of metadata, which now opens the possibility that a user could search for a document by its author, date of publication, subject, and publisher. For example, a Linux site publishes an article on Apache. By using the Subject field to supply "Apache" as a keyword, an Apache web site, not interested in general Linux information, can locate that story and point to it. We can imagine applications that allow users to keep track of the metadata for documents they browse, which could be much more useful than bookmarks for retrieving something that has interested you.

Once the RSS 1.0 proposal is solidified, O'Reilly Network will begin providing DC-compliant metadata via RSS. If you are interested in doing so as well, please let us know.