The Semantic Web: A Primer
November 1, 2000
|Table of Contents|
The Semantic Web lies at the heart of Tim Berners-Lee's vision for the future of the World Wide Web. Along with others at the W3C Berners-Lee is working on the infrastructure for this next stage of the Web's life. But the question "What is the Semantic Web?" is being asked with increasing frequency. While mainstream media is content with a high level view, XML developers want to know more, and they want to discover the substance behind the vision.
Accusations of fuzziness about the Semantic Web (SW) have been levelled at Berners-Lee, and it is certainly true that he has yet to deliver the long-awaited "Semantic Web Whitepaper." However, there are interesting suggestions in his Semantic Web Roadmap text which give details about the direction he wants to go. Furthermore, SW activity at the W3C and MIT/LCS has been increasing in intensity, and community involvement with RDF, a key SW technology, has increased markedly over recent months.
In his Roadmap document, Berners-Lee contrasts the Semantic Web with the existing, merely human-readable Web: "the Semantic Web approach instead develops languages for expressing information in a machine processable form." This is perhaps the best way of summing up the Semantic Web -- technologies for enabling machines to make more sense of the Web, with the result of making the Web more useful for humans.
Given that goal, it's unsurprising that the scope of the Semantic Web vision is somewhat broad and ill-defined. There are many ways to solve the problem and many technologies that can be employed. Some XML developers have a "well-formed" prejudice against, as they cheerily call it, the "Pedantic Web" because of the strong links with RDF (not everyone's favorite technology) and the definite view taken on URIs. But to perceive the SW only in this light would be a mistake. Technical peeves aside, the value of the Semantic Web is to solve real problems in communication. First and foremost this means radically improving our ability to find, sort, and classify information: an activity that takes up a large part of our time.
The development of the Semantic Web is well underway. This development is occurring in at least two areas: from the infrastructural, all-embracing, position as espoused by the W3C/MIT and other academically-focused organizations, and also in a more directed application-specific fashion by those using web technologies for electronic business.
One of the fundamental contributions towards the Semantic Web to date has been the development of XML itself. Liberating data from opaque, inextensible formats as it does, XML provides an interoperable syntactical foundation upon which solutions to the larger issues of representing relationships and meaning can be built. It's an important center of agreement among individual developers and corporations. The face of the Web is changing, offering once again new possibilities for communication and interaction -- not because all of the underlying concepts are new per se, but because they can be combined on the Web and exposed to the opportunity and unpredictability of large-scale decentralization.
For the developer, however, the grand vision is irrelevant unless it can be put to work. The point of this article is to draw together the technological threads of the Semantic Web and introduce some tools available now that can be used as a basis for experimentation and development.
This section addresses some of the most important technologies for constructing the Semantic Web. By no means is this list exhaustive because, as I observe in the section addressing RDF, as long as there is some translation to a common data model, many syntaxes can be a source of structured information for a machine. However, I have included those technologies that are key in this stage of Semantic Web development.
Perhaps surprisingly, a powerful tool for the construction of the Semantic Web is HTML itself or, more properly, XHTML. Most people are acquainted with the "meta" tags which can be used to embed metadata about the document as a whole (for more on metadata see An Introduction to Dublin Core.) Yet there are more powerful, granular techniques available too. Although largely unused by web authors, XHTML offers several facilities for introducing semantic hints into markup to allow machines to infer more about the web page content than just the text. These tools include the "class" attribute, used most often with CSS stylesheets. A strict application of these can allow data to be extracted by a machine from a document intended for human consumption. For instance, consider the example:
<p> For more information, contact: <span class="contact" id="edumbill"> <span class="name">Edd Dumbill</span>, <span class="role">Managing Editor</span>, <span class="organization">XML.com</span> </span> </p>
A program could easily construct from such a XHTML snippet a "Contact" object identified by the ID "edumbill" with properties "name", "role" and "organization."
Techniques similar to this, known colloquially as "screen scraping," have been used for some time on the Web. Common applications include the extraction of data from search engines for use in Perl scripts or the extraction of headline information from news sources. For these applications the problem has been the shifting nature of the design of HTML pages and, thus, the need to readjust the scrapers whenever the design changes. A page marked up using the technique showed above would enable reliable scripts to interface with the HTML.
As web application providers consider adding SOAP and similar interfaces to their systems to allow remote-application access, they could actually be saved the effort of maintaining twin APIs (browser and SOAP) by embedding machine-readable information in the HTML itself. There is still a lot of value and utility in simpler web technologies.
Once the richer information has been embedded in a page, a program still needs to transform it into the format it requires. At this point another W3C technology, XSLT, has a lot to offer. Given an XHTML page as input, it is useful for selecting and transforming the contents of that page. It provides an excellent bridge from older HTML technology to the nascent XML-based Semantic Web applications. A tool of singular utility when used in conjunction with an XSLT processor is Dave Raggett's "Tidy," which can take HTML and turn it into XHTML. As most web authoring tools still don't have XHTML support, HTML will be created by web authors for some time to come. Tidy facilitates the processing of normal HTML with XSLT, enabling authors of such documents to participate in the Semantic Web.
Although there have been several proposals for embedding RDF inside HTML pages, the technique of using XSLT transformations has a much broader appeal. Few people want to learn RDF, and so it presents a barrier to the creation of semantically rich web pages. Using XSLT provides a way for web developers to add semantic information with minimal extra effort. Dan Connolly of the W3C has conducted quite a number of experiments in this area, including HyperRDF, which extracts RDF statements from suitably marked-up XHTML pages.
Semantic Web Technologies (con'td)
The W3C's Resource Description Framework is one of the cornerstones of Semantic Web work. While its somewhat unwieldy syntax often attracts negative attention from XML developers, the real value of RDF is the data model. It defines a very simple data model of triples (subject, predicate, object), where subject and predicate are URIs, and the object is either a URI or a literal. With this simple model, objects and their properties may be represented. Although the XML serialization of RDF (the "Syntax" of the RDF Model & Syntax specification) is referred to as RDF/XML, other syntaxes are being proposed to try and overcome the awkwardness of the existing syntax. For example, RDF models could just as easily be serialized using SOAP's serialization rules (see presentation at WWW9 by Henrik Frystyk-Nielsen). It is in this simple data model where the power of RDF truly lies. As long as information on the web can be reduced to triples like this, it doesn't really matter which XML serialization format is used. What isn't negotiable here though is the role of the URI as a universal identifier.
The table below shows an hypothetical RDF/XML snippet, and the generated triples in the data model.
<contact rdf:about="edumbill"> <name>Edd Dumbill</name> <role>Managing Editor</role> <organization>XML.com</organization> </contact>
Once we have the data model, there's a need to describe the characteristics of the objects being modeled. For instance, we want to say that a "Contact" must have a name, role, and organization property. This is where RDF schemas come in -- they define an RDF vocabulary that can be used to express the "Contact" class. This allows all users of a resource of type "Contact" to have an agreed expectation of its properties and relationship to other resource types.
RDF schemas differ somewhat from XML schemas (such as DTDs or W3C XML Schemas) in that they do not define a permissible syntax but instead classes, properties, and their interrelation: they operate directly at the data model level, rather than the syntax level. Scaled up over the Web, RDF schemas are a key technology, as they allow machines to make inferences about the data collected from the web.
In fact, work is now underway to take RDF Schemas one step further in the description of ontologies. (An ontology is essentially a formal description of objects and their interrelationships.) The MIT/LCS has begun to define DAML (DARPA Agent Markup Language), a language for expressing ontologies. Although DAML is very much a work in progress, real work can be done now with RDF Schemas, see the section on Redfoot below.
The hardest problem in this area is not the infrastructure, but the actual ontologies themselves. Until an industry-wide ontology exists for, say , vehicle parts, there is a limit to the utility of the SW in the auto manufacturing industry. Organizations such as the Dublin Core Metadata Initiative have been developing such vocabularies for some time now, and they've made progress both in terms of the ontologies themselves and also tools to manage and maintain them.
Work on XML protocols -- the use of XML for messaging and remote procedure calls -- approaches the Semantic Web from the other end of the spectrum. Avoiding grand schemes for the classification of everything, it is focused on standardizing XML-based interactions between computers. A key component of XML protocol technology is the description and discovery of web services available via XML protocols such as SOAP, since systems require the ability to conduct electronic transactions with other systems of which they have no prior knowledge.
This requirement has led to the creation of technologies such as Web Services Description Language (WSDL), which describes the characteristics of the interface offered by a web service, and ADS, which allows the advertisement and discovery of such services. ADS, by offering techniques for embedding such descriptions inside normal web content, fits neatly into the Semantic Web vision. (For more on WSDL and ADS, see our XML Protocol Technology Reference.) The recently announced UDDI effort also provides an API for registries of web e-business services. Although the Semantic Web vision focuses on decentralized technology as opposed to centralized registries, the emphasis on machine discovery of resources is a common theme.
While the XML protocol-related technologies solve narrow problems in order to achieve results over the next year, they represent use-cases for the Semantic Web, and one expects that mature SW technologies will cater for the solution of problems such as these.
The major center of Semantic Web-related development thus far has been in the area of RDF. The creation of semantically-richer documents is a relatively easy task, so most of the effort has been concentrated on accumulating the data, storing it and querying it. RDF/XML provides a useful intermediate syntax which, when combined with tools like XSLT, allows multiple data sources to be combined.
Further details on RDF tools and applications can be obtained from the W3C's RDF home page, and the RDF category of XMLhack. In this section I will concentrate on introducing tools useful for making a relatively speedy start with Semantic Web development.
Redland: Redland is an RDF application framework with C and Perl APIs. As a framework, most of its components are pluggable. For instance, you can choose which RDF parser you use (an important factor at this stage in RDF's development, where the emphasis on conformance for RDF parsers is not as high as it is for XML parsers). Storage mechanisms are also pluggable: currently, in-memory storage and Berkeley DB are supported. Beta-level software.
Redfoot: Redfoot is a 100% Python application framework for distributed RDF applications. It provides a web interface to its RDF import, editing, and viewing functions. It also has support for RDF Schemas. One of its more intriguing features is emerging support for peer-to-peer exchange of RDF data -- peered Redfoots (Redfeet?) will be able to discover the contents of each other's stores. Easy customization of the web interface makes this a good choice for experimentation with RDF. Alpha/beta-level software.
Wraf: The Web Resource Application Framework is another RDF application framework, this time written in Perl. It also offers a web interface to RDF storage, editing and querying. Alpha-level software.
RSS 1.0: This work on the next generation of web site metadata distribution employs RDF for its data model and syntax. Of particular interest is its use at the W3C, where XSLT is used to extract the RSS information from the front page. Dan Connolly has documented how this was done. If you want to experiment with scraping data from XHTML pages, this is an interesting starting point.
Describing and retrieving photos using RDF and HTTP: This note, written by W3C staff, describes the creation of a system allowing the description and retrieval of photographs using RDF. The RDF itself is embedded in the comment portion of JPEG files using a custom editor application, and it's retrieved through an extension to a web server. This illustrates another good starting point for doing Semantic Web development using existing web technologies: attempting to combine this work with a framework such as Redfoot would be an interesting line of investigation.
The Semantic Web has already been the subject of much bluster among the XML developer community and will doubtless continue to be so. Arguments rage over the usefulness of the technology, the difficulty of using RDF, and so on. However, the Semantic Web vision of a machine-readable web has possibilities for application in most web technology -- while some complain about its lack of definition, its broad scope properly reflects the quietly radical effect it will have on the Web.