The Semantic Web: A Primer
|Table of Contents|
The Semantic Web lies at the heart of Tim Berners-Lee's vision for the future of the World Wide Web. Along with others at the W3C Berners-Lee is working on the infrastructure for this next stage of the Web's life. But the question "What is the Semantic Web?" is being asked with increasing frequency. While mainstream media is content with a high level view, XML developers want to know more, and they want to discover the substance behind the vision.
Accusations of fuzziness about the Semantic Web (SW) have been levelled at Berners-Lee, and it is certainly true that he has yet to deliver the long-awaited "Semantic Web Whitepaper." However, there are interesting suggestions in his Semantic Web Roadmap text which give details about the direction he wants to go. Furthermore, SW activity at the W3C and MIT/LCS has been increasing in intensity, and community involvement with RDF, a key SW technology, has increased markedly over recent months.
In his Roadmap document, Berners-Lee contrasts the Semantic Web with the existing, merely human-readable Web: "the Semantic Web approach instead develops languages for expressing information in a machine processable form." This is perhaps the best way of summing up the Semantic Web -- technologies for enabling machines to make more sense of the Web, with the result of making the Web more useful for humans.
Given that goal, it's unsurprising that the scope of the Semantic Web vision is somewhat broad and ill-defined. There are many ways to solve the problem and many technologies that can be employed. Some XML developers have a "well-formed" prejudice against, as they cheerily call it, the "Pedantic Web" because of the strong links with RDF (not everyone's favorite technology) and the definite view taken on URIs. But to perceive the SW only in this light would be a mistake. Technical peeves aside, the value of the Semantic Web is to solve real problems in communication. First and foremost this means radically improving our ability to find, sort, and classify information: an activity that takes up a large part of our time.
The development of the Semantic Web is well underway. This development is occurring in at least two areas: from the infrastructural, all-embracing, position as espoused by the W3C/MIT and other academically-focused organizations, and also in a more directed application-specific fashion by those using web technologies for electronic business.
One of the fundamental contributions towards the Semantic Web to date has been the development of XML itself. Liberating data from opaque, inextensible formats as it does, XML provides an interoperable syntactical foundation upon which solutions to the larger issues of representing relationships and meaning can be built. It's an important center of agreement among individual developers and corporations. The face of the Web is changing, offering once again new possibilities for communication and interaction -- not because all of the underlying concepts are new per se, but because they can be combined on the Web and exposed to the opportunity and unpredictability of large-scale decentralization.
For the developer, however, the grand vision is irrelevant unless it can be put to work. The point of this article is to draw together the technological threads of the Semantic Web and introduce some tools available now that can be used as a basis for experimentation and development.
This section addresses some of the most important technologies for constructing the Semantic Web. By no means is this list exhaustive because, as I observe in the section addressing RDF, as long as there is some translation to a common data model, many syntaxes can be a source of structured information for a machine. However, I have included those technologies that are key in this stage of Semantic Web development.
Perhaps surprisingly, a powerful tool for the construction of the Semantic Web is HTML itself or, more properly, XHTML. Most people are acquainted with the "meta" tags which can be used to embed metadata about the document as a whole (for more on metadata see An Introduction to Dublin Core.) Yet there are more powerful, granular techniques available too. Although largely unused by web authors, XHTML offers several facilities for introducing semantic hints into markup to allow machines to infer more about the web page content than just the text. These tools include the "class" attribute, used most often with CSS stylesheets. A strict application of these can allow data to be extracted by a machine from a document intended for human consumption. For instance, consider the example:
<p> For more information, contact: <span class="contact" id="edumbill"> <span class="name">Edd Dumbill</span>, <span class="role">Managing Editor</span>, <span class="organization">XML.com</span> </span> </p>
A program could easily construct from such a XHTML snippet a "Contact" object identified by the ID "edumbill" with properties "name", "role" and "organization."
Techniques similar to this, known colloquially as "screen scraping," have been used for some time on the Web. Common applications include the extraction of data from search engines for use in Perl scripts or the extraction of headline information from news sources. For these applications the problem has been the shifting nature of the design of HTML pages and, thus, the need to readjust the scrapers whenever the design changes. A page marked up using the technique showed above would enable reliable scripts to interface with the HTML.
As web application providers consider adding SOAP and similar interfaces to their systems to allow remote-application access, they could actually be saved the effort of maintaining twin APIs (browser and SOAP) by embedding machine-readable information in the HTML itself. There is still a lot of value and utility in simpler web technologies.
Once the richer information has been embedded in a page, a program still needs to transform it into the format it requires. At this point another W3C technology, XSLT, has a lot to offer. Given an XHTML page as input, it is useful for selecting and transforming the contents of that page. It provides an excellent bridge from older HTML technology to the nascent XML-based Semantic Web applications. A tool of singular utility when used in conjunction with an XSLT processor is Dave Raggett's "Tidy," which can take HTML and turn it into XHTML. As most web authoring tools still don't have XHTML support, HTML will be created by web authors for some time to come. Tidy facilitates the processing of normal HTML with XSLT, enabling authors of such documents to participate in the Semantic Web.
Although there have been several proposals for embedding RDF inside HTML pages, the technique of using XSLT transformations has a much broader appeal. Few people want to learn RDF, and so it presents a barrier to the creation of semantically rich web pages. Using XSLT provides a way for web developers to add semantic information with minimal extra effort. Dan Connolly of the W3C has conducted quite a number of experiments in this area, including HyperRDF, which extracts RDF statements from suitably marked-up XHTML pages.
Pages: 1, 2