Building a Semantic Web Site
Bring Metadata Back to RSS
|
|
| Post your comments |
Even though the Semantic Web may yet seem a remote dream, there are already tools one can use to make a tiny step forward by building "semantic web sites," which can be much easier to navigate than ordinary sites.
In this article, I will discuss how RSS 1.0 and its taxonomy module can be used as a central format to carry metadata collected in a classical news format, such as XMLNews-Story, to RDF or relational databases and XML Topic Maps. Readers should have basic familiarity with RSS and RDF, and a little topic maps knowledge would also help.
Overview

I have built XMLfr (http://xmlfr.org), a French site dedicated to and powered by XML, as a showcase for XML technologies and will use it as a real life example throughout this article. XMLfr is a dynamic site, using XML and XSLT, which stores its pages in the XMLNews-Story format.
The site structure is described by a set of RSS 1.0 channels, and the semantic information encoded in the rich XMLNews-Story inline markup is converted into RSS 1.0 taxonomy markup.
These RSS channels may be consolidated in an RDF database allowing ad hoc semantic queries on the global set of articles. They feed RDBMS tables for online, real-time queries that build a dynamic site index and include navigational information in the XHTML pages sent to the site users.
The RSS channels can be transformed into XTM Topic Maps, to be displayed by Topic Maps visualization systems, and be enriched by the statistics extracted from the database in order to propose topic associations.
About RSS
RSS stands for RDF (or Rich) Site Summary.
Netscape introduced RSS 0.9, one of the first RDF vocabularies, as a general site summary vocabulary in order to syndicate headlines on their "My.Netscape" portal. It was rapidly followed by RSS 0.91 with more syndication features, but leaving out its RDF syntax. Both releases are still widely used as a syndication vocabulary, used by portals such as Userland, Moreover, and Meerkat; but the vocabulary seemed to have reached a dead end by mid-2000.
After the additions of RSS 0.91, the language had lost its focus, many requests for improvement were made without any structure and selection process to advance them, and these requests were pushing in different directions with a risk of loosing still more focus.
More importantly, there was no plan nor method to add metadata.
The RSS 1.0 Working Group (Gabe Beged-Dov, Dan Brickley, Rael Dornfest, Ian Davis, Leigh Dodds, Jonathan Eisenzopf, David Galbraith, R.V. Guha, Ken MacLeod, Eric Miller, Aaron Swartz and myself, Eric van der Vlist) was created with the charter of defining an extensible specification, built on a refocused RDF core vocabulary and a mechanism facilitating the construction of specific modules.
The RSS 1.0 specification (http://purl.org/rss/1.0/) was published in December 2000, together with a Dublin Core module and a set of supporting tools. A taxonomy module is under discussion, and the format used by XMLfr is based on the current Working Draft.
From XMLNews-Story to RSS 1.0
XMLfr's RSS 1.0 channels are generated by an XSLT transformation out of three different sources of information:
- An RSS channel template without any items, and the reference of
- the contents of a directory stored as XML, pointing to
- the XMLNews-Story documents.
The XMLNews-Story element described by the path
/nitf/body/body.head contains information that is needed
to describe an RSS item, including Dublin Core (DC) elements such as
dc:creator, dc:date,
dc:description.
The interesting potential of using XMLNews-Story is the possible
use of the inline markup to generate more semantic information than is
simply specified in the header. XMLfr uses three of these elements
that are pertinent to its domain: org,
person, and object.title. Extracting these
elements allows the generation of dc:object elements in
the RSS item's properties to provide a list of keywords for an
article. Here's a fragment from an article on XMLfr that shows these
elements in use:
<p> Le <a href="http://4suite.org/index.epy">site</a>  <object.title>4Suite</object.title> décrit <object.title>4Suite Server</object.title> comme un "une architecture pour services <object.title>XML</object.title>", n'étant pas destinée à . être un serveur d'applications autonome mais plutôt à . "coopérer étroitement avec d'autres technologies de serveurs d'applications". </p>
However, the literal keywords cannot be used as unique identifiers
by themselves. (A good example of this is the need to distinguish
between the Apache organization and the Apache web server.) The RSS
1.0 taxonomy module was defined to fix this issue by replacing the
words ordinarily used within dc:subject elements with
unique identifiers (URIs).
The topic URIs are simply constructed by concatenation of a base URI, the element name, and the text content of the element. Here's an example of an item description using RSS 1.0 and the DC and taxonomy modules:
<item rdf:about="http://xmlfr.org/actualites/tech/010222-0001">
<title>Mises à jour 4Suite.</title>
<link>http://xmlfr.org/actualites/tech/010222-0001</link>
<dc:description>Uche Ogbuji a annoncé une .../...</dc:description>
<dc:creator>Par Michael Smith, xmlhack - traduit par Eric
van der Vlist, Dyomedea (vdv@dyomedea.com).</dc:creator>
<dc:date>2001-02-22</dc:date>
<dc:subject>4Suite Server, 4Suite, Uche Ogbuji, .../... </dc:subject>
<taxo:topics>
<rdf:Bag>
<rdf:li resource="http://xmlfr.org/index/object.title/4suite+server/"/>
<rdf:li resource="http://xmlfr.org/index/object.title/4suite/"/>
<rdf:li resource="http://xmlfr.org/index/person/uche+ogbuji/"/>
<rdf:li resource="http://xmlfr.org/index/object.title/python/"/>
.../...
</rdf:Bag>
</taxo:topics>
<dc:publisher>XMLfr</dc:publisher>
<dc:type>text</dc:type>
<dc:language>fr</dc:language>
</item>
This is fun, but do we have a use for such a document? The basic use of RSS channels today is to get the titles of stories on your site displayed by aggregators such as O'Reilly's Meerkat.
.
And in our case, XMLfr uses these channels internally to display its lists of articles, giving RSS its original meaning of "RDF Site Summary":

Aggregators of RSS information (Meerkat, Morever, etc.) do not yetuse taxonomy information, and a simple RSS channel would be sufficient to get your title displayed. So how can we utilize the extra metadata we extracted from the document body?
RDF Databases
One significant feature of RSS 1.0 is that it is fully compliant with RDF and can be directly loaded into RDF databases such as rdfDB or Squish. These let you query the data using an SQL-like query language and give you full access to the taxonomy information.
These languages are very convenient for walking through the entire set of RDF triples, letting you access all the information that is available by doing joins between related objects. The following example, using rdfDB, shows queries to find all the articles that mention the person Uche Ogbuji and then to display all the topics from articles that mention the person Uche Ogbuji.
load RDF file http://xmlfr.org/actualites/general.rss10 into newrss</>
0
0 </>
select ?item from newrss where
(http://purl.org/rss/1.0/modules/taxonomy/#topics ?item ?bag),
(http://www.w3.org/1999/02/22-rdf-syntax-ns##li
?bag http://xmlfr.org/index/person/uche+ogbuji/)
</>
?item
http://xmlfr.org/actualites/tech/010222-0001
0 </>
select ?topic from newrss where
(http://www.w3.org/1999/02/22-rdf-syntax-ns##li
?bag http://xmlfr.org/index/person/uche+ogbuji/)
(http://www.w3.org/1999/02/22-rdf-syntax-ns##li ?bag ?topic)
</>
?topic
http://xmlfr.org/index/org/fourthought/
.../...
http://xmlfr.org/index/object.title/python/
http://xmlfr.org/index/person/uche+ogbuji/
http://xmlfr.org/index/object.title/4suite/
http://xmlfr.org/index/object.title/4suite+server/
0 </>
XMLfr has been running for several months using rdfDB as the backend storage for its dynamic index system and using JrdfDB, a Java interface developed for this purpose that interfaces with the XSLT processor XT.
Although rdfDB has been fast and reasonably stable, several features are badly needed for this application to be scalable and to develop additional features. These features include
- Sorting (to retrieve articles by date)
- Retrieving unique rows (to remove duplicate results)
- Setting a maximum number of rows (to paginate the results)
- Grouping
- Aggregates (to count a number of relevant matches)
- Administration (load/unload)
RDBMS
So if rdfDB won't perform, what next? A fully fledged RDF database is not strictly needed just to keep track of the relations between topics and pages (or "occurrences," to follow the vocabulary of Topic Maps), and a traditional RDBMS with a straightforward table design has all the qualities required to be used as online storage for this purpose.
XMLfr has migrated its dynamic index to a couple of PostgreSQL tables:
test=> \d topics Table = topics +--------------------------------+----------------------------------+-------+ | Field | Type | Length| +--------------------------------+----------------------------------+-------+ | channel | varchar() | 255 | | item | varchar() | 255 | | topic | varchar() | 255 | +--------------------------------+----------------------------------+-------+ test=> \d items Table = items +--------------------------------+----------------------------------+-------+ | Field | Type | Length| +--------------------------------+----------------------------------+-------+ | item | varchar() | 255 | | dcdate | date | 4 | | title | varchar() | 255 | | description | varchar() | 255 | +--------------------------------+----------------------------------+-------+
These tables are loaded with data from text dumps, which are generated by two simple XSLT transformations run against the RSS 1.0 channels. The dynamic index system on XMLfr is reached through a table of keywords displayed with the articles:

These keywords are linked to pages from the dynamic index, displaying lists of articles found in the database:

These results are very similar to those we might obtain from the creation of a Topic Map.
Topic Maps
Topic Maps are documents that describe topics, their interrelations, and their occurrences within an XML document.
A RSS 1.0 channel with taxonomy data happens to have all the information needed to generate a XTM 1.0 Topic Map, as the following example Topic Map fragment shows.
<topic id="person-uche+ogbuji">
<instanceOf>
<topicRef xlink:href="#person"/>
</instanceOf>
<baseName>
<baseNameString>uche ogbuji (person)</baseNameString>
</baseName>
<occurrence id="person-uche+ogbuji-1">
<instanceOf>
<topicRef xlink:href="#story"/>
</instanceOf>
<resourceRef xlink:href="http://xmlfr.org/actualites/tech/010222-0001"/>
</occurrence>
</topic>
As a proof of concept, this Topic Map has been loaded into the empolis K42Â? Knowledge Server, a screenshot of which is shown below.

Although the screenshot looks a bit different from the list of articles on the web site, the difference doesn't add value and shows that the actual syntax doesn't matter -- the dynamic index is essentially a Topic Map. With some effort, and the features of an RDBMS, we can also do more than this by creating more information that describes how the topics over the site are related.
Aerial Photographs
A Topic Map of XMLfr maps the site content and gives the same picture -- in a different syntax -- as the dynamic index system available online.
This picture is directly derived from the markup used in the
articles published on the site, and adding a new keyword in a story
marked up as org, object.title, or
person is sufficient to create a new topic.
The obvious things that are missing from this Topic Map is the topic associations, that is, the relationships between the topics in the map.
However, if we do not a priori know the nature of the topic associations, we may guess at their existence by looking at the most common associations found in the articles. This feature can easily be achieved using SQL grouping and aggregates; and it's been implemented on XMLfr through a very simple algorithm: for each topic, the list of the 15 other topics most often found associated with the current topic is displayed:

The accuracy of this technique in discovering related topics is surprising. As an example, Tim Berners-Lee is associated with XML, W3C, RDF, SVG, URI, W3C, XLink, DOM, HTML, HTTP, Java, SGML, Semantic Web, XPath and ISO, which is a fairly good description for such a simple algorithm.
This same algorithm couldn't, unfortunately, be used directly on current RDF databases, as they are missing aggregates and grouping.
It can be used to generate associations in our XTM Topic Map, as shown by this fragment below which shows a relationship between the person Uche Ogbujiand the object 4Suite.
<association id="assoc-person-uche+ogbuji-2">
<instanceOf>
<topicRef xlink:href="#related"/>
</instanceOf>
<member>
<roleSpec>
<topicRef xlink:href="#from"/>
</roleSpec>
<topicRef xlink:href="#person-uche+ogbuji"/>
</member>
<member>
<roleSpec>
<topicRef xlink:href="#to"/>
</roleSpec>
<topicRef xlink:href="#object.title-4suite"/>
</member>
</association>
These associations form patterns over the collection of articles, similar to the curves that can be seen on an aerial photograph: a human is needed to say if it's a road or a river and thus turn the photograph into a map. But I think the patterns should be usable as a first step for finding topic associations.
The associations have been created, in this Topic Map, as almost anonymous (related/from/to) and could be manually updated to transform the Topic Aerial Photograph into a Topic Map.
Tout ce Transforme
Antoine-Laurent de Lavoisier, a French chemist, once said, Rien ne se perd, rien ne se crée, tout se transforme,, or, "nothing is lost, nothing is created, everything is transformed"; and French people believe that this sentence is the foundation of modern chemistry. The real enabler for the work described in this article is of course XSLT, by which "everything istransformed".
Although, unlike Lavoisier's discovery, an XSLT transformation does allow the loss of content (this is sometimes referred as "semantic firewall"), an XSLT transformation does not create anything, so this result wouldn't have been possible if the source documents hadn't been carefully tagged.
This clearly shows that even if new technologies are now available to manipulate semantic information, this information needs to be available in the original documents, manually added afterward, or automatically extracted -- this is one of the challenges of Semantic Web.
Credits
Many thanks to
- The RSS 1.0 Working Group.
- Bénédicte Le Grand for a presentation at XML 2000 (Conceptual Exploration of Topic Maps) that gave me some hints on how to calculate the distance between topics.
- Empolis (previously known as Step UK) for kindly lending me a license of their k42 Topic Map Engine and patiently supporting my questions.
- The authors of the many open source software products used to run XMLfr (Linux, Apache, Jserv, XT, PostgreSQL, etc.).
References
- XMLfr: http://xmlfr.org/
- XMLNews-Story: http://www.xmlnews.org/docs/xmlnews-story.html
- RSS 1.0: http://purl.org/rss/1.0/
- RDF: http://www.w3.org/TR/REC-rdf-syntax/
- rdfDB: http://web1.guha.com/rdfdb/
- JrdfDB: http://4xt.org/downloads/JrdfDB/
- Squish: http://swordfish.rdfweb.org/rdfquery/
- XTM 1.0: http://www.topicmaps.org/xtm/1.0/
- Empolis k42: http://www.empolis.co.uk/products/prod_k42.asp
- Le Grand, Bénédicte - Conceptual Exploration of Topic Maps (XML 2000 conference presentation)
Got comments or questions on this technique? Ask the author and share your views in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- Isn't a taxonomy a hierarchy?
2004-12-27 06:35:27 janegil [Reply]
The codes sample gives both http://xmlfr.org/index/object.title/4suite/ and http://xmlfr.org/index/object.title/4suite+server/ as taxo:topics.
Should'n there be an external definition of the taxonomy to tell us that any page about 4suite+server/ is also about 4suite/?
Without the subset/superset relation between topics, this isn't a taxonomy at all, just a collection of unrelated topi IDs.
- RSS For document flows?
2001-08-06 13:42:52 Jonas Bosson [Reply]
Does anyone know if there are any proposed schema/metadata standards for document flows (ie issue="phone error", status="open", flow="support" ..)
We try to make our information as structured as possible
to make the XML or HTML document more than SQL...
Best regards,
Jonas Bosson
Btw: Our website at www.illuminet.se use RSS to represent search results, news and navigation on every page and documents are edited using ms-word over the net and converted to XML as a result of moddav-events. ;-)
RSS - is realy a great represtentation for search and navigation - the source is metadata/properties and contextual information trough our search-engine.
- RDF Smentic Web Page
2001-07-30 05:06:38 Craig Stone [Reply]
I am a student researcher looking at devloping a browser based around RDF. Having found many articles on RDF and XML etc I feel this article
gives an execellent insight into building Semantic web pages.
Any further information would be greatly received
Regards
CS
- An Approach to the Semantic Web
2001-05-09 17:14:42 Scott Tsao [Reply]
This is the first article I have read that gives fair and equal treatment to RDF and XTM. (Are there any other that I missed?) As a potential implementor and user of those SW technologies, I am not as interested in exactly HOW to use those technologies, as to WHAT is the business scenarios the author had in mind (examples in French did not help much ;-). Therefore, I would hope the author could provide a sequel to this article outlining the scenarios (or use cases) one could follow when accessing his Semantic Web Site.
Assuming we do have these scenarios (or use cases), one could then ask this question: WHICH technology can best solve the problem at hand (perhaps by stepping through the use cases)? The impression I got while reading the article was ... it all depends ... one could use either RDF or XTM in most of the cases. For RDF, the benefit may come from the fact that RSS is RDF-compliant, and there are existing tools available for queries. For XTM, the benefit may come from the easiness for semantic discovery and enrichment (e.g., discovering topics and adding associations). And, as we expand on the scenarios, we could probably find areas that neither RDF nor XTM fit the bill. Then perhaps we would resort to some other technologies (e.g., DAML+OIL) to get to where we want to be.
What I am proposing here is to look at those "SW technologies" through a different lens, not from the technologist's perspective, but from the end-user's perspective. I think the myriad SW-related articles flooding our screens today are mostly from the former, but very few from the latter. I would encourage people continue to write SW-related articles but preferably following this outline:
1. What is the business problem?
2. What is the proposed business solution? (senarios and use cases)
3. What are the potential technical approaches?
4. Which technical solution works best and why?
5. How can the proposed technical solution be developed and implemented?
6. What are the available standards and tools (if any)?
Any thoughts?
Regards,
Scott Tsao
The Boeing Company
- An Approach to the Semantic Web
2001-05-10 01:54:51 Eric van der Vlist [Reply]
I will try to give this sequel keeping a "fair and equal treatment to RDF and XTM"...
As a preamble, I would say that the availability and quality of the meta information is more important than the serialization format (RDF, XTM or any other).
I see RDF and XTM as belonging to different levels, though and believe they should be more complementary than competing.
RDF is a very generic syntax to express facts as triples while XTM is an application describing "Topic Maps", i.e. the relations between topics and between topics and resources.
I believe that XTM could have used a RDF syntax, however since it is not the case with XTM 1.0 we have to make a choice and, I think that it depends on the application you want to build and the tools you want to use.
If your application is all about describing topics and relations between topics and resources you might want to use XTM and the tools that are available to build Topic Maps.
On the other hand, if you want to consolidate information between applications and, for example, link your site summary with annotations and conformance tests, the generic RDF model should be much easier to use since triples from different sources do merge automatically when you load them.
Developing new applications with XTM is of course possible (many papers have been published for instance to show how Topic Maps may be used to represent knowledge bases) but requires to put on "topic maps lenses" and to consider everything as Topic Map objects (i.e. topics, associations or occurrences) and that's not always very natural.
The border line I would personally draw is then very simple: if you need a Topic Map, then go for XTM, but if you want something more extensible, consider using RDF. And keep in mind that if you've taken care to include enough information, you will always be able to transform RDF into XTM or XTM into RDF.
- An Approach to the Semantic Web
2001-05-12 13:31:53 Scott Tsao [Reply]
Eric wrote:
> I see RDF and XTM as belonging to different levels, though and
> believe they should be more complementary than competing.
I agree whole-heartly with you here, and my attempt was trying to
find out WHERE they could be more complementary.
> I believe that XTM could have used a RDF syntax, however since it
> is not the case with XTM 1.0 we have to make a choice and, I think
> that it depends on the application you want to build and the tools
> you want to use.
I am not sure about this. I have also heard suggestions that RDF
model should be serialized in terms of XLink (which XTM is based
on). My quess is that this might be a tool issue, i.e., whether
tools are readily (and freely) available to process the serialized
data stream.
> If your application is all about describing topics and relations
> between topics and resources you might want to use XTM and the
> tools that are available to build Topic Maps.
Agreed. The nice and clean separation between the topic layer and
resource layer is a "user-friendly mental model" that helps me to
visualize in my mind how I would want to semantically organize my
myriad resources.
> On the other hand, if you want to consolidate information between
> applications and, for example, link your site summary with
> annotations and conformance tests, the generic RDF model should be
> much easier to use since triples from different sources do merge
> automatically when you load them.
Since I am not familiar with the details of RDF, I might be ignorant
here. What do you mean "link your ..."? Is this the same as the
XLink model (I thought RDF does not use XLink)? Also, you mentioned
in various places the strength of RDF's "automatic and implicit
merge" feature. Can you give a simple example of this? How would
you compare it with the XTM merge feature (I believe it is part of
the XTM Processing Model)?
> Developing new applications with XTM is of course possible (many
> papers have been published for instance to show how Topic Maps may
> be used to represent knowledge bases) but requires to put on "topic
> maps lenses" and to consider everything as Topic Map objects (i.e.
> topics, associations or occurrences) and that's not always very
> natural.
As a matter of fact (as I stated earlier) as an user I prefer to put
on the "topic maps lenses" (feels very natural to me). I can name a
couple of applications that this type of lenses fit naturally:
- controlled vocabularies (e.g., thesauri)
- metadata registry (and repository)
- Bible studies (as pointedly elaborated by Patrick Durusau, see
http://groups.yahoo.com/group/xtm-wg/message/2317)
> The border line I would personally draw is then very simple: if you
> need a Topic Map, then go for XTM, but if you want something more
> extensible, consider using RDF. And keep in mind that if you've
> taken care to include enough information, you will always be able
> to transform RDF into XTM or XTM into RDF.
As an implementor, I would hope that I will not have to pay the
panelty for this transformation. Also, I don't understand what you
mean by "more extensible" if I use RDF. Is it because the fact that
more tools are available (especially those advocated by the W3C)?
From the semantic enrichment standpoint, I think XTM is more
extensible. We are probably talking about "extensibility" at two
different levels, which we both agreed from the start.
Thanks,
Scott Tsao
The Boeing Company
- An Approach to the Semantic Web
- An Approach to the Semantic Web
- Tower of Babel
2001-05-04 16:07:56 Tom Germond [Reply]
Well, it seemed like it would be a great article, but it turns out I have to be bilingual to make any real use of it.
This isn't by any chance a sign of things to come as I get into XML, is it?
