Instant RDF?

August 30, 2000

Leigh Dodds

This week the XML Deviant returns from holiday to find that, although the W3C's Resource Description Framework (RDF) technology appears to be gaining supporters, developers still have many concerns about the complexity of its syntax.

Mining the Web

Complexity has been one criticism which RDF has had difficulty in shaking off. Both the RDF model, and its serialization syntax, have fallen foul of this issue at various points in its development. Efforts to produce a simpler serialization syntax have lead to several alternate proposals, including one from Tim Berners-Lee ("The Strawman Proposal"), and one from Sergey Melnick ("Simplified Syntax for RDF"). For non-RDF-afficionados, the serialization syntax is the representation of the RDF data model as XML. (Although XML is only one possible means of representing this information).

While technical concerns have been raised about specific details of the RDF syntax, the main aim of simplification is to make it easier to generate RDF from existing (and future) XML documents--documents which were not produced with RDF applications in mind. Given the slow adoption of RDF, this seems a useful approach.

While discussion of the finer points of the RDF syntax are no doubt beneficial, for developers seeking to gain some benefit from using RDF this transitional step from XML to RDF is important. An increasingly large amount of XML data coupled with a vast amount of HTML (suitably tidied for well-formedness) provides a rich data source for bootstrapping RDF applications

A recent discussion on the W3C RDF Interest mailing list has highlighted some different viewpoints on how this might be achieved.

Simplifying the Syntax

Broadly speaking, there are two viewpoints in this debate. They differ in terms of how much the structure of an XML document should be affected by RDF. Or, in other words, how much effort needs to be invested up front to allow a document type to be processed as RDF.

The conventional approach assumes that the XML document should contain RDF markup. A parser can then directly process the document extracting the "triples", which are the core data items in RDF. (See "Abbreviated Syntax" in the RDF Specification)

The other viewpoint suggests that RDF should not be allowed to impact the structure of XML documents at all. Instead tools should be provided to generate RDF from these documents prior to their processing by an RDF application. This has been termed "Screen-scraping" in RDF circles. Aaron Swartz offered a proposal discussing this approach:

What is needed is a way to allow RDF parsers to extract RDF triples from regular XML. This would be an amazing boost for RDF, allowing any existing XML format to be easily used as RDF information.

Swartz suggests that XSLT transformations could be one mechanism suitable for generating the desired RDF. Ora Lassila, co-editor of the RDF specification, welcomed a simpler syntax but urged caution when transforming XML into RDF:

I feel that although a simpler, more intuitive syntax would be a good idea, transforming *any* XML to "something like RDF" is somewhat dangerous unless you make sure that any intended semantics is preserved. Syntactic transformation is only half of the battle...

Swartz's suggestion has parallels with recent work, using similar "screen scraping" techniques to generate RDF. This includes work by Dan Connolly to generate RDF databases from email archives, and extract Dublin Core Metadata from XHTML files. Dan Brickley has also posted research notes describing the use of XSLT in RDF screen scraping.

Of course, the two approaches are not mutually exclusive. RDF can be scraped from existing documents, while new formats can include RDF markup directly. Existing formats could also be revised to this end. It's likely that a combination of these techniques will yield the best results.

The recent proposal for a revised RSS format takes this approach--the format includes RDF markup, while associated tools allow the generation of RSS from XHTML documents. This gives authors an easy route to producing RSS documents without encumbering them with a new syntax.

Ease of Use

One area of clear agreement is that RDF needs to be user-friendly, as Jonathan Borden noted:

It should be easy for people to add RDF statements into otherwise mundane XML documents in ways that minimally interfere with the chosen document structure.

Charles McCathieNevile agreed, but saw better ways of tackling the problem than the syntax:

It should be a trivial matter of making the statements in their favourite authoring tool, or of using a simple point click drag interface to specify arcs and nodes of meaning. Sitting around writing pointy brackets is like telling the poor country astrophysicist to use only a slide rule because it's better - sure, it works, but there are better ways.

Ora Lassila also saw syntax difficulties as a minor issue:

Personally I am not opposed to a new RDF syntax (the current looks a bit like it was "designed by committee" :-). But ultimately the syntax shouldn't matter all that much since I am sure everyone is hoping that most of RDF will be both read and *written* by machines (not humans).

There are echoes here of other RDF debates in which the fabled "killer app" is seen as the most important goal, rather than a quest for simplicity. However, given the slow acceptance of RDF, several developers disagreed with Lassila's viewpoint; believing syntax to be very important in this stage of RDF's development. Greg FitzPatrick observed that simple syntax contributed to the success of HTML:

HTML is also read by machines. But if HTML had been difficult to comprehend and not mnemonic it would not have started a landslide.

Bill de Hora highlighted the use of RDF in the RSS 1.0 proposal, and commented that this is a good opportunity for RDF to increase its profile:

It would be shame to miss out on the opportunity to piggyback RDF on the popularity of RSS feeds, that is, to miss piggybacking on a network amplification effect, on the assumption that there is no pressing need to adapt the syntax because tools will appear to automate serialization of RDF anyway. That's not the case for legions of people using RSS now. So at this point in time, the syntax is perhaps very important: it is after all the concrete expression of the model and is what people will have to manipulate.

I'm less concerned about the precise syntax (once the model is invariant), than about missing a golden opportunity to seed RDF.

While it's too early to say whether RSS will its proving ground, RDF's supporters are keen to see more adoption. Dan Brickley has suggested to developers that effort should be spent on producing interoperability tests for the increasing range of available RDF parsers:

The XSLT / Semantic Web Screenscraping threads on this list have shown how we can extract RDF models from all manner of well managed XML data. There are a fair number of RDF 1.0 parsers now, and significant effort has gone into creating these. I would rather see our time go on developing interoperability tests for these to get them up to production grade, learning through doing so about any grey areas in the syntax spec.