XML Europe 2004: Refactoring XML
May 5, 2004
The European XML conference (XML Europe) took place last month in Amsterdam. One of the presentations was titled "refactoring XML" and, without going that far, one of the main recurring themes was certainly refactoring the uses of XML.
Six years after the publication of XML 1.0, many people wonder if there is still room for XML conferences and if it isn't time to devote ourselves to XML applications. XML Europe 2004 proved them wrong and has shown that it's time to use the experience we have gained to optimize the use of XML. This will require important simplifications, a good deal of refactoring, and many more XML conferences to share our progress.
Web Services are Still Web Services
Jeff Barr, Web Services evangelist at Amazon.com, opened the conference by presenting the strategy and the realizations of Amazon.com in the domain of web services. In a very pragmatic way, Amazon.com is fulfilling the vision of the "extended company" we were trying to sell when I was working at Sybase, 10 years ago.
This is also the dominant vision in web services marketing presentations and you all know the picture: in a world increasingly competitive and specialized, organizations have to open their IT systems to their partners and customers so that they can use it and finally contribute to the benefits of the "extended enterprise".
To fulfill this vision, we might have thought that the full pile of web services standards (all those "ws-*" documents) was required but Jeff Barr's presentation proved us wrong. Despite a very wide scope of features of Amazon.com's web services, covering much more than the catalog including selling and buying, the interfaces have been designed for simplicity.
These services are available either as SOAP or REST (that is, XML over HTTP). Much simpler, REST web services can be tested using a web browser. They account for 80% of the actual requests. In my view this is confirmation of the continuity between the Web and web services. Before anything else, web services are, as their name indicates, services accessible on the Web. They belong to the Web, and that's what makes them so interesting.
The Importance of Notation
Steve Pemberton followed with a second plenary that turned out to be a brilliant digression -- almost impossible to sum up -- on the impact of tools and presentation on the content. Did you know that the texts written on paper with a pen are typically shorter and of better quality than ones written using a text processor?
This phenomenon doesn't spare computer scientists. The notations used to write documents have a major influence on the quality and the content of these documents. Those notations can't and shouldn't be totally hidden by tools from the users' view: as they are not all using the same tools it would be a design flaw to rely on the tools to hide the flaws of a notation.
Pemberton stressed that we should use the power of our computers to make our lives easier; that we should have no hesitation to use notations that are simple and readable even if not XML. It isn't an "X" in its name that makes a notation good. After all, parsing is easy, and there is no reason to reject simple text formats easily transformable to XML.
Isn't that the reason for the success of the compact syntax of RELAX NG and the many different Wiki formats?
If You Like SVG, You'll Love SVG 1.2
The feature that has most impressed me is definitely "sXbL", which enables the display arbitrary XML documents as SVG through a transformation. The transformation language defined for that purpose is easier than XSLT and bi-directional so that modifications applied to the resulting SVG are brought back into the original XML.
Have a REST
We return to web services with Paul Prescod's presentation, which compared two REST implementations, Amazon.com and Atom.
As expected from a defender of the REST architectural style, Prescod's presentation started with a moving speech in favor of REST: "the document is what matters"; "we need resource oriented architecture rather than SOA [service oriented architectures]"; "XML is the solution to the problem, not the problem"; "the emphasis should be on resources" and "there should be a seamless web of information resources".
The comparison between the two implementations focused on the impact of their contexts on a touchy choice: the choice of the identifiers. As mentioned earlier, the purpose of Amazon.com's web services is to open its IT system to partners and customers and that vision stays very egocentric. Amazon.com is the center of the extended enterprise.
In this context, the identifiers used by those web services are the identifiers used in Amazon's databases, and the products are identified by centrally assigned "ASIN" numbers. On the other hand, Atom is a fully decentralized organization and uses URIs which are by definition "universal" and do not require any centralized administration.
Although the difference may seem to be minimal, that choice means that Amazon.com's web services couldn't easily scale to become distributed between different organizations while Atom is natively designed to do so.
Saxon's Internals Revealed
Michael Kay described the XSLT and XPath optimizations performed internally by SAXON. He claims that, despite what we may have thought, the optimization techniques used for XSLT 1.0 (which doesn't have any schema related information) and XSLT 2.0 (which may rely on the schema of the source documents) aren't fundamentally different, even if some of them can be more effective with XSLT 2.0 .
The reason why the benefit of schemas for optimization purposes is so limited is that the source documents and the XSLT transformations by themselves already provide most of the information needed to perform the optimizations. Thinking about it, it occurs to me that it should have been obvious since my Examplotron has shown that instance documents can be considered schemas. Similarly, Kay explained that type information can be inferred from XSLT 1.0 transformations. For instance, if a parameter is initialized as "1" and later used to feed another parameter after being incremented, it is easy to guess that this parameter will always be an integer.
If you needed a reason for using XSLT 2.0, you'll have to look for another one.
The PSVI Exposed
The Post Schema Validation Infoset (PSVI) is the set of information gathered at validation time. This set is sparsely described in the W3C XML Schema recommendations and the PSVI remains for most of us a very abstract concept.
Elena Litani proposed a description of an API defined by the Xerces project to expose the PSVI as well as the schema components through DOM or SAX. Published as a W3C Note, this API is currently implemented by the Java and C versions of Xerces and gives a read access to the schema components and PSVI information items associated with elements or attributes from instance documents.
Formal Logic to Rescue
One of the problems with W3C XML Schema is that, unlike RELAX NG, it doesn't rest on a mathematical model that would have provided a coherent and ambiguity free formalism. Henry Thompson proposes to define logics to describe both the relations between schema components and the relation between schema components and instance documents. These logics consist of a syntax (or "sentential form"), a model, and an interpretation that relates the syntax to the model.
Although presented differently, I found this approach similar to what is presented in the specification of RELAX NG. The approach has been very beneficial to RELAX NG and is definitely a path worth following.
Beside the fact that the models are different -- which is to be expected, since they describe languages with different semantics -- Thompson insists on the need to give formal definitions of the three components of these logics, while RELAX NG has kept the third component (interpretation) as plain English in its specification. In that respect, I think that Thompson's proposal can be seen as a generalization of the RELAX NG approach and that it should be applicable to other languages.
Thompson lists many potential benefits and applications of this approach: a formal description is a normative reference that doesn't tolerate the ambiguities found in natural language prose. It should be possible to generate readable specifications from this formal description, and tools able to process these grammars should be able to check their coherence and provide reference implementations which, though probably very slow, may be able to distinguish between different human interpretations of the logic.
RELAX NG and XSL-FO
The XSL-FO recommendation doesn't provide any schema or DTD to describe its vocabulary, so Alexander Peshkov proposes a RELAX NG for XSL-FO. Comparing the usage of several schema languages (including XSLT) to validate XSL-FO documents, he concludes that RELAX NG is far superior to the other alternatives and even suggests to use NRL (also known as DSDL Part 4) to validate the XSLT transformations that produce XSL-FO documents.
The main limitation encountered with RELAX NG during the development of this very complex schema is the inability to classify errors between severe errors and warnings. Peshkov wants to work around this limitation by performing two validations: one against a lax schema that will report only severe errors and another one against a strict schema that will also report warnings.
Interestingly, this is a feature that I am using in the editorial system of XMLfr and that I propose to include to DSDL Validation Management.
Ontology Driven Topic Maps
After a reminder of what Topic Maps and ontologies are, how they overlap but are also complementary, Bernard Vatant described a proposal for using OWL/RDF to define constraints on Topic Maps.
The idea isn't really new since using OWL (or its predecessor DAML+OIL) on Topic Maps had already been proposed by Nikita Ogievetsky at Extreme Markup Languages 2001 and by Eric Freese at XML 2002. These previous proposals were based on an explicit translation of Topic Maps in RDF and, although this translation seems simple enough that I proposed a first draft in early 2001, the RDF and Topic Map communities have not reached a consensus on this point.
To avoid this most controversial point Vatant said that his proposal isn't yet another RDF serialization of Topic Maps, and he has cautiously chosen to rely directly on URIs that seem to be the common denominator between the two communities. He proposes to use OWL/RDF to define constraints on Topic Maps topics, associations, roles and other "knowledge objects" manipulated by a Topic Map and identified by their URIs without attempting to explicitly model what these knowledge objects are in RDF.
This is enough to check if a Topic Map "commits" to an ontology, i.e. if all its classes, associations, roles and occurrence types are defined in the ontology and the assertions made in the Topic Map are consistent with the ontology. As minimal as it may appear to be, this proposal allows one to use all the expressive power of OWL and RDF Schema to express constraints on Topic Maps, and this goes well beyond the features envisioned for TMCL, the language that's currently being specified to express constraints on Topic Maps.
Test Driven XML Systems
The complexity of XML systems is often aggravated by the weakness of associated test tools; the interdependency between the various components and resources involved (instance documents, schemas, transformations, programs, etc.) makes their evolution very perilous. Brandon Jockman proposed to improve all that through a better practice of test suites, especially where schemas and transformations are involved.
To validate schemas, he suggests using sample documents. That's nothing less than inversing the role of schemas and instance documents: in this new scenario, the instance documents validate schemas. That's an idea with which I have been playing for a while and although I like it, I think that it can also add to the burden of maintaining consistent systems as we'd need to migrate both the schema and the sample documents. If we work with sample documents which, again, is the basic idea behind Examplotron, why not generate the schema from the samples instead of checking that it matches the samples?
Steve Cayzer proposed a more semantic approach to blogging. If adequate user interfaces encouraged weblog authors to add more metadata to their blogs, semantic blogs would be created in which it would be much easier to search and navigate.
Crawling the Semantic Web
Search engine bots or crawlers play a fundamental role in the Web and the Semantic
needs crawlers. Matt Biddulph described
his Semantic Web crawler that relies on the
rdfs:seeAlso term, which is to the
Semantic Web what hyperlinks are to the Web.
Beyond lessons drawn from his implementation, tests and advice to authors of RDF documents to facilitate the job of Semantic Web crawlers, Biddulph elaborated on points that are specific to the Semantic Web: the possibility to store the result of the crawling in distributed databases, the need to store not only the RDF triples but also information on the sources, the necessity to manage the level of trust associated to the resources, and the possibility to use OWL to facilitate the integration of assertions using different identifiers for the same subjects (which is the common case with FOAF).
Uniting RDF and XPath
Adapting XPath to navigate graphs instead of trees was one of the hot and novel topics of Extreme Markup Languages 2003 with proposals from Steve Cassidy, who suggested adding new XPath axes, and Norman Walsh, who presented a tool that generates limited tree views from RDF graphs.
The new proposal by Damian Steer and its reference implementation based on SAXON is different from both since it lazily expends a RDF graph into local trees when the access to the nodes is first made. For instance, if "A" is linked to "B" which is linked to "C" which is linked back to "A", the XPath expressions "B/C" will navigate from "A" to "C" if "A" is the context node and "C/A" will navigate from "B" to "A" if "B" is the context node.
Applications using this mechanism need to check that they do not produce endless loops ("B/C/A" loops back to "A" when "A" is the context node) and XSLT as harmless as the classical identity transformation would produce an endless loop. Applied to RDF, this mechanism allows access to RDF "nodes" using XPath expressions that become independent of the syntactical variations of the XML syntax used to express the RDF graph.
Modernising Semantic Web Markup
The detractors of the XML serialization of RDF complain that it's more a macro-language to generate triples than a syntax for expressing triples. As if foreshadowing Steven Pemberton's call that "parsing is easy", a number of non-XML formats have been proposed for RDF, including N3 (by Tim Berners-Lee) and Turtle (by Dave Beckett) that are "real" syntaxes to express triples.
Dave Beckett, who pleasantly warned us "I can't stop myself inventing syntaxes", proposes "Regular XML RDF" (RxR) which is an XML serialization for RDF triples at the complete opposite of the XML syntax for RDF and follows the RDF data model.
Since the RDF data model is about describing a graph composed of triples made of a subject, a predicate, and object, the elements of the RxR vocabulary are "graph", "triple", "object", "predicate" and "subject". RxR is (almost) as simple as that. It appears to be an ideal syntax to represent in a simple, easy to read and almost canonical fashion a RDF graph in XML. But we must note that RxR documents are much more verbose than the RDF/XML syntax: sometimes a macro language can be beneficial.
The abstract of Steve Newcomb's presentation left me curious what it was about. It was a presentation of the conclusions of a study done for the European Commission regarding R&D in the area of electronic publishing. Its conclusions are rather optimistic and enumerate a number of reasons why Europe could become a key player in the revolution of the publishing industry that is currently under way.
Steve Newcomb urges us to improve the dialogue between IT developers and the purveyors of knowledge and content: given the importance of this revolution, "cost saving isn't the vision" and content publishers need to be involved in the decisions. This should result in the improvement of the "smartness" of the content ("smartness" in that context being the "ability to participate fully at the semantic level to the intelligent space"). This smartness is, of course, welcome to use Topic Maps to express itself.
Noting that this effort needs a banner to be advertised and that the banner "Semantic Web" can't be used since publishing is not restricted to the Web, Steve Newcomb suggests that we rally behind the term "smart document".
EbXML registries as described in the joint OASIS and UN/CEFACT ebXML registry specification, also published as an ISO 15000 standard, provide a highly generic distributed storage system that is independent of the other ebXML specifications. Even when they are used in a complete ebXML system to store documents such as CPP (Collaboration Protocol Profile), BP (Business Process specification) or Core Components, they have no specific knowledge of the semantics and properties of these documents and treat them as opaquely as they would a JPEG image.
The purpose of the presentation of Farrukh Najmi was to show how ebXML registries could be used as CMS to store documents for publication on the Web. After brief presentations of what CMSes are, on one hand, and ebXML registries on the other hand, Najmi did a tour of the main features of ebXML registries (publication, research and management, document life cycle management, metadata, notification, security and federation) to explain how they might be beneficial to implement a CMS.
That's Alex Brown who took the challenge to bring out this controversial subject. Among his complaints against XML are its verbosity (most of the features that could keep SGML documents concise have been suppressed from XML), the "attribute question", a data model (the XML infoset) more complex than it should be with the variety of its types of nodes.
Claiming that a good design is elegant, minimal yet complete and separates data from presentation, Alex Brown notices that XML isn't elegant (the attribute question "smells bad"), isn't minimal, and doesn't separate the data, the XML infoset, from the syntax, which is imposed. To fix that, he proposes to define a new profile of SGML, simpler than XML that would keep a compatibility with XML and would get rid of DTDs, PIs, comments, CDATA sections, attributes and namespaces. This proposal doesn't seem to have convinced the many W3C members attending the session, some of whom voiced objections during the Q&A session.
The last presentation was by Mark Birbeck, who proposed an XHTML 2.0 syntax for representing RDF triples recently published as a W3C Note. It's achieved by allowing "meta" elements and attributes pretty much everywhere in XHTML documents. Birbeck has shown how different combinations of these element and attributes could provide a flexible way to express RDF triples.
This work appears to be moving on at a fair pace and the principal remaining issue seems to be how to bind assertions to a fragment of a page: the current proposal relies on "id" attributes in a way that hasn't fully convinced me. The challenge is important since the goal is to reconcile XHTML and RDF.
The State of XML
The closing plenary was given by Edd Dumbill and was published in last week's issue of XML.com. Dumbill had chosen to pick the most important topics and comment upon them in detail; both the choice of these topics and his comments were in sync with the recurring theme of "refactoring the use of XML". Six years after the publication of XML 1.0, there are still many interesting things to discover, even in the "lowest" layers of the architecture and XML conferences still have an important role to play.