XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML 2003 Conference Diary

December 23, 2003

Eric van der Vlist, author of O'Reilly's books on RELAX NG and W3C XML Schema, shares his personal view of December's XML 2003 Conference, held in Philadelphia, PA, USA.

Schemas Everywhere, With a Soupçon of Semantic Web

I am on my way back from XML 2003 and it's time for me to draw the conclusions from this event which, year after year, remains the major conference of the markup community. For this year's conference has been dominated by schema languages, but I am so biased that this probably doesn't prove anything. Schema languages have become my main focus and I see them everywhere!

The other notable thing I noticed this week is a rise in interest for the Semantic Web at large and an increasing number of presentations showing concrete issues solved by its technologies. This is something that I also noticed in my consulting work recently, and can be interpreted as a sign that the Semantic Web could be taking off in a relatively near future.

Schemas were, of course, the main subject of discussion during our ISO DSDL meeting on Saturday and of my tutorial on XML Schema Languages on Sunday. Monday was a day off for me, without any schema except for the homework I am finishing for DSDL.

Even in SVG

I had decided to start the conference, on Tuesday, with something entertaining and went to see some SVG. And impromptu session by Mansfield was really good, and schema free! His next session "Adding Another Dimension to Scalable Vector Graphics" was also very enlightening, I would have said perfect, but he had the funny idea to show his 3D vocabulary with a W3C XML Schema! Did he want to make it look more complex than it was? I don't know, but document samples are so much easier to read and comment on, that it looks insane to present schema snippets! If he really wanted to show a more formal representation, the compact syntax of RELAX NG, or even a DTD, is much more readable.

I feel that it's unfortunate when people give schemas such a central role, too. There is an increasing tendency to say that we "publish a schema" when we mean that we publish a XML vocabulary. This tendency is not new and people used to "publish DTDs" before "publishing schemas", but it's still as wrong today with schemas than it was wrong yesterday with DTDs! A schema by itself is pretty useless, and it's only the upper part of the iceberg. If Mansfield's proposal had only been a schema, his presentation wouldn't have been worth attending. What constitutes its value is the semantics attached to his elements and attributes, the data model associated with his documents, and the Javascript implementation that converts, client side, his 3D description into SVG. And none of this is in the schema.

Associated with XForms

The next session, "Generating User Interfaces from Composite Schemas", was about transforming WXS schemas into XForms. This seems to be becoming a pretty common pattern: the separation between the data model and the user interface advocated by XForms is a nice concept, which must be very useful for big projects with clearly differentiated teams of interface designers and software developers; but many projects are tempted to group both the user interface and the data model in a single document, and XML schemas is the common choice for this purpose. Patrick Garvey and Bill French describe the benefits of such an approach:

The efforts described here fulfil this commitment in two ways. Our processor tightly couples user interfaces with an underlying data model expressed in an XML Schema. This helps ensure data integrity and consistency at all levels of the user application. Second, our processor represents a formal process for encoding user interfaces. The automated nature of the process allows us to develop complex user interfaces without the need to write much custom code. It also allows us to easily propagate changes in the data model up to the user interface level, again with a minimum of coding. We believe that our techniques are a simpler, more developer-friendly approach to creating and maintaining user interfaces. Our interfaces do not require complicated and obscure script code for validating and marshaling form data. We minimize the danger that the application's data model and user interface go our of sync while one changes and the other does not.

I wonder if this approach isn't overly "technology centric". It again makes the schema the center of development, with the user interface is derived from the schema. Today's accepted trend is to make users the center of IT projects, and I'd like to see tools following that direction. Why not decide that the user interface is the center and derive schemas from XForms forms rather than derive the forms from the schema? I have recently tried to raise a discussion on www-forms@w3.org on this topic.

The three next sessions were remarkably schema free for me and, despite their interest, I will skip them from this schema-oriented chronicle.

Back to DTDs

The next session I attended was "NLM's Public Domain DTDs: A 9-Month Update" and I really enjoyed the focus placed by Debbie Lapeyre and Jeff Beck on the design and documentation of their vocabulary. But, here again, I think that they are not giving a good image of their work by calling it "DTDs". A good part of the design and most of the documentation work is independent of any schema language and would easily survive a migration to, let's say, RELAX NG. They have defined more than a set of DTDs: they've designed a vocabulary.

Why DTDs? Debbie Lapeyre had a slide to answer to this question: W3C XML Schema has not been used because there are still many interoperability issues between processors when schemas get complex. NLM, with its requirement to be a superset of all the existing vocabularies to describe articles, would have been a complex schema. Also, they considered that DTDs are easier to "modularize" than W3C XML Schema and that was important for NLM, which is a set of modular DTDs that let you build custom subsets. And, yes, they are considering publishing schemas in other schema languages such as RELAX NG.

All this sounds very attractive and I have promised to myself to take a closer look at NLM to see what could be the benefit for my web site XMLfr.org.

The DSDL Track

There was no formal DSDL tracks at XML 2003, but the next four sessions were nevertheless dedicated to DSDL parts.

The first of these was James Clark's "Incremental XML Parsing and Validation in a Text Editor", a wonderful presentation of how RELAX NG (aka DSDL part 2) can be used to guide XML editing. Although this was describing Clark's "nXML" mode for Emacs, the principles given there were generic and could apply to other XML editing tools.

What I liked the most in this talk is the different perspective on XML parsing and validation. Traditionally, we differentiate parsing from validation and include the check for well-formedness in the parsing. This separation does not work well during the editing of XML documents. Rick Jelliffe had already shown that in an amazing session called " When Well-Formed is too much and Validity is too little" at XML Europe 2002. James Clark, who had already shown his interest in the concept by adding "feasible validation" to his RELAX NG processor "jing", is now following a similar path in nXML. An XML editor needs to be able to rapidly process the structure of the markup to provide syntax highlighting, and document-wide well-formedness is too much for that. Clark's nXML thus includes a parser which is limited to token recognition and does not check that tags are balanced, and a validator that checks well-formedness and validity against RELAX NG schemas when they are available.

Even if the idea doesn't seem to get a lot of traction, I am convinced that XML deserves a cleaner processing model as advocated on XML-DEV many times by Simon St.Laurent with his "half-parser" and even by myself. There are indeed many situations where "when well-formed is too much and validity is too little", and our tools should do a better job to help us in these situations.

The other area which was gave good food for thought in this presentation is that James Clark insisted that during the whole process of parsing and validation, no tree is ever built in memory. This is a new proof that the requirement undertaken by RELAX NG to allow stream processing is met, and another different perspective on XML documents. We tend to see them as trees, while they can also be seen and processed as streams of events. This dual nature of XML is something we should not forget in our applications.

The next session was my own "ISO DSDL Overview and Update" and I won't cover it here. During the presentation, I felt that the message was relatively clear and that the audience was following the ten parts of DSDL, but it will probably take some time before we can spread the word and that people remember some of the ten DSDL parts.

Murata Makoto came next to present "Combining Multiple Vocabularies Without Tears", a high level introduction to DSDL part 4 and its "divide and validate" paradigm, complemented by James Clark's "Namespace Routing Language (NRL)" proposal. These two complementary talks described a new way to validate compound documents: rather than combining individual schemas, which often requires adapting them and requires that they use the same schema language, NRL (which is the main input to DSDL part 4) proposes a language that splits composite documents according to their namespaces, and specifies which schemas must be used for each of these parts. Many examples were given during these two talks, including the validation of SOAP messages with their envelope and payload, and XHTML documents embedding various namespaces going from SVG to RDF through XForms.

Caching and Resolvers

Next on my list was Norm Walsh's "Caching in with Resolvers", a practical and realistic presentation of the two options to deal with the fact that the web is by nature unreliable. As Walsh puts it: "The real nice way to use URIs is to use absolute URIs on the network, but networks go down and latency is sometimes significant." This doesn't impact only schemas and DTDs, but schemas and DTD are severely affected by this phenomenon. Norm Walsh went through the benefits and drawbacks of the two classical solutions to this issue: resolvers such as XML catalogs and general purpose caching proxies.

These two solutions are so different that we may not even think them comparable. Walsh's conclusion was that they are complementary, and that we can use a XML Catalogs resolver relying on a caching proxy for delegating the resolution of sub-catalogs to external servers. That still seems complex and maybe we just need XML Catalogs resolvers to act more like caching proxies!

Russian Dolls Escape from Schemas

The metaphor of Russian dolls has hit the XML worl through schemas and I was curious to see how it applied to multi-version documents. My next session was thus "Russian Dolls and XML: Designing Multi-Version XML Documents" by Robin La Fontaine and Thomas Nichols. Also called "DeltaXML Unified Delta", this is a proposal to serialize multiple versions of XML documents by highlighting their differences. The reason of this usage of the metaphor is that the multiple versions are embedded in a single documents like Russian dolls.

Beyond the metaphor is a simple vocabulary composed of "vset" attributes, identifying whether the nodes are modified between the different versions, and "PCDATA" and "attribute" elements to hold updated text nodes and attributes. A very clean proposal, whose main complexity comes from the lack of homogeneity with which XML treats elements, text nodes and attributes. The nice thing is that from a unified delta any version of the encoded document can be extracted by a simple XSLT transformation. Applications of this proposal include content management, internationalization, variant management and collaborative authoring.

But this simple proposal is quite a challenge for schema languages! How can you validate such a compound document? You could use NRL and your favourite schema language to validate the different pieces of the document independently, you could use DSDL Part 10 (Validation Management) to extract a version through XSLT and validate it with any schema language, but there is no provision right now in DSDL to loop over all the versions embedded in a united delta and validate them.

Topic Maps

Three Topic Maps sessions were next on my list. The title of the first one, "Topic Map Design Patterns For Information Architecture" was a little bit scary, but Kal Ahmed did a good job of making a clear and practical presentation from which I will remember that Topic Maps can be represented in UML, and that design patterns do work nicely with Topic Maps. According to Ahmed, Topic Map Design Patterns expressed in UML and serialized as topic maps provide a fourth way -- been neglected so far -- of describing topic map models. We already had Public Subject Identifiers (PSI), PSI Metadata and the Topic Map Constraint Language (TMCL), and we still missed something that is both human readable (thus the UML) and prescriptive. These models can be expressed as topic maps, and they are simpler than ontologies. Because they become part of the topic map, design patterns can be built on top of other design patterns.

It was nice to see that existing practices (UML diagrams and Design Patterns) can be used with new technologies such as topic maps, and I think that this could be generalized to the Semantic Web at large and RDF in particular. Although I am not a RDF schema expert, I think that RDF schema and OWL should help a lot in writing RDF design patterns. A good candidate for a RDF Design Pattern could be a design pattern to implement... topic maps in RDF!

Nikita Ogievetsky and Roger Sperberg's "Book Builders: Content Repurposing with Topic Maps" was more concrete, showing how topic maps can be used to create Question and Answer books on demand from a corpus of existing questions and answers. Michel Biezunski's "Semantic Integration at the IRS: The Tax Map" was another case study showing nice ideas such as automatic discovery of topics in the resources. Both are a good indication that Semantic Web technologies are now being used to solve real world problems for real world customers.

Pages: 1, 2

Next Pagearrow