Reports from XML 2002

December 18, 2002

Eric van der Vlist

Eric van der Vlist is the author of XML Schema.

Microsoft Office Embraces XML

For many participants, the most memorable event of XML 2002 will be Jean Paoli's presentation of Office 11, which promises to deliver easier access to XML for hundreds of millions of workstations.

Those of us who had to connect a Windows PC to the Internet in the early 90s remember the difficulty of choosing and installing TCP/IP and the browser software necessary to access and browse the Web. At that time, Microsoft didn't believe in the future of the Internet, TCP/IP wasn't natively supported by Windows, Internet Explorer was a vague project in a laboratory, and many companies had developed market segments left uncovered by Microsoft.

This era was wiped away by a U-turn, which I have always considered a miracle from a company of the size of Microsoft. Within months, it bundled a TCP/IP stack and Internet Explorer with Windows. Internet features have been rapidly added to Office. Users applauded. Companies positioned on these segments either changed their strategy or disappeared, which is believed to have decreased the overall possibility for innovation in some domains.

The impression given in the presentation by Jean Paoli, co-editor of the XML 1.0 recommendation and pilot of the "XMLization" of Office, is that Microsoft is doing with XML what it did with the Internet.

XML was already well-supported by Microsoft BackOffice products from Biztalk to CMS through XML Server, but a key piece was missing: tools to let users manipulate XML on their workstations. The issue of editing XML documents is still as difficult as the connection to the Internet in the early 90s: technically there is a solution and many XML editors are available, but the financial, institutional, and human impact of deploying these tools on a large scale is considered a major obstacle by many organizations.

"XML for the masses" is the target of Office 11, and Paoli's presentation suggested that Microsoft will likely meet the target. Without major innovations (except maybe the linking of XML documents to Excel spreadsheets using XPointer references), Office 11 appears to xmlizing Office using largely standard technologies and enabling the manipulation of arbitrary XML documents using customer-chosen schemas:

  • Word 11 has been transformed into a XML editor and can be used to edit any XML document, assuming you can write a W3C XML Schema for it. The presentation can be configured and validation is done on the fly like spell-checking. A standard XML format has also been added for ordinary Word documents.

  • Excel XP already exports documents as XML. Excel 11 adds support for arbitrary documents and, maybe more interesting, the possibility to import values from XML documents as with a DBMS. The selection of the values to import is done through drag and drop; the links are stored as XPointer expressions.

  • XDocs is a new application that defines and uses document-oriented forms, more similar to the forms in Lotus Notes than those on the Web. User input may be stored together with the form in a XML document using a schema defined by Microsoft or in external XML documents with arbitrary schemas.

  • Access 11 can export its content as XML using a user-supplied schema.

  • Visio has its own XML format and can also read arbitrary XML documents and display their content in its drawings. Visio is the first Microsoft application to support SVG; it can load and save SVG documents.

  • Front Page 11 includes a WYSIWYG XSLT editor to define XSLT transformations through drag and drop.

Powerpoint is the only piece without an XML update because, according to Paoli, of the lack of time. For the rest of the applications, XML can be used as a channel of information between front and back office applications, but also between front office applications, creating a lot of new possibilities for users, consultants, and integrators who will be able to exploit them.

"XML for the masses", the target of Microsoft, is not the same as the target of the XML editors we know today, and it will probably take some time for Word to become a serious challenger. Nevertheless, many will be tempted to use a tool which is installed on so many workstations. The effect of this may well be to reduce the opportunity for competition, innovation, and value-add in the existing XML editor world, yet also to create new market segments.

Often criticized for its "embrace and extend" strategy, Microsoft has finally decided to continue to play the game with XML, even though the extensibility of XML opens new and unpredictable possibilities. But Microsoft needs to control all the major market segments in fear that XML might give to the masses the possibility of emancipation from its domination.

While the deployment of XML on millions of workstations is good news in the short term, it will certainly modify the landscape of desktop XML. The shape of this new landscape, and the longer-term consequences, are difficult to foresee.

OpenOffice: the XML format for the masses

Jean Paoli for Microsoft and Daniel Vogelheim for OpenOffice both chose the same title "XML for the masses" for their presentations, a commonality which hides two very different approaches from the editors of two competing office productivity suites.

The strategy of OpenOffice is to focus on the XML format natively used to store the documents. In his presentation Daniel Vogelheim gave an overview of the OpenOffice XML format and justified the design decisions taken to meet the following requirements:

  • use existing standards -- don't reinvent the wheel
  • "transformability" -- the format must be usable outside of the office application
  • first class XML -- all structured content must be accessible through XML structures

The result is a complex format (over 450 elements and more than 1600 attributes), with a good amount of redundancy, but easily readable and easily transformable. To some extent it's an XML format for users more than for developers.

This format has been given as an input to the OpenOffice OASIS Technical Committee, which aims to create an "open, XML-based file format specification for office applications."

For Microsoft, the strategy appears to be to bring XML tools to each desktop, and leave each user free to choose appropriate schemas, rather than promoting an XML office format which will be specific to Microsoft, and for which a licensing model is still unclear.

If the target of Microsoft Office 11 is to deliver XML tools to the masses, the target of OpenOffice is to become the XML office format for the masses.

ISO DSDL on the move

Document Schema Definition Languages (DSDL) is a project of the ISO/IEC JTC 1/SC34/WG1 working group chaired by Charles Goldfarb. After the meetings held in Baltimore during XML 2002, this working group has published a set of recommendations which have been all approved:

  • Relax NG specification is published as "ISO/IEC FDIS 19757-2, DSDL Part 2: Regular-grammar-based validation - RELAX-NG". This stage is similar to a W3C Proposed Recommendation and the specification should be approved as a final ISO standard within a period of two months.
  • Part 0 (Overview) is published as a CD (i.e. a first Working Draft).
  • Part 4 (Selection of Validation Candidates) is published as a CD. Proposed by Murata Makoto, this part defines a language for selecting XML islands which can be validated through different schemas. The document is derived from prior work known as "Relax Namespaces".

DSDL has also advanced the definition of the remaining parts, and several of them should lead to publications during the next meeting at XML Europe 2003. Among the decisions taken on part 1 (Interoperability Framework), are as follows:

  • The Interoperability Framework should allow one to use existing languages without modifications and thus to manipulate full documents instead of fragments. This approach will follow the principles of the Schemachine proposal by Rick Jelliffe, but it will be made more declarative and less procedural.

  • My own XVIF proposal is considered interesting and an extension mechanism will be added to Relax NG to support it in a more standard way.

Further reading on DSDL and XVIF:

A burst of schemas

For different reasons many XML 2002 presentations proposed the use of multiple validations and transformations for advanced needs, rather than using a single schema considered too complex or even impossible to write and maintain.

Perhaps influenced by my editorship of "DSDL part 1 -- Interoperability Framework", I observed the justification of such a framework returning as a leitmotif in many presentations:

  • Liora Alschuler, in her presentation " Layered Constraints: The Proposal for HL7 Healthcare Templates," explained why, in face of the huge diversity of the practices for health reports, HL7 has chosen to associate a very lax generic schema together with templates, i.e. specific constraints which formalize the different local usages of the common schema.

  • Walter Hamscher in " XBRL: XML, XLink, and the Revolution in Corporate Reporting" explained why, for similar reasons, XBRL has chosen to express most of the structure of its reports as extended XLink link bases. Since these links cannot be validated using W3C XML Schema, to be complete the validation of XBRL documents requires a validation by the application layer that could also be performed using a language such as Schematron.

  • Gabe Beged-Dov in " Normalized Metadata Format: RDF Meets XML Schema" showed how RDF documents may be "normalized" to facilitate their validation through W3C XML Schema.

  • Eric Freese in " Using DAML+OIL as a Constraint Language for Topic Maps" proposed a modification of the syntax of XTM Topic Maps documents (which could be done by a XSLT transformation) that enables their validation using RDF applications such as OWL or DAML+OIL.

  • Bob DuCharme in " Maintaining Schemas for Pipelined Stages" has shown that the customization of generic W3C XML Schema or Relax NG schemas with added metadata could be performed through XSLT transformations more easily than using the derivation techniques of these languages.

There are few commonalities between these presentations, but all of them show how, confronted with the issue of a complex validation in very different domains, projects have chosen to split the validation of their documents into different, easier to write, elementary steps.

That's also the approach taken by DSDL, the reason why it will be a multi-part standard, and the justification of its part 1, Interoperability Framework, which will define a language to describe the choreography of the elementary steps needed to perform a complex validation.

Impossible, Except For James Clark

The fact that the translation of Relax NG schemas into W3C XML Schema was considered impossible wasn't a good enough reason for James Clark, who presented at XML 2002 the latest progress of Trang, his multi-format schema converter.

While it is not possible to transform any Relax NG schema into W3C XML Schema, Clark considered that in most of the cases a converter smart enough to figure out the intent of the author should be able to write a readable W3C XML schema. This would be as good as a schema written directly using W3C XML Schema and a good approximation of the Relax NG original schema. Clark's presentation showed that this goal is not far from being met.

To achieve his aim, Clark had to make abstraction of the syntaxes beyond the two schema languages and to define a data model which is intermediate between Relax NG and W3C XML Schema. The translation is done in two steps: the Relax NG schema is parsed and feeds the data model, and criteria are applied to this data model to decide which W3C XML Schema should be used to get the best results both in term of approximation and readability.

The results from the examples presented by Clark is outstanding, and he has acknowledged that writing Trang was much more complex than writing a Relax NG implementation.

There appears to be quite a demand behind this tool too, as many developers are attracted by the simplicity of Relax NG, while they need to publish WXS schemas to be used by an increasing number of tools (Office 11 will require WXS schemas). Clark's answer to this request is simple: "don't bother with W3C XML Schema, write your schemas with Relax NG and use Trang to convert them."

Literate Programming in XML

Norm Walsh presented at XML 2002 his implementation ( XWEB) of Literate Programming in XML, available as part of his DocBook stylesheets, which enables the extraction of both documentation and code from common documents.

Introduced in 1984 by the father of TeX, Donald Knuth, Literate Programming is a software development methodology that focuses on documentation by embedding the source code in the documentation. Michael Sperberg-McQueen has proposed its application to SGML back in 1993.

Its principle is to document code fragments (classes, methods, variables or XSLT templates are examples of code fragments), to include the code fragments in their documentation, and to link them into a "web". This web can then be transformed to produce, on one hand, readable documentation in any format and, on the other, the source code.

The XML implementation presented by Walsh is thus based on mature concepts, and its current version was published in March 2002. It supports not only embedding traditional languages such as Perl or Java but also XML languages such as XSLT, schemas, or any other XML "code". Support for languages such as Python would probably require some adaptation since XWEB doesn't appear to care much about indentation.

Literate Programming isn't incompatible with other methodologies such as Extreme Programming; among the extensions which seem most interesting, one could probably easily add the definition of the unit tests within the documentation of the code fragments.

Purists will also note that, following tradition, XWEB is implemented through XWEB documents, a minimal XSLT transformation being provided to "bootstrap" these documents into usable pieces of code, which consist mostly of XSLT transformations.