Menu

XML Europe 2002 Coverage

May 22, 2002

Leigh Dodds

Note: Leigh Dodds and Edd Dumbill are covering XML Europe 2002 this week, live from sunny Barcelona, Spain. The XML-Deviant will be updated during the week with conference coverage, news, and notes. -- EDITOR.

Topics on the European Map

In his opening keynote at the XML Europe 2002 conference, Peter Pappamikail, head of the European Union (EU) Information Resource Management Group, gave some background on their activities within European government and the unique challenges which they face in defining a new information architecture for the EU Parliament.

Pappamikail explained that a key policy of the group was to use XML as the underlying technology for their efforts to draw together data, editorial, and metadata standards, among other policy documents. Pappamikail also briefly outlined two current initiatives. The first, ParlML, will attempt to capture best practices and other XML "bricks and tricks" to inform the use of XML in various European parliamentary activities. The second project, MiREG, will define a common metadata framework and associated controlled vocabularies and topic maps for use in public administration across the EU.

It was obvious from Pappamikail's presentation, and the direction that both of these initiatives are taking, that the EU will be making a strong investment in working with Semantic Web technologies and common vocabularies. Pappamikail explained that these could help to provide a semantic layer -- capturing common concepts and a shared understanding of them -- that could help manage the diversity of information that the European Government must provide to EU citizens. Enabling easy navigation to accessible content was described as a key requirement. Pappamikail termed this desire for strong resource discovery features as ensuring that there is "no wrong door" for accessing content.

This initial theme continued throughout the first day of the conference, with several presentations, particularly in the content management track, showing the continuing influence of Semantic Web technologies, especially Topic Maps. One vendor suggested that vanilla content management systems will quickly become a devalued commodity: knowledge management technologies were seen as an integral aspect of content management to combat information overload or, as as Steve Pepper calls it, "infoglut".

Published Subjects and Content Structures

Published Subjects are a Topic Map technology which may help solve some of the problems faced by Pappamikail's group, particularly the need for users of a controlled vocabulary to reach a common understanding of its terms for it to be most effective. This ensures that it will be consistently applied when authoring documents. Dealing with the many different EU languages was one issue which Papamikail highlighted. The difficulties in reaching shared understanding of controlled terms across multiple languages further complicates this scenario.

Published Subjects basically allow a subject to be defined in a Topic Map as a concept with an associated URI. One or more names may then be associated with a subject, which might be localized for specific languages. This gives the freedom to support multiple languages but still provide a stable subject resource with which documents can be associated. The ability to associate human-readable documentation with a Published Subject is another important component that will drive shared understandings of the controlled terms. The details of Published Subjects and an overview of how they can be used to improve reuse, indexing, and classification of documents was the subject of Bernard Vatant's paper "Re-using technical documents beyond their original context "

Published Subjects is still in it's infancy but there are already several OASIS Technical Committees (TC) working with the technology. The tm-pubsubj committee is defining some general recommendations and best practices for the publication of Published Subjects, while two other committees are working on concrete classifications for specific areas. The geolang TC are defining a classification for countries and languages, while the xmlvoc TC is defining classifications that can be applied to XML standards.

While Bernard Vatant's presentation discussed the separation of subject indexes from content using Published Subjects, Jean Delahousse presented a use case for the separation of content structure from content. In a talk entitled "Dynamic Publication Through Content Structure Management Tools", Delahousse described a real-world use of applying Topic Maps to extract content structures from documents in a Bank intranet.

Content structures range from those inherent in the content itself (e.g. chapters and sections) to relationships between different kinds of content. Delahousse highlighted the benefits of identifying these structures, the former through analysis of the content and the latter by mining knowledge from the enterprise, i.e. the content creators, editors and users.

Delahousse explained that the use of taxonomies and topic maps greatly enhanced the resource discovery capabilities of the intranet application. The direct benefit was allowing users to quickly hone in the content they need or have it filtered to their particular interests or skill levels. A further key benefit of modeling these structures separately from the actual content was that users could independently define their own content organization, e.g. to collate commonly used material. This obviously delivers a high degree of customization.

Maps and Guides

While Topic Maps certainly show a great deal of promise in enhancing the management and delivery of content, there are important human and cost factors to the use or at least deployment of Topic Maps. The success of a Topic Maps application seems predicated on the strength of the underlying model. This in turn requires high quality editorial input and subsequent domain analysis. For example, Delahousse recommended using existing efforts in producing content structure models whenever possible, but noted that these would need tailoring to individual business needs. Of necessity this will require significant input from the current creators and users of the content.

The high costs of moving to electronic, XML-based publishing have been recognized for some time. These costs become slowly amortized as more diverse publishing and business models are enabled. It seems that adopting ontology-based Semantic Web application may also incur high initial costs to solve the knotty modeling problems. Indeed, it seems likely that there is a possible revenue stream in providing these standardized ontologies as a product in their own right. This further strengthens the need for efforts like the OASIS Technical Committees and public sector efforts such as those being undertaken by the EU to define free, publicly available ontologies.

Also in XML-Deviant

The More Things Change

Agile XML

Composition

Apple Watch

Life After Ajax?

Adaptive Graphics

"Separate content and presentation" is a mantra with which every XML developer is intimately familiar. It's a message that has been repeatedly delivered at numerous conferences and product demonstrations, to the point where it has now become an accepted tenet of any content management system. Yet it's a design pattern which is still most strongly associated with the delivery of textual content. But the pattern is one that can also be applied to other kinds of content publication, a message that was clearly emphasized in Benjamin Jung and John McKeown's "Adaptive Graphics" presentation at XML Europe 2002 this week.

The central theme of the presentation was the benefit of attempting to separate content and presentation elements of images to allow the delivery of adaptive or personalized graphical content. Tailoring graphical content to end user preference or need is a feature enabled by the use of SVG to describe an image. For example, it's vastly easier to alter a piece of text content in an SVG image than it is to attempt the same with a raster graphics format. This introduces the potential for on-the-fly image generation as a means to present complex data. The presentation included several compelling examples of how this functionality could be put to good use in a number of different application domains.

A key aspect to the generation of adaptive graphics is identifying the static and dynamic aspects of an image. This typically involves identifying the background elements of a image, upon which dynamic elements will be overlayed. A distinction was then drawn between two types of changes that could be applied to an image.

Qualitative changes are most often associated with altering aspects of an image for localization purposes. The example used here was that of an online comic in which the text was translated into the end users preferred language.

Quantitative changes involve changes to the actual data used to render an image and, therefore, demonstrate the greatest degree of dynamism. The examples here included entertainment maps (visualizing the locations and times of various events across a street map), weather maps, and rail planners that include current information on train times. Each example includes time sensitive data, which means that content must be tailored to the user at the time of request; and in some cases includes both a mixture of qualitative and quantitative changes.

The presentation also touched on whether similar approaches could be applied to audio and video content. An example of generating musical notation directly from MusicML was briefly discussed, while the prospect of content and presentation separation for video applications (e.g. blue-screening, screen ratio alterations, etc.) prompted a lively exchange following the presentation.

While the content-presentation divide for graphics is not quite as clear cut as for full-text publishing, attempting to draw this distinction can yield some interesting results. This confirms the utility of the basic design pattern and points toward some further interesting applications for SVG.

The Markup Spectrum

The most interesting presentation I've attended so far was Rick Jelliffe's: "When Well-formedness is Too Much, and Validity is Too Little".

Jelliffe's theme was an alternate view of the classes of XML document. The XML 1.0 specification enshrines two types or states for documents: well-formed and valid. Yet SGML allows a wider variety of states. For example, tag minimization and short references allow documents that aren't well-formed to still be successfully processed. And the greater power of SGML DTDs allows for a more restrictive class of valid documents than that defined in XML.

Jelliffe presented his alternate document types as a spectrum of states. "Feasibly", "Inferably", "Impliably", and "Amply Tagged" are states which are all less than well-formed. The latter state became possible in the SGML '98 revision and is actually already widely deployed. Amply Tagged is the state in which most HTML documents can be classified.

Moving beyond well-formed, Jelliffe's categorization moved steadily toward validity through a similar progression: Feasibly Valid through Minimally and finally Valid documents. PSVI valid documents were placed somewhere beyond valid due to the more rigorous constraints that are typically applied.

The point of illustrating this spectrum of states was to highlight the states useful for different kinds of applications, particularly editing. Jelliffe's thesis is that Well-Formed is too restrictive for markup creation, and that one of the lesser states is actually more suitable. From this perspective, Jelliffe noted that XHTML was unlikely to be successful because the XML Well-Formedness rules are too strict; Amply Tagged HTML is much more user-friendly.

Although limited by time, Jelliffe was able to outline some techniques that illustrated how the additional document states might still be reliably processed. For example, only islands of content in a document might be validated at a time. Schematron's phase mechanism was also suggested as a means to apply flexible validation throughout a documents lifecycle. Schemas might also be adapted, automatically, to reduce their strength when applied during editing.

These concepts have already guided the development of the new Toplogi Collaborative Markup Editor which is currently in late beta with very promising results. It will be interesting to see how these concepts might be applied to the ISO DSDL project, of which Jelliffe is a member. DSDL is currently undergoing a realignment to focus on publishing requirements exclusively, at least in its initial phases. W3C XML Schemas will now be supported as an extension, rather than as one of the core languages, which include Schematron and RELAX NG. Jelliffe was keen to avoid any sense that there might be conflict between the two efforts, stressing that each was most useful in distinct but possibly overlapping application areas.

Case Studies

An essential aspect to all conferences is the sharing of experience both informally and during formal case study presentations, of which there were several during this years conference.

Edwin van der Klaauw described a prototype system which will revolutionize the publishing capabilities of the Dutch Library for the Blind, which provides Braille, large print, and audio versions of conventionally printed books. The original production processes are very manually intensive and the processes for each of the different publishing formats require separate access to the original printed text.

The replacement system still requires initial scanning and OCR of the source text to produce the initial electronic inputs, but the content is then captured as XML using WorX for Word, a Microsoft Word plugin that provides an XML export feature. Several similar technologies were being demonstrated by vendors this year. The XML content enabled a much broader set of possible output formats to be adopted, including both HTML and Digital Talking Books (DTB). It also impacted the editorial processes: not only was there no longer any need for the original source text to be available beyond the initial scanning, but editors could now begin working in parallel even on the same text.

The study was a clear demonstration of how the high initial investments of moving a production system over to XML yielded returns through enabling not only different publishing formats, but, potentially, different business models. The library is now able to syndicate its content using the DTB format. It's also a clear indication that introducing these new systems requires more than just deploying markup, a clear understanding of the current work flow is required to recognize all potential benefits.

A second interesting case study came from Kluwer, a large European publisher, in a presentation given by Gerth van Wijk. The aims of the project were to improve the creation of book indexes, another manually intensive process which, due to a lack of central coordination between indexers, is difficult to combine across Kluwer's entire corpus of material.

The presentation provided some interesting comparisons between the difficulties of modeling the data, which largely consisted of a thesaurus containing multiple parallel hierarchies of terms, and keyword strings constructed from terms in the thesaurus. The findings were that the relational model was actually much better than XML for storing the data, due to its much better support for referential integrity and normalization. Unfortunately the data proved difficult to manipulate using SQL, making a database-only application just as difficult.

The hybrid approach -- storing the data in a relational model and then presenting it as XML to higher layers in the application -- was also attempted. This too was subsequently rejected because the problems encountered in integrating the two models made the application much more complex. The application suffered from the worst, rather the best of the two worlds.

The ultimate solution was to build the system around Topic Maps. These offered the same modeling power as the relational model but with several additional layers of indirection. But even here shortcomings were encountered, particularly the lack of standardized constraint and query languages. TMCL and TMQL are still being developed, meaning that Kluwer is still currently unable to meet their goal of having a fully standards compliant application.

A Namespace Webcrawler

On the final day of the conference, Eric van der Vlist presented the results of his innovative attempts to harvest information about namespace usage on the web.

Van der Vlist explained the problems associated with finding authoritative information on existing XML vocabularies. While there are a number of schema repositories, these are isolated from one another with little or no attempt to compare equivalent or overlapping vocabularies produced by different communities. Van der Vlist was also interested in generating statistics on how often vocabularies were being used, e.g. to identify whether usage might be declining -- something that might influence whether a particular vocabulary would be used in a project.

The key insight was that publishing information on namespaces and vocabularies is basically web publishing; the same tools and techniques can be applied to indexing the results. Van der Vlist therefore constructed an application, based around a open source web crawler and the 4Suite RDF database, which he used to search the web and identify pages in which namespaces were used. The system used regular expressions to perform an initial parse of the harvested documents so that it could also capture namespace usage data from documents that weren't well-formed. Detailed analysis of well-formed documents was carried out using XSLT. The summarized statistics for individual namespaces, supplemented by references to the original documents, editorial commentary, and links to relevant news feeds were then published using RDDL.

The initial results, although not statistically significant (so far only a small sample of documents have been harvested) were intriguing: very few harvested documents were either namespace aware or well-formed, the latter not being particularly surprising. The top two namespaces were XHTML and those associated with Microsoft Office exports -- none of the documents containing the latter were ever well-formed.

A more extensive crawl and subsequent tracking of namespace usage could provide a very useful additional information resource for developers. The results will not only provide information on activity in particular namespaces, but can also lead them directly to relevant material -- specifications, schemas, tutorials, even mailing list discussions in which the namespace has been quoted in an example. Van der Vlist issued a call for sponsors to help with this more ambitious, but worthy goal.