What You See Isn't What We Want

June 13, 2001

Leigh Dodds

Amid all the attention currently focused on XML as a means for business data interchange, it seems as if its original use -- marking up documents -- has been neglected. This week the XML-Deviant highlights a frequently asked question which shows that many people are still seeking how best to move to XML as a document format.

Frequently Sought Conversion

Frequently asked questions are the lifeblood of any Internet forum, to such an extent that they become enshrined in FAQ documents, thereby avoiding the need for old hands to explain some particular solution for the millionth time. They are also useful barometers for showing the sticking points in any new technology. If users are repeatedly encountering the same few problems time and time again it might be a sign that there is an underlying problem, misconception, or fallacy that's being propagated, or there may be shortfalls in the availability of good tools and training.

The Deviant has noted that one of the most common FAQs at the moment is "how do I convert my Word documents to XML?" This one has cropped up in several forums and appears regularly on XML-DEV. And, like many seemingly simple questions, there are a variety of answers.

If the intention is to simply convert a small number of documents to XML or HTML suitable for publishing on the Web, then using the built-in Save As XML/HTML option is a good starting point. But the results of this are messy to say the least. A great deal of Word specific cruft is left in the resulting document. This has lead to the production of numerous tools capable of cleaning up the mess, as well as others, like Omnimark, that provide an alternative conversion facility.

Also in XML-Deviant

The More Things Change

Agile XML


Apple Watch

Life After Ajax?

In some cases what is being asked is much more ambitious. Users may have a large number of documents that must be converted, and they may want to continue to use Word as an authoring tool for the generation of structured XML documents conforming to a particular schema, for use in a publishing system, or document repository. These are the users who are plainly keen to gain some of the widely advertised advantages of XML by moving their documentation out of a proprietary format.

In these circumstances it seems that the received wisdom is to roll-up your sleeves and begin coding. The key technique is to use Word Styles (user-defined formatting properties) as markers for particular document structures (paragraphs, lists, headings, etc.) and then use scripts or macros to generate markup based on this styling information. Further manipulation with XSLT, for example, can further refine the results to yield the desired format. Rather surprising for users who may be seeking an off-the-shelf solution.

And unfortunately it isn't as simple as it may appear, as Marcus Carr recently noted during the latest discussion of Word-to-XML conversion issues:

As enticing as it sounds, it doesn't really work. If you rely on the consistent application of styles, you will be let down by people. If you rely on macros and/or interface prompting, you'll build something that's difficult to maintain and modify, assuming you can make it rigid enough that nobody can screw it up anyway. Structured authoring in Word is a bit of a holy grail, and the very short life of Microsoft's own 'SGML Author' in the early to mid-nineties indicates that it might truly be mythical.

This reliance on users applying styling correctly seems to be a key factor, as many contributors also observed. Writing to the XML-DOC mailing list , Pete Beazley made similar comments.

...the consistency of styles and formats in your Word documents is the most important factor in determining how much post-editing will be required -- certainly more important than the tool itself, at least in my opinion.

When you are moving content from Word to XML, you are moving from an unstructured document to a structured one. The paragraph styles, font weights, sizes, and color in the Word document all provide clues to the structure and semantics of that document. The quality of the conversion results will depend directly on how well you are able to define a mapping from style/format to structure and semantics and how consistent those styles and formats are.

Speaking from positive experience, Alan Kent believed that offering only a small number of available styles to choose from can improve user performance.

..we have done this on several projects successfully. Its not perfect, and keeping the supported styles low helped, but the users were much happier being able to use Word rather than having to learn to use a different editor for just one task they had to do.

...Bottom line is you can get it to work, but ... keeping the number of styles down to a minimum is important as its harder for users to make mistakes then.

The Best We Can Do?

It appears then that the current best practice is far from optimal. And there may be another sting in the tail. The discussion so far has considered only authoring of documents. What about editing and maintaining a document corpus in this way? This is a natural part of any document life cycle particularly for technical documentation. How does one go back from XML to a Word WYSIWYG interface? Indeed does using this kind of tool make sense in these circumstances? Bob DuCharme claims that

...if you really want to use Word to read an XML file, edit it, then save that as XML just like the original only with your changes, you're out of luck. Your best bet is to use another XML editor and then convert the output to Word as described above for anyone who needs it that way.

This is perhaps not quite the answer that people are looking for, especially those hoping to leverage an existing installation base of office application suites rather than switching to a dedicated XML editor. However there are other considerations that must be taken into account which might suggest a less optimal solution, as Soumitra Sengupta argued.

...the reality is that the cost of retraining millions of MS Office users to use powerful XML Editors is enormous. But the benefits of getting content into XML for reuse is huge. It can save organizations tons of money by reducing manual cutting and pasting and errors in re-entering. What is wrong if we can get a large part of the way there by using domain specific constraints and then installing a less laborious and less cumbersome process of "cleaning up markup errors". You can not argue that it does not have value although it may not be perfect? As for the programming for different types of documents, one way to address it is XSLT and a good XSLT development tool.

Marcus Carr expressed some reservations on this front, advocating alternative solutions.

...I wouldn't mind seeing some statistics on the cost of retraining users compared to building an application that leverages from their existing knowledge. Cleaning up markup errors is exactly what I spent almost all of the past decade doing, but if I was asked whether a company would be better off adopting that approach for the long-term or retraining their workforce and spending the money up front, I wouldn't need to spend a lot of time thinking about it.

In total agreement with this viewpoint, Michael Champion pointed out that customizing an XML editor is not only an easier task, it also offers more opportunities for reuse than a Word solution:

Products such as XMetaL are very easy for end-users to use once they've been setup with schemas, stylesheets, and some UI customization for the specific application. It's probably easier for a typical XML developer to learn how to customize XMetaL than it is to setup all the scripts, styles, templates etc. that a Word add-in would require.

Finally, "native XML everywhere" really does make it easy to build robust systems out of off-the-shelf components. If you use a Word addin, you're going to be building a lot of proprietary scripts, etc. and be locked in to specific vendors' tools. If you build a native XML editing application with XMetaL or something similar, you'll be able to re-use the DTDs, stylesheets, DOM code, HTML templates, webserver/database interfaces, etc. even if you switch products.

This seems almost like a no-brainer to me. We all promote our XML products because they're built on open, interoperable standards ... if we think our XML solutions are worth buying, we should be buying "synergistic" XML products from others, right? If we need XML content, use an XML authoring system! If you don't and want free-form content, use Word. If you need both, use the right tool for the job at hand.

While on the surface both options seem labor intensive, the native XML route is certainly more open. And there may be other advantages that can be gained if XML editors become sufficiently sophisticated. As John Turnbull pointed out, a decent editor does more than just generate pointy brackets:

The production of useful, valid XML that adheres to a prior agreement in a practical editing interface presents such a difficult set of problems that only a tiny handful of companies has even considered it. Two or three have succeeded.

But that's just the beginning. The more interesting problem is how you make a project report *act* differently from an RFP during its creation, or make a legal contract act differently for the user, than does a set of meeting notes. When you succeed at that, you are no longer just constraining your user, you are simplifying the authoring task while you assure yourself of input that your software can safely process.

The distinctions among editors that produce valid XML have very little to do with the production of a valid document, but a great deal to do with the degree to which they are open to customization and integration into larger XML systems. When these systems evolve, the customizations and integrations have to evolve with them. Good XML "editors" are actually developer tools that allow the creation of good XML editing interfaces -- for any kind of document. Most XML editors fail at both levels.

There is a certain amount of irony in that there are still some big hurdles facing widespread adoption of XML as a business and technical document format, despite the number of lessons that can be learned from SGML. Or perhaps the real lesson here is that these issues are resistant to simplification and there are no quick wins. The real low hanging fruit seems to be in using XML to exchange simple data structures, explaining the flurry of activity in that sector at present.