From Word to XML
December 30, 2003
Among the most-asked XML questions of all are those which ask how to process XML using a client application with which the questioner is already familiar. The bulk of these questions, in turn, focus on XML's virtues as an open, structured-data medium: "How do I use XML in a database?" for instance, or "How can I convert my XML document into an Excel spreadsheet (or vice-versa)?"
But, especially given its roots in SGML and HTML, XML functions equally well as an open, structured-document medium. And that's where this month's question comes from.
Note: I don't pretend that my answer here is definitive or encyclopedic. It covers only one solution among a host of alternatives. If the response to past columns of this sort is any indication, within a week or two you'll be able to find numerous reader-supplied comments at the end of the article, giving you pointers to other options.
Q: How can I convert a Microsoft Word document into XML?
A: Recent versions of Word claim "save as XML" features of one kind or another. Maybe that "claim" is too harsh; they do create well-formed XML documents, after all. But it's XML of a spectacularly hideous form, even for simple documents -- nearly as gnarly and impenetrable to the human eye as XSL-FO.
(For a good idea of what to expect, see A. Russell Jones's recent article on devx.com, "Export Customized XML from Microsoft Word with VB.NET." Don't worry if you don't know or care anything about VB.NET; just check out that article's Figure 1 -- which shows how the document appears in Word -- and its Listing 1 as well. The latter is the output of the document coming from Word 2003's "save as XML" feature.)
Whether you like or don't like Word, or use it in your everyday working life, you may be called upon to convert a Word document to XML at some point. And if you don't even have Word in the first place, the quality of the word processor's "save as XML" output is moot anyway. What do you do then?
A good place to start searching when you're pretty sure software for processing XML must exist, but you don't know where to find it, is xmlsoftware.com. In this case, use the site menu to locate the "Conversion Tools" page.
As you can see, most XML-to/from-Word packages don't process "true" Word documents in the classic .doc form. Instead, they rely on Word's long-standing support for Rich Text Format (RTF). (RTF documents are "structured", after a fashion. But the language is intended primarily to support the display of textual matter -- not unlike Adobe's PDF. If you'd like to learn more about RTF, check the Microsoft site. Another good source is the interglacial.com site, put together by Sean M. Burke, author of The RTF Pocket Guide, published in 2003 by O'Reilly and Associates.)
upCast: Word to RTF to XML
At least one of the XML conversion tools on the xmlsoftware.com site does support native Word .doc conversion: upCast, from infinity-loop GmbH. In this column I'll take a look at how upCast (currently at version 4) does its work.
First, let's get the questions of platforms and licenses out of the way. upCast is Java-based and thus available cross-platform, with installers for Windows, Unix, and Macs. The licensing comes in a variety of flavors, including (among others) a commercial product, a free evaluation, and a "private" (single user, non-commercial) version.
After installing upCast and browsing through its documentation (and the infinity-loop site), you find that its .doc file support is limited in one sense: the .doc file(s) in question must have been created using Word 97 (or later), on on a PC running Windows 95, 98, NT, or 2000. For other, earlier versions of Word and/or Windows, the document first must be saved as RTF; the RTF file then is fed into the upCast conversion process. This limitation shouldn't be a problem for most Windows users, but it is something to bear in mind.
The .doc support relies on one other requirement: it uses an add-in, provided with upCast, called WordLink; this add-in saves the binary .doc as a temporary RTF file, using a copy of Word which is installed on the user's machine. So WordLink isn't available for Mac- and Unix-based upCast users. Hence, upCast users on these platforms are limited to processing RTF files only.
Running upCast is fairly simple. The main dialog box consists of two sections:
- The upper section ("Import Settings") is for specifying input parameters, chief of
is the name of the source file to be converted:
Figure 1: upCast import settings
- The lower section ("Export Settings") lets you identify the name and properties of
Figure 2: upCast export settings
In the second screen shot, I've pulled down the selection list to show what you can do with upCast. By default, the program outputs an XML document using upCast's own built-in DTD. Here's a fragment of a resulting document in this vocabulary:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE document PUBLIC "-//infinity-loop//DTD upCast 4.0//EN"
<?xml-stylesheet type="text/css" href="helloworld.css"?>
style="widows: 0; orphans: 0; word-break-inside: normal; \-ilx-block-border-mode: merge;">
<property name="title" value="Hello" type="text" />
<property name="author" value="John Simpson" type="text" />
<property name="numberOfPages" value="1" type="integer" />
<part style="page: pageStyle1;">
<par class="Normal">Hello world!</par>
This has a number of interesting features (highlighted in bold, above).
First, note the
xml-stylesheet PI. In order to capture not only the contents
of the document (which appear later, as text strings within
par elements), but
also its look-and-feel, upCast extracts style information from the RTF document being
processed and writes it to a Cascading Style Sheet. A small fragment of this style
looks like this:
/* Paragraph Properties: */
/* Character Properties: */
font-family: "Times New Roman", serif;
With this style sheet and the PI, a viewer (such as a browser capable of displaying XML via CSS) can render the document's contents in something like the way they appear in the source document. This rendering isn't 100% exact, of course -- CSS doesn't do everything a word processor does, in exactly the same way, and browsers are notoriously inconsistent in the extent to which they support CSS.
The second thing to notice about the output document is the two namespace declarations.
declares that the
html: namespace prefix is associated with the HTML 4.0
The other (more interesting) one identifies an
xlink: namespace prefix. How
does upCast use XLink? In several ways, including these:
- Each hyperlink (including e-mail addresses) in the original Word document is converted
linkelement with numerous XLink-specific attributes, such as:
<par class="Normal"[other attributes]>e-mail:
- Each Word "bookmark" is translated into a
referenceelement, which (like
link) takes a variety of XLink attribute. The
xlink:hrefattribute uses a fragment identifier to locate a specific portion of the document:
<reference xlink:type="simple" xlink:show="other"
(Note also, by the way, the use of alternative values for the
- Each image embedded in the Word document is referenced with an empty XLinking
<image xlink:type="simple" xlink:href="myImage01.jpg"
As I said, actually being able to use such XLinking markup presumes the availability of XLink-smart software. The Mozilla browser can handle simple XLinks in XML documents; for example, the email hyperlink in the first of the above three bullets displays correctly as:
Figure 3: Mozilla view of upCast link element
Again, though, you needn't use upCast simply to generate documents in upCast's own XML dialect. As you can see from the second screen shot above, other output options include XHTML 1.0 (Strict) and DocBook 4.2. (DocBook support is only beta-level, although I found no problems with it. And one thing it allows you to do is to migrate a document from Word to PDF, using software which generates PDF output, from DocBook input, without using Adobe Acrobat itself.) As with the output to the native upCast vocabulary, selecting the XHTML and DocBook output formats both cause corresponding CSS style sheets to be generated.
I did encounter some surprises in the resulting XHTML display, but only for Word features with no precise or consistently-renderable CSS counterparts. On the whole, though, the display was remarkably close to the original. For instance, here's a portion of a screen capture from a Word document, as displayed in Word:
Figure 4: Original document opened in Word
And here's the corresponding output of the upCast-generated XHTML document, viewed in Mozilla:
Figure 5: upCast-output version of above document, viewed in Mozilla
Also in XML Q&A
Not perfect, but very good. A particularly neat touch is the translation of the Word document's bookmarks into true hypertext equivalents, using fragment identifiers which scroll the browser directly to the correct portion of the document.
I haven't covered in this column the use of upCast's other output filter options Like the upCast XML, XHTML, and DocBook outputs, these other options seem to work smoothly and with few surprises. (My favorite of these is the "XSLT Processor" feature, which first generates an XML document and then transforms it to some other form, by way of a user-supplied style sheet and the Apache Xalan XSLT processor.) Nor have I covered the use of infinity-loop's parallel XML-to-Word product, unsurprisingly called downCast. If you're interested in straightforward translation back and forth between Word and various XML formats, though, I encourage you to investigate these other tools on your own. And of course, by all means take a look at the other software on xmlsoftware.com's "Conversion Tools" page.