Menu

Word to XML and Back Again

December 8, 2004

Peter Sefton

A recent article on the O'Reilly Network showed how to edit XML using Word 2003, as long as your target XML format was not too far-removed from the built-in structural limitations of a word processor, and last year there was a survey of solutions on XML.com. But since Word 2000d, it has been possible to "export as XML" if you are up for a little bit of post-processing.

In fact, Rick Jelliffe blogged about this year's Open Publish conference: "If I were to pick a theme or meme, it was that the decision on whether and how to support Word was by far the most critical decision for most large XML deployments." The "whether" is a big question, but here's something about how to support Word in an XML project.

In this article, I will show you how to take the frighteningly messy result of Word's "Save as Web Page" and turn it into well-formed XML, using a few lines of Python and a touch of XSLT. Grab the sample Python application, and if you have libxml2 installed, you can type:

python wordconverter.py mydoc.htm > mydoc.xml
python wordconverter.py mydoc.xml > mynewdoc.htm

(Ignore the complaints from the libxml2 parser.)

Even if you do use Word 2003 (and many of us don't), you may find that this is a more useful approach than WordprocessingML&--the Word 2003 XML format--particularly if you are producing web pages. One major advantage of the hack I will show here is that it gives you pre-rendered, web-ready versions of your images, equations, graphs, and so on, nicely linked in img elements. You just have to remove the non-HTML parts.

I have been using this technique for more than four years, both with a commercial-but-free-to-use processor that I helped specify, and with a .NET version that I worked on for a former employer. These techniques are well tested on thousands of documents.

To be really useful, you will need to create templates for your authors so that you have predictable outputs to turn into XML. Unfortunately, good template design, and the benefits of basing your custom document types on HTML, are topics beyond the scope of a single article. There is more about template design, particularly for HTML output, at my site, in my Word Processor Interoperability project.

If you "Save as Web Page" in Word 2000 or beyond and open the result in a text editor, you will see something that is nearly XML, but with some craftily designed hacks to make the document accessible from a web browser (well, Internet Explorer, anyway) while containing enough embedded code to reconstitute a Word document in almost all of its glory. This widely reviled format is classic Microsoft. When first introduced, it used to crash or confuse competing browsers.

The goal for this little project is twofold: first, to figure out how to get from Microsoft's format to well-formed XML (we will not be validating this format with a schema or DTD), as from there it is straightforward to use XSLT or the language of your choice to transform the document, probably for rendering. For rendering, the trick is to simply discard most of the proprietary, undocumented Word features and transform the basic HTML paragraphs and tables into something useful, maybe even valid XHTML.

The second part of the goal is to be able to reverse the process, and turn the XML back into a Word document. You could use this to make minor changes to an existing document, such as changing metadata or incorporating comments from a web site. Or you could create entirely new documents, based on a shell, and use Word to render or edit them.

Now to work. The HTML head element starts off with some pretty standard stuff. All we need to do here is quote the attribute values, and close the unclosed elements meta and link. For this, I have settled on libxml2's HTML document parser, as discussed recently on XML.com over the more obvious alternative of Python's own standard sgmllib. The main problem with sgmllib is that it turns all characters in element names into lowercase. So to round-trip the document back into Microsoft's format, we would need to use a big lookup table, or use a hacked version of the library.

<head>

<meta http-equiv=Content-Type 

   content="text/html; 

   charset=windows-1252">

<meta name=ProgId content=Word.Document>

<meta name=Generator content="Microsoft Word 11">

<meta name=Originator content="Microsoft Word 11">

<link rel=File-List 

    href="word2xml_files/filelist.xml">

<link rel=Edit-Time-Data 

    href="word2xml_files/editdata.mso">

Parsing an HTML document in libxml2 is one-line simple (assuming that doc contains your document as a string), and it deals with both attributes and empty elements with aplomb:

import libxml2 htmldoc = libxml2.htmlParseDoc(doc, None)

This gives you an XML document, htmldoc, that you can process like any other XML document. But it's not that simple. There are some limitations to libxml2's considerable powers, starting with the fact that that it does not seem to understand the XML-style namespace declarations that Word puts in at the top of the document, even though they are fine in XML. You may also run into issues with encodings, particularly when using fonts such as Wingdings and the like. I have not attempted to deal with this issue in detail, but there is a commented hack in the sample file that should get you started.

<html xmlns:v="urn:schemas-microsoft-com:vml"

xmlns:o="urn:schemas-microsoft-com:office:office"

xmlns:w="urn:schemas-microsoft-com:office:word"

xmlns="http://www.w3.org/TR/REC-html40">

So when libxml2 encounters bits of a document like this in the office namespace:

<o:p> </o:p>

it complains, and turns them into this:

<p/>

We can't have elements moving between namespaces, from the office namespace to the default HTML namespace, but there is a simple solution, which involves replacing the : character with a _ character--and then putting it back later. The latter bit we'll do using XLST.

#Hide namespaces from libxml2's HTML parser

qualifiedname = '<(/?)(\w):(\w)'

hackedname =  r'<\1\2_\3'

doc = re.sub(qualifiedname, hackedname, doc)

The namespace issue is a bit ugly, but nothing compared to the horror of what I like to call the Mutant Markup Declaration (MMD), which is the dirty trick used to hide proprietary Word data in an HTML file. There are two variants of the MMD.

This kind of MMD that starts with <!--[if some-condition ]> and ends with <![endif]--> is a species of comment, to hide things from "normal" software:

<!--[if gte mso 9]><xml>

<o:DocumentProperties>

…

<o:Author>Peter Sefton</o:Author>

…

</o:DocumentProperties>

</xml><![endif]-->

Ironically, inside of the comment is pure, well-formed XML, thoughtfully wrapped in <xml> tags to emphasize the point. This is just a couple of regular expressions away from being XML. But how to do it? The most obvious way would be to turn the MMDs into processing instructions (PIs), as that is really their function. Unfortunately, though, libxml2 ignores PIs when parsing HTML, so I settled on the ugly-but-safe approach of using empty elements, and made-up ones at that.

Two substitutions will fix the MMDs:

startComment = r"<\!--\[(.*?)\]\>";

startCommentReplace = r"<mmd='\1' comment='start' /><div language='mso-conditional'>";

doc = re.sub(startComment, startCommentReplace, doc)

endComment = r"<!\[(.*?)\]-->";

endCommentReplace = r"</div><mmd value='\1' comment='end' />";

doc = re.sub(endComment, endCommentReplace, doc)

Here's an example that illustrates a few more challenges. If you have a style that you use for lists in Word, called L1* (for list, first level, with a bullet), it might look something like this:

  • Bullet point
  • Bullet point
  • Bullet point

In Word's format, each paragraph looks like this (don't look at this if you're squeamish; it's not pretty):

<p class=L10>

<![if !supportLists]>

<span lang=EN-AU style='font-family: Symbol;

mso-fareast-font-family:Symbol;mso-bidi-font-family:Symbol'><span

style='mso-list:Ignore'>•<span style=

'font:7.0pt "Times New Roman"'>...some spaces...       

</span></span></span>

<![endif]>

<span lang=EN-AU>Bullet point</span></p>

There is a Mutant Markup Declaration in here marking the beginning and end of some rendering information that uses non-breaking spaces for rendering the list. This works (sort of) in conjunction with a CSS stylesheet embedded in the document's head. The MMDs are easily dealt with:

startMMD = r'<\!\[(.*?)\]\>'

startMMDReplace = "<mso-declaration value='\1' />"

doc = re.sub(startMMD, startMMDReplace, doc)

endMMD = r'<\!\[endif\]>'

endMMDReplace = "<m  so-declaration value='endif' />"

doc = re.sub(endMMD, endMMDReplace, doc)

Now we can put it all together:

def parsehtmlfile(self, htmfilename):

  self.htmfilename = htmfilename

  self.doc = open(htmfilename).read()

  #Remove mutant markup using regular expressions

  self.doc = EscapeMMD(self.doc)

  #Create a libxml2 XML document

  self.htmldoc = libxml2.htmlParseDoc(self.doc, None)

  

There is one more complication to deal with. The list items in the original Word document had the style L1*, but the paragraph here is marked as class="L10". We need to look in the CSS stylesheet, in the head, to resolve this indirection. Here you will find a CSS rule that contains the property we are looking for: mso-style-name. The trick here is to extract the stylesheet and build a lookup table of class names, so you can say getstylename('L10') and get the answer L1*.

p.L10, li.L10, div.L10

  {mso-style-name:L1*;

  mso-style-parent:B1;

  margin-top:6.0pt;

  ...

  mso-ansi-language:EN-AU;}

So we grab the all the contents of all of the styleNodes:

styleNodes = self.htmlDoc.xpathEval("//*[local-name() = 'style']")

  styles = ''

  for styleNode in styleNodes:

      styles += styleNode.serialize()

And call something to extract the style names and store them in a dictionary:

   self.extractstyles(styles)

Then it's a matter of using the magic of XPath to visit every node in the document that has a class attribute, and if there is one, add another made-up attribute: mso-style-name.

    classNodes = self.htmldoc.xpathEval('//*[@class]')

  for classy in classNodes:

      className = classy.prop('class')

      msoStyle = self.getStyleName(className)

      if msoStyle:

          classy.newProp('mso-style-name', msoStyle)

Now we have a libxml2 document object ready to serialize. The weird Microsoft markup has been escaped into mmd elements, and namespaces have been escaped. The final step is to use a little bit of XSLT to serialize the document. The only interesting part of this is the part that puts the namespaces back, by matching elements that have an underscore in their names and doing some string manipulation to reinstate the namespaces.

<xsl:template match="*[contains(local-name(),'_')]">

<xsl:variable name="new-name" select="concat(substring-before(local-name(), '_'), ':',

     substring-after(local-name(), '_'))"/>

<xsl:element name="{$new-name}">

<xsl:apply-templates select="@*|node()" />

</xsl:element>

</xsl:template>

Finally, as promised, the return trip. There are only a few lines of Python, because I did it in XLST. This makes it portable across programming languages.

class xmltoword:

   xmldoc = ''

   styledoc = libxml2.parseDoc(wordxml2html)

   style = libxslt.parseStylesheetDoc(styledoc)
def __init__(self):

  pass

def parsexmlfile(self, fileName):

  self.xmldoc = libxml2.parseFile(fileName)



def output(self):           

  return self.style.applyStylesheet(self.xmldoc, None).serialize()

This kind of hackish transformation is but one way to approach the issue of getting Word documents into XML. One limitation is that while the round trip gives you a document that Word will accept, there are changes to whitespace and character encodings that mean you cannot automate testing across a large set of documents. To build a higher-fidelity version would require a custom parser. There are also ways of extracting XML from .doc files, and macro-based approaches, even using the OpenOffice.org word processor, which can read Word files (give or take a bit) and natively saves documents in XML. But as far as I know, my approach gives you the best crack at round-tripping documents, rather than siphoning off XML, provided you leave the proprietary stuff intact.

Before using this technique or the sample code on too many documents, try it out with a representative sample of real material from your users. And do be careful: my doctor friends tell me that they are still seeing a lot of injuries from the sharp edges on the inside of Microsoft Office files.