XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Adventures with OpenOffice and XML
by Matt Sergeant | Pages: 1, 2

Putting It All Together

As an illustration, you can download the source XML file of this article which was written in OpenOffice build 605, saved as XML, and then transformed using the techniques below.

How can we put the XML generated from OpenOffice to good use? What XML geeks really want to see is a free WYSIWYG XML editor like XMetaL or Adept. And here it is. If we restrict ourselves (or our customers) to using defined styles, OpenOffice can truly be a structured XML editor, without ever knowing you are editing XML.

By processing the XML generated by OpenOffice, we can turn tags like <text:h text:style-name="P10"> into something significantly easier to work with like <Heading_3>. And for structured XML, we really don't need all the font and page settings. But some of the style information may be of interest; for example <span> tags may point to XSL FO styles -- which are almost identical to CSS styles -- so these might be useful in trying to get a similar look if we translate the page to HTML.

We could do this transformation with XSLT. But I prefer XPathScript because it's more natural to me since I can use variables, define functions and pass parameters.

The code below will only work on current releases of OpenOffice (and probably works best on files saved from build 605), due to the aforementioned changes in the automatic styles functionality.

From automatic style to the real style

First we need to find an XPath expression that will take us from the text's style name (which will be an automatic style name like "P1") to the real style name. This is actually rather simple.

 
/office:document
/office:automatic-styles
/style:style
[@style:name="P1"]
/@style:parent-style-name

It finds the style:parent-style-name attribute of the automatic style. I call this the "actual style".

We can translate the actual style to a string we can use for an element name by removing spaces using XPath's translate() function; it will change "Heading 1" to "Heading_1".

A name mapping

Next we need to setup a name mapping to translate style names to a more preferred form. For example, we translate "Text_body" to "para".

Mappings are trivial in Perl (and hence XPathScript), we simply setup a hash.

 
my %stylemap = (
 Text_body => "para",
);

Adding the metadata

Let's assume for now that we are only interested in Dublin Core metadata. To get this we use the simple XPath /office:document/office:meta/dc:*.

Transformation results

The full stylesheet can be run using the Perl module XML::XPathScript, which you can download from CPAN. It comes with a command line utility, xpathscript.

The results of this transformation on a simple OpenOffice test document are

 
<article>
 <artheader xmlns:dc="http://purl.org/dc/elements/1.1/>"
   <dc:title>Test Example</dc:title>
   <dc:creator>Matt Sergeant</dc:creator>
   <dc:date>2000-11-13T21:00:01</dc:date>
   <dc:language>en-US</dc:language>
 </artheader>
 <body>
   <Heading_1>Test</Heading_1>
   <para>Here is some text</para>
 </body>
</article>

The result is much simpler than the original. We can easily work with this to transform to HTML using more XPathScript or XSLT.

Flat structure

The document format follows HTML's style of headings followed by text. This is not my personal preference. I prefer DocBook, which models the document as a tree structure -- sections are contained within a <sect1> tag, and sub-sections are contained within the parent section, rather than just occurring in the main flow of tags. A tree structure makes it easier to manipulate the document. For example, generating a table of contents is a simple recursive loop. But with the flat format in OpenOffice it's more difficult as we have to maintain information about the current heading levels.

It would be ideal to make the stylesheet produce a tree-shaped document instead of a flat one. So that is what I did. Since it requires maintaining state information about the current heading level. the choice of XPathScript is vindicated again since it's just Perl. I've written a stylesheet that gets very close to generating DocBook from OpenOffice XML files; it's what I used to provide this article to XML.com, followed by another transformation to generate HTML. I can do this in one step using AxKit pipelines. I save the file into the web document root, and AxKit transforms it to HTML for me.

OpenOffice for Content Management

As I mentioned earlier, my aim is to use OpenOffice as the editing component for a content management system (specifically, as an add-on for AxKit). The one thing that has thrown a monkeywrench into the works is OpenOffice's packaging format. You cannot pass ZIP archives to an XML parser. Since XML application servers like AxKit and Cocoon allow the XML provider to be overridden, we can even reach into those ZIP archives to extract the XML before further processing with stylesheets.

In November at XML Dev Con in San Jose I gave a talk about the current state of XML applications for web developers in the open source world. My conclusion was that while the server side of XML processing is competitive with, if not better than, proprietary products, the client-editor side of things was a long way off. OpenOffice's XML format changes everything. Now you really can edit a richly formatted document in a WYSIWYG word processor and publish it directly to the Web. That's a huge step in the right direction for the open source community.

Other ideas that could be implemented include

  • convert a presentation file to Sun's XML slide format and then to SVG using their toolkit;

  • use stylesheets to generate OpenOffice's XML format from XML formats like DocBook or XHTML (or the output from the transformation above) to create a form of round-trip editing;

  • use stylesheets to generate XHTML directly, rather than an interim format;

Doubtless there are many more possibilities. I look forward to feedback about this article.