Adventures with OpenOffice and XML
by Matt Sergeant
|
Pages: 1, 2
Putting It All Together
As an illustration, you can download the source XML file of this article which was written in OpenOffice build 605, saved as XML, and then transformed using the techniques below.
How can we put the XML generated from OpenOffice to good use? What XML geeks really want to see is a free WYSIWYG XML editor like XMetaL or Adept. And here it is. If we restrict ourselves (or our customers) to using defined styles, OpenOffice can truly be a structured XML editor, without ever knowing you are editing XML.
By processing the XML generated by OpenOffice, we can turn tags
like <text:h
text:style-name="P10"> into something significantly easier
to work with like <Heading_3>. And for structured XML, we really don't
need all the font and page settings. But some of the style information
may be of interest; for example <span> tags may point to XSL FO
styles -- which are almost identical to CSS styles -- so these might
be useful in trying to get a similar look if we translate the page to
HTML.
We could do this transformation with XSLT. But I prefer XPathScript because it's more natural to me since I can use variables, define functions and pass parameters.
The code below will only work on current releases of OpenOffice (and probably works best on files saved from build 605), due to the aforementioned changes in the automatic styles functionality.
From automatic style to the real style
First we need to find an XPath expression that will take us from the text's style name (which will be an automatic style name like "P1") to the real style name. This is actually rather simple.
/office:document /office:automatic-styles /style:style [@style:name="P1"] /@style:parent-style-name |
It finds the style:parent-style-name attribute of the
automatic style. I call this the "actual style".
We can translate the actual style to a string we can use for an
element name by removing spaces using XPath's translate()
function; it will change "Heading 1" to "Heading_1".
A name mapping
Next we need to setup a name mapping to translate style names to a more preferred form. For example, we translate "Text_body" to "para".
Mappings are trivial in Perl (and hence XPathScript), we simply setup a hash.
my %stylemap = ( Text_body => "para", ); |
Adding the metadata
Let's assume for now that we are only interested in Dublin Core
metadata. To get this we use the simple XPath
/office:document/office:meta/dc:*.
Transformation results
The full stylesheet can be run using the Perl module XML::XPathScript, which you can download from CPAN. It comes with a command line utility, xpathscript.
The results of this transformation on a simple OpenOffice test document are
<article> <artheader xmlns:dc="http://purl.org/dc/elements/1.1/>" <dc:title>Test Example</dc:title> <dc:creator>Matt Sergeant</dc:creator> <dc:date>2000-11-13T21:00:01</dc:date> <dc:language>en-US</dc:language> </artheader> <body> <Heading_1>Test</Heading_1> <para>Here is some text</para> </body> </article> |
The result is much simpler than the original. We can easily work with this to transform to HTML using more XPathScript or XSLT.
Flat structure
The document format follows HTML's style of headings followed by
text. This is not my personal preference. I prefer DocBook, which
models the document as a tree structure -- sections are contained
within a <sect1> tag, and sub-sections are contained within the
parent section, rather than just occurring in the main flow of tags. A
tree structure makes it easier to manipulate the document. For
example, generating a table of contents is a simple recursive
loop. But with the flat format in OpenOffice it's more difficult as we
have to maintain information about the current heading levels.
It would be ideal to make the stylesheet produce a tree-shaped document instead of a flat one. So that is what I did. Since it requires maintaining state information about the current heading level. the choice of XPathScript is vindicated again since it's just Perl. I've written a stylesheet that gets very close to generating DocBook from OpenOffice XML files; it's what I used to provide this article to XML.com, followed by another transformation to generate HTML. I can do this in one step using AxKit pipelines. I save the file into the web document root, and AxKit transforms it to HTML for me.
OpenOffice for Content Management
As I mentioned earlier, my aim is to use OpenOffice as the editing component for a content management system (specifically, as an add-on for AxKit). The one thing that has thrown a monkeywrench into the works is OpenOffice's packaging format. You cannot pass ZIP archives to an XML parser. Since XML application servers like AxKit and Cocoon allow the XML provider to be overridden, we can even reach into those ZIP archives to extract the XML before further processing with stylesheets.
In November at XML Dev Con in San Jose I gave a talk about the current state of XML applications for web developers in the open source world. My conclusion was that while the server side of XML processing is competitive with, if not better than, proprietary products, the client-editor side of things was a long way off. OpenOffice's XML format changes everything. Now you really can edit a richly formatted document in a WYSIWYG word processor and publish it directly to the Web. That's a huge step in the right direction for the open source community.
Other ideas that could be implemented include
convert a presentation file to Sun's XML slide format and then to SVG using their toolkit;
-
use stylesheets to generate OpenOffice's XML format from XML formats like DocBook or XHTML (or the output from the transformation above) to create a form of round-trip editing;
use stylesheets to generate XHTML directly, rather than an interim format;
Doubtless there are many more possibilities. I look forward to feedback about this article.