Adventures with OpenOffice and XML

February 7, 2001

At the Open Source conference in Monterey last year, Sun announced their plans to release the current source code for Star Office, renamed OpenOffice. In October they followed up on their plans, releasing both the source code and binaries for OpenOffice build 605. One of the features added since Star Office 5.2 was the ability to save files as XML.

In addition to being open source, saving as XML makes OpenOffice truly open. Aside from being open source, XML's self-documenting nature allows us to dive into the document format without having to dive into C++. And more significantly, XML allows us to use simple, free tools to manipulate the documents themselves.

In this article we will examine the structure of the format. We will not go into great detail, as Sun has already done so in a 400 page specification. Instead we will focus on using the XML to generate something of potential interest to web developers and content editors.

It's important to note that OpenOffice isn't ready to be an every day word processor. Components like printing and spell checking were removed in the migration to open source because Sun didn't own them. I expect they will be added back by the open source community as time goes by. When Sun releases Star Office 6 I expect they will include the proprietary spell checker and print engine again. Also worth noting is that OpenOffice is relatively unstable at the moment. I experienced several crashes and other serious problems while working on this article. Thanks to Daniel Vogelheim of Sun for helping me through those troubles.

XML Requirements

Having migrated the project to open source, Sun did the right thing in opening the development process. Hence all of Sun's decisions are open to public discussion on the OpenOffice mailing lists. Sun had a specific set of requirements when designing the XML format for OpenOffice. The short list of requirements can be found on the OpenOffice XML site ; but let's review the list here.

Core Requirements:

The file format must be capable of being used as an office program's native file format. The format must be "non-lossy" and must support (at least) the full capability of a StarOffice/OpenOffice document. The format is likely to be used for document interchange but that alone is not enough.
Structured content should make use of XML's structuring capabilities and be represented in terms of XML elements and attributes.
The file format must be fully documented and have no "secret" features.
OpenOffice must be the reference implementation for this file format.

Sun plans for XML to become the default save format. This is not the case presently in build 605. I have to select "Save As" to export to XML, but when OpenOffice is finally released expect it to be the default save format.

Packaging

OpenOffice documents can be compound -- that is, they can contain multiple documents of different formats. Sun's examined the different ways of packaging up compound documents using XML. It picked the ZIP format. Initially this choice surprised me since I've always thought the most standard way to store binary data in XML was base64 encoding. However this decision is fully explained in detail on the OpenOffice site. Two factors were vital: ZIP's indexing ability and the importance of being able to load and save on demand. It means that an OpenOffice file will be a ZIP file containing at least one XML file, along with other files of relevance to the document (such as images, and possibly other OpenOffice files).

The OpenOffice XML Format

What about the details of the XML format itself. A specification document is available online, although it's a big document so I'll distill some of it here.

Document Root


<office:document>

The document root element is <office:document>. (I'm leaving namespaces URIs out; OpenOffice appears to use namespaces to its advantage in a very clean manner, unlike other office suites.)

According to the specification, this is a generic document root. All OpenOffice documents have this document root. A spreadsheet and word processor file will have the same document root, allowing us to do some generic processing.

Metadata


<office:meta>

   <dc:title>Adventures with Open Office and XML</dc:title>

Document metadata is one of the more interesting features of the OpenOffice XML format. OpenOffice metadata is enclosed in the <office:meta> tag at the top level of the document, immediately following the document root. Sun chose Dublin Core for the majority of their metadata elements. Where Dublin Core did not have an available element, Sun created elements in their meta namespace, including

generator -- the application that created this file;
initial-creator -- the original author of the file (dc:creator is used for the person who last edited the file);
creation-date -- the date this file was first created (dc:date is used for the date of the last edit);
keywords -- can be edited in the document properties dialog.

How is this useful? We could write a Perl script to display the author of OpenOffice files.


use XML::XPath;

while (my $file = shift @ARGV) {

   next unless -f $file;

   eval {

     my $xp = XML::XPath->new(filename => $file);

     print $_, ": ", $xp->findvalue("//dc:creator"), "\n";

   };

}

If we call this script dcdir, the results on a directory full of OpenOffice XML files might be


$ dcdir *.sxw

test.so.xml.sxw: Matt Sergeant

This works regardless of the type of OpenOffice file we are examining. With a little more work we can ensure that the file is an XML file of the OpenOffice format (at the moment, this script will crash when it comes across a non-XML file). See Kip Hampton's regular Perl and XML column for more details on using XML::XPath.

Styles


<office:styles>

   <style:style style:name="Source Code"

           style:family="paragraph"

           style:parent-style-name="Text body">

       <style:properties fo:font-family="Courier"

               fo:margin-left="0.25inch"

               fo:font-size="11pt"/>

OpenOffice formats text using text styles, allowing easy modification of a document's appearance. Styling information is saved in the XML format. The list of defined styles is enclosed within the <office:styles> element.

Each style, marked up with the <style:style> element, defines, in attributes, a style name; a style family (for example a paragraph style or text (inline) style, equivalent to <div> or <span> in HTML); a parent style (because styles inherit their parent style's attributes); and a class, which is used in the OpenOffice style dialog box to categorize styles.

Within the style element itself are style properties, which are stored in the attributes of the <style:properties> empty element. The properties of a style are inherited from the ancestor styles and only modified properties are stored (which saves space). The second interesting re-use of public XML schemas occurs in the use of XSL FO attributes (about which there's more in "Using Formatting Objects") to define style properties. Theoretically this means we should be able to do some formatting to produce an XSL FO document.

Why would we want to do this when we can print directly from OpenOffice? I work in content management and application serving (see my XML.com article on AxKit), and some of my clients would like to be able to use an ordinary word processor to create content. By doing some preprocessing, and then passing the output to FOP or another XSL FO processor, we can generate PDF files automatically from content saved into the web hierarchy. (This functionality isn't yet available, but please get in touch if this sort of thing interests you.)

It is again worth noting here that where XSL FO did not have an equivalent attribute to the internal implementation in OpenOffice, Sun have defined their own attributes in one of the OpenOffice namespaces.

Automatic Styles


<office:automatic-styles>

 <style:style style:name="P1" style:family="paragraph" 

         style:parent-style-name="Title">

   <style:properties fo:font-family="Arial" 

           fo:font-style-name="" fo:font-size="18pt" 

           fo:font-weight="bold" fo:text-align="end" 

           style:justify-single-word="false" 

           fo:margin-top="0.16inch" 

           fo:margin-bottom="0.0835inch"/>

How do you allow people to use styles, yet also allow local modifications to the fonts, weight, and so on? OpenOffice does this by injecting an automatic style between the text and the real style, so that rather than

"Some text" -> is of style -> "Title"

we have

"Some text" -> is of auto style -> "P1" -> parent style -> "Title"

OpenOffice has a section called <style:automatic-styles> following the main styles definitions. Within the main body of the document, only automatic styles are used.

There are some changes going on in this area presently. The current CVS builds at Sun only use automatic styles when local modifications to the formatting have been made. For example, suppose we make some text bold within a paragraph. OpenOffice uses the automatic style to define that hard formatting with a span. The result might look something like


<text:p text:style-name="Text body">Some text with 

a <text:span text:style-name="T1">bold</text:span> word 

in it.</text:p>

Main Body Text


<office:body>

Finally we get to the main body of the document. Of all the sections, it's probably the simplest to follow. We will address each of the major tags in turn. Unlike other sections, the body text is free-form, so the following tags can appear anywhere within the <office:body> section.

Headings


<text:h text:style-name="P4" text:level="1">The format

itself</text:h>

Headings are defined with the <text:h> tag. Build 605 export uses an automatic style, whereas with later builds it will likely be


<text:h text:style-name="Heading 1">The format itself</text:h>

Paragraphs


<text:p text:style-name="P3">Some text</text:p>

Paragraphs of text are defined with the <text:p> tag. We can now start to see that some of the tags in the body are similar to HTML, albeit in a different namespace.

Spans


<text:span text:style-name="T6">spanned text</text:span>

OpenOffice spans are exactly the same as spans in HTML. They delimit an inline section of a paragraph, applying alternate styling to the spanned text.

Lists

Lists are defined by tags of similar same name as those used in DocBook. Specifically these are <text:ordered-list>, <text:unordered-list> and <text:list-item>.

Graphics

Vector graphics can be embedded directly into the document with OpenOffice, which is a nice feature, but you will be even more pleased to know that OpenOffice uses SVG as its native vector graphics format. And these vector graphics can occur directly within the flow of the body document. Daniel Vogelheim informed me, however, that while mostly correct, the format is "mostly SVG". There are some things that OpenOffice can do with graphics that SVG does not define. So again they have extended the format using elements in their own namespace.

Putting It All Together

As an illustration, you can download the source XML file of this article which was written in OpenOffice build 605, saved as XML, and then transformed using the techniques below.

How can we put the XML generated from OpenOffice to good use? What XML geeks really want to see is a free WYSIWYG XML editor like XMetaL or Adept. And here it is. If we restrict ourselves (or our customers) to using defined styles, OpenOffice can truly be a structured XML editor, without ever knowing you are editing XML.

By processing the XML generated by OpenOffice, we can turn tags like <text:h text:style-name="P10"> into something significantly easier to work with like <Heading_3>. And for structured XML, we really don't need all the font and page settings. But some of the style information may be of interest; for example <span> tags may point to XSL FO styles -- which are almost identical to CSS styles -- so these might be useful in trying to get a similar look if we translate the page to HTML.

We could do this transformation with XSLT. But I prefer XPathScript because it's more natural to me since I can use variables, define functions and pass parameters.

The code below will only work on current releases of OpenOffice (and probably works best on files saved from build 605), due to the aforementioned changes in the automatic styles functionality.

From automatic style to the real style

First we need to find an XPath expression that will take us from the text's style name (which will be an automatic style name like "P1") to the real style name. This is actually rather simple.


/office:document

/office:automatic-styles

/style:style

[@style:name="P1"]

/@style:parent-style-name

It finds the style:parent-style-name attribute of the automatic style. I call this the "actual style".

We can translate the actual style to a string we can use for an element name by removing spaces using XPath's translate() function; it will change "Heading 1" to "Heading_1".

A name mapping

Next we need to setup a name mapping to translate style names to a more preferred form. For example, we translate "Text_body" to "para".

Mappings are trivial in Perl (and hence XPathScript), we simply setup a hash.


my %stylemap = (

 Text_body => "para",

);

Adding the metadata

Let's assume for now that we are only interested in Dublin Core metadata. To get this we use the simple XPath /office:document/office:meta/dc:*.

Transformation results

The full stylesheet can be run using the Perl module XML::XPathScript, which you can download from CPAN. It comes with a command line utility, xpathscript.

The results of this transformation on a simple OpenOffice test document are


<article>

 <artheader xmlns:dc="http://purl.org/dc/elements/1.1/>"

   <dc:title>Test Example</dc:title>

   <dc:creator>Matt Sergeant</dc:creator>

   <dc:date>2000-11-13T21:00:01</dc:date>

   <dc:language>en-US</dc:language>

 </artheader>

 <body>

   <Heading_1>Test</Heading_1>

   <para>Here is some text</para>

 </body>

</article>

The result is much simpler than the original. We can easily work with this to transform to HTML using more XPathScript or XSLT.

Flat structure

The document format follows HTML's style of headings followed by text. This is not my personal preference. I prefer DocBook, which models the document as a tree structure -- sections are contained within a <sect1> tag, and sub-sections are contained within the parent section, rather than just occurring in the main flow of tags. A tree structure makes it easier to manipulate the document. For example, generating a table of contents is a simple recursive loop. But with the flat format in OpenOffice it's more difficult as we have to maintain information about the current heading levels.

It would be ideal to make the stylesheet produce a tree-shaped document instead of a flat one. So that is what I did. Since it requires maintaining state information about the current heading level. the choice of XPathScript is vindicated again since it's just Perl. I've written a stylesheet that gets very close to generating DocBook from OpenOffice XML files; it's what I used to provide this article to XML.com, followed by another transformation to generate HTML. I can do this in one step using AxKit pipelines. I save the file into the web document root, and AxKit transforms it to HTML for me.

OpenOffice for Content Management

As I mentioned earlier, my aim is to use OpenOffice as the editing component for a content management system (specifically, as an add-on for AxKit). The one thing that has thrown a monkeywrench into the works is OpenOffice's packaging format. You cannot pass ZIP archives to an XML parser. Since XML application servers like AxKit and Cocoon allow the XML provider to be overridden, we can even reach into those ZIP archives to extract the XML before further processing with stylesheets.

In November at XML Dev Con in San Jose I gave a talk about the current state of XML applications for web developers in the open source world. My conclusion was that while the server side of XML processing is competitive with, if not better than, proprietary products, the client-editor side of things was a long way off. OpenOffice's XML format changes everything. Now you really can edit a richly formatted document in a WYSIWYG word processor and publish it directly to the Web. That's a huge step in the right direction for the open source community.

Other ideas that could be implemented include

convert a presentation file to Sun's XML slide format and then to SVG using their toolkit;
use stylesheets to generate OpenOffice's XML format from XML formats like DocBook or XHTML (or the output from the transformation above) to create a form of round-trip editing;
use stylesheets to generate XHTML directly, rather than an interim format;

Doubtless there are many more possibilities. I look forward to feedback about this article.