XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Adventures with OpenOffice and XML

February 07, 2001

At the Open Source conference in Monterey last year, Sun announced their plans to release the current source code for Star Office, renamed OpenOffice. In October they followed up on their plans, releasing both the source code and binaries for OpenOffice build 605. One of the features added since Star Office 5.2 was the ability to save files as XML.

In addition to being open source, saving as XML makes OpenOffice truly open. Aside from being open source, XML's self-documenting nature allows us to dive into the document format without having to dive into C++. And more significantly, XML allows us to use simple, free tools to manipulate the documents themselves.

In this article we will examine the structure of the format. We will not go into great detail, as Sun has already done so in a 400 page specification. Instead we will focus on using the XML to generate something of potential interest to web developers and content editors.

It's important to note that OpenOffice isn't ready to be an every day word processor. Components like printing and spell checking were removed in the migration to open source because Sun didn't own them. I expect they will be added back by the open source community as time goes by. When Sun releases Star Office 6 I expect they will include the proprietary spell checker and print engine again. Also worth noting is that OpenOffice is relatively unstable at the moment. I experienced several crashes and other serious problems while working on this article. Thanks to Daniel Vogelheim of Sun for helping me through those troubles.

XML Requirements

Having migrated the project to open source, Sun did the right thing in opening the development process. Hence all of Sun's decisions are open to public discussion on the OpenOffice mailing lists. Sun had a specific set of requirements when designing the XML format for OpenOffice. The short list of requirements can be found on the OpenOffice XML site ; but let's review the list here.

Core Requirements:

  1. The file format must be capable of being used as an office program's native file format. The format must be "non-lossy" and must support (at least) the full capability of a StarOffice/OpenOffice document. The format is likely to be used for document interchange but that alone is not enough.

  2. Structured content should make use of XML's structuring capabilities and be represented in terms of XML elements and attributes.

  3. The file format must be fully documented and have no "secret" features.

  4. OpenOffice must be the reference implementation for this file format.

Sun plans for XML to become the default save format. This is not the case presently in build 605. I have to select "Save As" to export to XML, but when OpenOffice is finally released expect it to be the default save format.

Packaging

OpenOffice documents can be compound -- that is, they can contain multiple documents of different formats. Sun's examined the different ways of packaging up compound documents using XML. It picked the ZIP format. Initially this choice surprised me since I've always thought the most standard way to store binary data in XML was base64 encoding. However this decision is fully explained in detail on the OpenOffice site. Two factors were vital: ZIP's indexing ability and the importance of being able to load and save on demand. It means that an OpenOffice file will be a ZIP file containing at least one XML file, along with other files of relevance to the document (such as images, and possibly other OpenOffice files).

The OpenOffice XML Format

What about the details of the XML format itself. A specification document is available online, although it's a big document so I'll distill some of it here.

Document Root

 
<office:document>

The document root element is <office:document>. (I'm leaving namespaces URIs out; OpenOffice appears to use namespaces to its advantage in a very clean manner, unlike other office suites.)

According to the specification, this is a generic document root. All OpenOffice documents have this document root. A spreadsheet and word processor file will have the same document root, allowing us to do some generic processing.

Metadata

 
<office:meta>
   <dc:title>Adventures with Open Office and XML</dc:title>

Document metadata is one of the more interesting features of the OpenOffice XML format. OpenOffice metadata is enclosed in the <office:meta> tag at the top level of the document, immediately following the document root. Sun chose Dublin Core for the majority of their metadata elements. Where Dublin Core did not have an available element, Sun created elements in their meta namespace, including

  • generator -- the application that created this file;

  • initial-creator -- the original author of the file (dc:creator is used for the person who last edited the file);

  • creation-date -- the date this file was first created (dc:date is used for the date of the last edit);

  • keywords -- can be edited in the document properties dialog.

How is this useful? We could write a Perl script to display the author of OpenOffice files.

 
use XML::XPath;
while (my $file = shift @ARGV) {
   next unless -f $file;
   eval {
     my $xp = XML::XPath->new(filename => $file);
     print $_, ": ", $xp->findvalue("//dc:creator"), "\n";
   };
}

If we call this script dcdir, the results on a directory full of OpenOffice XML files might be

 
$ dcdir *.sxw
test.so.xml.sxw: Matt Sergeant

This works regardless of the type of OpenOffice file we are examining. With a little more work we can ensure that the file is an XML file of the OpenOffice format (at the moment, this script will crash when it comes across a non-XML file). See Kip Hampton's regular Perl and XML column for more details on using XML::XPath.

Styles

 
<office:styles>
   <style:style style:name="Source Code"
           style:family="paragraph"
           style:parent-style-name="Text body">
       <style:properties fo:font-family="Courier"
               fo:margin-left="0.25inch"
               fo:font-size="11pt"/>

OpenOffice formats text using text styles, allowing easy modification of a document's appearance. Styling information is saved in the XML format. The list of defined styles is enclosed within the <office:styles> element.

Each style, marked up with the <style:style> element, defines, in attributes, a style name; a style family (for example a paragraph style or text (inline) style, equivalent to <div> or <span> in HTML); a parent style (because styles inherit their parent style's attributes); and a class, which is used in the OpenOffice style dialog box to categorize styles.

Within the style element itself are style properties, which are stored in the attributes of the <style:properties> empty element. The properties of a style are inherited from the ancestor styles and only modified properties are stored (which saves space). The second interesting re-use of public XML schemas occurs in the use of XSL FO attributes (about which there's more in "Using Formatting Objects") to define style properties. Theoretically this means we should be able to do some formatting to produce an XSL FO document.

Why would we want to do this when we can print directly from OpenOffice? I work in content management and application serving (see my XML.com article on AxKit), and some of my clients would like to be able to use an ordinary word processor to create content. By doing some preprocessing, and then passing the output to FOP or another XSL FO processor, we can generate PDF files automatically from content saved into the web hierarchy. (This functionality isn't yet available, but please get in touch if this sort of thing interests you.)

It is again worth noting here that where XSL FO did not have an equivalent attribute to the internal implementation in OpenOffice, Sun have defined their own attributes in one of the OpenOffice namespaces.

Automatic Styles

 
<office:automatic-styles>
 <style:style style:name="P1" style:family="paragraph" 
         style:parent-style-name="Title">
   <style:properties fo:font-family="Arial" 
           fo:font-style-name="" fo:font-size="18pt" 
           fo:font-weight="bold" fo:text-align="end" 
           style:justify-single-word="false" 
           fo:margin-top="0.16inch" 
           fo:margin-bottom="0.0835inch"/>

How do you allow people to use styles, yet also allow local modifications to the fonts, weight, and so on? OpenOffice does this by injecting an automatic style between the text and the real style, so that rather than

"Some text" -> is of style -> "Title"

we have

"Some text" -> is of auto style -> "P1" -> parent style -> "Title"

OpenOffice has a section called <style:automatic-styles> following the main styles definitions. Within the main body of the document, only automatic styles are used.

There are some changes going on in this area presently. The current CVS builds at Sun only use automatic styles when local modifications to the formatting have been made. For example, suppose we make some text bold within a paragraph. OpenOffice uses the automatic style to define that hard formatting with a span. The result might look something like

1G
 
<text:p text:style-name="Text body">Some text with 
a <text:span text:style-name="T1">bold</text:span> word 
in it.</text:p>

Main Body Text

 
<office:body>

Finally we get to the main body of the document. Of all the sections, it's probably the simplest to follow. We will address each of the major tags in turn. Unlike other sections, the body text is free-form, so the following tags can appear anywhere within the <office:body> section.

Headings

 
<text:h text:style-name="P4" text:level="1">The format
itself</text:h>

Headings are defined with the <text:h> tag. Build 605 export uses an automatic style, whereas with later builds it will likely be

 
<text:h text:style-name="Heading 1">The format itself</text:h>

Paragraphs

 
<text:p text:style-name="P3">Some text</text:p>

Paragraphs of text are defined with the <text:p> tag. We can now start to see that some of the tags in the body are similar to HTML, albeit in a different namespace.

Spans

 
<text:span text:style-name="T6">spanned text</text:span>

OpenOffice spans are exactly the same as spans in HTML. They delimit an inline section of a paragraph, applying alternate styling to the spanned text.

Lists

Lists are defined by tags of similar same name as those used in DocBook. Specifically these are <text:ordered-list>, <text:unordered-list> and <text:list-item>.

Graphics

Vector graphics can be embedded directly into the document with OpenOffice, which is a nice feature, but you will be even more pleased to know that OpenOffice uses SVG as its native vector graphics format. And these vector graphics can occur directly within the flow of the body document. Daniel Vogelheim informed me, however, that while mostly correct, the format is "mostly SVG". There are some things that OpenOffice can do with graphics that SVG does not define. So again they have extended the format using elements in their own namespace.

Pages: 1, 2

Next Pagearrow