XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

If you are using any word processor or editor in a group situation, such as a technical writing team, or an office, then it will probably be in your interest to set up templates for authors to use to ensure consistency, reduce effort, and help automate conversation of documents between formats, such as building web pages from office documents. If you are also trying to store and manipulate content in XML but want to use a word processing environment for authoring, then well-crafted templates are even more important.

In this article, I'm going to explore some of the ways that OpenOffice.org's Writer application (I'm using version 1.1.2 on Linux and 1.1.3 on Windows XP) is open to customization and configuration. I'll walk through some of the techniques I used to set up the first templates I built with the application in my quest for an interoperable, XHTML-ready system of templates and styles which will work across Microsoft Word and Writer.

Here are four techniques you might like to use if you are maintaining templates: (a) using an unzip tool to rip open the Writer file format and get at the parts (b) using XSLT to automate production of a large set of styles (c) adding a keyboard-accessible menu to apply those styles, and (d) automatically generating a number of macros to help in (c). I will illustrate the techniques using my application, but they are easily adaptable to other situations.

But first, here's a bit of background. Tim Bray wrote earlier this year about the state of word processing applications for the web.

If everyone's going to write for the web (and it looks a lot of people are going to), we need the web equivalents of Word Perfect and Wordstar and Xywrite and Microsoft Word, and we need them right now.

Some discussion flowed around this, with some claiming that OpenOffice.org is an adequate solution right now (see Tim's addition to his page) and others speculating that a new application may be required. A wiki even appeared in which the issue could be discussed. I joined in the discussion and decided that OpenOffice.org and Word are both part of the solution until something better comes along, kicking off a project to create configuration layers for both Word and OpenOffice.org as general purpose XHTML editors for generic documents. In the course of the work, I have come to appreciate the open, XML-based goodies in OpenOffice.org which is just as well, because I would not like to customize it using the Graphical User Interface (GUI) or deal with its macro language, although I look forward to decent Python scripting, which appears to be on the way.

OpenOffice.org is an open source office suite, which includes a pretty decent word processor, Writer. Like any decent word processor, it has a number of customization options, and like any software, it has its own set of strengths and weaknesses. It does have a customizable XSLT stylesheet that can be used to generate XHTML from any word processing document, but this produces far from ideal output unless you go to some lengths to customize it, as it is simply impossible to produce sensible mappings from word processing documents to XHTML in all cases. Templates are a necessity to enable authors to work with a set of styles that will map to XHTML.(Another major issue is that unless you actually run the XHTML export stylesheet manually after you have saved the document in the normal way and extracted the content, you do not even get access to the images in your document. So at this stage, I consider the XHTML export to be a work in progress.)

Hack 1: Unzipping and Manipulating Writer Files

Let's start with the basics: the file format. You can read about it in detail in a forthcoming O'Reilly title, which is available online in draft OpenOffice.org XML Essentials—Using OpenOffice.org's XML Data Format. We're only concerned with the Writer application here, rather than spreadsheets and suchlike. We will be dealing with Writer documents .sxw and Writer templates .stw, mostly the latter. These files are both actually ZIP files containing all your document data, with all the configuration and textual content in XML.

Three Ways to Unzip Files Using Windows

  1. On a Windows XP system, you can use the built-in zip function by changing the name of an OpenOffice.org file to end in .zip, at which point you can right-click on the file to explore it as though it is a directory or extract it to another location.

  2. Or use any old zip application, possibly adding to the file associations so that you can unzip OO.o files with a right-click.

  3. An approach I like is to grab the UnxUtils utilities, which are ports of GNU utilities to Windows.

    Download UnxUtils.zip, unpack to c:\Program Files\unxutils, then add the path to the binaries C:\Program Files\unxutils\usr\local\wbin to your system path.

    This gives you a selection of GNU staples for use on Windows, very handy for people like me who keep typing ls in windows instead of dir, not to mention being able to use zip and unzip from the command line exactly as I have shown in this article.

First hack, a quick exercise:

  1. Create a new OO.o text document and type in it, something like "Hello world".

  2. Save your new one-paragraph epic as test.sxw.

  3. Unzip the content. On a Unix-esque system (Windows users, see the sidebar), you can probably type this: unzip -d test test.sxw

    And you will be rewarded with some component files in a directory called "test":

    
    extracting: test/mimetype
    inflating: test/content.xml
    inflating: test/styles.xml
    extracting: test/meta.xml
    inflating: test/settings.xml
    inflating: test/META-INF/manifest.xml
    
  4. Open up the content.xml file in a text editor or an XML editor. This is where the, um, content of your document is kept.

  5. Ignore everything except the part you just created. You'll find it in a text:p element, which is what? A paragraph.

  6. Duplicate your "Hello world" paragraph.

  7. Save content.xml

  8. And re-zip it back together as an open office document, possibly by changing into testdir and typing zip -r ../newdoc.sxw * to give you a new document called newdoc.sxw.

  9. If you have been careful not to break the document, then you will have a new Writer document with "Hello World" in it twice.

Now you're hacking OpenOffice.org. Why? You might like to automate some kinds of document processing, create documents, or in an extreme situation, make changes to a document when you don't have a copy of OpenOffice.org. Try that with a Word ".doc" file! (Actually, don't. See my previous article on how to turn Word documents into XML.)

Hack 2: Adding Styles to a Template

Next step is to do some real work, this time on a template. We're going to make a whole lot of styles. A style is a named set of formatting instructions, so you can make parts of your document look and function alike with the application of a single named label, rather than having to laboriously hand-format each part of the document. Instead of having to remember that all your headings are 18-point Helvetica, you assign a heading style to each and let the machine format them for you. This is (a) lazier, (b) easier to change when Helvetica goes out of fashion, (c) going to let you build a table of contents simply by harvesting anything labeled as a heading, (d) going to make generating XHTML easy, and (e) highly recommended.

So here's the spec for this application, where we want to transform Writer documents into XHTML. We need styles for headings, ordered and unordered lists with different flavors of numbering, block-quote styles for quoting blocks of text at different levels of indenting, and paragraphs that can be nested to continue a list item. Using these styles, we will be able to reliably create XHTML documents from both Microsoft Word and OpenOffice.org in a fairly consistent manner. Word processors are really only good at flat sequences of paragraphs, but we can use well-designed styles to create nesting for XHTML.

Family Type Styles names
1 2 3 4 5
Paragraph (p) p
Heading (h) h1 h2 h3 h4 h5
Heading (h) Numbered (#) h1# h2# h3# h4# h5#
List item (li) Numbered (#) li1# li2# li3# li4# li5#
List item (li) Bullet (*) li1* li2* li3* li4* li5*
List item (li) Uppercase Alpha (A) li1A li2A li3A li4A li5A
List item (li) Lowercase Alpha (a) li1a li2a li3a li4a li5a
List item (li) Lowercase Roman (i) li1i li2i li3i li4i li5i
List item (li) Lowercase Roman (I) li1I li2I li3I li4I li5I
List item (li) Continuing paragraph (p) li1p li2p li3p li4p li5p
Blockquote (bq) bq1 bq2 bq3 bq4 bq5
Definition List Term (dt) dt1 dt2 dt3 dt4 dt5
Definition List Description (dd) dd1 dd2 dd3 dd4 dd5

I will leave detailed discussion of how this mapping from list styles to XHTML will be done for another time, but I do provide a couple of examples here so you get the flavor. The items in brackets are the style names that you would use in the word processor. The example would look pretty much the same in OO.o as it does here in XHTML give or take a bit; check out the source of this page to see the HTML:

  • (Style: li1*) A list bullet

  • (li1*) And another

    1. (li2#) And a numbered item

      (li2p) With a follow-on paragraph

    2. (li2#) And another numbered item

  • (li1*) And another list item introducing a quote:

    (bq2) From somebody else.

(These style names have been chosen for their brevity, regularity, and the fact that they do not overlap with built-in or "standard" styles in either OO.o or Word, making the job of converting between formats simpler.)

That's a lot of styles to set up using the point'n'click method, way too much like work for me, so my approach was to create a blank template, open it up to see how it worked, and then use XSLT to hack the styles.xml inside a Writer template file (.stw) which contains, you guessed it, definitions of the styles for this template. I did create the heading and plain-paragraph styles by hand using the GUI, but the lists were too fiddly to do that way.

For this part of the exercise, we are going to be operating on a template rather than a document. To get a template:

  1. Open a blank document in OO.o.

  2. From the File menu, select Save As.

  3. From the "Save as type" drop-down, select "OpenOffice.org. 1.0 Template".

  4. Type a name, template, and the result will be new file called template.stw.

Unzip the template into a directory called template (unzip -d template template.stw).

To add styles, we want to transform styles.xml using a stylesheet which you can get here.

  1. Copy styles.xml to old-styles.xml

  2. On my Fedora 2 Core Linux machine, the transformation is a matter of typing:

    xsltproc --novalid add-styles.xsl old-styles.xml > styles.xml

    See the sidebar for advice about how to run transforms using Windows.

Using XSLT from the Command Line on Windows
or Elsewhere Using Java

The hardest part of writing this article was finding a simple way to use XSLT from the command line on Windows. The most promising candidate is called nxslt, and it uses .NET which is really easy to install using Windows update, but for some reason, it doesn't work for these open office hacks. So my best recommendation, if you don't want to go through a great deal of mucking around, is to take the advice in this xml.com article and use Saxon, which apparently means, in modern times, that you need to get yourself a recent Java runtime environment, probably from Sun. I navigated that maze, then downloaded Saxon 6.5.3, unzipped it into c:\Program files\saxon, and I was able to run stylesheets like so (adjust all the paths as required):


java -jar c:\Program Files\saxon\saxon.jar old-styles.xml 
     add-styles.xml > styles.xml

No guarantees, but if you take this option, then all you need to do is reverse the order of the parameters in the examples here; input document first rather than stylesheet.

If you want to check out the result, then skip ahead to the part where you re-constitute a template.

Unfortunately Saxon does not have an option to turn off validation in the source file. You will need to figure out how to get it to see the DTD files, possibly by the brute force approach of copying them into wherever you're working. Failing that, simply remove the DOCTYPE declaration from the source file to stop Saxon from looking for it, then put it back in the result. (We didn't call this article "Hacking Open Office" for nothing). That is, cut and paste this bit:

<!DOCTYPE office:document-styles PUBLIC
"-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
"office.dtd">

If anybody has better ideas about simple-to-install XSLT processors please comment below.

I will cover only the highlights of the XSLT template here.

The first thing we need to do is to add style definitions. We do this by finding the beginning of the place where the outline styles are defined, using a template with an appropriate match attribute, and slip in some other styles first.


<xsl:template match="text:outline-style">
  <!-- Add new paragraph styles here -->
  <xsl:call-template name="make-styles">
  <xsl:with-param name="family">li</xsl:with-param>
  <xsl:with-param name="type">*</xsl:with-param>
</xsl:call-template>

This calls a named template make-styles, which takes as parameters the family and type of style, as set out in the table above. This template is used recursively to generate five levels of style definition.

The recursion starts with a default level parameter of 5, and then it calls itself, passing $level - 1 to the level parameter until at $level = 0 it stops. The result is the same as a construct like a for-loop.


<xsl:template name="make-styles">
 <xsl:param name="family" select="'li'"/>
 <xsl:param name="type" select="'*'"/>
 <xsl:param name="level">5</xsl:param>
 <xsl:param name="style-name" select="concat($family, $level, $type)"/>
 <xsl:choose>
  <xsl:when test="$level = 0">
    <!--We're done-->
  </xsl:when>
 <xsl:otherwise>
  <!--Recurse-->
  <xsl:call-template name="make-styles">
    <xsl:with-param name="level" select="$level - 1"/>
    <xsl:with-param name="type" select="$type"/>
    <xsl:with-param name="family" select="$family"/>
   </xsl:call-template>
  

Which is followed by the part that actually makes the style:


<style:style style:name="{$style-name}" style:family="paragraph" 
     style:parent-style-name="Default" style:list-style-name="{$style-name}">
          <xsl:choose>
          <xsl:when test="$family = 'dt'">
          <xsl:attribute name="style:next-style-name">
		  <xsl:value-of select="concat('dd',$level)"/>
		  </xsl:attribute>
          <style:properties 
             text:space-before="{($level - 1)}cm" 
             fo:margin-left="{($level - 1)}cm" 
             fo:margin-right="0cm" 
             fo:text-indent="0cm" 
             fo:font-weight="bold"
             style:auto-text-indent="false"/>
                                       
                                        
         </xsl:when>
         <xsl:when test="$family = 'bq'">
             <style:properties 
             text:space-before="{$level}cm" 
             fo:margin-left="{$level}cm" 
             fo:margin-right="0cm" 
             fo:text-indent="0cm" 
             fo:font-style="italic"
             style:auto-text-indent="false"/>
          </xsl:when>
          <xsl:when test="$type = 'p' or $family='dd'">     
             <style:properties 
             text:space-before="{($level)}cm" 
             fo:margin-left="{($level)}cm" 
             fo:margin-right="0cm" 
             fo:text-indent="0cm" 
             style:auto-text-indent="false"/>
          </xsl:when>
                                
        </xsl:choose>
</style:style>

This has a xsl:choose to select different formatting for different families of paragraph style. Bulleted and numbered styles don't get any formatting in this part, as their indenting and so on is set further down in the named template make-lists

Hint: You can do a lot with OpenOffice.org via experimentation; use the GUI to set up some styles, save the document, and have a peek inside to see what happens. Then you can extract the relevant bits and use them in stylesheets or other code.

Writer not only has styles for paragraphs and sub-paragraph text-spans, but it has separate styles for lists. This can cause a few headaches, because the correspondence between the two is a bit fluid. You can link a paragraph style to a list style, but that does not prevent you from later choosing a different list style. And more problematically, each list can have multiple levels. (Yes, I have heard of conditional styles, and no I don't think they will help in this case).

For the project I'm presenting here, the two goals are to (a) inter-operate with Microsoft Word, via Word .doc files, which Writer is fairly good at reading and writing, and (b) create a template that can later be used to create good-quality XHTML. OO.o's list styles will cause problems for Word, which has a tighter mapping between list levels and paragraph styles and a looser way of combining them. There will also be trouble when creating XHTML. The problem is that in 'normal' use of OO.o, it is very easy to end up with paragraphs that are not formatted exclusively with styles. For example, if you want to mix unordered lists and blockqoutes, then you could end up with a very complex set of interactions between list and paragraph styles and custom formatting that a stylesheet may not be able to reliably decode.

So, my approach is to try to work with a one-to-one mapping between paragraph styles and list styles. This is a compromise, but it means that authors can work with paragraph styles exclusively. This is achieved by creating a list for each of our paragraph styles that has bullets or numbering and then setting all the levels in that list to have the same formatting, so that it does not matter if they inadvertently get changed.

Working with Lists in Writer

When you have the insertion point inside a list, two things happen that you need to be aware of:

  1. To the right of the 'Object Bar' (toggle it on and off under View / Toolbars / Object Bar to see what I mean), a left-facing arrow will appear.

    Click the arrow, and the toolbar is replaced with a list-specific set of buttons for changing the level of list items Change list level within the list and restarting numbering .

    You will need the Restart Numbering button, though, to force numbering to restart when you begin a new instance of a list. You may like to use View / Toolbars / Customize to add the restart numbering button to the main object bar (it's under the Numbering category).

  2. An item will appear in the status bar, bottom right of your Writer window, for example, Level 1 : li1*.

    I have designed all the list styles presented in this article to have the same formating at all levels, so clicking the level-changing buttons will have no visible effect.

Finally, sometimes applying a paragraph style that is linked to a list style does not have the desired effect. In this case, you may need to click on the Numbering On/Off or Bullets On/Off buttons a couple of times to clear an existing list.

The final step in this hack is to re-constitute the template. Zip the contents of the directory back into a template:


cd template
zip -r ../new.stw *

Open the resulting new.stw using writer, via File / Templates / Edit (not via File / Open, which will create a new instance document).

An alternative technique you might like to consider is to import styles from your new template into an existing one--meaning you could maintain several templates containing discrete sets of styles (lists, headings, character styles). To import, use Format / Style / Load, and browse to a file. You can select which kinds of styles to import and whether to overwrite existing ones.

We have now covered two techniques for OpenOffice.org customization: unzipping documents and templates and adding styles by hacking the styles file.

Hack 3: Adding a Styles Menu with Keyboard Shortcuts

Now that we have a new template, it is possible to apply the new paragraph style, using the 'Stylist' (hit F11 to toggle it on and off), and largely ignore the list Styles unless you get into trouble. But applying styles in OO.o is painful. There's no simple way to map styles to keystrokes, and even the stylist does not let you use the keyboard to help select the style you want. The next stage is to show how you can add a new 'Style' menu to the application, with keyboard shortcuts.

The first problem is that there is no way to add a style to a menu. First we have to add a macro and then call that from the menu. And not just one macro--we need a macro for each and every style. Fortunately, we can automate this process. We will tackle the problem by starting with the menu, then using the menu to generate the required macros. This approach means that if you want to hand-code all or part of a menu, you can still use the stylesheet here to generate macros for each style mentioned in the menu.

This is what the new menu will look like:

New hierarchical style menu
A new styles menu with keyboard access via ALT key combinations.

OO.o has a configuration system for changing menus. It is very hard to use, and poorly implemented, so we will spend as little time in it as possible. All we need to do is make one small change to the main menu, and OO.o will save it as XML in the configuration directory, at which point you can grab it and hack it using XSLT, or manually add to it.

  • Open Writer.
  • From the Tools menu choose Configure.
  • Click the Menu tab.
  • Hit the New Menu button.
  • Close the dialog box.

What you have just done is make a change to OO.o's configuration, which it will write out into a configuration directory. Where that is will depend on your operating system. To find out where:

  • From the Tools menu, choose Options.
  • Under OpenOffice.org, in the list of categories at the left, select Paths.
  • Find and note-down the path for User configuration.
  • Close 00.o completely, including the quick start application it leaves in the Windows system tray.
  • Find the user configuration directory you just wrote down, and there should be a file called menubar.xml

I have supplied a sample stylesheet that works in a way that is very similar to the setup stylesheet covered earlier. It generates a hierarchical menu of each of the families of styles, adding them to the old menu bar and spitting out a new menu bar. Parts of this are hard-coded to provide the menu hierarchy, but there are recursive parts to handle the repetition involved in creating all those macro calls. Here is a fragment of the stylesheet that creates the menu for 'li' styles; there is a level parameter used here as in the previous stylesheet.


<menu:menu menu:id="slot:{$level}" menu:label="Level ~{$level} - li{$level}">
<menu:menupopup>
<menu:menuitem 
    menu:id="macro://./Standard.WPInteropStyles.li{$level}bull()" 
    menu:label="Bullet {$level} - li{$level}~*"/>
<menu:menuitem 
    menu:id="macro://./Standard.WPInteropStyles.li{$level}num()" 
    menu:label="Numbered {$level} - li{$level}~#"/>
<menu:menuitem 
    menu:id="macro://./Standard.WPInteropStyles.li{$level}p()" 
    menu:label="Paragraph {$level} - li{$level}~p"/>
...
</menu:menupopup>
</menu:menu>

This stylesheet is designed to load, via document(), a data file (wp-interop-styles.cml) containing names for all the character or sub-paragraph text styles. I generated this list by grabbing all such element names from the XHTML recommendation and putting them into an XML data file. (To use this as-is, you will have to either set these styles up by hand, download the latest sample template from my web site, or add to the setup stylesheet covered earlier.)

The ~ character is used to indicate the appropriate keyboard shortcut.

  1. Rename writermenubar.xml to old-writermenubar.xml

  2. Run the stylesheet:

    xsltproc --novalid generate-menubar.xsl old-writermenubar.xml > writermenubar.xml

    adjusting the paths to the various files as necessary.

Now we have a new menu for OO.o which will always be visible. If you start up Writer, (remember to shut down OpenOffice.org completely first) then you will be able to point and click to apply styles or use the keyboard, starting with ALT-S and then hitting the underlined characters to delve into the menus (at least you will once we install the macros needed to apply styles).

Hack 4: Generating Macros

The final stylesheet uses the menubar we just generated as its input and creates a text-output (not XML) that can be pasted into OO.o's macro editor.

To install then macros generated by our stylesheet:

  • Run the make-macros.xsl stylesheet:

    xsltproc --novalid make-macros.xsl writermenubar.xml > macros.txt

    This creates a main subroutine called SetStyle which takes a style name as an argument and applies the style.

  • Open the template, via File / Templates / Edit (not via File / Open, which will create a new instance document).

  • From the Tools menu choose Macros, then Macro...

    You should now be looking at a tree control, showing the various documents and templates you have open.

  • Click on new template / Standard to select the standard library of macros. This will probably be empty depending on your starting setup.

  • Click New, to create a new module, and name it WPInteropStyles.

    You should now be looking at a macro-editor window.

  • Paste the contents of macros.txt into the macro editor, replacing all the boilerplate code that's in there.

  • Clickety-click your way out of the macro editor, and save the template.

A bit of detective work will show you where the macros live within the file format once you save. Hint: look in META-INF/manifest.xml to see where your macros are stored within the Writer file-package.

In this article, I have covered a few techniques that will be of interest to template maintainers working with OpenOffice.org writer: how to crack open the file format, how to maintain large sets of styles, and how to customize menus and macros, all without using anything except standard tools, zip, an XSLT processor, and a text editor. All this can, of course, be further automated with a programming language of some kind, even a batch file. There are some changes coming in version 2 of OpenOffice.org, but all these techniques will be forwards compatible, although some things like the location and name of the menu-bar files look like they will change.



1 to 8 of 8
1 to 8 of 8