Menu

Opening Open Formats with XSLT

February 4, 2004

Bob DuCharme

This month I'm taking a break from covering XSLT 2.0 to describe how the combination of XSLT 1.0 and an application with an open XML format solved a problem for me. I solved this problem so quickly and easily that it got me thinking about how the combination of XSLT 1.0 and the increasing amount of open XML formats are opening up a world of simple, valuable new applications and utilities for us to write.

At XML 2003 I was thinking about presentations, slides, public speaking, and speaker notes. I've used my own little DTD (now a RELAX NG Compact schema) for years to create slides and slide notes, and I've written an XSLT stylesheet to convert to HTML files of slides. I've written another stylesheet that provides a feature I find sorely missing from all slideshow packages: it pulls out only slide titles and slide comments into a single HTML document that I can print and hold in my hand when speaking. The only formatting I need in these slide comments is paragraph breaks and in-line bolding, because bolding of key phrases in the notes makes it easier for me to glance briefly at the notes to find the important points when I'm giving a talk.

In Jon Udell's keynote speech at the conference, he mentioned that the only Microsoft Office 2003 application that would lack an XML output option was the mail program, but he forgot another one, PowerPoint. I was looking forward to some sort of Save As XML feature in PowerPoint so that I could create the kind of speaker notes that I like from XML versions of PowerPoint presentations, and it looks like I won't get this ability for a while, at least not directly from Microsoft.

I decided to try it with OpenOffice, the free, open source, multi-platform office application suite. Once I saw the XML that its slide presentation program created, it took me less time to write a stylesheet that did exactly what I wanted than it took to download OpenOffice over the conference hotel's T1 line.

Looking at OpenOffice's XML

After installing OpenOffice, I didn't bother with any tutorials or user manuals. I just started up Impress, its slide presentation program, wrote a few slides and some notes to accompany them, and saved it. The saved file's extension was "sxi," but I had heard that these were zip files, so I tried unzipping them and found a file called content.xml. (Instead of diving into the zipped XML head-first the way I did, you might want to first check out The OpenOffice XML File Format page and the XML FAQ that it links to for a little background on XML's role in the application.)

The content.xml file had no carriage returns. Instead of looking for the DTD, which you can easily find from the "XML File Format" page listed above, I just indented the elements to see their nesting structure (in Emacs, this is Ctrl-Alt-\ for most major modes, including James Clark's nxml, which I heartily recommend; a very short stylesheet also does the same thing). In the indented version, it was clear that the document had a lot of header information, so I searched for the first text I had entered into the first slide and found it in the first draw:page element of the document's office:body element, shown here with white space added:

<draw:page draw:name="page1" draw:style-name="dp1" 
           draw:id="1" draw:master-page-name="Default" 
           presentation:presentation-page-layout-name="AL1T0">
  <draw:text-box presentation:style-name="pr1" 
        draw:text-style-name="P1" draw:layer="layout" 
        svg:width="23.912cm" svg:height="3.508cm" 
        svg:x="2.058cm" svg:y="1.743cm" 
        presentation:class="title">
    <text:p text:style-name="P1">Title of slide 1
    </text:p>
  </draw:text-box>
  <draw:text-box presentation:style-name="pr2" 
        draw:text-style-name="P1" draw:layer="layout" 
        svg:width="23.912cm" svg:height="13.231cm" 
        svg:x="2.058cm" svg:y="5.838cm" 
        presentation:class="subtitle">
    <text:p text:style-name="P1">Text of slide 1
    </text:p>
  </draw:text-box>
  <presentation:notes>
    <draw:page-thumbnail draw:style-name="gr1" 
          draw:layer="layout" svg:width="12.768cm" 
          svg:height="9.576cm" svg:x="4.411cm" 
          svg:y="2.794cm" draw:page-number="1" 
          presentation:class="page"/>
    <draw:text-box presentation:style-name="pr3" 
          draw:text-style-name="P2" draw:layer="layout" 
          svg:width="15.021cm" svg:height="10.63cm" 
          svg:x="3.292cm" svg:y="13.299cm" 
          presentation:class="notes">
      <text:p text:style-name="P2">First par of notes for slide 1. 
      </text:p>
      <text:p text:style-name="P2"/>
      <text:p text:style-name="P2">Second par of notes. 
      <text:span text:style-name="T1">Bolded
      </text:span>
      <text:span text:style-name="T2"> text right there. 
      </text:span>
      </text:p>
      <text:p text:style-name="P2">
        <text:span text:style-name="T2"/>
      </text:p>
      <text:p text:style-name="P2">
        <text:span text:style-name="T2">End of first test.
        </text:span>
      </text:p>
    </draw:text-box>
  </presentation:notes>
</draw:page>

I made some guesses about where a stylesheet would consistently find the titles and notes for each slide and wrote the following stylesheet:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:draw="http://openoffice.org/2000/drawing"
                xmlns:text="http://openoffice.org/2000/text"
                xmlns:presentation="http://openoffice.org/2000/presentation"
                version="1.0">

  <xsl:template match="/">
    <html><head><title>Speaker Notes</title>
    <style>
      <xsl:comment>
        p {font-size: 10pt}
        h1 {font-family: arial; font-size: 12pt; font-weight: bold}
      </xsl:comment>
    </style>
    </head>
    <body>
      <xsl:apply-templates/>
    </body>
    </html>
  </xsl:template>

  <xsl:template match="draw:page">
    <h1>slide <xsl:number/>:
    <xsl:value-of select="draw:text-box[1]/text:p[1]"/></h1>
    <xsl:apply-templates select="presentation:notes"/>
  </xsl:template>

  <xsl:template match="text:p">
    <p><xsl:apply-templates/></p>
  </xsl:template>

  <xsl:template match="text:span[@text:style-name='T1']">
    <b><xsl:apply-templates/></b>
  </xsl:template>

</xsl:stylesheet>

The first template rule, which fires upon seeing the source document's root, does some basic HTML setup and calls xsl:apply-templates for the rest of the source tree, surrounding the result in a body element. The elements before the first draw:page element don't seem to have any PCDATA between their tags, so the XSLT default template rules don't add anything to the result tree for them. For the template rule to handle the draw:page elements, each of which corresponds to a slide, I guessed that the first text:p element of the first draw:text-box was the slide's title, so I added that to the result tree inside of an HTML h1 header element, with a prefix showing the number of the slide's order in the presentation. The template rule for the draw:page element finishes by applying an xsl:apply-templates element to only the presentation:notes elements, skipping the slide's contents, because the point of the result document is to be just slide titles and corresponding notes.

The template rule for the text:p element converts each one to an HTML p element. Based on the markup I found around the word "Bolded" that I bolded in my sample slide show, I guessed that OpenOffice rendered bold text with a start-tag of <text:span text:style-name="T1">, so I converted that and its corresponding end-tag to to an HTML b element.

The stylesheet turned my complete sample content.xml file into this (for once, I've removed whitespace for greater readability):

<html xmlns:draw="http://openoffice.org/2000/drawing"
      xmlns:text="http://openoffice.org/2000/text"
      xmlns:presentation="http://openoffice.org/2000/presentation">
  <head>
    <meta http-equiv="Content-Type"
          content="text/html; charset=utf-8">
      <title>
      Speaker Notes</title>
      <style>
        <!--   p {font-size: 10pt}
             h1 {font-family: arial;
             font-size: 12pt; font-weight: bold}   -->
      </style>
  </head>
  <body>
    <h1>slide 1: Title of slide 1</h1>
    <p>First par of notes for slide 1</p>
    <p></p>
    <p>Second par of notes. <b>Bolded</b> text right there.</p>
    <p></p>
    <p>End of first test.</p>

    <h1>slide 2: Title of slide 2</h1>
    <p>Here are the notes for slide 2.</p>
    <h1>slide 3: Title of slide 3</h1>
    <p>Notes for slide 3 right here.</p>
  </body>
</html>

The empty p elements between the paragraphs of the notes for the first slide are not something I like to see; but remember that everything you've seen till now is the result of a quick hack, and it worked remarkably well.

Speaker Notes from PowerPoint

It worked so well, in fact, that I decided to really push my luck. OpenOffice applications can read binary files created by their Microsoft Office equivalents, so I found a binary PowerPoint file that I had created for an internal presentation at my place of employment with no thought of eventual XML conversion. I read it into OpenOffice Impress, saved it as an sxi file, unzipped that, and ran the stylesheet above with the content.xml file that was in this new sxi file.

    

Also in Transforming XML

Automating Stylesheet Creation

Appreciating Libxslt

Push, Pull, Next!

Seeking Equality

The Path of Control

The resulting speaker notes file came out beautifully. I never even had to go back and tweak the stylesheet. (If I had, I would have gotten rid of those empty p elements, perhaps by wrapping the third template rule's p element with an xsl:if element that has a test attribute of " . != '' ", which would have only added the node to the result tree if its contents were not equal to the empty string. The stylesheet may need additional tweaks to work well in all cases.) Then I thought "that was neat, I should write a column about it. But can I really write a whole column about such a simple little stylesheet?"

The point, though, is not about some special trick demonstrated by the stylesheet itself, but about the increasing amount of useful work that we can do with plain old 1999 XSLT 1.0, especially as more and more of the valuable data around us becomes available in XML. OpenOffice's ability to read PowerPoint, Word, and Excel binary files and to then save them as XML means that XSLT developers can take advantage of vast new sets of data. You can also create documents using OpenOffice instead of Word, Excel, or PowerPoint, and add XSLT stylesheets to your workflow to let coworkers do new things with their documents, spreadsheets, and presentations. Of course, soon you'll be able to do this with data saved natively from Microsoft Office 2003 -- if you want to spend the money on it, if you're running Windows 2000 with Service Pack 3 or later or Windows XP or later, and if you aren't looking for XML versions of PowerPoint presentations, like I was.

Microsoft Office and OpenOffice aren't the only applications making rich data available as XML. As you hear about more kinds of data becoming available in XML, ask yourself: if I know XSLT, what can I do with this data that I couldn't do before? I'm sure the answer will have many pleasant surprises.