
Opening Open Formats with XSLT
This month I'm taking a break from covering XSLT 2.0 to describe how the combination of XSLT 1.0 and an application with an open XML format solved a problem for me. I solved this problem so quickly and easily that it got me thinking about how the combination of XSLT 1.0 and the increasing amount of open XML formats are opening up a world of simple, valuable new applications and utilities for us to write.
At XML 2003 I was thinking about presentations, slides, public speaking, and speaker notes. I've used my own little DTD (now a RELAX NG Compact schema) for years to create slides and slide notes, and I've written an XSLT stylesheet to convert to HTML files of slides. I've written another stylesheet that provides a feature I find sorely missing from all slideshow packages: it pulls out only slide titles and slide comments into a single HTML document that I can print and hold in my hand when speaking. The only formatting I need in these slide comments is paragraph breaks and in-line bolding, because bolding of key phrases in the notes makes it easier for me to glance briefly at the notes to find the important points when I'm giving a talk.
In Jon Udell's keynote speech at the conference, he mentioned that the only Microsoft Office 2003 application that would lack an XML output option was the mail program, but he forgot another one, PowerPoint. I was looking forward to some sort of Save As XML feature in PowerPoint so that I could create the kind of speaker notes that I like from XML versions of PowerPoint presentations, and it looks like I won't get this ability for a while, at least not directly from Microsoft.
I decided to try it with OpenOffice, the free, open source, multi-platform office application suite. Once I saw the XML that its slide presentation program created, it took me less time to write a stylesheet that did exactly what I wanted than it took to download OpenOffice over the conference hotel's T1 line.
Looking at OpenOffice's XML
After installing OpenOffice, I didn't bother with any tutorials or user manuals. I just started up Impress, its slide presentation program, wrote a few slides and some notes to accompany them, and saved it. The saved file's extension was "sxi," but I had heard that these were zip files, so I tried unzipping them and found a file called content.xml. (Instead of diving into the zipped XML head-first the way I did, you might want to first check out The OpenOffice XML File Format page and the XML FAQ that it links to for a little background on XML's role in the application.)
The content.xml file had no carriage returns. Instead of
looking for the DTD, which you can easily find from the "XML File Format"
page listed above, I just indented the elements to see their nesting
structure (in Emacs, this is Ctrl-Alt-\ for most major modes, including
James Clark's nxml, which I heartily recommend; a very short stylesheet also does
the same thing). In the indented version, it was clear that the document
had a lot of header information, so I searched for the first text I had
entered into the first slide and found it in the first draw:page
element of the document's office:body element, shown here with
white space added:
<draw:page draw:name="page1" draw:style-name="dp1"
draw:id="1" draw:master-page-name="Default"
presentation:presentation-page-layout-name="AL1T0">
<draw:text-box presentation:style-name="pr1"
draw:text-style-name="P1" draw:layer="layout"
svg:width="23.912cm" svg:height="3.508cm"
svg:x="2.058cm" svg:y="1.743cm"
presentation:class="title">
<text:p text:style-name="P1">Title of slide 1
</text:p>
</draw:text-box>
<draw:text-box presentation:style-name="pr2"
draw:text-style-name="P1" draw:layer="layout"
svg:width="23.912cm" svg:height="13.231cm"
svg:x="2.058cm" svg:y="5.838cm"
presentation:class="subtitle">
<text:p text:style-name="P1">Text of slide 1
</text:p>
</draw:text-box>
<presentation:notes>
<draw:page-thumbnail draw:style-name="gr1"
draw:layer="layout" svg:width="12.768cm"
svg:height="9.576cm" svg:x="4.411cm"
svg:y="2.794cm" draw:page-number="1"
presentation:class="page"/>
<draw:text-box presentation:style-name="pr3"
draw:text-style-name="P2" draw:layer="layout"
svg:width="15.021cm" svg:height="10.63cm"
svg:x="3.292cm" svg:y="13.299cm"
presentation:class="notes">
<text:p text:style-name="P2">First par of notes for slide 1.
</text:p>
<text:p text:style-name="P2"/>
<text:p text:style-name="P2">Second par of notes.
<text:span text:style-name="T1">Bolded
</text:span>
<text:span text:style-name="T2"> text right there.
</text:span>
</text:p>
<text:p text:style-name="P2">
<text:span text:style-name="T2"/>
</text:p>
<text:p text:style-name="P2">
<text:span text:style-name="T2">End of first test.
</text:span>
</text:p>
</draw:text-box>
</presentation:notes>
</draw:page>
I made some guesses about where a stylesheet would consistently find the titles and notes for each slide and wrote the following stylesheet:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:draw="http://openoffice.org/2000/drawing"
xmlns:text="http://openoffice.org/2000/text"
xmlns:presentation="http://openoffice.org/2000/presentation"
version="1.0">
<xsl:template match="/">
<html><head><title>Speaker Notes</title>
<style>
<xsl:comment>
p {font-size: 10pt}
h1 {font-family: arial; font-size: 12pt; font-weight: bold}
</xsl:comment>
</style>
</head>
<body>
<xsl:apply-templates/>
</body>
</html>
</xsl:template>
<xsl:template match="draw:page">
<h1>slide <xsl:number/>:
<xsl:value-of select="draw:text-box[1]/text:p[1]"/></h1>
<xsl:apply-templates select="presentation:notes"/>
</xsl:template>
<xsl:template match="text:p">
<p><xsl:apply-templates/></p>
</xsl:template>
<xsl:template match="text:span[@text:style-name='T1']">
<b><xsl:apply-templates/></b>
</xsl:template>
</xsl:stylesheet>
The first template rule, which fires upon seeing the source
document's root, does some basic HTML setup and calls
xsl:apply-templates for the rest of the source tree, surrounding
the result in a body element. The elements before the first
draw:page element don't seem to have any PCDATA between their
tags, so the XSLT default template rules don't add anything to the result
tree for them. For the template rule to handle the draw:page
elements, each of which corresponds to a slide, I guessed that the first
text:p element of the first draw:text-box was the
slide's title, so I added that to the result tree inside of an HTML
h1 header element, with a prefix showing the number of the
slide's order in the presentation. The template rule for the
draw:page element finishes by applying an
xsl:apply-templates element to only the
presentation:notes elements, skipping the slide's contents,
because the point of the result document is to be just slide titles and
corresponding notes.
The template rule for the text:p element converts
each one to an HTML p element. Based on the markup I found around
the word "Bolded" that I bolded in my sample slide show, I guessed that
OpenOffice rendered bold text with a start-tag of <text:span
text:style-name="T1">, so I converted that and its corresponding
end-tag to to an HTML b element.
The stylesheet turned my complete sample content.xml file into this (for once, I've removed whitespace for greater readability):
<html xmlns:draw="http://openoffice.org/2000/drawing"
xmlns:text="http://openoffice.org/2000/text"
xmlns:presentation="http://openoffice.org/2000/presentation">
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8">
<title>
Speaker Notes</title>
<style>
<!-- p {font-size: 10pt}
h1 {font-family: arial;
font-size: 12pt; font-weight: bold} -->
</style>
</head>
<body>
<h1>slide 1: Title of slide 1</h1>
<p>First par of notes for slide 1</p>
<p></p>
<p>Second par of notes. <b>Bolded</b> text right there.</p>
<p></p>
<p>End of first test.</p>
<h1>slide 2: Title of slide 2</h1>
<p>Here are the notes for slide 2.</p>
<h1>slide 3: Title of slide 3</h1>
<p>Notes for slide 3 right here.</p>
</body>
</html>
The empty p elements between the paragraphs of the
notes for the first slide are not something I like to see; but remember
that everything you've seen till now is the result of a quick hack, and it
worked remarkably well.
Speaker Notes from PowerPoint
It worked so well, in fact, that I decided to really push my luck. OpenOffice applications can read binary files created by their Microsoft Office equivalents, so I found a binary PowerPoint file that I had created for an internal presentation at my place of employment with no thought of eventual XML conversion. I read it into OpenOffice Impress, saved it as an sxi file, unzipped that, and ran the stylesheet above with the content.xml file that was in this new sxi file.
|
Also in Transforming XML | |
The resulting speaker notes file came out beautifully. I
never even had to go back and tweak the stylesheet. (If I had, I would
have gotten rid of those empty p elements, perhaps by wrapping
the third template rule's p element with an xsl:if
element that has a test attribute of
" . != '' ", which would have only added the
node to the result tree if its contents were not equal to the empty
string. The stylesheet may need additional tweaks to work well in all
cases.) Then I thought "that was neat, I should write a column about
it. But can I really write a whole column about such a simple little
stylesheet?"
The point, though, is not about some special trick demonstrated by the stylesheet itself, but about the increasing amount of useful work that we can do with plain old 1999 XSLT 1.0, especially as more and more of the valuable data around us becomes available in XML. OpenOffice's ability to read PowerPoint, Word, and Excel binary files and to then save them as XML means that XSLT developers can take advantage of vast new sets of data. You can also create documents using OpenOffice instead of Word, Excel, or PowerPoint, and add XSLT stylesheets to your workflow to let coworkers do new things with their documents, spreadsheets, and presentations. Of course, soon you'll be able to do this with data saved natively from Microsoft Office 2003 -- if you want to spend the money on it, if you're running Windows 2000 with Service Pack 3 or later or Windows XP or later, and if you aren't looking for XML versions of PowerPoint presentations, like I was.
Microsoft Office and OpenOffice aren't the only applications making rich data available as XML. As you hear about more kinds of data becoming available in XML, ask yourself: if I know XSLT, what can I do with this data that I couldn't do before? I'm sure the answer will have many pleasant surprises.
- Tim Bray on OpenOffice
2004-03-29 10:34:38 Bob DuCharme - More about OpenOffice.org (shameless self-promotion)
2004-02-07 21:04:16 J David Eisenberg