Moving to OpenOffice: Batch Converting Legacy Documents
January 11, 2006
Most people know that you can use OpenOffice 2.0, the open source alternative to Microsoft Office, to open up a Microsoft Office file and then save it in OpenOffice's native format—a zipped file that includes the document's contents as an XML file conforming to the OASIS OpenDocument standard. This works for Word documents, Excel spreadsheets, and PowerPoint presentations. (See the XML.com article Opening Open Formats with XSLT for an example of the kind of simple new application that this makes possible: a stylesheet that extracts the slide titles and notes from a slideshow file and stores them in a single file that you can use as speaker notes when giving a presentation.) After opening one of these file types, you can even export them to an Acrobat PDF file.
Like a lot of single-file conversions, this involves opening up and filling out various dialog boxes. If you have hundreds of files to convert, this isn't very practical. What if you want to load XML versions of a large collection of Word files, Excel spreadsheets, or PowerPoint files into an XML-aware database where you can query the collection? What if you're the State of Massachusetts or IBM?
Like its Microsoft counterpart, OpenOffice has a macro language. You can start up
OpenOffice from the Linux or Windows command-line prompt with instructions to to run
particular macro, and you can even pass a filename as a parameter to that macro. Adding
-invisible switch to the command line tells OpenOffice to start up without
the graphical user interface (GUI). Put all these together, and you've got a command
that converts a Microsoft Office file to an OpenOffice file (or an Acrobat file) with
of the GUI. To convert a hundred files, you can use a Perl script or other scripting
language to create a batch file or shell script that has the hundred commands necessary
convert those files.
(Note: the macro code that I found that inspired this was clearly written for OpenOffice 1.x, but I put all this together and tested it using OpenOffice 2.0 for Windows and for Linux.)
Creating the Conversion Macros
All the macro subroutines you need are shown below. To create a macro module in OpenOffice to store these macros, pick Macros from the OpenOffice Tools menu and then pick Organize Macros, OpenOffice.org Basic, Organizer, and New to create a new module. Name your module MyConversions and click the Close button.
The new Module will show up as /My Macros/Standard/MyConversions in the macro module tree listing, as shown here:
Select your MyConversions module and then click the Edit button. In the script
that comes up, you'll find the lines
Sub Main and
End Main to get
you started. Replace them with the code shown here and then save your work:
' Based on code from http://www.oooforum.org/forum/viewtopic.phtml?t=3772 ' Save document as an Acrobat PDF file. Sub SaveAsPDF( cFile ) cURL = ConvertToURL( cFile ) ' Open the document. Just blindly assume that the document ' is of a type that OOo will correctly recognize and open ' without specifying an import filter. oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, _ Array(MakePropertyValue( "Hidden", True ),)) cFile = Left( cFile, Len( cFile ) - 4 ) + ".pdf" cURL = ConvertToURL( cFile ) ' Save the document using a filter. oDoc.storeToURL( cURL, Array(_ MakePropertyValue( "FilterName", "writer_pdf_Export" ),) oDoc.close( True ) End Sub ' Save document as a Microsoft Word file. Sub SaveAsDoc( cFile ) ' mostly a copy of SaveAsPDF cURL = ConvertToURL( cFile ) oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, (_ Array(MakePropertyValue( "Hidden", True ),)) cFile = Left( cFile, Len( cFile ) - 4 ) + ".doc" cURL = ConvertToURL( cFile ) oDoc.storeToURL( cURL, Array(_ MakePropertyValue( "FilterName", "MS WinWord 6.0" ),) oDoc.close( True ) End Sub ' Save document as an OpenOffice 2 file. Sub SaveAsOOO( cFile ) ' mostly a copy of SaveAsPDF. Save as an OpenOffice file. cURL = ConvertToURL( cFile ) oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, _ Array(MakePropertyValue( "Hidden", True ),)) ' Set output file extension based on lower-case ' version of input extension. Select Case LCase(Right(cFile,3)) Case "ppt" ' PowerPoint file. cFileExt = "odp" Case "doc" ' Word file. cFileExt = "odt" Case "xls" ' Excel file. cFileExt = "ods" Case Else cFileExt = "xxx" End Select cFile = Left( cFile, Len( cFile ) - 3 ) + cFileExt cURL = ConvertToURL( cFile ) oDoc.storeAsURL( cURL, Array() ) oDoc.close( True ) End Sub Function MakePropertyValue( Optional cName As String, Optional uValue ) _ As com.sun.star.beans.PropertyValue Dim oPropertyValue As New com.sun.star.beans.PropertyValue If Not IsMissing( cName ) Then oPropertyValue.Name = cName EndIf If Not IsMissing( uValue ) Then oPropertyValue.Value = uValue EndIf MakePropertyValue() = oPropertyValue End Function
OpenOffice Basic is a variation on Microsoft Office Basic, which is a variation on
Basic, which is a variation on the original BASIC that those of us of a certain age
as our first programming language. I never got too far into any of them, but it's
enough to look up the syntax that you need with a few Google searches. For example,
I had no
idea how to do the case statement in the
SaveAsOOO subroutine, but between the
online help and Google, I found what I needed.
The code defines three subroutines and a function. The
MakePropertyValue function come from a 2003 OpenOffice forum
SaveAsDoc subroutine is almost a direct copy of
SaveAsPDF, with a different parameter passed to
MakePropertyValue to identify the filter name. If you want to add subroutines
to save an OpenOffice spreadsheet as an Excel file or an OpenOffice Impress slideshow
PowerPoint file, you could copy these routines and change that parameter and the extension
of the saved file.
The great thing about the
SaveAsOOO subroutine is its versatility. You don't
have to tell it whether you're reading a Word file, an Excel file, or a PowerPoint
tell it the appropriate output format. It just knows. As you can see, I did code a
logic to ensure that the output file has the right file extension, but the rest is
close to the
Running the macro from a shell prompt should work whether you leave OpenOffice open or quit out of it first. The following shows the basic command line for converting a Word file to OpenOffice on a Windows computer, split onto two lines to fit here:
"C:\Program Files\OpenOffice.org 2.0\program\soffice" -invisible macro:///Standard.MyConversions.SaveAsOOO(c:\temp\sample.doc)
I don't have the soffice.exe executable in my path, so I had to include the full
path to it enclosed in quotes because of the space in the Program Files directory
-invisible switch tells OpenOffice not to bother with the startup
screen, a default document, or any of the GUI. (Try starting up soffice.exe from
the command line with a single parameter of
-? to see a list of interesting
options.) The macro is named in a URL-like format, with the path down the macro tree
structure to the macro to be run, and the file to be converted is included in parentheses
a parameter to the macro. There's no need to provide an output file name, because
infers it from the input filename and the requested action.
Because the macro code adds
http:// as a prefix to turn the input filename
into a URL, you must include the complete path to it, as shown above, or you'll get
error message "URL seems to be an unsupported one."
The linux version of the command line (again, split here) needs to use a different
name. The OpenOffice installation on my Ubuntu distribution put the
binary in my path, so I didn't have to say where it was when starting it. I did enclose
call to the macro in quotes, because otherwise the parentheses confused the shell.
Otherwise, the exact same macros installed with the procedure described above worked
ooffice2 -invisible "macro:///Standard.MyConversions.SaveAsOOO(/home/bob/temp/sample.doc)"
I tried converting several different files. The sample.doc file is a test file I've
kept around for a few years to test the mettle of any program or service that claims
convert Word files to XML. It uses built-in and newly-created block and newline styles,
nested bulleted lists, a BMP file, a table with spanning cells, an embedded spreadsheet,
a few other things that can throw off a conversion program.
SaveAsOOO did fine
Go Forth and Convert MS Office Files
Now that you've got a free, multi-platform tool that can convert new and old (well, at least as old as Office 97) MS Office files to an open XML standard, how can you best put it to good use? Anything that can be run from a command line can be used in an unattended, "lights out" workflow. A Perl script can take a list of filenames and create a batch file or shell script with a series of commands like those shown above to convert those files. If the raw XML is really what you're after, a script can also pull that XML out of the OpenOffice zip file and rename it to correspond with the input file, like in this shell script:
# Remember to include full path with # filename for $1 and to omit extension ooffice2 -invisible "macro:///Standard.MyConversions.SaveAsOOO($1.doc)" unzip -o $1.odt content.xml cp content.xml $1.xml
Windows batch file version:
REM Remember to include full path with REM filename for %1 and to omit extension set OooExe="C:\Program Files\OpenOffice.org 2.0\program\soffice" %OOOExe% -invisible macro:///Standard.MyConversions.SaveAsOOO(%1.doc) unzip -o %1.odt content.xml copy content.xml %1.xml
If you're going to make high volume conversion part of an ongoing daily workflow,
restarting of OpenOffice for every conversion will slow you down. In Windows, starting
soffice.exe in quickstart mode (with the
-quickstart switch on the
command line) before doing your conversions should make those conversions go faster.
To go a
few steps further, the
-accept switch specifies a Universal
Network Objects string that lets you communicate with the running OpenOffice process
via an API from a program written in C++, OpenOffice Basic, Python, Java, or other languages and
pass input documents to your OpenOffice process using API calls.
To me, the exciting part about this is not the ability to convert new Word or Excel files that people send me to OpenOffice XML, but the ability to convert old files. How many old Microsoft Office files do you have access to? What new applications would be possible if you could unlock the information in them by converting those files to a well-documented XML format and then using XML tools to mine that information? Considering that we can do all this with free software that runs on both Windows and Linux, there should be huge new opportunities to explore.