Moving to OpenOffice: Batch Converting Legacy Documents

January 11, 2006

Bob DuCharme

Most people know that you can use OpenOffice 2.0, the open source alternative to Microsoft Office, to open up a Microsoft Office file and then save it in OpenOffice's native format—a zipped file that includes the document's contents as an XML file conforming to the OASIS OpenDocument standard. This works for Word documents, Excel spreadsheets, and PowerPoint presentations. (See the article Opening Open Formats with XSLT for an example of the kind of simple new application that this makes possible: a stylesheet that extracts the slide titles and notes from a slideshow file and stores them in a single file that you can use as speaker notes when giving a presentation.) After opening one of these file types, you can even export them to an Acrobat PDF file.

Like a lot of single-file conversions, this involves opening up and filling out various dialog boxes. If you have hundreds of files to convert, this isn't very practical. What if you want to load XML versions of a large collection of Word files, Excel spreadsheets, or PowerPoint files into an XML-aware database where you can query the collection? What if you're the State of Massachusetts or IBM?

Like its Microsoft counterpart, OpenOffice has a macro language. You can start up OpenOffice from the Linux or Windows command-line prompt with instructions to to run a particular macro, and you can even pass a filename as a parameter to that macro. Adding the -invisible switch to the command line tells OpenOffice to start up without the graphical user interface (GUI). Put all these together, and you've got a command line that converts a Microsoft Office file to an OpenOffice file (or an Acrobat file) with no use of the GUI. To convert a hundred files, you can use a Perl script or other scripting language to create a batch file or shell script that has the hundred commands necessary to convert those files.

(Note: the macro code that I found that inspired this was clearly written for OpenOffice 1.x, but I put all this together and tested it using OpenOffice 2.0 for Windows and for Linux.)

Creating the Conversion Macros

All the macro subroutines you need are shown below. To create a macro module in OpenOffice to store these macros, pick Macros from the OpenOffice Tools menu and then pick Organize Macros, Basic, Organizer, and New to create a new module. Name your module MyConversions and click the Close button.

The new Module will show up as /My Macros/Standard/MyConversions in the macro module tree listing, as shown here:

OpenOffice Macro module dialog box

Select your MyConversions module and then click the Edit button. In the script that comes up, you'll find the lines Sub Main and End Main to get you started. Replace them with the code shown here and then save your work:

' Based on code from

' Save document as an Acrobat PDF file.
Sub SaveAsPDF( cFile )
   cURL = ConvertToURL( cFile )
   ' Open the document. Just blindly assume that the document 
   ' is of a type that OOo will correctly recognize and open 
   ' without specifying an import filter.
   oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, _
            Array(MakePropertyValue( "Hidden", True ),))

   cFile = Left( cFile, Len( cFile ) - 4 ) + ".pdf"
   cURL = ConvertToURL( cFile )
   ' Save the document using a filter.   
   oDoc.storeToURL( cURL, Array(_
            MakePropertyValue( "FilterName", "writer_pdf_Export" ),)
   oDoc.close( True )
End Sub

' Save document as a Microsoft Word file. 
Sub SaveAsDoc( cFile ) 
   ' mostly a copy of SaveAsPDF
   cURL = ConvertToURL( cFile )
   oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, (_
            Array(MakePropertyValue( "Hidden", True ),))

   cFile = Left( cFile, Len( cFile ) - 4 ) + ".doc"
   cURL = ConvertToURL( cFile )
   oDoc.storeToURL( cURL, Array(_
            MakePropertyValue( "FilterName", "MS WinWord 6.0" ),)
   oDoc.close( True )

End Sub

' Save document as an OpenOffice 2 file. 
Sub SaveAsOOO( cFile ) 
   ' mostly a copy of SaveAsPDF. Save as an OpenOffice file. 
   cURL = ConvertToURL( cFile )
   oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, _
            Array(MakePropertyValue( "Hidden", True ),))

   ' Set output file extension based on lower-case 
   ' version of input extension.
   Select Case LCase(Right(cFile,3))
     Case "ppt"         ' PowerPoint file.
       cFileExt = "odp"
     Case "doc"         ' Word file.
       cFileExt = "odt"
     Case "xls"         ' Excel file.
       cFileExt = "ods"
     Case Else
       cFileExt = "xxx"
    End Select
   cFile = Left( cFile, Len( cFile ) - 3 ) + cFileExt
   cURL = ConvertToURL( cFile )
   oDoc.storeAsURL( cURL, Array() )
   oDoc.close( True )

End Sub

Function MakePropertyValue( Optional cName As String, Optional uValue ) _
   Dim oPropertyValue As New
   If Not IsMissing( cName ) Then
      oPropertyValue.Name = cName
   If Not IsMissing( uValue ) Then
      oPropertyValue.Value = uValue
   MakePropertyValue() = oPropertyValue
End Function

OpenOffice Basic is a variation on Microsoft Office Basic, which is a variation on Visual Basic, which is a variation on the original BASIC that those of us of a certain age learned as our first programming language. I never got too far into any of them, but it's easy enough to look up the syntax that you need with a few Google searches. For example, I had no idea how to do the case statement in the SaveAsOOO subroutine, but between the online help and Google, I found what I needed.

The code defines three subroutines and a function. The SaveAsPDF subroutine and the MakePropertyValue function come from a 2003 OpenOffice forum posting. The SaveAsDoc subroutine is almost a direct copy of SaveAsPDF, with a different parameter passed to MakePropertyValue to identify the filter name. If you want to add subroutines to save an OpenOffice spreadsheet as an Excel file or an OpenOffice Impress slideshow as a PowerPoint file, you could copy these routines and change that parameter and the extension of the saved file.

The great thing about the SaveAsOOO subroutine is its versatility. You don't have to tell it whether you're reading a Word file, an Excel file, or a PowerPoint file, or tell it the appropriate output format. It just knows. As you can see, I did code a little logic to ensure that the output file has the right file extension, but the rest is pretty close to the SaveAsPDF subroutine.

Running It

Running the macro from a shell prompt should work whether you leave OpenOffice open or quit out of it first. The following shows the basic command line for converting a Word file to OpenOffice on a Windows computer, split onto two lines to fit here:

"C:\Program Files\ 2.0\program\soffice" 
  -invisible macro:///Standard.MyConversions.SaveAsOOO(c:\temp\sample.doc)

I don't have the soffice.exe executable in my path, so I had to include the full path to it enclosed in quotes because of the space in the Program Files directory name. The -invisible switch tells OpenOffice not to bother with the startup screen, a default document, or any of the GUI. (Try starting up soffice.exe from the command line with a single parameter of -? to see a list of interesting options.) The macro is named in a URL-like format, with the path down the macro tree structure to the macro to be run, and the file to be converted is included in parentheses as a parameter to the macro. There's no need to provide an output file name, because the macro infers it from the input filename and the requested action.

Because the macro code adds http:// as a prefix to turn the input filename into a URL, you must include the complete path to it, as shown above, or you'll get the error message "URL seems to be an unsupported one."

The linux version of the command line (again, split here) needs to use a different binary name. The OpenOffice installation on my Ubuntu distribution put the ooffice2 binary in my path, so I didn't have to say where it was when starting it. I did enclose the call to the macro in quotes, because otherwise the parentheses confused the shell. Otherwise, the exact same macros installed with the procedure described above worked perfectly:

ooffice2 -invisible 

I tried converting several different files. The sample.doc file is a test file I've kept around for a few years to test the mettle of any program or service that claims to convert Word files to XML. It uses built-in and newly-created block and newline styles, nested bulleted lists, a BMP file, a table with spanning cells, an embedded spreadsheet, and a few other things that can throw off a conversion program. SaveAsOOO did fine with it.

Go Forth and Convert MS Office Files

Now that you've got a free, multi-platform tool that can convert new and old (well, at least as old as Office 97) MS Office files to an open XML standard, how can you best put it to good use? Anything that can be run from a command line can be used in an unattended, "lights out" workflow. A Perl script can take a list of filenames and create a batch file or shell script with a series of commands like those shown above to convert those files. If the raw XML is really what you're after, a script can also pull that XML out of the OpenOffice zip file and rename it to correspond with the input file, like in this shell script:

# Remember to include full path with 
# filename for $1 and to omit extension
ooffice2 -invisible  "macro:///Standard.MyConversions.SaveAsOOO($1.doc)"
unzip -o $1.odt content.xml
cp content.xml $1.xml

Windows batch file version:

REM Remember to include full path with 
REM filename for %1 and to omit extension
set OooExe="C:\Program Files\ 2.0\program\soffice"
%OOOExe%  -invisible macro:///Standard.MyConversions.SaveAsOOO(%1.doc)
unzip -o %1.odt content.xml
copy content.xml %1.xml

If you're going to make high volume conversion part of an ongoing daily workflow, this restarting of OpenOffice for every conversion will slow you down. In Windows, starting up soffice.exe in quickstart mode (with the -quickstart switch on the command line) before doing your conversions should make those conversions go faster. To go a few steps further, the -accept switch specifies a Universal Network Objects string that lets you communicate with the running OpenOffice process via an API from a program written in C++, OpenOffice Basic, Python, Java, or other languages and pass input documents to your OpenOffice process using API calls.

To me, the exciting part about this is not the ability to convert new Word or Excel files that people send me to OpenOffice XML, but the ability to convert old files. How many old Microsoft Office files do you have access to? What new applications would be possible if you could unlock the information in them by converting those files to a well-documented XML format and then using XML tools to mine that information? Considering that we can do all this with free software that runs on both Windows and Linux, there should be huge new opportunities to explore.