XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.


Moving to OpenOffice: Batch Converting Legacy Documents

January 11, 2006

Most people know that you can use OpenOffice 2.0, the open source alternative to Microsoft Office, to open up a Microsoft Office file and then save it in OpenOffice's native format—a zipped file that includes the document's contents as an XML file conforming to the OASIS OpenDocument standard. This works for Word documents, Excel spreadsheets, and PowerPoint presentations. (See the XML.com article Opening Open Formats with XSLT for an example of the kind of simple new application that this makes possible: a stylesheet that extracts the slide titles and notes from a slideshow file and stores them in a single file that you can use as speaker notes when giving a presentation.) After opening one of these file types, you can even export them to an Acrobat PDF file.

Like a lot of single-file conversions, this involves opening up and filling out various dialog boxes. If you have hundreds of files to convert, this isn't very practical. What if you want to load XML versions of a large collection of Word files, Excel spreadsheets, or PowerPoint files into an XML-aware database where you can query the collection? What if you're the State of Massachusetts or IBM?

Like its Microsoft counterpart, OpenOffice has a macro language. You can start up OpenOffice from the Linux or Windows command-line prompt with instructions to to run a particular macro, and you can even pass a filename as a parameter to that macro. Adding the -invisible switch to the command line tells OpenOffice to start up without the graphical user interface (GUI). Put all these together, and you've got a command line that converts a Microsoft Office file to an OpenOffice file (or an Acrobat file) with no use of the GUI. To convert a hundred files, you can use a Perl script or other scripting language to create a batch file or shell script that has the hundred commands necessary to convert those files.

(Note: the macro code that I found that inspired this was clearly written for OpenOffice 1.x, but I put all this together and tested it using OpenOffice 2.0 for Windows and for Linux.)

Creating the Conversion Macros

All the macro subroutines you need are shown below. To create a macro module in OpenOffice to store these macros, pick Macros from the OpenOffice Tools menu and then pick Organize Macros, OpenOffice.org Basic, Organizer, and New to create a new module. Name your module MyConversions and click the Close button.

The new Module will show up as /My Macros/Standard/MyConversions in the macro module tree listing, as shown here:

OpenOffice Macro module dialog box

Select your MyConversions module and then click the Edit button. In the script that comes up, you'll find the lines Sub Main and End Main to get you started. Replace them with the code shown here and then save your work:

' Based on code from http://www.oooforum.org/forum/viewtopic.phtml?t=3772

' Save document as an Acrobat PDF file.
Sub SaveAsPDF( cFile )
   cURL = ConvertToURL( cFile )
   ' Open the document. Just blindly assume that the document 
   ' is of a type that OOo will correctly recognize and open 
   ' without specifying an import filter.
   oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, _
            Array(MakePropertyValue( "Hidden", True ),))

   cFile = Left( cFile, Len( cFile ) - 4 ) + ".pdf"
   cURL = ConvertToURL( cFile )
   ' Save the document using a filter.   
   oDoc.storeToURL( cURL, Array(_
            MakePropertyValue( "FilterName", "writer_pdf_Export" ),)
   oDoc.close( True )
End Sub

' Save document as a Microsoft Word file. 
Sub SaveAsDoc( cFile ) 
   ' mostly a copy of SaveAsPDF
   cURL = ConvertToURL( cFile )
   oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, (_
            Array(MakePropertyValue( "Hidden", True ),))

   cFile = Left( cFile, Len( cFile ) - 4 ) + ".doc"
   cURL = ConvertToURL( cFile )
   oDoc.storeToURL( cURL, Array(_
            MakePropertyValue( "FilterName", "MS WinWord 6.0" ),)
   oDoc.close( True )

End Sub

' Save document as an OpenOffice 2 file. 
Sub SaveAsOOO( cFile ) 
   ' mostly a copy of SaveAsPDF. Save as an OpenOffice file. 
   cURL = ConvertToURL( cFile )
   oDoc = StarDesktop.loadComponentFromURL( cURL, "_blank", 0, _
            Array(MakePropertyValue( "Hidden", True ),))

   ' Set output file extension based on lower-case 
   ' version of input extension.
   Select Case LCase(Right(cFile,3))
     Case "ppt"         ' PowerPoint file.
       cFileExt = "odp"
     Case "doc"         ' Word file.
       cFileExt = "odt"
     Case "xls"         ' Excel file.
       cFileExt = "ods"
     Case Else
       cFileExt = "xxx"
    End Select
   cFile = Left( cFile, Len( cFile ) - 3 ) + cFileExt
   cURL = ConvertToURL( cFile )
   oDoc.storeAsURL( cURL, Array() )
   oDoc.close( True )

End Sub

Function MakePropertyValue( Optional cName As String, Optional uValue ) _
   As com.sun.star.beans.PropertyValue
   Dim oPropertyValue As New com.sun.star.beans.PropertyValue
   If Not IsMissing( cName ) Then
      oPropertyValue.Name = cName
   If Not IsMissing( uValue ) Then
      oPropertyValue.Value = uValue
   MakePropertyValue() = oPropertyValue
End Function

OpenOffice Basic is a variation on Microsoft Office Basic, which is a variation on Visual Basic, which is a variation on the original BASIC that those of us of a certain age learned as our first programming language. I never got too far into any of them, but it's easy enough to look up the syntax that you need with a few Google searches. For example, I had no idea how to do the case statement in the SaveAsOOO subroutine, but between the online help and Google, I found what I needed.

The code defines three subroutines and a function. The SaveAsPDF subroutine and the MakePropertyValue function come from a 2003 OpenOffice forum posting. The SaveAsDoc subroutine is almost a direct copy of SaveAsPDF, with a different parameter passed to MakePropertyValue to identify the filter name. If you want to add subroutines to save an OpenOffice spreadsheet as an Excel file or an OpenOffice Impress slideshow as a PowerPoint file, you could copy these routines and change that parameter and the extension of the saved file.

The great thing about the SaveAsOOO subroutine is its versatility. You don't have to tell it whether you're reading a Word file, an Excel file, or a PowerPoint file, or tell it the appropriate output format. It just knows. As you can see, I did code a little logic to ensure that the output file has the right file extension, but the rest is pretty close to the SaveAsPDF subroutine.

Pages: 1, 2

Next Pagearrow