XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Moving to OpenOffice: Batch Converting Legacy Documents
by Bob DuCharme | Pages: 1, 2

Running It

Running the macro from a shell prompt should work whether you leave OpenOffice open or quit out of it first. The following shows the basic command line for converting a Word file to OpenOffice on a Windows computer, split onto two lines to fit here:


"C:\Program Files\OpenOffice.org 2.0\program\soffice" 
  -invisible macro:///Standard.MyConversions.SaveAsOOO(c:\temp\sample.doc)

I don't have the soffice.exe executable in my path, so I had to include the full path to it enclosed in quotes because of the space in the Program Files directory name. The -invisible switch tells OpenOffice not to bother with the startup screen, a default document, or any of the GUI. (Try starting up soffice.exe from the command line with a single parameter of -? to see a list of interesting options.) The macro is named in a URL-like format, with the path down the macro tree structure to the macro to be run, and the file to be converted is included in parentheses as a parameter to the macro. There's no need to provide an output file name, because the macro infers it from the input filename and the requested action.

Because the macro code adds http:// as a prefix to turn the input filename into a URL, you must include the complete path to it, as shown above, or you'll get the error message "URL seems to be an unsupported one."

The linux version of the command line (again, split here) needs to use a different binary name. The OpenOffice installation on my Ubuntu distribution put the ooffice2 binary in my path, so I didn't have to say where it was when starting it. I did enclose the call to the macro in quotes, because otherwise the parentheses confused the shell. Otherwise, the exact same macros installed with the procedure described above worked perfectly:


ooffice2 -invisible 
  "macro:///Standard.MyConversions.SaveAsOOO(/home/bob/temp/sample.doc)"

I tried converting several different files. The sample.doc file is a test file I've kept around for a few years to test the mettle of any program or service that claims to convert Word files to XML. It uses built-in and newly-created block and newline styles, nested bulleted lists, a BMP file, a table with spanning cells, an embedded spreadsheet, and a few other things that can throw off a conversion program. SaveAsOOO did fine with it.

Go Forth and Convert MS Office Files

Now that you've got a free, multi-platform tool that can convert new and old (well, at least as old as Office 97) MS Office files to an open XML standard, how can you best put it to good use? Anything that can be run from a command line can be used in an unattended, "lights out" workflow. A Perl script can take a list of filenames and create a batch file or shell script with a series of commands like those shown above to convert those files. If the raw XML is really what you're after, a script can also pull that XML out of the OpenOffice zip file and rename it to correspond with the input file, like in this shell script:


# Remember to include full path with 
# filename for $1 and to omit extension
ooffice2 -invisible  "macro:///Standard.MyConversions.SaveAsOOO($1.doc)"
unzip -o $1.odt content.xml
cp content.xml $1.xml

Windows batch file version:


REM Remember to include full path with 
REM filename for %1 and to omit extension
set OooExe="C:\Program Files\OpenOffice.org 2.0\program\soffice"
%OOOExe%  -invisible macro:///Standard.MyConversions.SaveAsOOO(%1.doc)
unzip -o %1.odt content.xml
copy content.xml %1.xml

If you're going to make high volume conversion part of an ongoing daily workflow, this restarting of OpenOffice for every conversion will slow you down. In Windows, starting up soffice.exe in quickstart mode (with the -quickstart switch on the command line) before doing your conversions should make those conversions go faster. To go a few steps further, the -accept switch specifies a Universal Network Objects string that lets you communicate with the running OpenOffice process via an API from a program written in C++, OpenOffice Basic, Python, Java, or other languages and pass input documents to your OpenOffice process using API calls.

To me, the exciting part about this is not the ability to convert new Word or Excel files that people send me to OpenOffice XML, but the ability to convert old files. How many old Microsoft Office files do you have access to? What new applications would be possible if you could unlock the information in them by converting those files to a well-documented XML format and then using XML tools to mine that information? Considering that we can do all this with free software that runs on both Windows and Linux, there should be huge new opportunities to explore.



1 to 13 of 13
  1. How to convert to .txt?
    2007-12-04 06:54:09 kaplun
  2. An amavisd-new filter to convert attachements to ODF
    2007-11-07 13:14:08 rsandu

  3. 2007-06-20 21:28:57 Leontius
  4. Very Helpful
    2007-05-05 15:31:09 mannym
  5. runs in the background in Linux
    2007-02-09 08:13:23 akaihola
  6. Problem with command line conversion
    2007-01-24 00:12:23 jestarovic
  7. Problems with filename
    2006-11-07 03:22:42 paai@uvt.nl
  8. great
    2006-10-24 08:15:10 meatron
    • great
      2006-10-24 08:40:40 Bob DuCharme
      • great
        2006-10-24 09:30:18 meatron
        • great
          2006-10-24 14:19:20 meatron
          • great
            2006-10-24 14:56:29 meatron
  9. How to specify the macro source file without defining it inside OO ?
    2006-09-20 10:52:29 pyPeton
  10. Excellent Tutorial
    2006-04-21 03:29:26 kyiyer
  11. Error for Comma in Filenames When TXT -> ODT
    2006-04-14 10:35:30 ParetoJ
  12. Reusage of existing Microsoft Macros in Legacy Documents
    2006-01-17 10:04:14 SvanteSchubert
  13. Thank you!
    2006-01-12 16:11:18 J David Eisenberg
1 to 13 of 13