Skip to content

Instantly share code, notes, and snippets.

@psd
Created January 4, 2010 14:47
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save psd/268550 to your computer and use it in GitHub Desktop.
Save psd/268550 to your computer and use it in GitHub Desktop.
OpenOffice Document Conversion
Experiments in running a headless OpenOffice as a document convertor for TiddlyDocs, etc.
Running OpenOffice Headless:
$ cd /Applications/OpenOffice.org.app/Contents/program #Mac OSX
$ cd /usr/lib/openoffice.org.app/program #CentOS
$ ./soffice.bin -headless -invisible -nofirststartwizard -accept="socket,port=8100;urp;"
init script:
see openoffice.sh xvfb.sh for init.d scripts
uses Xvfb for virtual X11 DISPLAY:
$ yum install xorg-x11-fonts*
$ yum install Xvfb
http://www.oooforum.org/forum/viewtopic.phtml?t=11890
Art of Solving Java Open Office document conversion client:
http://www.artofsolving.com/opensource/jodconverter
Running from command line:
$ java -jar jodconverter-cli-2.2.2.jar <input-document> <output-document>
.. expands to multiple files, doesn't handle relative directories well
Running as Tomcat:
bin/startup.sh
http://localhost:8080/converter/
$ wget http://localhost:8080/converter/service \
--post-file=document.odt \
--header="Content-Type: application/vnd.oasis.opendocument.text" \
--header="Accept: application/pdf" \
--output-document=document.pdf
.. doesn't handle well document exploding to generate HTML plus images
Sun Wiki Publisher Extension:
http://extensions.services.openoffice.org/project/wikipublisher?intcmp=1547
Source:
http://sw.openoffice.org/source/browse/sw/swext/mediawiki/
Python:
http://wiki.services.openoffice.org/wiki/Python
$ export DYLD_LIBRARY_PATH="/Applications/OpenOffice.org.app/Contents/program/"
os.system('DYLD_LIBRARY_PATH="/Applications/OpenOffice.org.app/Contents/program/" /usr/bin/python2.3 ooextract.py --pdf test.odt')
http://qa.openoffice.org/issues/long_list.cgi?issuelist=93084
"""
For Mac OS 10.4 (Tiger) the default python 2.3 interpreter works fine. The default python 2.5 interpreter that came with Mac OS 10.5 (Leopard) gave me the following error:
Fatal Python error: Interpreter not initialized (version mismatch?)
Abort trap
I tried running the script with python versions 2.3.4, 2.3.5, and 2.3.7; they all failed with the same error.
"""
http://reidransom.com/geek/scripting-openoffice-org-app-with-python-on-mac-os-x/
http://udk.openoffice.org/python/python-bridge.html#replacing
http://udk.openoffice.org/python/python-bridge.html
"""
UnoConv:
http://dag.wieers.com/home-made/unoconv/
unoconv converts between any document format that OpenOffice understands. It uses OpenOffice's UNO bindings for non-interactive conversion of documents.
Supported document formats include Open Document Format (.odt), MS Word (.doc), MS Office Open/MS OOXML (.xml), Portable Document Format (.pdf), HTML, XHTML, RTF, Docbook (.xml), and more.
"""
Building Python from source for Mac OSX:
Hacked pyconfig.h:
#undef _POSIX_C_SOURCE
#undef _XOPEN_SOURCE
#define HAVE_BROKEN_POSIX_SEMAPHORES
Hacked Makefile:
prefix= /System/Library/Frameworks/Python.framework/Versions/2.3.4
and removed "-u __dummy" from LINKFORSHARED line
Hacked ..openoffice.org .. basis-link/program/uno.py:
import sys
if sys.platform == 'darwin':
# make sure libpyuno.dylib is found
import os
newpath = os.path.split( __file__ )[0]
cwd = os.getcwd()
os.chdir( newpath )
import pyuno
os.chdir( cwd )
else:
import pyuno
import __builtin__
#
# PyODConverter (Python OpenDocument Converter) v1.1 - 2009-11-14
#
# This script converts a document from one office format to another by
# connecting to an OpenOffice.org instance via Python-UNO bridge.
#
# Copyright (C) 2008-2009 Mirko Nasato <mirko@artofsolving.com>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl-2.1.html
# - or any later version.
#
DEFAULT_OPENOFFICE_PORT = 8100
import uno
from os.path import abspath, isfile, splitext
from com.sun.star.beans import PropertyValue
from com.sun.star.task import ErrorCodeIOException
from com.sun.star.connection import NoConnectException
FAMILY_TEXT = "Text"
FAMILY_WEB = "Web"
FAMILY_SPREADSHEET = "Spreadsheet"
FAMILY_PRESENTATION = "Presentation"
FAMILY_DRAWING = "Drawing"
#---------------------#
# Configuration Start #
#---------------------#
# see http://wiki.services.openoffice.org/wiki/Framework/Article/Filter
# most formats are auto-detected; only those requiring options are defined here
IMPORT_FILTER_MAP = {
"txt": {
"FilterName": "Text (encoded)",
"FilterOptions": "utf8"
},
"csv": {
"FilterName": "Text - txt - csv (StarCalc)",
"FilterOptions": "44,34,0"
}
}
EXPORT_FILTER_MAP = {
"pdf": {
FAMILY_TEXT: { "FilterName": "writer_pdf_Export" },
FAMILY_WEB: { "FilterName": "writer_web_pdf_Export" },
FAMILY_SPREADSHEET: { "FilterName": "calc_pdf_Export" },
FAMILY_PRESENTATION: { "FilterName": "impress_pdf_Export" },
FAMILY_DRAWING: { "FilterName": "draw_pdf_Export" }
},
"html": {
FAMILY_TEXT: { "FilterName": "HTML (StarWriter)" },
FAMILY_SPREADSHEET: { "FilterName": "HTML (StarCalc)" },
FAMILY_PRESENTATION: { "FilterName": "impress_html_Export" }
},
"odt": {
FAMILY_TEXT: { "FilterName": "writer8" },
FAMILY_WEB: { "FilterName": "writerweb8_writer" }
},
"doc": {
FAMILY_TEXT: { "FilterName": "MS Word 97" }
},
"rtf": {
FAMILY_TEXT: { "FilterName": "Rich Text Format" }
},
"txt": {
FAMILY_TEXT: {
"FilterName": "Text",
"FilterOptions": "utf8"
}
},
"ods": {
FAMILY_SPREADSHEET: { "FilterName": "calc8" }
},
"xls": {
FAMILY_SPREADSHEET: { "FilterName": "MS Excel 97" }
},
"csv": {
FAMILY_SPREADSHEET: {
"FilterName": "Text - txt - csv (StarCalc)",
"FilterOptions": "44,34,0"
}
},
"odp": {
FAMILY_PRESENTATION: { "FilterName": "impress8" }
},
"ppt": {
FAMILY_PRESENTATION: { "FilterName": "MS PowerPoint 97" }
},
"swf": {
FAMILY_DRAWING: { "FilterName": "draw_flash_Export" },
FAMILY_PRESENTATION: { "FilterName": "impress_flash_Export" }
}
}
PAGE_STYLE_OVERRIDE_PROPERTIES = {
FAMILY_SPREADSHEET: {
#--- Scale options: uncomment 1 of the 3 ---
# a) 'Reduce / enlarge printout': 'Scaling factor'
"PageScale": 100,
# b) 'Fit print range(s) to width / height': 'Width in pages' and 'Height in pages'
#"ScaleToPagesX": 1, "ScaleToPagesY": 1000,
# c) 'Fit print range(s) on number of pages': 'Fit print range(s) on number of pages'
#"ScaleToPages": 1,
"PrintGrid": False
}
}
#-------------------#
# Configuration End #
#-------------------#
class DocumentConversionException(Exception):
def __init__(self, message):
self.message = message
def __str__(self):
return self.message
class DocumentConverter:
def __init__(self, port=DEFAULT_OPENOFFICE_PORT):
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
try:
context = resolver.resolve("uno:socket,host=localhost,port=%s;urp;StarOffice.ComponentContext" % port)
except NoConnectException:
raise DocumentConversionException, "failed to connect to OpenOffice.org on port %s" % port
self.desktop = context.ServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", context)
def convert(self, inputFile, outputFile):
inputUrl = self._toFileUrl(inputFile)
outputUrl = self._toFileUrl(outputFile)
loadProperties = { "Hidden": True }
inputExt = self._getFileExt(inputFile)
if IMPORT_FILTER_MAP.has_key(inputExt):
loadProperties.update(IMPORT_FILTER_MAP[inputExt])
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, self._toProperties(loadProperties))
try:
document.refresh()
except AttributeError:
pass
family = self._detectFamily(document)
self._overridePageStyleProperties(document, family)
outputExt = self._getFileExt(outputFile)
storeProperties = self._getStoreProperties(document, outputExt)
try:
document.storeToURL(outputUrl, self._toProperties(storeProperties))
finally:
document.close(True)
def _overridePageStyleProperties(self, document, family):
if PAGE_STYLE_OVERRIDE_PROPERTIES.has_key(family):
properties = PAGE_STYLE_OVERRIDE_PROPERTIES[family]
pageStyles = document.getStyleFamilies().getByName('PageStyles')
for styleName in pageStyles.getElementNames():
pageStyle = pageStyles.getByName(styleName)
for name, value in properties.items():
pageStyle.setPropertyValue(name, value)
def _getStoreProperties(self, document, outputExt):
family = self._detectFamily(document)
try:
propertiesByFamily = EXPORT_FILTER_MAP[outputExt]
except KeyError:
raise DocumentConversionException, "unknown output format: '%s'" % outputExt
try:
return propertiesByFamily[family]
except KeyError:
raise DocumentConversionException, "unsupported conversion: from '%s' to '%s'" % (family, outputExt)
def _detectFamily(self, document):
if document.supportsService("com.sun.star.text.WebDocument"):
return FAMILY_WEB
if document.supportsService("com.sun.star.text.GenericTextDocument"):
# must be TextDocument or GlobalDocument
return FAMILY_TEXT
if document.supportsService("com.sun.star.sheet.SpreadsheetDocument"):
return FAMILY_SPREADSHEET
if document.supportsService("com.sun.star.presentation.PresentationDocument"):
return FAMILY_PRESENTATION
if document.supportsService("com.sun.star.drawing.DrawingDocument"):
return FAMILY_DRAWING
raise DocumentConversionException, "unknown document family: %s" % document
def _getFileExt(self, path):
ext = splitext(path)[1]
if ext is not None:
return ext[1:].lower()
def _toFileUrl(self, path):
return uno.systemPathToFileUrl(abspath(path))
def _toProperties(self, dict):
props = []
for key in dict:
prop = PropertyValue()
prop.Name = key
prop.Value = dict[key]
props.append(prop)
return tuple(props)
if __name__ == "__main__":
from sys import argv, exit
if len(argv) < 3:
print "USAGE: python %s <input-file> <output-file>" % argv[0]
exit(255)
if not isfile(argv[1]):
print "no such input file: %s" % argv[1]
exit(1)
try:
converter = DocumentConverter()
converter.convert(argv[1], argv[2])
except DocumentConversionException, exception:
print "ERROR! " + str(exception)
exit(1)
except ErrorCodeIOException, exception:
print "ERROR! ErrorCodeIOException %d" % exception.ErrCode
exit(1)
# OpenOffice utils.
#
# Based on code from:
# PyODConverter (Python OpenDocument Converter) v1.0.0 - 2008-05-05
# Copyright (C) 2008 Mirko Nasato <mirko@artofsolving.com>
# Licensed under the GNU LGPL v2.1 - or any later version.
# http://www.gnu.org/licenses/lgpl-2.1.html
#
import sys
import os
import time
import atexit
OPENOFFICE_PORT = 8100
# Find OpenOffice.
_oopaths=(
('/usr/lib64/ooo-2.0/program', '/usr/lib64/ooo-2.0/program'),
('/opt/openoffice.org3/program', '/opt/openoffice.org/basis3.0/program'),
)
for p in _oopaths:
if os.path.exists(p[0]):
OPENOFFICE_PATH = p[0]
OPENOFFICE_BIN = os.path.join(OPENOFFICE_PATH, 'soffice')
OPENOFFICE_LIBPATH = p[1]
# Add to path so we can find uno.
if sys.path.count(OPENOFFICE_LIBPATH) == 0:
sys.path.insert(0, OPENOFFICE_LIBPATH)
break
import uno
from com.sun.star.beans import PropertyValue
from com.sun.star.connection import NoConnectException
class OORunner:
"""
Start, stop, and connect to OpenOffice.
"""
def __init__(self, port=OPENOFFICE_PORT):
""" Create OORunner that connects on the specified port. """
self.port = port
def connect(self, no_startup=False):
"""
Connect to OpenOffice.
If a connection cannot be established try to start OpenOffice.
"""
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
context = None
did_start = False
n = 0
while n < 6:
try:
context = resolver.resolve("uno:socket,host=localhost,port=%d;urp;StarOffice.ComponentContext" % self.port)
break
except NoConnectException:
pass
# If first connect failed then try starting OpenOffice.
if n == 0:
# Exit loop if startup not desired.
if no_startup:
break
self.startup()
did_start = True
# Pause and try again to connect
time.sleep(1)
n += 1
if not context:
raise Exception, "Failed to connect to OpenOffice on port %d" % self.port
desktop = context.ServiceManager.createInstanceWithContext("com.sun.star.frame.Desktop", context)
if not desktop:
raise Exception, "Failed to create OpenOffice desktop on port %d" % self.port
if did_start:
_started_desktops[self.port] = desktop
return desktop
def startup(self):
"""
Start a headless instance of OpenOffice.
"""
args = [OPENOFFICE_BIN,
'-accept=socket,host=localhost,port=%d;urp;StarOffice.ServiceManager' % self.port,
'-norestore',
'-nofirststartwizard',
'-nologo',
'-headless',
]
env = {'PATH' : '/bin:/usr/bin:%s' % OPENOFFICE_PATH,
'PYTHONPATH' : OPENOFFICE_LIBPATH,
}
try:
pid = os.spawnve(os.P_NOWAIT, args[0], args, env)
except Exception, e:
raise Exception, "Failed to start OpenOffice on port %d: %s" % (self.port, e.message)
if pid <= 0:
raise Exception, "Failed to start OpenOffice on port %d" % self.port
def shutdown(self):
"""
Shutdown OpenOffice.
"""
try:
if _started_desktops.get(self.port):
_started_desktops[self.port].terminate()
del _started_desktops[self.port]
except Exception, e:
pass
# Keep track of started desktops and shut them down on exit.
_started_desktops = {}
def _shutdown_desktops():
""" Shutdown all OpenOffice desktops that were started by the program. """
for port, desktop in _started_desktops.items():
try:
if desktop:
desktop.terminate()
except Exception, e:
pass
atexit.register(_shutdown_desktops)
def oo_shutdown_if_running(port=OPENOFFICE_PORT):
""" Shutdown OpenOffice if it's running on the specified port. """
oorunner = OORunner(port)
try:
desktop = oorunner.connect(no_startup=True)
desktop.terminate()
except Exception, e:
pass
def oo_properties(**args):
"""
Convert args to OpenOffice property values.
"""
props = []
for key in args:
prop = PropertyValue()
prop.Name = key
prop.Value = args[key]
props.append(prop)
return tuple(props)
#!/bin/bash
OOo_HOME=/usr/bin
SOFFICE_PATH=$OOo_HOME/soffice
PIDFILE=/var/run/openoffice-server.pid
set -e
case "$1" in
start)
if [ -f $PIDFILE ]; then
echo "OpenOffice headless server has already started."
sleep 5
exit
fi
echo "Starting OpenOffice headless server"
$SOFFICE_PATH -display :1 -headless -nologo -nofirststartwizard -accept="socket,host=127.0.0.1,port=8100;urp" & > /dev/null 2>&1
touch $PIDFILE
;;
stop)
if [ -f $PIDFILE ]; then
echo "Stopping OpenOffice headless server."
killall -9 soffice && killall -9 soffice.bin
rm -f $PIDFILE
exit
fi
echo "Openoffice headless server is not running."
exit
;;
*)
echo "Usage: $0 {start|stop}"
exit 1
esac
exit 0
#!/bin/sh
xvfb_start() {
if [ -x /usr/bin/Xvfb ]; then
echo "Starting Virtual Frame Buffer X Server (Xvfb) as local display :1.0"
echo " /usr/bin/Xvfb :1 -screen 0 800x600x16 -fbdir /usr/src"
/usr/bin/Xvfb :1 -screen 0 800x600x16 -fbdir /usr/src &
else
echo "Error: Could not find /usr/bin/Xvfb. Cannot start Xvfb."
fi
}
xvfb_stop() {
if [ -x /usr/bin/killall ]; then
echo "Stopping Virtual Frame Buffer X Server (Xvfb) for local display :1.0"
/usr/bin/killall Xvfb 2> /dev/null
else
echo "Error: Could not find /usr/bin/killall. Cannot stop Xvfb."
fi
}
case "$1" in
'start')
xvfb_start
;;
'stop')
xvfb_stop
;;
'restart')
xvfb_stop
sleep 1
xvfb_start
;;
*)
if [ -x /usr/bin/basename ]; then
echo "usage: `/usr/bin/basename $0` start|stop|restart"
else
echo "usage: $0 start|stop|restart"
fi
esac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment