Skip to content

Instantly share code, notes, and snippets.

@mmulich
Last active December 16, 2015 00:29
Show Gist options
  • Save mmulich/5347977 to your computer and use it in GitHub Desktop.
Save mmulich/5347977 to your computer and use it in GitHub Desktop.
cnx-transforms

JOD Service

JOD (Java Open Document converter) is an alternative to the *office headless mode that we wish to use in production. In the past the *office headless mode was used, but it is not ideal and not a suitable solution for concurrent builds. It will remain available for developers to use as a feasible alternative to the heavy handed JOD service.

The JOD service was originally built into OERPub's SWORD implementation. Connexions came across it as a viable alternative to issues raised by the developers that were not running GUI environments which enabled a reliable headless *office to run. At this time it was also realized that *office headless mode could bottleneck with concurrent tasks, which is something the JOD service already solved.

The JOD service will give us the ability to do concurrent builds and have the conversion as a service.

Installation

These instructions will assume you are using an Ubuntu system, because this would turn into a small book if we were to include instructions for every platform. Make your best effort to translate these instructures to your platform and feel free to ask the developers for assistance.

Installing MS Word -> *office dependencies

A *office installation will be required (LibreOffice, OpenOffice or StarOffice). You are free to choose whichever, but this documentation will use LibreOffice. You can adapt your environment to whichever you choose.

Installation of LibreOffice can be done using the following command:

$ sudo apt-get install libreoffice

A macro needs to be added to the into *office for the transform to be successful. (see also: oerpub.rhaptoslabs.swordpushweb)

$ wget https://raw.github.com/oerpub/oerpub.rhaptoslabs.swordpushweb/develop/docs/office_macro/Module1.xba
$ mkdir -p ~/.config/.libreoffice/3/user/basic/Standard/
$ mv Module1.xba ~/.config/.libreoffice/3/user/basic/Standard/

Installing Python-JOD

Python-JOD is a poorly named project that provides an interfaces for the *office document conversions.

Clone and navigate into the project:

$ git clone git clone https://github.com/oerpub/Python-JOD.git

Note

Adjust PIPE_PATH in install.sh to reflect whichever flavor of *office you installed.

You'll need to install maven in order to run the install. On ubuntu, this would be:

$ apt-get install openjdk-6-jdk maven

Then issue the installer command:

$ sudo JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::") \
JRE_HOME=$(readlink -f /usr/bin/java | sed "s:/bin/java::") \
./install.sh

<https://github.com/oerpub/Python-JOD/blob/master/jodconverter-webapp-build/README.txt>`_

Usage

The '--dev-mode' flag will enable the use of the *office headless mode, rather than the full blown JOD service. This is only to be used when using this mode. We are using stderr for these messages because standard out (stdout) is used for piping output.

Logging

Logging is done in two areas of this package: 1) at the library layer and 2) at the commandline interface layer.

Library Logger

The library logger is available in cnxtransforms.reporting (package name is reporting so that it doesn't conflict with the standard library's logging module) as logger with the logger name as cnxtransforms.

>>> from cnxtransforms.reporting import logger

This is the logger that should be used throughout the transformation library functions.

The package provides a default logging configuration (in the package as default_logging.cfg) which will log all info level messages to standard error (stderr). The library user can customize the logging configuration by creating their own at /etc/cnx-transforms/logging.cfg or ~/.cnx-transforms/logging.cfg.

Note

Only one of these configuration files is used at a time and therefore it may be useful to copy the one provided in this package as a starting point.

Commandline Logger

The commandline logger is available in the cnxtransforms.cli module as logger with the logger name as cnxtransforms.cli.

This logger is setup similar to the library-logger. It by default is setup to report at the info level to stderr.

String and Bytes interface

String inherits from io.StringIO with one major difference. It has a name parameter, which makes it a named stringio buffer. This makes it possible to take the buffer straight into a File without deliberately specifying a name.

>>> from cnxtransforms import String
>>> str_buf = String(u"The dingo ate my baby!",
...                  name="dingo.poo")
>>> str_buf
<String instance of 'dingo.poo'>

Likewise, Bytes has the same named buffer interface, except that it inherits from io.BytesIO rather than io.StringIO.

>>> from cnxtransforms import Bytes
>>> b_buf = Bytes("PK\x03\x04\x14\x00\x00\x08\x00\x00f\x1e\x8aB",
...               name="junk.bin")
>>> b_buf
<Bytes instance of 'junk.bin'>

File interface

File inherits from io.FileIO that has some context specific properties.

>>> from cnxtransforms import File
>>> from cnxtransforms import word_to_ooo

>>> address = 'localhost:2002'
>>> filepath = os.path.join('cnxtransforms', 'test-data', 'test-document.docx')
>>> file = File(filepath)
>>> file.filepath
'/mnt/hgfs/cnx-transforms/cnxtransforms/test-data/test-document.docx'
>>> file.filename
'test-document.docx'
>>> file.basepath
'/mnt/hgfs/cnx-transforms/cnxtransforms/test-data'

Call for output, which is also a File object.

>>> output = word_to_ooo(file, server_address=address)
>>> output.filepath
'/mnt/hgfs/cnx-transforms/cnxtransforms/test-data/test-document.docx.odt'

Default behavior will create an output object when an output isn't passed into the function.

Files can be created on the fly from any IOBase object using the from_io class method.

>>> import io
>>> io_value = io.BytesIO("PK\x03\x04\x14\x00\x00\x08\x00\x00f\x1e\x8aB^\xc62\x0c'\x00\x00\x00'\x00\x00\x00\x08\x00\x00\x00mimetypeapplication/vnd.oasis.opendocument.textPK\x03\x04\x14\x00\x00\x08\x00\x00f\x1e\x8aB\xeeY:\xc8\xd4\xee\x01\x00\xd4\xee\x01\x00-\x00\x00\x00Pictures/10000201000004AD0000020B937CE175.png\x89PNG\r\n")
>>> File.from_io(io_value)
<File instance of '/tmp/...'>

File Sequences interface

A FileSequence inhertis from collections.abc.MutableSequence. It provides some specialization that will later allow us to do things like:

>>> from cnxtransforms import to_zip_file
>>> zip = to_zipfile(file_sequence)
>>> zip
<ZipFile object at 0x...>

In the near term this is useful when a transform produces more than one outcome. For example, the ODT to CNXML transform will split the content and resources (e.g. images), which will result in more than one output.

>>> from cnxtransforms import File
>>> file = File(os.path.join('cnxtransforms', 'test-data',
...                          'test-document.docx.odt'))
>>> from cnxtransforms import odt_to_cnxml
>>> cnxml = odt_to_cnxml(file)
>>> type(cnxml)
cnxtransforms.FileSequence
>>> cnxml
[<String instance of 'index.cnxml'>,
 <Bytes instance of 'Picture.png'>,
 <Bytes instance of 'Picture.jpg'>,
 <Bytes instance of 'graphics1.jpg'>]

This is also useful when working with zipfiles and other archives. The following is an of instantiating a FileSequence using a zipfile.Zipfile instance.

>>> import zipfile
>>> zfile = zipfile.ZipFile('html.zip')
>>> from cnxtransforms import FileSequence
>>> archive = FileSequence.from_zipfile(zfile)
>>> archive
[<String instance of 'index.html'>,
 <Bytes instance of 'graphic.jpg'>]

Command-line Interface

The command-line interface (CLI) is set up to behave similar to Docutils conversion tools, where a file path input can be given and the mutually inclusive file output can be given as well, but will default to standard out (stdout). If the input comes from standard in (stdin), the output must go to stdout.

Each command is set up to facilitate one transformation that can be piped into another. For example:

$ cat test-document.docx | word2soffice | soffice2cnxml > content.zip
$ cnxml2html content.zip > html.zip
$ cat html.zip | html2cnxml > cnxml.zip
@reedstrm
Copy link

reedstrm commented Apr 9, 2013

I think I understand the behavior, but my question is why? What's the client and why does it help to specialize File this way?

@mmulich
Copy link
Author

mmulich commented Apr 9, 2013

What part of the behavior are you questioning?

File doesn't need to be specialized. I could just as easily use any IOBase object, but for debugging purposes the properties are helpful.

@mmulich
Copy link
Author

mmulich commented Apr 9, 2013

By the way, I'm planning to change this so that it doesn't depend on files at all. Ideally the beginning and end points would only do filesystem writes everything else would be buffers, so of which may indirectly write to the filesystem.

@mmulich
Copy link
Author

mmulich commented Apr 10, 2013

Also, mimetypes will be added into the specializations to promote better error discovery as well as provide adaptation paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment