Skip to content

Instantly share code, notes, and snippets.

@datacustodian
Last active December 7, 2023 15:22
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save datacustodian/4483fff487a0ef70c7b7604f0106a64e to your computer and use it in GitHub Desktop.
Save datacustodian/4483fff487a0ef70c7b7604f0106a64e to your computer and use it in GitHub Desktop.
Document Conversion

Document Conversion

This document outlines some ideas for document conversion on Linux and Mac OS X platforms using command line tools. Distribute documents as plain text using UTF-8 encoding whenever possible. Everyone should embrace the mantra "plain text is beautiful".

Document Metadata

Use file command to obtain basic metadata for most file formats. For image files make sure you have ImageMagick installed, then use identify command to extract image metadata.

Encoding Conversion

Use iconv command to convert plain text from one encoding to another. The basic usage is

$ iconv -c -f <source_encoding> -t <target_encoding> input.txt > output.txt

The -c option discards unconvertible characters, and pointy brackets denote required options. For a list of supported encodings run

$ iconv -l

Text Extraction

PDF

Poppler library (https://poppler.freedesktop.org/), based on Xpdf, comes with a suite of PDF tools. Use pdftotext command to extract text from PDF file, assuming a text layer exists.

HTML

Use html2text command (http://www.mbayer.de/html2text/) to extract text from HTML file.

DjVu

DjVuLibre (http://djvu.sourceforge.net/), an open source DjVu library and viewer, comes with a suite of command line utilities. Use the djvutxt command to extract text from DjVu, assuming a text layer exsits.

RTF

UnRTF (https://www.gnu.org/software/unrtf/)

XML

Install xml-twig-tools package.

Use xml_grep to extract text from XML document:

xml_grep example.xml --text_only

Extract text only from mytag tag:

xml_grep 'mytag' example.xml --text_only

Formatting Plain Text

Mac OS X

Use textutil command to convert plain text to rtf, rtfd, html, doc, docx, odt, wordml, and webarchive formats. The -info option extracts basic metadata from files of these formats. textutil is based on the Cocoa Framework, so it isn't available on Linux.

Use cupsfilter command to convert non-PDF formats to PDF.

Plain Text to PostScript

Use enscript command (http://www.linuxfromscratch.org/blfs/view/svn/pst/enscript.html) to convert text files to PostScript, HTML, and RTF. Unfortunately, enscript does not support UTF-8 encoding.

Use paps command (http://paps.sourceforge.net/) to format UTF-8 plain text files. paps requires the Pango library (http://www.pango.org/).

pandoc

Use pandoc command to convert amongst popular markup formats: http://pandoc.org/

Note that pandoc supports the newer XML-based docx MS Word format but not the older OLE-based doc MS Word format.

Specfic Document Format Conversions

Mac OS X

Use textutil command to convert among txt, rtf, rtfd, html, doc, docx, odt, wordml, and webarchive formats.

Use cupsfilter command to convert TXT to PDF and HTML to PDF.

LibreOffice in headless mode

If you have LibreOffice installed on your system, you can run soffice command in headless mode to convert documents:

$ soffice --headless --convert-to <TargetFileExtension>[:<NameOfFilter>] input_file.xxx

Note that the square brackets around :<NameOfFilter> mean that this part is optional. The output file will be named input_file.TargetFileExtension. On Windows command line, the convert-to parameter uses only one dash.

Please refer to LibreOffice documentation for details: https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters

PostScript to PDF

Use pstopdf command to convert PostScript to PDF.

DjVu to XML

Use djvutoxml command from DjVuLibre library (http://djvu.sourceforge.net/) to convert DjVu to XML.

RTF to HTML

Use UnRTF to convert RTF files to HTML files. UnRTF also supports LaTeX and ASCII plain text output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment