Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Convert a Word Document into MD

Converting a Word Document to Markdown in Two Moves

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

Then I found unoconv. This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)

Example

Say you have the Council Rules in a Word Document named "test.docx." (For a real-life example, visit http://github.com/vzvenyach/Council_Rules/). Now, you run the following at the command line:

unoconv -f html test.docx
pandoc -f html -t markdown -o test.md test.html

Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. In the end, here's my nicely rendered version of the Rules.

Pandoc now supports docx to md conversion!

I've forked and updated your gist: https://gist.github.com/aembleton/1eb889bc443996a508df

The link to Pandoc needs changing from [Pandoc](johnmacfarlane.net/pandoc/) to [Pandoc](http://pandoc.org/). Currently the link tries to direct to a sub-directory of this.

Hi if you also want images, have a look here!

I took your excellent idea, and created a bash file which also extracts images. Awesome for a one-time conversion of word documents :)

See https://gist.github.com/jesperronn/ff5764274b3642bc7f2f

Naphier commented Dec 9, 2015

For pandoc users here's a simple batch file for converting to gihub markdown. Just drag the file name to the command shell that when asked for input file, do it again for output file and change the extension to md. Enter. Done.

@echo off
echo enter input file
set /p input=":"
echo enter output file
set /p output=":"
%localappdata%\Pandoc\pandoc.exe -w markdown_github -o %output% %input%
echo Created %output%
pause

Cheers!

I have at least 20,000 pages of Word docs that I need to convert into markdown files in the next few months. I am new to this and am looking for suggestions of how to get the job done. I have some budget, but am trying to use the money I have carefully. Any input on software or other ideas would be appreciated.

niau commented Apr 6, 2016

I assume that the best approach is to use the approach described above and some manual review/fix after conversion has been done.
Ideally you can write some own tooling to automate the conversion and to try to strip some of the stuff that might not look good in markup.
If you are looking for development service on how this could be done and automate the process please ping me.

It seems unoconv can not handle file name with Chinese characters.

➜ 下载 ls -l POS
-rw-rw-r-- 1 tsy tsy 15337 8月 6 03:56 POS机故障处理工单.docx
-rw-rw-r-- 1 tsy tsy 10828477 8月 4 11:34 POS机器安装教程.docx
➜ 下载 unoconv -f html POS机器安装教程.docx
Error: Unable to connect or start own listener. Aborting.
➜ 下载 mv POS机器安装教程.docx pos.docx
➜ 下载 unoconv -f html pos.docx
➜ 下载 pandoc -f html -t markdown -o pos.md pos.html
➜ 下载 ls -ltr

I think this should be mentioned -
pandoc -s example30.docx -t markdown -o example35.md

@jack17529 Thank you, your command works for my case which can grab title and subtitle of the dock while the suggested command in the post doesnt.

Owner

vzvenyach commented Jul 10, 2017

@jack17529 I didn't realize that (a) this gist had any traffic since i wrote it, and (b) that pandoc now allows for conversion! I'm going to test it today and update the gist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment