Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Convert a Word Document into MD

Converting a Word Document to Markdown in Two Moves

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

Then I found unoconv. This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)

Example

Say you have the Council Rules in a Word Document named "test.docx." (For a real-life example, visit http://github.com/vzvenyach/Council_Rules/). Now, you run the following at the command line:

unoconv -f html test.docx
pandoc -f html -t markdown -o test.md test.html

Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. In the end, here's my nicely rendered version of the Rules.

@aembleton

This comment has been minimized.

Show comment Hide comment
@aembleton

aembleton Aug 11, 2015

Pandoc now supports docx to md conversion!

I've forked and updated your gist: https://gist.github.com/aembleton/1eb889bc443996a508df

Pandoc now supports docx to md conversion!

I've forked and updated your gist: https://gist.github.com/aembleton/1eb889bc443996a508df

@dazfuller

This comment has been minimized.

Show comment Hide comment
@dazfuller

dazfuller Nov 25, 2015

The link to Pandoc needs changing from [Pandoc](johnmacfarlane.net/pandoc/) to [Pandoc](http://pandoc.org/). Currently the link tries to direct to a sub-directory of this.

The link to Pandoc needs changing from [Pandoc](johnmacfarlane.net/pandoc/) to [Pandoc](http://pandoc.org/). Currently the link tries to direct to a sub-directory of this.

@jesperronn

This comment has been minimized.

Show comment Hide comment
@jesperronn

jesperronn Nov 30, 2015

Hi if you also want images, have a look here!

I took your excellent idea, and created a bash file which also extracts images. Awesome for a one-time conversion of word documents :)

See https://gist.github.com/jesperronn/ff5764274b3642bc7f2f

Hi if you also want images, have a look here!

I took your excellent idea, and created a bash file which also extracts images. Awesome for a one-time conversion of word documents :)

See https://gist.github.com/jesperronn/ff5764274b3642bc7f2f

@Naphier

This comment has been minimized.

Show comment Hide comment
@Naphier

Naphier Dec 9, 2015

For pandoc users here's a simple batch file for converting to gihub markdown. Just drag the file name to the command shell that when asked for input file, do it again for output file and change the extension to md. Enter. Done.

@echo off
echo enter input file
set /p input=":"
echo enter output file
set /p output=":"
%localappdata%\Pandoc\pandoc.exe -w markdown_github -o %output% %input%
echo Created %output%
pause

Cheers!

Naphier commented Dec 9, 2015

For pandoc users here's a simple batch file for converting to gihub markdown. Just drag the file name to the command shell that when asked for input file, do it again for output file and change the extension to md. Enter. Done.

@echo off
echo enter input file
set /p input=":"
echo enter output file
set /p output=":"
%localappdata%\Pandoc\pandoc.exe -w markdown_github -o %output% %input%
echo Created %output%
pause

Cheers!

@pezbemine

This comment has been minimized.

Show comment Hide comment
@pezbemine

pezbemine Mar 30, 2016

I have at least 20,000 pages of Word docs that I need to convert into markdown files in the next few months. I am new to this and am looking for suggestions of how to get the job done. I have some budget, but am trying to use the money I have carefully. Any input on software or other ideas would be appreciated.

I have at least 20,000 pages of Word docs that I need to convert into markdown files in the next few months. I am new to this and am looking for suggestions of how to get the job done. I have some budget, but am trying to use the money I have carefully. Any input on software or other ideas would be appreciated.

@niau

This comment has been minimized.

Show comment Hide comment
@niau

niau Apr 6, 2016

I assume that the best approach is to use the approach described above and some manual review/fix after conversion has been done.
Ideally you can write some own tooling to automate the conversion and to try to strip some of the stuff that might not look good in markup.
If you are looking for development service on how this could be done and automate the process please ping me.

niau commented Apr 6, 2016

I assume that the best approach is to use the approach described above and some manual review/fix after conversion has been done.
Ideally you can write some own tooling to automate the conversion and to try to strip some of the stuff that might not look good in markup.
If you are looking for development service on how this could be done and automate the process please ping me.

@tiansiyuan

This comment has been minimized.

Show comment Hide comment
@tiansiyuan

tiansiyuan Aug 6, 2016

It seems unoconv can not handle file name with Chinese characters.

➜ 下载 ls -l POS
-rw-rw-r-- 1 tsy tsy 15337 8月 6 03:56 POS机故障处理工单.docx
-rw-rw-r-- 1 tsy tsy 10828477 8月 4 11:34 POS机器安装教程.docx
➜ 下载 unoconv -f html POS机器安装教程.docx
Error: Unable to connect or start own listener. Aborting.
➜ 下载 mv POS机器安装教程.docx pos.docx
➜ 下载 unoconv -f html pos.docx
➜ 下载 pandoc -f html -t markdown -o pos.md pos.html
➜ 下载 ls -ltr

It seems unoconv can not handle file name with Chinese characters.

➜ 下载 ls -l POS
-rw-rw-r-- 1 tsy tsy 15337 8月 6 03:56 POS机故障处理工单.docx
-rw-rw-r-- 1 tsy tsy 10828477 8月 4 11:34 POS机器安装教程.docx
➜ 下载 unoconv -f html POS机器安装教程.docx
Error: Unable to connect or start own listener. Aborting.
➜ 下载 mv POS机器安装教程.docx pos.docx
➜ 下载 unoconv -f html pos.docx
➜ 下载 pandoc -f html -t markdown -o pos.md pos.html
➜ 下载 ls -ltr

@jack17529

This comment has been minimized.

Show comment Hide comment
@jack17529

jack17529 May 23, 2017

I think this should be mentioned -
pandoc -s example30.docx -t markdown -o example35.md

I think this should be mentioned -
pandoc -s example30.docx -t markdown -o example35.md

@issactoast

This comment has been minimized.

Show comment Hide comment
@issactoast

issactoast Jun 17, 2017

@jack17529 Thank you, your command works for my case which can grab title and subtitle of the dock while the suggested command in the post doesnt.

@jack17529 Thank you, your command works for my case which can grab title and subtitle of the dock while the suggested command in the post doesnt.

@vzvenyach

This comment has been minimized.

Show comment Hide comment
@vzvenyach

vzvenyach Jul 10, 2017

@jack17529 I didn't realize that (a) this gist had any traffic since i wrote it, and (b) that pandoc now allows for conversion! I'm going to test it today and update the gist.

Owner

vzvenyach commented Jul 10, 2017

@jack17529 I didn't realize that (a) this gist had any traffic since i wrote it, and (b) that pandoc now allows for conversion! I'm going to test it today and update the gist.

@mangatram1

This comment has been minimized.

Show comment Hide comment
@mangatram1

mangatram1 Dec 26, 2017

There seems to be issue with tables while converting docx to md. anyone has a solution for this?

There seems to be issue with tables while converting docx to md. anyone has a solution for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment