Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Convert a Word Document into MD

Converting a Word Document to Markdown in One Move

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

Installing Pandoc

On a mac you can use homebrew by running the command brew install pandoc.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." Pandoc can convert from markdown into .docx, and it also works in the other direction.

Example

Say you have the Council Rules in a Word Document named "test.docx." (For a real-life example, visit http://github.com/vzvenyach/Council_Rules/). Now, you run the following at the command line:

pandoc -f docx -t markdown -o test.md test.docx

Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. In the end, here's my nicely rendered version of the Rules.

#!/bin/bash
#
# generate a Markdown version of a word document. Goes in separate folder, since
# images are extracted and converted as well (separate folder avoids naming clashes).
#
# REQUIREMENTS: pandoc
#
#
# with pandoc
# --extract-media=[media folder]
#
# USAGE:
#
# docx2md.sh filename #(no extension)
#
# This will generate a converted file in a subfolder, for example if you have a file
# `contract.docx`:
#
# generates:
# ```
# contract
# ├── contract.md
# └── media
# ├── image1.png
# ├── image2.png
# └── image3.png
# ```
#
# Author: Jesper Rønn-Jensen 2015-11-30
# https://gist.github.com/jesperronn/ff5764274b3642bc7f2f
# Inspired by https://gist.github.com/aembleton/1eb889bc443996a508df
#
which pandoc > /dev/null
rc=$?
if [[ $rc != 0 ]]; then
echo "FATAL missing pandoc. You can install with 'brew install pandoc' or similar"
exit 9
fi
if [ -z "$1" ]; then
echo "Usage:"
echo ""
echo " docx2md.sh [filename-no-extension]"
exit 13
fi
if [ ! -f "$1.docx" ]; then
echo "FATAL missing file '$1.docx'"
exit 11
fi
mkdir -p "$1"
cd "$1"
pandoc -f docx -t markdown --extract-media="." -o "README.md" "../$1.docx"
@AndreyMZ

This comment has been minimized.

Copy link

commented Apr 6, 2016

Looks like docx2md.sh contract generates contract/README.md instead of contract/contract.md.

@Christian-G

This comment has been minimized.

Copy link

commented Jul 5, 2016

@asmita-varma

This comment has been minimized.

Copy link

commented Nov 5, 2016

I am an amateur in this field. Please help me out.

After doing:
sudo apt-get install pandoc

I wrote:
pandoc -f docx -t markdown -o filename.md filename.docx

And I got the message:
pandoc: Unknown reader: docx

Where am I going terribly wrong..please help

@kguidonimartins

This comment has been minimized.

Copy link

commented Dec 2, 2016

Nice!
I just did change README.md by $1.md in the 55 row.

@deanpeters

This comment has been minimized.

Copy link

commented Jan 10, 2017

Hi @asmita-varma,

I got good results with:

pandoc -s -f markdown filename.md -t docx -o filename.docx

Converting it back, was ok, but with a few too many linefeeds and escapes than I'd like:

pandoc -s -f docx filename.docx -t markdown -o filename2.md
diff filename.md filename2.md

YMMV,
Dean P

@Batman089

This comment has been minimized.

Copy link

commented Mar 7, 2018

Thank you @deanpeters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.