Skip to content

Instantly share code, notes, and snippets.

@vdavez
Last active August 6, 2023 08:58
Show Gist options
  • Save vdavez/7278543 to your computer and use it in GitHub Desktop.
Save vdavez/7278543 to your computer and use it in GitHub Desktop.
Convert a Word Document into MD

Converting a Word Document to Markdown in Two Moves

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

Then I found unoconv. This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)

Example

Say you have the Council Rules in a Word Document named "test.docx." (For a real-life example, visit http://github.com/vzvenyach/Council_Rules/). Now, you run the following at the command line:

unoconv -f html test.docx
pandoc -f html -t markdown -o test.md test.html

Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. In the end, here's my nicely rendered version of the Rules.

@dazfuller
Copy link

The link to Pandoc needs changing from [Pandoc](johnmacfarlane.net/pandoc/) to [Pandoc](http://pandoc.org/). Currently the link tries to direct to a sub-directory of this.

@jesperronn
Copy link

Hi if you also want images, have a look here!

I took your excellent idea, and created a bash file which also extracts images. Awesome for a one-time conversion of word documents :)

See https://gist.github.com/jesperronn/ff5764274b3642bc7f2f

@Naphier
Copy link

Naphier commented Dec 9, 2015

For pandoc users here's a simple batch file for converting to gihub markdown. Just drag the file name to the command shell that when asked for input file, do it again for output file and change the extension to md. Enter. Done.

@echo off
echo enter input file
set /p input=":"
echo enter output file
set /p output=":"
%localappdata%\Pandoc\pandoc.exe -w markdown_github -o %output% %input%
echo Created %output%
pause

Cheers!

@pezbemine
Copy link

I have at least 20,000 pages of Word docs that I need to convert into markdown files in the next few months. I am new to this and am looking for suggestions of how to get the job done. I have some budget, but am trying to use the money I have carefully. Any input on software or other ideas would be appreciated.

@niau
Copy link

niau commented Apr 6, 2016

I assume that the best approach is to use the approach described above and some manual review/fix after conversion has been done.
Ideally you can write some own tooling to automate the conversion and to try to strip some of the stuff that might not look good in markup.
If you are looking for development service on how this could be done and automate the process please ping me.

@tiansiyuan
Copy link

It seems unoconv can not handle file name with Chinese characters.

➜ 下载 ls -l POS
-rw-rw-r-- 1 tsy tsy 15337 8月 6 03:56 POS机故障处理工单.docx
-rw-rw-r-- 1 tsy tsy 10828477 8月 4 11:34 POS机器安装教程.docx
➜ 下载 unoconv -f html POS机器安装教程.docx
Error: Unable to connect or start own listener. Aborting.
➜ 下载 mv POS机器安装教程.docx pos.docx
➜ 下载 unoconv -f html pos.docx
➜ 下载 pandoc -f html -t markdown -o pos.md pos.html
➜ 下载 ls -ltr

@jack17529
Copy link

I think this should be mentioned -
pandoc -s example30.docx -t markdown -o example35.md

@issactoast
Copy link

@jack17529 Thank you, your command works for my case which can grab title and subtitle of the dock while the suggested command in the post doesnt.

@vdavez
Copy link
Author

vdavez commented Jul 10, 2017

@jack17529 I didn't realize that (a) this gist had any traffic since i wrote it, and (b) that pandoc now allows for conversion! I'm going to test it today and update the gist.

@mangatram1
Copy link

There seems to be issue with tables while converting docx to md. anyone has a solution for this?

@JBMatthews
Copy link

JBMatthews commented Jun 2, 2018

@jack17529 You deserve more recognition man. Thanks a lot! Worked like a charm. Cheers!

@petrk94
Copy link

petrk94 commented Jun 25, 2018

Is there a solution to convert *.doc files (not docx) to md?

@esarks
Copy link

esarks commented Aug 3, 2018

We are developing word documents in chapters and pushing these to Wiki. We are using word due to it's capability to track changes and add comments and users outside of our GitHub space can contribute. Here's a batch script that will convert the latest version of the document in a folder and automatically push to the Wiki. It uses pandoc and magick. It depends on a variable being set where the images are being stored in GitHub.

This can be viewed in our public Repo: department-of-veterans-affairs/ES-ASG

set aImage="https://github.com/department-of-veterans-affairs/ES-ASG/raw/master/Projects/ES%%20ASG/ES%%20ASG%%20API%%20Playbook%%20Project/Content/01.00%%20ASG_API%%20Playbook_Introduction_Section/media/"

rmdir media

rem get the last .docx file
for /R %%f in (*.docx) do set aFile=%%~nf

rem convert docx to mediawiki and extract images
pandoc --extract-media ./ -t mediawiki -o "%aFile%.mediawiki" "%aFile%.docx"

rem Fix up image URLs
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.emf', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.jpeg', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.jpg', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.gif', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace '.tmp', '.png' | Out-File '%aFile%.mediawiki'" -encoding UTF8
powershell -Command "(gc '%aFile%.mediawiki' -encoding UTF8) -replace 'File:.//media/', '%aImage%' | Out-File '%aFile%.mediawiki'" -encoding UTF8

rem housekeeping and move Wiki for publishing
del *.bak
copy "%aFile%.mediawiki" "C:\GitHub\ES-ASG.wiki"

rem convert all image files to .png
cd media
for /R %%f in (*.emf) do (
magick %%~nf.emf %%~nf.png
)

for /R %%f in (*.jpeg) do (
magick %%~nf.jpeg %%~nf.png
)

for /R %%f in (*.jpg) do (
magick %%~nf.jpg %%~nf.png
)

for /R %%f in (*.tmp) do (
magick %%~nf.tmp %%~nf.png
)

for /R %%f in (*.gif) do (
magick %%~nf.gif %%~nf.png
)

rem push to GitHub Repo

git add -f --all
git commit -m "Publish"
git push --all

cd "C:\GitHub\ES-ASG.wiki"

git add -f --all
git commit -m "Publish"
git push --all

pause

@silverlightning926
Copy link

Udacity's Intro to Programming - Back End: Log Analysis Project
This Log Analysis Project was assigned from Udacity's Intro to Programming Nanodegree (IPND), specifically the Back-End Development path.
The goal of this project is to create a reporting tool that prints out reports based on the data in the news database. This reporting tool is a Python program using the psycopg2 module to connect to the database.
The reporting tool answers the following question based on the data in the news database:
a) What are the most popular three articles of all time?
b) Who are the most popular article authors of all time?
c) On which days did more than 1% of requests lead to errors?
Procedures for Preparing the software environment and data:

  1. Refer to the below link Installing the Virtual Machine (Intro to Programming ND) for screen shots and instructions for installing and running the Virtual machine.
  2. Start by installing Oracle VM virtual box. VirtualBox is a free and open-source hosted hypervisor for x86 computers developed by Oracle Corporation. Virtualbox is the software used specifically for running the virtual machine. Vagrant, in the next step, will run VirtualBox. Download Virtual Box from https://www.virtualbox.org/. This project uses Oracle Virtual Machine version 5.1.30_118389.win.exe (October 16th, 2017) for Windows hosts.
  3. Install Vagrant. Vagrant configures the virtual machine installed in Step 1 above. This enables you to share files between your host computer and VM’s file system. Download Vagrant from https://www.vagrantup.com/. This project uses the current latest version of Vagrant 2.1.5
  4. Inside the VM and change the directory to /vagrant. Use the ls command to explore the list of files. Files inside the VM’s/vagrant directory are shared inside the vagrant folder on your computer. For e.g. PostegreSQL database lives only inside the VM.
  5. Start the virtual machine by running vagrant up. Vagrant downloads and installs the Linux operating system. Give it a few minutes for this to complete. After vagrant up is finished, you can run vagrant ssh to log in to the newly installed Linux VM.
  6. Once inside the VM, change the directory to vagrant and use ls to explore and list of files.
  7. Next, download the news data from this link. Ensure that the files are downloadable from this link. You will need to unzip this file after downloading it.
  8. Load the data from newsdata.sql by using psql -d news -f newsdata.sql. Running this command will connect to your installed database server and execute the SQL commands in the downloaded file, creating tables and populating them with data. Note that you only need to run this command once. Type psql -d news to explore the database.
  9. Typing python NewLog_Analysis.py executes the NewLog_Analysis.py file. The NewLog_Analysis python file executes the SQL commands to the database file (newsdata.sql) and prints the answers to the above 3 questions.

@7Moses
Copy link

7Moses commented May 7, 2019

@jack17529 Thank you, works for me, … tables too, likely excellent,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment