Skip to content

Instantly share code, notes, and snippets.

@vdavez
Last active August 6, 2023 08:58
Show Gist options
  • Save vdavez/7278543 to your computer and use it in GitHub Desktop.
Save vdavez/7278543 to your computer and use it in GitHub Desktop.
Convert a Word Document into MD

Converting a Word Document to Markdown in Two Moves

The Problem

A lot of important government documents are created and saved in Microsoft Word (*.docx). But Microsoft Word is a proprietary format, and it's not really useful for presenting documents on the web. So, I wanted to find a way to convert a .docx file into markdown.

The Solution

As it turns out, there are several open-source tools that allow for conversion between file types. Pandoc is one of them, and it's powerful. In fact, pandoc's website says "If you need to convert files from one markup format into another, pandoc is your swiss-army knife." But, although pandoc can convert from markdown into .docx, it doesn't work in the other direction.

Then I found unoconv. This little tool takes advantage of OpenOffice's ability to convert a Word document into a bunch of different formats. But, unoconv too has a bit of a downside. Specifically, unoconv tries to keep a lot of the formatting that Word has embedded in a document. The output is, well, messy.

But, by using unconv and pandoc in combination, you can get a pretty clean output. And, the best part is that it retains footnotes and other key syntax (italics, etc.)

Example

Say you have the Council Rules in a Word Document named "test.docx." (For a real-life example, visit http://github.com/vzvenyach/Council_Rules/). Now, you run the following at the command line:

unoconv -f html test.docx
pandoc -f html -t markdown -o test.md test.html

Out is a beautiful markdown file. Admittedly, there's a bit of junk at the top with the Table of Contents. I deleted this when I rendered it nicely with strapdown.js. In the end, here's my nicely rendered version of the Rules.

@silverlightning926
Copy link

Udacity's Intro to Programming - Back End: Log Analysis Project
This Log Analysis Project was assigned from Udacity's Intro to Programming Nanodegree (IPND), specifically the Back-End Development path.
The goal of this project is to create a reporting tool that prints out reports based on the data in the news database. This reporting tool is a Python program using the psycopg2 module to connect to the database.
The reporting tool answers the following question based on the data in the news database:
a) What are the most popular three articles of all time?
b) Who are the most popular article authors of all time?
c) On which days did more than 1% of requests lead to errors?
Procedures for Preparing the software environment and data:

  1. Refer to the below link Installing the Virtual Machine (Intro to Programming ND) for screen shots and instructions for installing and running the Virtual machine.
  2. Start by installing Oracle VM virtual box. VirtualBox is a free and open-source hosted hypervisor for x86 computers developed by Oracle Corporation. Virtualbox is the software used specifically for running the virtual machine. Vagrant, in the next step, will run VirtualBox. Download Virtual Box from https://www.virtualbox.org/. This project uses Oracle Virtual Machine version 5.1.30_118389.win.exe (October 16th, 2017) for Windows hosts.
  3. Install Vagrant. Vagrant configures the virtual machine installed in Step 1 above. This enables you to share files between your host computer and VM’s file system. Download Vagrant from https://www.vagrantup.com/. This project uses the current latest version of Vagrant 2.1.5
  4. Inside the VM and change the directory to /vagrant. Use the ls command to explore the list of files. Files inside the VM’s/vagrant directory are shared inside the vagrant folder on your computer. For e.g. PostegreSQL database lives only inside the VM.
  5. Start the virtual machine by running vagrant up. Vagrant downloads and installs the Linux operating system. Give it a few minutes for this to complete. After vagrant up is finished, you can run vagrant ssh to log in to the newly installed Linux VM.
  6. Once inside the VM, change the directory to vagrant and use ls to explore and list of files.
  7. Next, download the news data from this link. Ensure that the files are downloadable from this link. You will need to unzip this file after downloading it.
  8. Load the data from newsdata.sql by using psql -d news -f newsdata.sql. Running this command will connect to your installed database server and execute the SQL commands in the downloaded file, creating tables and populating them with data. Note that you only need to run this command once. Type psql -d news to explore the database.
  9. Typing python NewLog_Analysis.py executes the NewLog_Analysis.py file. The NewLog_Analysis python file executes the SQL commands to the database file (newsdata.sql) and prints the answers to the above 3 questions.

@7Moses
Copy link

7Moses commented May 7, 2019

@jack17529 Thank you, works for me, … tables too, likely excellent,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment