Skip to content

Instantly share code, notes, and snippets.

@jacaetevha
Created January 4, 2012 22:40
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save jacaetevha/1562599 to your computer and use it in GitHub Desktop.
Save jacaetevha/1562599 to your computer and use it in GitHub Desktop.
a script to output txt from the word/document.xml portion of a .docx file
[diff "word"]
binary = true
textconv = docx-to-txt.rb -t
#! /usr/bin/env ruby
# Simplistic DOCX to plain text converter, loosely based on the
# Simplistic OpenDocument Text (.odt) to plain text converter.
# Author: Jason Rogers <https://github.com/jacaetevha>
#
# Assumes that you have the unzip and tidy commands available for your system
require 'optparse'
options = {}
optparse = OptionParser.new do|opts|
opts.banner = "Usage: #{File.basename __FILE__} [options] file"
options[:text_only] = false
opts.on( '-t', '--text-only', 'Output less information' ) do
options[:text_only] = true
end
opts.on( '-h', '--help', 'Display this screen' ) do
puts opts
exit
end
end
optparse.parse!
if ARGV[0].nil?
puts "No filename given!\n"
puts "Usage: #{File.basename __FILE__} filename\n"
exit 1;
end
unless File.exist?(ARGV[0])
puts "File does not exist!\n"
puts "Usage: #{File.basename __FILE__} filename\n"
exit 1;
end
command = "unzip -qq -p '#{ARGV[0]}' word/document.xml"
command += " | tidy -utf8 -xml -w 255 -i -c -q -asxml" unless options[:text_only]
content = `#{command}`
if options[:text_only]
content.gsub! /<[^>]+>/, '' # remove all XML tags
content.gsub! /\n{2,}/, "\n\n" # remove multiple blank lines
content.gsub! /\A\n+/, '' # remove leading blank lines
end
puts content
@towolf
Copy link

towolf commented Feb 2, 2012

lame sauce.

@jacaetevha
Copy link
Author

lame comment, but thanks for trying

@seba--
Copy link

seba-- commented Aug 21, 2013

Very helpful, thanks! I made a small improvement to retain the paragraph structure of the docx document: https://gist.github.com/seba--/6294697.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment