Skip to content

Instantly share code, notes, and snippets.

@tekkub
Created November 2, 2010 21:39
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tekkub/660340 to your computer and use it in GitHub Desktop.
Save tekkub/660340 to your computer and use it in GitHub Desktop.
Word docx files suck!
#!/usr/bin/env ruby
# Converts those nasty xml documents word likes to create into something
# remotely resembling markdown
# Fun side effect: you'll get every revision of the document!
abort "No file name given" unless filename = ARGV.first
remove_tags = [
'\?xml',
"w:rFonts",
"w:cols",
"w:docGrid",
"w:szCs",
"w:sz",
"w:p",
"w:proofErr",
"w:pgMar",
"w:body",
"/w:body",
"w:sectPr",
"/w:sectPr",
"w:pgSz",
"w:document",
"/w:document",
"w:r",
"/w:r",
"w:t",
"/w:t",
"w:rPr",
"/w:rPr",
"w:pPr",
"/w:pPr",
]
gsubs = [
[/<\/w:p>/, "\n\n"],
[/<w:br\/>/, " \n"],
[/[“”]+/, '"'],
[/<w:i\/>(\w+) /, '*\1* '],
[/\n\n+/, "\n\n"],
[/\n\n+\Z/m, "\n"],
]
f = File.read(filename)
out = f
remove_tags.each {|tag| out = out.gsub(/<#{tag}( .*?)?\/?>/, '')}
gsubs.each {|r,s| out = out.gsub(r,s)}
puts out
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment