Skip to content

Instantly share code, notes, and snippets.

@cabo
Created January 10, 2011 20:43
Show Gist options
  • Save cabo/773435 to your computer and use it in GitHub Desktop.
Save cabo/773435 to your computer and use it in GitHub Desktop.
quickly extract the text from a MOOX ("OOXML") .docx file
#!/opt/local/bin/ruby1.9
require 'zip/zipfilesystem'
require 'nokogiri'
MS_S = "http://schemas.openxmlformats.org/"
MS_W = MS_S + "wordprocessingml/2006/main"
MS_W_OD = MS_S + "officeDocument/2006/relationships/officeDocument"
ARGV.each do |fn|
Zip::ZipFile.open(fn) do |zf|
docrels = Nokogiri::XML(zf.read("_rels/.rels"))/"Relationship"
docrels.each do |rel|
if rel["Type"] == MS_W_OD
doc = Nokogiri::XML(zf.read(rel["Target"]))
puts doc.xpath(".//w:t", "w" => MS_W).map(&:text)
end
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment