Skip to content

Instantly share code, notes, and snippets.

@shreyaskarnik
Created March 5, 2012 23:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save shreyaskarnik/1982168 to your computer and use it in GitHub Desktop.
Save shreyaskarnik/1982168 to your computer and use it in GitHub Desktop.
Python Code to Extract Highlighted Text from DOCX (Word 2007 and Up format)
#!usr/bin/python
# -*- coding: utf-8 -*-
from docx import *
document = opendocx(r'test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'
tag_t = WPML_URI + 't'
for word in words:
for rPr in word.findall(tag_rPr):
high=rPr.findall(tag_highlight)
for hi in high:
if hi.attrib[tag_val] == 'yellow':
print word.find(tag_t).text.encode('utf-8').lower()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment