Skip to content

Instantly share code, notes, and snippets.

@nmolivo
Created September 23, 2018 23:39
Show Gist options
  • Save nmolivo/305c50e8f504dbd6e9b4935e2048d937 to your computer and use it in GitHub Desktop.
Save nmolivo/305c50e8f504dbd6e9b4935e2048d937 to your computer and use it in GitHub Desktop.
#open doc, from folder 'docs', extract XML coding
pathway = '/home/ec2-user/ec2docs/'+file
document = zipfile.ZipFile(pathway)
xml_content = document.read('word/document.xml')
document.close()
xml_str = str(xml_content)
#create linklist for doc, by going through the XML and finding the links
link_list = re.findall('>http.*?\<',xml_str) #it returns text starting with '>http', ending with '<', inclusive.
link_list = [x[1:-1] for x in link_list] #shaves off the last character of each item in the list. (it's a '<')
#replace &amp; with &, and other html entities.
link_list = [html.unescape(x) for x in link_list]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment