Skip to content

Instantly share code, notes, and snippets.

@psorianom
Created February 20, 2019 13:50
Show Gist options
  • Save psorianom/1f83760dfb71dee8d7088ea8b027c6fd to your computer and use it in GitHub Desktop.
Save psorianom/1f83760dfb71dee8d7088ea8b027c6fd to your computer and use it in GitHub Desktop.
Extract text from CAPP XMLs
import xml.etree.ElementTree
import glob
texts = []
all_files = list(glob.glob('./extracted/*.xml'))
n_files = len(all_files)
with open("all_capp_new.txt", "w") as filo:
for i,f in enumerate(all_files):
print("Treating file {0} => {1}/{2}\n".format(f, i+1 , n_files))
e = xml.etree.ElementTree.parse(f).getroot()
try:
text = [t for t in e.find("TEXTE").find("BLOC_TEXTUEL").find("CONTENU").itertext()]
space_text = "\n".join(text)
filo.write("".join(space_text) + "\n")
except Exception as e:
print("Could not parse file {}\n because {}".format(f, e))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment