Skip to content

Instantly share code, notes, and snippets.

@mapmeld
Created December 20, 2017 12:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mapmeld/ee2fca410e4a1d1e85bb4972d5181fdc to your computer and use it in GitHub Desktop.
Save mapmeld/ee2fca410e4a1d1e85bb4972d5181fdc to your computer and use it in GitHub Desktop.
Process Tekstoj XML for Neural Network
# pip3 install lxml
import os
from lxml import etree
directory = './tekstoj'
originalArticles = os.listdir(directory)
count = 0
total = len(originalArticles)
for article in originalArticles:
count = count + 1
if (article.find('.xml') == -1):
continue
print(article)
xmlsource = open(directory + '/' + article, 'r')
htmltree = etree.HTML(xmlsource.read())
content = htmltree.findall(".//p")
xmlsource.close()
if (len(content) > 0):
txtsource = open(directory + '/' + article.replace('.xml', '') + '.txt', 'w')
for para in content:
if para.text is not None:
txtsource.write(para.text + "\n\n")
txtsource.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment