Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Parse Wikpedia Articles (after extraction with wikiextractor, but if you strip punctuation from all tokens this might work with raw wikipedia xml export)
with open("./wikipedia_articles_text") as f:
article_text = f.read()
articles = article_text.split("</doc>")
documents = []
for i, article in enumerate(articles):
lines = article.split("\n")
if i == 0:
title = lines[1]
text = "\n".join(lines[3:])
else:
title = None
text = None
if len(lines) > 3:
title = lines[2]
text = "\n".join(lines[4:]).strip("\n")
if text == None:
continue
documents.append({
"title": title,
"text": text
})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment