Skip to content

Instantly share code, notes, and snippets.

@rafaljanwojcik
Created November 25, 2019 13:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rafaljanwojcik/724cb19bedc065f54784d3f43dbf1b8b to your computer and use it in GitHub Desktop.
Save rafaljanwojcik/724cb19bedc065f54784d3f43dbf1b8b to your computer and use it in GitHub Desktop.
def text_to_word_list(text, remove_polish_letters):
''' Pre process and convert texts to a list of words
method inspired by method from eliorc github repo: https://github.com/eliorc/Medium/blob/master/MaLSTM.ipynb'''
text = remove_polish_letters(text)
text = str(text)
text = text.lower()
# Clean the text
text = sub(r"[^A-Za-z0-9^,!?.\/'+]", " ", text)
text = sub(r"\+", " plus ", text)
text = sub(r",", " ", text)
text = sub(r"\.", " ", text)
text = sub(r"!", " ! ", text)
text = sub(r"\?", " ? ", text)
text = sub(r"'", " ", text)
text = sub(r":", " : ", text)
text = sub(r"\s{2,}", " ", text)
text = text.split()
return text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment