Skip to content

Instantly share code, notes, and snippets.

@ravishchawla
Created June 27, 2018 18:42
Show Gist options
  • Save ravishchawla/115bb04a03319d00adb6e9408ddd3483 to your computer and use it in GitHub Desktop.
Save ravishchawla/115bb04a03319d00adb6e9408ddd3483 to your computer and use it in GitHub Desktop.
'''
Clean each document by removing unnecesary characters and splitting by space.
'''
def clean_document(doco):
punctuation = string.punctuation + '\n\n';
punc_replace = ''.join([' ' for s in punctuation]);
doco_clean = doco.replace('-', ' ');
doco_alphas = re.sub(r'\W +', '', doco_clean)
trans_table = str.maketrans(punctuation, punc_replace);
doco_clean = ' '.join([word.translate(trans_table) for word in doco_alphas.split(' ')]);
doco_clean = doco_clean.split(' ');
doco_clean = [word.lower() for word in doco_clean if len(word) > 0];
return doco_clean;
# Generate a cleaned reviews array from original review texts
review_cleans = [clean_document(doc) for doc in reviews];
sentences = [' '.join(r) for r in review_cleans]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment