Skip to content

Instantly share code, notes, and snippets.

@josht-jpg
Last active September 2, 2020 19:17
Show Gist options
  • Save josht-jpg/799d64ad293406e665fe7a295627aa5e to your computer and use it in GitHub Desktop.
Save josht-jpg/799d64ad293406e665fe7a295627aa5e to your computer and use it in GitHub Desktop.
Removing stop words and tokenizing
import nltk
stop_words = nltk.corpus.stopwords.words('english')
def clean(book, stop_words):
book = book.lower()
#tokenizing
book_tokens_clean = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(book)
book_clean = pd.DataFrame(book_tokens_clean, columns = ['word'])
#removing stop words
book_clean = book_clean[~book_clean['word'].isin(stop_words)]
#removing extraneous spaces
book_clean['word'] = book_clean['word'].apply(lambda x: re.sub(' +', ' ', x))
book_clean = book_clean[book_clean['word'].str.len() > 1]
return book_clean
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment