Skip to content

Instantly share code, notes, and snippets.

@ameyavilankar
Last active January 25, 2023 10:19
Show Gist options
  • Save ameyavilankar/10347201 to your computer and use it in GitHub Desktop.
Save ameyavilankar/10347201 to your computer and use it in GitHub Desktop.
Removing Punctuation and Stop Words nltk
import string
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
def preprocess(sentence):
sentence = sentence.lower()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(sentence)
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
return " ".join(filtered_words)
sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good. French-Fries"
print preprocess(sentence)
@edenzik
Copy link

edenzik commented Mar 25, 2015

use:
filtered_words = filter(lambda token: token not in stopwords.words('english'), tokens)

@rakshithShetty
Copy link

Convert stopwords.words('english') to a set before using in line 11 ... It is currently a list and is incredibly slow for large documents.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment