Skip to content

Instantly share code, notes, and snippets.

@sdoshi579
Last active January 21, 2019 18:17
Show Gist options
  • Save sdoshi579/d55e21bd78df3293919f74048db10862 to your computer and use it in GitHub Desktop.
Save sdoshi579/d55e21bd78df3293919f74048db10862 to your computer and use it in GitHub Desktop.
Natural Language Processing Basics Uses Nltk package and shows tokenization and lemmatization
import nltk
nltk.download()
import nltk
from nltk.tokenize import RegexpTokenizer
text = 'Citizens of India are known as Indians.'
# By passing r'\w+' to the RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
print(tokens)
# ['Citizens', 'of', 'India', 'are', 'known', 'as', 'Indians']
from nltk.corpus import stopwords
sw = stopwords.words('english')
clean_tokens = [token for token in tokens if token not in sw]
clean_tokens
# ['Citizens', 'India', 'known', 'Indians']
from nltk.stem.porter import PorterStemmer
pstemmer = PorterStemmer()
[pstemmer.stem(token) for token in clean_tokens]
# ['citizen', 'india', 'known', 'indian']
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
[lemmatizer.lemmatize(token) for token in clean_tokens]
# ['Citizens', 'India', 'known', 'Indians']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment