Skip to content

Instantly share code, notes, and snippets.

@dcollien
Last active August 22, 2018 20:47
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dcollien/851e7b7348799f3513facef51982092b to your computer and use it in GitHub Desktop.
Save dcollien/851e7b7348799f3513facef51982092b to your computer and use it in GitHub Desktop.
Simple Text Classification using NLTK Naive Bayes and TextRank
import nltk
from summa.keywords import keywords
def get_features(text):
# get the top 80% of the phrases from the text, scored by relevance
return dict(keywords(text, ratio=0.8, split=True, scores=True))
def train_texts(classified_texts):
# process the training set
features = []
for classification, text in classified_texts:
features.append((get_features(text), classification))
return nltk.NaiveBayesClassifier.train(features)
def classify(classifier, text):
# classify a document
return classifier.classify(get_features(text))
# Example:
classifier = train_texts([
('spam', spam_text),
('ham', ham_text)
])
is_spam = classify(classifier, ham_text) == 'spam'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment