Skip to content

Instantly share code, notes, and snippets.

@griesmey
Created September 21, 2015 04:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save griesmey/50b6077a3e605847086d to your computer and use it in GitHub Desktop.
Save griesmey/50b6077a3e605847086d to your computer and use it in GitHub Desktop.
quick bigram gist; You can use the dictVectorizer and the Tfidf transformer to generate your features
from collections import Counter
import re
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from itertools import islice, tee
from nltk.corpus import stopwords
def tokenize(sentence):
words = re.findall("[a-zA-Z]+", sentence)
bigram = []
for gram in generate_ngrams(words, 2):
bigram.append('{0} {1}'.format(gram[0], gram[1]))
# take out stop words
words = [w for w in words if w not in stopwords.words("english")]
words.extend(bigram)
return words
def generate_ngrams(lst, n):
ilst = lst
while True:
a, b = tee(ilst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
ilst = b
else:
break
print(tokenize('Hello there good guy. I will kill you'))
@griesmey
Copy link
Author

Sentiment analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment