Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
text_vectorizer = TfidfVectorizer(
sublinear_tf=True,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
ngram_range=(1, 1),
max_features=30000)
text_vectorizer.fit(pd.concat([train['comment_text'], test['comment_text']]))
train_word_features = text_vectorizer.fit_transform(train['comment_text'])
test_word_features = text_vectorizer.transform(test['comment_text'])
char_vectorizer = TfidfVectorizer(
sublinear_tf=True,
strip_accents='unicode',
analyzer='char',
ngram_range=(1, 5),
max_features=35000)
char_vectorizer.fit(pd.concat([train['comment_text'], test['comment_text']]))
train_char_features = char_vectorizer.transform(train['comment_text'])
test_char_features = char_vectorizer.transform(test['comment_text'])
x = hstack([train_char_features, train_word_features]).tocsr()
x_test = hstack([test_char_features, test_word_features]).tocsr()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.