Skip to content

Instantly share code, notes, and snippets.

@tyokota
Created May 2, 2019 08:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tyokota/25886790bf216ef6d6a67956c93e73aa to your computer and use it in GitHub Desktop.
Save tyokota/25886790bf216ef6d6a67956c93e73aa to your computer and use it in GitHub Desktop.
text_vectorizer = TfidfVectorizer(
sublinear_tf=True,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w{1,}',
ngram_range=(1, 1),
max_features=30000)
text_vectorizer.fit(pd.concat([train['comment_text'], test['comment_text']]))
train_word_features = text_vectorizer.fit_transform(train['comment_text'])
test_word_features = text_vectorizer.transform(test['comment_text'])
char_vectorizer = TfidfVectorizer(
sublinear_tf=True,
strip_accents='unicode',
analyzer='char',
ngram_range=(1, 5),
max_features=35000)
char_vectorizer.fit(pd.concat([train['comment_text'], test['comment_text']]))
train_char_features = char_vectorizer.transform(train['comment_text'])
test_char_features = char_vectorizer.transform(test['comment_text'])
x = hstack([train_char_features, train_word_features]).tocsr()
x_test = hstack([test_char_features, test_word_features]).tocsr()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment