Skip to content

Instantly share code, notes, and snippets.

@danemacaulay
Last active September 25, 2021 17:06
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save danemacaulay/c8e3194b63570de1cf88f431ade32107 to your computer and use it in GitHub Desktop.
Save danemacaulay/c8e3194b63570de1cf88f431ade32107 to your computer and use it in GitHub Desktop.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
import scipy as sp
posts = pd.read_csv('posts.csv')
# Create vectorizer for function to use
vectorizer = CountVectorizer(binary=False)
y = posts["score"].values.astype(np.float32)
X = sp.sparse.hstack((vectorizer.fit_transform(posts.message),
posts[['feature_1', 'feature_2']].values),
format='csr')
X_columns = vectorizer.get_feature_names() +
posts[['feature_1', 'feature_2']].columns.tolist()
print(posts)
print(X_columns)
print(X.toarray())
ID message feature_1 feature_2 score
1 'This is the text' 4 7 10
2 'This is more text' 3 2 8
@shantanu778
Copy link

Do I need to apply this function for test set as well?
or just pass the X_test to predict function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment