Created
August 26, 2019 02:39
-
-
Save Ken-Kuroki/54dff9f526aac072a4de9cee8293e03b to your computer and use it in GitHub Desktop.
Calculate TF-IDF from a count matrix
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np | |
from sklearn.preprocessing import normalize | |
def tf_idf(X): # corresponds to smooth=True and norm="l2" in sklearn.feature_extraction.text.TfidfVectorizer | |
tf = normalize(X, norm="l1", axis=1) | |
N = len(X) | |
df = np.count_nonzero(X, axis=0) | |
idf = np.log((N+1)/(df+1))+1 | |
return normalize(tf * idf, norm="l2") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment