Skip to content

Instantly share code, notes, and snippets.

@jrjames83
Created December 4, 2022 15:19
Show Gist options
  • Save jrjames83/1afa34bc4e38135a1637d11046a0bd51 to your computer and use it in GitHub Desktop.
Save jrjames83/1afa34bc4e38135a1637d11046a0bd51 to your computer and use it in GitHub Desktop.

Me: is tf-idf sensitive to word order?

ChatGPT: TF-IDF (term frequency-inverse document frequency) is a numerical statistic that is used to reflect the importance of a word in a document or a collection of documents. It is commonly used in text analysis and information retrieval to determine the relevance of a document to a particular query, and to determine the similarity between documents.

TF-IDF is based on the frequency of words within a document, and is therefore sensitive to variations in word order. This means that the TF-IDF score for a word can be different depending on where the word appears in the document, and whether it is next to other words with high or low TF-IDF scores.

For example, consider the following two sentences:

"The quick brown fox jumps over the lazy dog" "The lazy dog is jumped over by the quick brown fox" Both sentences contain the same words, but in different orders. If we calculate the TF-IDF scores for each word in each sentence, we will see that the scores for some words are different in the two sentences. For example, the word "quick" has a higher TF-IDF score in the first sentence than in the second sentence, because it is next to the high-scoring word "brown" in the first sentence, but not in the second sentence.

Overall, while TF-IDF is a useful tool for determining the relevance of a document to a particular query and for determining the similarity between documents, it is sensitive to variations in word order, and this can lead to inaccurate or misleading results.

Me: I wonder why it got this incorrect?

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'the cat in the hat',
    'the hat in the cat',
    'This document is the third document.'
]

tfidf_vectorizer = TfidfVectorizer()
t = tfidf_vectorizer.fit_transform(corpus)

(t.toarray()[0] == t.toarray()[1]).all()
True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment