Skip to content

Instantly share code, notes, and snippets.

@AaradhyaSaxena
Last active February 27, 2023 11:24
Show Gist options
  • Save AaradhyaSaxena/3aa9eeab8166291f5d11428fbe2259ce to your computer and use it in GitHub Desktop.
Save AaradhyaSaxena/3aa9eeab8166291f5d11428fbe2259ce to your computer and use it in GitHub Desktop.
gensim

Gensim

list all the keys

list(model.wv.index_to_key)

List n most similar products

sims = model.wv.most_similar('computer', topn=10)  # get other similar words

Corpora

The corpora module provides the tools for creating a dictionary (a mapping between words and their integer ids), building a bag-of-words representation of documents, and serializing and deserializing corpora.

from gensim import corpora

texts = [['apple', 'banana', 'orange'], ['orange', 'juice'], ['banana', 'smoothie']]

# Create a dictionary from the texts
dictionary = corpora.Dictionary(texts)

# Create a bag-of-words representation of the texts
corpus = [dictionary.doc2bow(text) for text in texts]

print(dictionary)
print(corpus)
Dictionary(4 unique tokens: ['apple', 'banana', 'orange', 'juice'])
[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1)], [(1, 1), (3, 1)]]

In this example, the dictionary contains four unique tokens (words) from the texts: 'apple', 'banana', 'orange', and 'juice'. The corpus is a list of bag-of-words representations for each document in the collection, where each tuple contains the index of the word in the dictionary and the frequency of that word in the document.

Similarity

The similarities module provides tools for computing document similarity based on their bag-of-words representations, which are typically created using the corpora module in gensim.

from gensim import corpora, similarities

texts = [['apple', 'banana', 'orange'], ['orange', 'juice'], ['banana', 'smoothie']]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Create a similarity index
index = similarities.SparseMatrixSimilarity(corpus, num_features=len(dictionary))

# Query for documents similar to a new document
query = ['apple', 'juice']
query_bow = dictionary.doc2bow(query)
sims = index[query_bow]

# Print the document similarity scores
print(sims)
[0.6666667  0.40824828 0.        ]

We create a SparseMatrixSimilarity index, which computes document similarity using a sparse matrix representation of the bag-of-words vectors. We then query the index for documents similar to a new document, represented by the query ['apple', 'juice']. The sims variable contains the similarity scores between the query and each document in the corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment