list(model.wv.index_to_key)
sims = model.wv.most_similar('computer', topn=10) # get other similar words
The corpora module provides the tools for creating a dictionary (a mapping between words and their integer ids), building a bag-of-words representation of documents, and serializing and deserializing corpora.
from gensim import corpora
texts = [['apple', 'banana', 'orange'], ['orange', 'juice'], ['banana', 'smoothie']]
# Create a dictionary from the texts
dictionary = corpora.Dictionary(texts)
# Create a bag-of-words representation of the texts
corpus = [dictionary.doc2bow(text) for text in texts]
print(dictionary)
print(corpus)
Dictionary(4 unique tokens: ['apple', 'banana', 'orange', 'juice'])
[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1)], [(1, 1), (3, 1)]]
In this example, the dictionary contains four unique tokens (words) from the texts: 'apple', 'banana', 'orange', and 'juice'. The corpus is a list of bag-of-words representations for each document in the collection, where each tuple contains the index of the word in the dictionary and the frequency of that word in the document
.
The similarities module provides tools for computing document similarity based on their bag-of-words representations, which are typically created using the corpora module in gensim.
from gensim import corpora, similarities
texts = [['apple', 'banana', 'orange'], ['orange', 'juice'], ['banana', 'smoothie']]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# Create a similarity index
index = similarities.SparseMatrixSimilarity(corpus, num_features=len(dictionary))
# Query for documents similar to a new document
query = ['apple', 'juice']
query_bow = dictionary.doc2bow(query)
sims = index[query_bow]
# Print the document similarity scores
print(sims)
[0.6666667 0.40824828 0. ]
We create a SparseMatrixSimilarity index, which computes document similarity using a sparse matrix representation of the bag-of-words vectors. We then query the index for documents similar to a new document, represented by the query ['apple', 'juice']. The sims variable contains the similarity scores between the query and each document in the corpus
.