Skip to content

Instantly share code, notes, and snippets.

@lmcinnes
Last active November 9, 2023 04:31
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save lmcinnes/0eac3f16185fb9624e928a90fcc24720 to your computer and use it in GitHub Desktop.
Save lmcinnes/0eac3f16185fb9624e928a90fcc24720 to your computer and use it in GitHub Desktop.
Document Embeddings with the Vectorizers Library
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@lmcinnes
Copy link
Author

lmcinnes commented May 1, 2021

To run this notebook you will need a number of libraries installed, and getting them all playing together is not necessarily easy. Here is a recipe that should work assuming you are using conda.

conda create -n docmap python=3.7 scikit-learn seaborn datashader holoviews numba tensorflow-hub
conda activate docmap
conda install pytorch torchvision -c pytorch
conda install transformers tokenizers umap-learn sentence-transformers -c conda-forge
git clone https://github.com/TutteInstitute/vectorizers
cd vectorizers
pip install .

@cakiki
Copy link

cakiki commented Jun 25, 2021

Absolutely fantastic work! Might be interesting to add the Transformer-encoder-based USE model to the comparison (https://tfhub.dev/google/universal-sentence-encoder-large/5)

@lmcinnes
Copy link
Author

@cakiki I managed to get onto a big machine and rerun all of this with the Transformer-encoder-based USE as well as a newer and better state of the art Sentence-BERT model (one specifically pre-trained for sentence similarity tasks). You can find the results here: https://gist.github.com/lmcinnes/ebc3966572c060ed1c44bfc71bf48771

The Sentence BERT model improves dramatically, and USE definitely gets a bit of a boost, but surprisingly vectorizers manages to stay comparable.

@cakiki
Copy link

cakiki commented Jul 1, 2021

Very interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment