Skip to content

Instantly share code, notes, and snippets.

@nbroad1881
Last active October 6, 2020 01:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nbroad1881/1bb332b5336c42058fc310cde1234a51 to your computer and use it in GitHub Desktop.
Save nbroad1881/1bb332b5336c42058fc310cde1234a51 to your computer and use it in GitHub Desktop.
Here is an easy way to use a GPU when using the DPR package in HuggingFace to make Faiss embeddings. I mistakenly thought it would automatically switch devices, and I was pleased to see the time it takes to embed 100 examples go down from 38 seconds to 3 seconds. If you can see ways to improve, please comment!
# see here https://huggingface.co/docs/datasets/faiss_and_ea.html#adding-a-faiss-index
# I loaded my dataset from a Pandas dataframe
import pandas as pd
df = pd.read_csv("dataset.csv")
from transformers import DPRContextEncoder, DPRContextEncoderTokenizerFast
import torch
torch.set_grad_enabled(False)
device = "cuda:0"
# set model to use GPU
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base").to(device)
ctx_tokenizer = DPRContextEncoderTokenizerFast.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
from datasets import load_dataset, Dataset
ds = Dataset.from_pandas(df)
# set all tensors to GPU using dictionary comprehension, then convert back to cpu after it goes through the ctx_encoder
ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**(ctx_tokenizer(example["text"], return_tensors="pt").to(device)))[0][0].cpu().numpy()})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment