Last active
October 6, 2020 01:31
-
-
Save nbroad1881/1bb332b5336c42058fc310cde1234a51 to your computer and use it in GitHub Desktop.
Here is an easy way to use a GPU when using the DPR package in HuggingFace to make Faiss embeddings. I mistakenly thought it would automatically switch devices, and I was pleased to see the time it takes to embed 100 examples go down from 38 seconds to 3 seconds. If you can see ways to improve, please comment!
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# see here https://huggingface.co/docs/datasets/faiss_and_ea.html#adding-a-faiss-index | |
# I loaded my dataset from a Pandas dataframe | |
import pandas as pd | |
df = pd.read_csv("dataset.csv") | |
from transformers import DPRContextEncoder, DPRContextEncoderTokenizerFast | |
import torch | |
torch.set_grad_enabled(False) | |
device = "cuda:0" | |
# set model to use GPU | |
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base").to(device) | |
ctx_tokenizer = DPRContextEncoderTokenizerFast.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base") | |
from datasets import load_dataset, Dataset | |
ds = Dataset.from_pandas(df) | |
# set all tensors to GPU using dictionary comprehension, then convert back to cpu after it goes through the ctx_encoder | |
ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**(ctx_tokenizer(example["text"], return_tensors="pt").to(device)))[0][0].cpu().numpy()}) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment