Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 13 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9 to your computer and use it in GitHub Desktop.
Save tjvananne/8b0e7df7dcad414e8e6d5bf3947439a9 to your computer and use it in GitHub Desktop.
How to read and process a downloaded pre-trained GloVe word vector (turn it into a data.frame) in base R
#' A word vector is a giant matrix of words, and each word contains a numeric array that represents the semantic
#' meaning of that word. This is useful so we can discover relationships and analogies between words programmatically.
#' The classic example is "king" minus "man" plus "woman" is most similar to "queen"
# function definition --------------------------------------------------------------------------
# input .txt file, exports list of list of values and character vector of names (words)
proc_pretrained_vec <- function(p_vec) {
# initialize space for values and the names of each word in vocab
vals <- vector(mode = "list", length(p_vec))
names <- character(length(p_vec))
# loop through to gather values and names of each word
for(i in 1:length(p_vec)) {
if(i %% 1000 == 0) {print(i)}
this_vec <- p_vec[i]
this_vec_unlisted <- unlist(strsplit(this_vec, " "))
this_vec_values <- as.numeric(this_vec_unlisted[-1]) # this needs testing, does it become numeric?
this_vec_name <- this_vec_unlisted[1]
vals[[i]] <- this_vec_values
names[[i]] <- this_vec_name
}
# convert lists to data.frame and attach the names
glove <- data.frame(vals)
names(glove) <- names
return(glove)
}
# using the function -------------------------------------------------------------------------
# here we are reading in the unzipped, raw, GloVe pre-trained word vector object (.txt)
# all you have to change is the file path to where you GloVe object has been unzipped
g6b_300 <- scan(file = "LARGE_FILES_pre_trained/glove.6B.300d.txt", what="", sep="\n")
# call the function to convert the raw GloVe vector to data.frame (extra lines are for wall-time reporting)
t_temp <- Sys.time()
glove.300 <- proc_pretrained_vec(g6b_300) # this is the actual function call
(t_elap_temp <- paste0(round(as.numeric(Sys.time() - t_temp, units="mins"), digits = 2), " minutes"))
print(dim(glove.300))
# [1] 300 400000
# NOTES: ------------------------------------------------------------------------------------------
#' I chose to use the 6 billion token, 300-dimension-per-word, 400k vocabulary word vector, so that
#' explains why the dimensions of this dataframe are 300 rows by 400k columns
#'
#' each column is a different word's numeric vector representation. it might be useful to t(glove.300)
#' to transpose into a matrix for some calculations like sim2 from text2vec package
# BONUS MATERIAL: definition for finding similar word vectors ----------------------------------------------
# let's have some fun with this and try out the most common examples
# this section requires the "text2vec" library
# install.packages("text2vec") # uncomment and execute this if you don't have that package
find_sim_wvs <- function(this_wv, all_wvs, top_n_res=40) {
# this_wv will be a numeric vector; all_wvs will be a data.frame with words as columns and dimesions as rows
require(text2vec)
this_wv_mat <- matrix(this_wv, ncol=length(this_wv), nrow=1)
all_wvs_mat <- as.matrix(all_wvs)
if(dim(this_wv_mat)[[2]] != dim(all_wvs_mat)[[2]]) {
print("switching dimensions on the all_wvs_matrix")
all_wvs_mat <- t(all_wvs_mat)
}
cos_sim = sim2(x=all_wvs_mat, y=this_wv_mat, method="cosine", norm="l2")
sorted_cos_sim <- sort(cos_sim[,1], decreasing = T)
return(head(sorted_cos_sim, top_n_res))
}
# try out the function - we're hoping that "queen" will be in the top 5 results here
this_word_vector <- glove.300[['king']] - glove.300[['man']] + glove.300[['woman']]
find_sim_wvs(this_word_vector, glove.300, top_n_res=5)
# "flock is to geese as bison is to ___________" (hoping for "herd")
# funny... "buffalo" tends to gravitate towards the city while "bison" is the animal
my_wv <- glove.300[['flock']] - glove.300[['geese']] + glove.300[['buffalo']] # all cities because "buffalo, NY"
find_sim_wvs(my_wv, glove.300, top_n_res=10)
my_wv <- glove.300[['flock']] - glove.300[['geese']] + glove.300[['bison']] # here we go, we got our "herds" we're looking for
find_sim_wvs(my_wv, glove.300, top_n_res=10)
@PD1994
Copy link

PD1994 commented Aug 23, 2021

Hi @tjvananne,

thank you very much for your post. I am still relatively inexperienced in R and am therefore looking for some help. Is there also the possibility to use the pre-trained model for bi- and trigram and how would one have to change the code for this?

Thanks in advance and best regards

@tjvananne
Copy link
Author

Hey @PD1994,

Haven't had a chance to test this old code in a while, so results may vary. I know one naive strategy people have used when working with sentences is to take the element-wise average of all words within the sentence (so if you're using 300 dimension word vectors, you'd end up with one 300 dimension vector that represents your sentence). I would imagine you could do something similar for bigrams and trigrams. You'd have to also think about at what point (if at all) you want to remove stop words. That decision gets a bit trickier once you start working with bigrams and trigrams.

I think the next step up in complexity (and presumably, accuracy) would be dipping into sequence based models like LSTMs and BERT. Those topics go far beyond the simple gist I wrote here though. Huggingface might be a good resource to checkout if you're looking for some state of the art NLP. Hope that helps!

-Taylor

@PD1994
Copy link

PD1994 commented Aug 24, 2021

Hi @tjvananne,

thank you very much for your reply, I appreciate any help or hints. Yes I already thought about this variant too, but I don't know if this really hits the point of my analysis. Also, I want to use bigrams for input as well as get it as output. I have trained a model by myself where this can be used quite well, but how to do this for the pre-trained models is just unknown to me.

Thanks also for your hint regarding other more complex possibilities. I have already played with the idea of using BERT. However, I have been using R for a very short time, which makes me afraid that I won't be able to implement it. Do you have any helpful contributions or codes for this?
As an alternative, I have also thought about ELMo, for which I have partly already found some trained models...

Best regards and thanks again.

@IanniMuliterno
Copy link

IanniMuliterno commented Feb 4, 2022

Hello, since the data is way too big for my computer, I used g6b_300 <-data.table::fread('glove_s300.txt', data.table = F, encoding = 'UTF-8', header = F) to read it. But when I run glove.300 <- proc_pretrained_vec(g6b_300) R returns this error Error in strsplit(this_vec, " ") : non-character argument .

There's another topic worth mentioning. I got that data: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download) from https://github.com/stanfordnlp/GloVe . Is that correct?

Edit: searching for answers I came to suspect that all columns from g6b_300 must be character type. I will check this out but doesn't sounds right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment