Astroneko404/word2vec_pretrained_models.md

## word2vec_pretrained_models.md

      
    Raw
  

              word2vec_pretrained_models.md
            
          
    Bio-medical corpus:

NCBI BioNLP (PubMed & MIMIC III Clinical notes): https://github.com/ncbi-nlp/BioSentVec

Common Crawl:

GloVe(42B tokens, 1.9M vocab, uncased, 300d vectors): http://nlp.stanford.edu/data/glove.42B.300d.zip
GloVe(840B tokens, 2.2M vocab, cased, 300d vectors): http://nlp.stanford.edu/data/glove.840B.300d.zip

Google News corpus:

3 million 300-dimension English word vectors: https://github.com/mmihaltz/word2vec-GoogleNews-vectors

Twitter:

GloVe(27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors): http://nlp.stanford.edu/data/glove.twitter.27B.zip
FredericGodin's embedding: https://github.com/FredericGodin/TwitterEmbeddings

UMBC webBase corpus:

https://ebiquity.umbc.edu/resource/html/id/351

Wikipedia:

Wikipedia2vec (12 languages): https://wikipedia2vec.github.io/wikipedia2vec/pretrained/
GloVe(Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)): http://nlp.stanford.edu/data/glove.6B.zip