Bio-medical corpus:
- NCBI BioNLP (PubMed & MIMIC III Clinical notes): https://github.com/ncbi-nlp/BioSentVec
Common Crawl:
- GloVe(42B tokens, 1.9M vocab, uncased, 300d vectors): http://nlp.stanford.edu/data/glove.42B.300d.zip
- GloVe(840B tokens, 2.2M vocab, cased, 300d vectors): http://nlp.stanford.edu/data/glove.840B.300d.zip
Google News corpus:
- 3 million 300-dimension English word vectors: https://github.com/mmihaltz/word2vec-GoogleNews-vectors
Twitter:
- GloVe(27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors): http://nlp.stanford.edu/data/glove.twitter.27B.zip
- FredericGodin's embedding: https://github.com/FredericGodin/TwitterEmbeddings
UMBC webBase corpus:
Wikipedia:
- Wikipedia2vec (12 languages): https://wikipedia2vec.github.io/wikipedia2vec/pretrained/
- GloVe(Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)): http://nlp.stanford.edu/data/glove.6B.zip