Skip to content

Instantly share code, notes, and snippets.

@jamesthomson
Created July 12, 2016 09:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jamesthomson/ea8b514465908dad924c8487e1163543 to your computer and use it in GitHub Desktop.
Save jamesthomson/ea8b514465908dad924c8487e1163543 to your computer and use it in GitHub Desktop.
word2vec example using tweet data
import pandas as pd
import re
import numpy as np
import nltk
import gensim
#import data. contains identifier and tweet
tweets=pd.DataFrame.from_csv('tweets.txt', sep='\t', index_col=False)
#data prep
#cleaning
#lower case
clean= tweets['tweet'].str.lower()
#untranslated symbols
clean = clean.str.replace('amp', ' ')
clean = clean.str.replace('quot', ' ')
#keep words whitespace and '
clean = clean.str.replace(r'[^\w\s\']','')
#remove numerics
clean=clean.str.replace(r'[\d]','')
sentences = clean.tolist()
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
model = gensim.models.Word2Vec(tokenized_sentences, min_count=10)
model.most_similar(positive=['moon'], topn=1)
model.most_similar(positive=['moon'], negative=['poor'], topn=5)
model.most_similar(positive=['moon', 'bench'], topn=5)
model.similarity('john', 'lewis')
model.similarity('bench', 'moon')
@rajacsp
Copy link

rajacsp commented Sep 23, 2018

If you can update the 'tweet.txt' file location, it would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment