Skip to content

Instantly share code, notes, and snippets.

@unglikteng
Created March 14, 2019 08:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save unglikteng/1e7d670d946982bf5a19bada5d0bf1d0 to your computer and use it in GitHub Desktop.
Save unglikteng/1e7d670d946982bf5a19bada5d0bf1d0 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Word2Vec Analysis on the Gnadenhutten Massacre\n",
"\n",
"**Author**: [Ung, Lik Teng](https://github.com/unglikteng) <br>\n",
"**Class**: [DH150, Winter 2019](http://asandersgarcia.humspace.ucla.edu/courses/dh150w19/) <br>\n",
"**Instructor**: [Professor Ashley Sanders Garcia](http://asandersgarcia.humspace.ucla.edu/)\n",
"\n",
"Word2Vec is a popular word embedding, which is able to model words in high-dimensional space beyond frequency count. The advantage of Word2Vec is that it can capture the \"contexts\" of a word within a specific body of corpus. I trained a Word2Vec model on 9 newspaper articles on the Gnadenhutten Massacre that happened on March 8, 1782. I am interested in how different sides involved in this massacre were being discussed in public discourse. Specifically, I am interested in words that are most associated with the Moravian Indians and the American militia.\n",
"\n",
"**Table of Contents**\n",
"* [1.Documents Import](#import)\n",
"* [2.Text Preprocessing](#preprocessing)\n",
"* [3.Word2Vec Training](#training)\n",
"* [4.Word2Vec to Tensor](#tensor)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"import cython, os #ENSURE cython package is installed on computer/canopy\n",
"import string, re, collections\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"from string import ascii_letters, digits\n",
"from smart_open import smart_open\n",
"\n",
"import gensim\n",
"from gensim.models import phrases \n",
"from gensim import corpora, models, similarities #calc all similarities at once, from http://radimrehurek.com/gensim/tut3.html\n",
"from gensim.models import Word2Vec, KeyedVectors\n",
"\n",
"from sklearn.manifold import TSNE\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot\n",
"# import plotly.offline as py \n",
"# py.init_notebook_mode(connected=True)\n",
"# import plotly.graph_objs as go\n",
"# import plotly.tools as tls\n",
"\n",
"import plotly.plotly as py\n",
"import plotly.tools as plotly_tools\n",
"import plotly.graph_objs as go\n",
"\n",
"plotly_tools.set_credentials_file(username='unglikteng', api_key='ho4TAl3mWMMNK6DnRSCL')\n",
"\n",
"from nltk import word_tokenize\n",
"from nltk.stem.snowball import SnowballStemmer\n",
"from nltk.corpus import stopwords\n",
"\n",
"plt.style.use('ggplot')\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"<a id=\"import\"></a>\n",
"## 1. Documents Import\n",
"\n",
"The directory path and filename are hardcoded here. Import your text documents if you would like to analyze them with Word2Vec.\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"primaries = []\n",
"primaryPath = os.path.join(os.path.realpath(\"\"),\"primary\")\n",
"for root, directories, file in os.walk('primary'):\n",
" for txt in file:\n",
" path = os.path.join(primaryPath, txt)\n",
" file = open(path, \"r\")\n",
" primaries.append(\"\".join(file.read().splitlines()))\n",
" file.close()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 9 Newspaper articles\n",
"len(primaries)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Centenary of GnadenhuttenInformation about the Old Ohio Moravian Settlement and its MassacreSpecial Correspondence of the Cincinnati Gazette Newark, O., April 27, - Gnadenhutten was established by a Christian Indian named Joshua, who brought with him a party of Mohicans, and proceeded to lay out the town on 24th day of September, 1772. It was on the west side of the Tuscarawas river, four miles above Schronbrunn (a Moravian village already established), and was called “Upper Town”. This location, however, was not satisfactory to the Netawatwees, then the reigning chief of the Delaware nation, who caused it to be removed to a point about eight miles below Schonbrunn, on the east side of the river. Here Gnadenhutten (Tents of Grace) was laid out October 9, 1772, by Joshua and his party, who were from the Moravian village of Friedenstadt (City of Peace), located on the Beaver river, in Pennsylvania. This village was subsequently removed and added to the villages of Gnadenhutten and Schonbrunn. Rev. David Zeisberger preached the first sermon in Gnadenhutten October 17, 1772.The British seized the Moravian Gnadenhutten, and with their horses, cattle, etc., drove them prisoners to the “Sandusky plains,” by Captain Matthew Elliott of the British army (a while American renegade, however), who had under his command at the time 300 hostile Indians. They were made captives September 11, 1781, and the party reached Sandusky river on the first day of October following, when they went into camp. The leaders of these Moravians at the time of the removal were Revs. Zeisberger, Senseman, and Jungman, of New Schonbrunn; Revs. Heckwelder and Jung, of Salem, and Rev. William Edwards, of Gnadenhutten. This camp, subsequently known as “Captives’ Town,” was located in the heart of the then hostile Wyandot country, on the Sandusky river, about a mile above the mouth of Broken Sword creek, and ten miles from the present town of Upper Sandusky. Here the captives were allowed to build huts and go into winter quarters. Late in October, 1781, leaders only were ordered to Detroit, there to go before the British commandant, Major DePeyster, to answer to the charge against them of aiding the Americans. They soon proved themselves innocent, and were sent back to “Captives’ Town,” on the Sandusky.'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get a sense of how the article look like \n",
"primaries[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"<a id=\"preprocessing\"></a>\n",
"## 2. Text Preprocessing\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Define our Text Preprocessor class\n",
"## Tokenize -> Remove stopwords -> Stemming \n",
"class Preprocessor:\n",
" def tokenize_word(self, sentence, to_token = None):\n",
" # all lower case \n",
" lower = sentence.strip().lower()\n",
" \n",
" # remove punctuation\n",
" punctuation_table = str.maketrans(string.punctuation, len(string.punctuation)*' ' )\n",
" noPunc = lower.translate(punctuation_table)\n",
" \n",
" # remove digit\n",
" nodigit = re.sub(r'\\d+', '', noPunc)\n",
" nodigit = re.sub(r'\\s+', ' ', nodigit).strip()\n",
" if to_token:\n",
" tokenized = word_tokenize(nodigit)\n",
" return tokenized\n",
" return nodigit \n",
" \n",
" def stem_word(self, tokens):\n",
" stemmer = SnowballStemmer(\"english\")\n",
" stemmed = []\n",
" for token in tokens:\n",
" stemmed.append(stemmer.stem(token))\n",
" return stemmed\n",
" \n",
" def remove_stopwords(self, tokens):\n",
" stopword_list = stopwords.words(\"english\")\n",
" filtered = [w for w in tokens if w not in stopword_list]\n",
" return filtered\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# Define preprocessor object\n",
"preprocessor = Preprocessor()\n",
"\n",
"primariesToken = [preprocessor.stem_word(\n",
" preprocessor.remove_stopwords(\n",
" preprocessor.tokenize_word(line, to_token=True))) \n",
" for line in primaries]\n",
"\n",
"primariesUnstemmed = [preprocessor.remove_stopwords(\n",
" preprocessor.tokenize_word(line, to_token=True)) \n",
" for line in primaries]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Build stemming dictionary\n",
"# This dictionary will help us trace back to the unstemmed words\n",
"stem_dict = {}\n",
"for i, row in enumerate(primariesUnstemmed):\n",
" for j, token in enumerate(row):\n",
" stem_dict.update({primariesToken[i][j]:token})\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Visualize the corpus - Frequency Analysis\n",
"\n",
"def word_counter(list_of_doc):\n",
" countVec = CountVectorizer()\n",
" df_cv = countVec.fit_transform(list_of_doc)\n",
" word_freq = dict(zip(countVec.get_feature_names(), np.asarray(df_cv.sum(axis=0)).ravel()))\n",
" word_counter = collections.Counter(word_freq)\n",
" word_counter_df = pd.DataFrame(word_counter.most_common(20), columns = ['word','freq'])\n",
"\n",
" a4_dims = (15, 10)\n",
" fig, ax = plt.subplots(figsize = a4_dims)\n",
" sns.barplot(x=\"word\", y=\"freq\", data=word_counter_df, palette= \"PuBuGn_d\",ax=ax)\n",
" return word_counter"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1080x720 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"wc_primaries = word_counter([\" \".join(tokens) for tokens in primariesToken])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"<a id=\"training\"></a>\n",
"## 3. Word2Vec Training \n",
"***"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Text Preprocessing -> Phrase Detection with Gensim -> Word2Vec Training \n",
"## Phrase Detection\n",
"\n",
"bigram_transformer = phrases.Phrases(primariesToken) \n",
"bigram= phrases.Phraser(bigram_transformer)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Word2Vec Hyperparameters**:\n",
"* skip-gram method is used instead of CBOW (Continuous Bag of Words) since skip-gram generally performs better on small dataset\n",
"* Dimension of word vectors: 500\n",
"* min_count: since the corpus is pretty small, set min_count to 2 is reasonable\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"model_primaries = Word2Vec(bigram[primariesToken], workers=4, sg=1,size=500,window=5, min_count = 2, sample=1e-3)\n",
"\n",
"model_primaries.init_sims(replace=True) #Precompute L2-normalized vectors. If replace is set to TRUE, forget the original vectors and only keep the normalized ones. Saves lots of memory, but can't continue to train the model.\n",
"model_primaries.save(\"model_primaries\") #save your model for later use! change the name to something to remember the hyperparameters you trained it with"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Load the model\n",
"model_p = Word2Vec.load(\"model_primaries\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1318"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# There are 1318 words in the vocabulary\n",
"len(model_p.wv.vocab)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"dict_keys(['old', 'ohio', 'moravian', 'settlement', 'correspond', 'cincinnati', 'newark', 'april', 'gnadenhutten', 'establish', 'christian_indian', 'name', 'joshua', 'brought', 'parti', 'proceed', 'lay', 'town', 'th', 'day', 'septemb', 'west', 'side', 'tuscarawa_river', 'four', 'mile', 'villag', 'alreadi', 'call', '“', 'upper', '”', 'locat', 'howev', 'reign', 'chief', 'delawar', 'nation', 'caus', 'remov', 'point', 'eight', 'schonbrunn', 'east', 'river', 'tent', 'grace', 'laid', 'octob', 'peac', 'pennsylvania', 'subsequ', 'rev', 'david_zeisberg', 'preach', 'first', 'british', 'seiz', 'hors', 'cattl', 'etc', 'drove', 'prison', 'sanduski', 'plain', 'captain', 'matthew', 'elliott', 'armi', 'american', 'command', 'time', 'hostil', 'indian', 'made', 'captiv', 'reach', 'follow', 'went', 'camp', 'leader', 'zeisberg', 'senseman', 'new', 'heckweld', 'jung', 'salem', 'william', 'known', '’', 'heart', 'wyandot', 'countri', 'broken', 'creek', 'ten', 'present', 'allow', 'build', 'hut', 'go', 'winter', 'quarter', 'late', 'order', 'detroit', 'major', 'answer', 'charg', 'aid', 'soon', 'prove', 'innoc', 'sent', 'back', 'fear', 'massacr', 'interest', 'event', 'advoc', 'week', 'two', 'happen', 'tuscarawa', 'counti', 'took', 'visit', 'site', 'ancient', 'trust', 'brief', 'account', 'place', 'terribl', 'may', 'reader', 'conclud', 'make', 'subject', 'communic', 'deserv', 'among', 'respect', 'valuabl', 'church', 'great', 'britain', 'origin', 'brethren', 'law', 'christ', 'know', 'unit', 'one', 'peculiar', 'unusu', 'belief', 'say', 'exhibit', 'submit', 'import', 'concern', 'member', 'lot', 'consist', 'number', 'small', 'cylind', 'inch', 'long', 'half', 'construct', 'end', 'pull', 'apart', 'disclos', 'word', 'yes', 'case', 'alik', 'far', 'appear', 'contain', 'use', 'princip', 'matter', 'instanc', 'young', 'man', 'mind', 'would', 'like', 'certain', 'woman', 'wife', 'minist', 'state', 'take', 'littl', 'put', 'thorough', 'consid', 'provid', 'approv', 'match', 'reason', 'although', 'need', 'marri', 'yet', 'also', 'much', 'decid', 'whether', 'accept', 'missionari', 'field', 'labor', 'character', 'hold', 'exampl', 'other', 'everi', 'bodi', 'christian', 'whatev', 'persuad', 'engag', 'mission', 'care', 'quarrel', 'carri', 'address', 'men', 'given', 'bethlehem', 'alway', 'still', 'center', 'unit_state', 'home', 'offic', 'societi', 'sever', 'year', 'revolutionari', 'war', 'wilder', 'convert', 'tribe', 'met', 'good', 'success', 'larg', 'savag', 'built', 'inhabit', 'three', 'within', 'goshen', 'washington', 'stand', 'beauti', 'situat', 'bank', 'south', 'philadelphia', 'columbus', 'railroad', 'hundr', 'faith', 'neat', 'meet', 'hous', 'pass', 'street', 'struck', 'quiet', 'throughout', 'modern', 'mani', 'gabl', 'ret', 'tradit', 'past', 'peopl', 'rush', 'hurri', 'world', 'around', 'simpl', 'tast', 'deepli', 'religi', 'earth', 'desir', 'daili', 'prayer', 'strife', 'sober', 'wish', 'never', 'learn', 'along', 'life', 'keep', 'way', 'attend', 'servic', 'sunday', 'inform', 'school', 'upon', 'congreg', 'born', 'seventi', 'year_ago', 'kind', 'accompani', 'spot', 'edg', 'eye', 'sacr', 'found', 'modest', 'foot', 'part', 'embrac', 'adjoin', 'graveyard', 'enclos', 'fenc', 'stood', 'occur', 'cruel', 'honor', 'grow', 'forest', 'tree', 'grown', 'sinc', 'seen', 'appl', 'plant', 'garden', 'ground', 'cellar', 'visibl', 'taken', 'char', 'corn', 'wood', 'pick', 'stone', 'piec', 'burn', 'red', 'hard', 'bore', 'mark', 'heat', 'lie', 'heap', 'purpos', 'erect', 'monument', 'last', 'rest', 'perhap', 'incid', 'histori', 'cruelti', 'equal', 'butcheri', 'pale', 'bloodi', 'slaughter', 'murder', 'king', 'excus', 'issu', 'mistak', 'true', 'addit', 'civil', 'nineti', 'act', 'without', 'even', 'grassi', 'could', 'scarc', 'horror', 'listen', 'recit', 'dark', 'deed', 'commit', 'sat', 'said', 'live', 'attach', 'wa', 'troubl', 'whose', 'territori', 'alli', 'look', 'leagu', 'white', 'hand', 'station', 'fort', 'pitt', 'pretend', 'harass', 'fire', 'summer', 'band', 'came', 'threat', 'promis', 'safeti', 'leav', 'crop', 'drag', 'hear', 'governor', 'noth', 'discharg', 'suffer', 'untold', 'privat', 'cold', 'permit', 'return', 'women_children', 'gather', 'advis', 'near', 'starv', 'els', 'arm', 'hunt', 'depred', 'frontier', 'rob', 'famili', 'mingo', 'cloth', 'stolen', 'pursu', 'compani', 'immedi', 'rais', 'colonel', 'williamson', 'set', 'arriv', 'night', 'march', 'next', 'morn', 'discov', 'advanc', 'cross', 'see', 'accost', 'come', 'protect', 'give', 'began', 'differ', 'spirit', 'bound', 'shut', 'nineti_six', 'consult', 'held', 'done', 'soldier', 'form', 'line', 'col_williamson', 'question', 'favor', 'save', 'step', 'forward', 'death', 'eighteen', 'whole', 'resolv', 'blood', 'men_women', 'children', 'meantim', 'suspect', 'dread', 'result', 'prepar', 'fate', 'pray', 'sing', 'hymn', 'aw', 'execut', 'commenc', 'doom', 'victim', 'brain', 'cooper', 'mallet', 'continu', 'cours', 'left', 'work', 'accomplish', 'repeat', 'treacheri', 'tragedi', 'right', 'escap', 'unobserv', 'warn', 'stun', 'blow', 'scalp', 'recov', 'conscious', 'departur', 'tell', 'stori', 'anoth', 'boy', 'hide', 'confin', 'flame', 'third', 'beg', 'conceal', 'horrid', 'enact', 'god', 'disast', 'die', 'kill', 'lightn', 'fiendish', 'torn', 'least', 'us', 'crawford', 'autumn', 'reveng', 'inhuman', 'chin', 'seem', 'silent', 'think', 'join', 'martyr', 'love', 'well', 'bone', 'sad', 'gray', 'hair', 'forc', 'august', 'breez', 'chant', 'depart', 'dreami', 'hill', 'safe', 'breath', 'break', 'sleep', 'shriek', 'aros', 'pain', 'chicago', 'find', 'month', 'scene', 'horribl', 'record', 'detail', 'previous', 'compos', 'english', 'mask', 'friendship', 'endur', 'hardship', 'persecut', 'blame', 'thank', 'prais', 'divis', 'fifti', 'forsaken', 'greater', 'portion', 'fall', 'face', 'actor', 'notori', 'david', 'destroy', 'suppos', 'trade', 'wild', 'actual', 'wrong', 'busi', 'usual', 'captur', 'militari', 'pittsburgh', 'vote', 'eighti', 'twenti', 'determin', 'news', 'almost', 'implicit', 'confid', 'devout', 'serv', 'offer', 'strength', 'encourag', 'infant', 'closer', 'mother', 'breast', 'brave', 'women', 'chosen', 'led', 'enter', 'butcher', 'pretti', 'poor', 'merci', 'dead', 'captor', 'manner', 'shot', 'tomahawk', 'various', 'attempt', 'rise', 'despatch', 'slaughter_hous', 'partial', 'consum', 'remain', 'buri', 'friend', 'person', 'perpetr', 'light', 'evid', 'fail', 'secur', 'kept', 'fortun', 'togeth', 'forti', 'thirti', 'former', 'six', 'acr', 'purchas', 'organ', 'object', 'perpetu', 'memori', 'exact', 'precious', 'nine', 'dollar', 'view', 'solicit', 'public', 'general', 'receiv', 'acknowledg', 'suggest', 'moravian_missionari', 'strong', 'beyond', 'doubt', 'grave', 'marbl', 'slab', 'commemor', 'tribun', 'histor', 'sun', 'testimoni', 'afterward', 'june', 'appropri', 'inscript', 'hope', 'midway', 'increas', 'popul', 'store', 'effort', 'fix', 'capit', 'today', 'celebr', 'white_settler', 'circumst', 'lead', 'render', 'induc', 'covert', 'land', 'clear', 'thrive', 'enjoy', 'comfort', 'outrag', 'treatment', 'endeavor', 'valley', 'refus', 'natur', 'thus', 'avoid', 'border', 'bring', 'start', 'human', 'expedit', 'plunder', 'pillag', 'influenc', 'knew', 'atroc', 'wallac', 'eastern', 'fled', 'toward', 'mrs', 'settler', 'pursuit', 'harbor', 'dress', 'earli', 'col', 'recent', 'barbar', 'finish', 'treat', 'harm', 'sooner', 'surrend', 'accus', 'told', 'propos', 'thirsti', 'share', 'decis', 'therefor', 'ask', 'moment', 'devot', 'tie', 'behind', 'scatter', 'room', 'progress', 'aliv', 'gun', 'knife', 'brutal', 'turn', 'pen', 'regist', 'regret', 'feel', 'fact', 'centenni', 'translat', 'mean', 'seven', 'fell', 'cheer', 'mingl', 'shout', 'signific', 'necessari', 'rare', 'pieti', 'heard', 'heathen', 'sold', 'trader', 'gentl', 'evil', 'effect', 'forcibl', 'becam', 'spread', 'flourish', 'convers', 'conspiraci', 'famous', 'ottawa', 'plot', 'though', 'harsh', 'ignor', 'canaanit', 'exist', '…', 'nativ', 'fit', 'fair', 'high', 'settl', 'fertil', 'soil', 'pleasant', 'prevail', 'pipe', 'coloni', 'broke', 'teacher', 'seat', 'headquart', 'cruelli', 'accord', 'crime', 'move', 'head', 'surround', 'smile', 'final', 'conduct', 'molest', 'complet', 'deceiv', 'prevent', 'manifest', 'council', 'mode', 'excit', 'idea', 'futur', 'campaign', 'bleed', 'written', 'demonstr', 'must', 'morrow', 'affair', 'believ', 'show', 'realiz', 'fortitud', 'larger', 'watch', 'outsid', 'produc', 'babe', 'parent', 'etern', 'quick', 'cut', 'similar', 'dwell', 'stay', 'short', 'frequent', 'retir', 'canada', 'pious', 'sixti', 'reflect', 'imposs', 'except', 'st', 'proper', 'observ', 'approach', 'especi', 'presid', 'arthur', 'hay', 'invit', 'deliv', 'new_fairfield', 'particip', 'wyom', 'probabl', 'struggl', '–', 'gospel', 'mahon', 'lehighton', 'elev', 'posit', 'novemb', 'attack', 'burnt', 'inscrib', 'lord', 'renew', 'mohagan', 'maintain', 'north', 'lehigh', 'possess', 'separ', 'begin', 'road', 'path', 'mountain', 'warrior', 'plantat', 'martin', 'resid', 'succeed', 'polici', 'chang', 'might', 'alon', 'mohegan', 'languag', 'bishop', 'foundat', 'shawne', 'hatchet', 'french', 'resolut', 'chapel', 'weissport', 'cultur', 'defeat', 'open', 'neighbor', 'caution', 'enemi', 'possibl', 'compli', 'nov', 'georg', 'custard', 'expect', 'joseph', 'sturg', 'got', 'door', 'ran', 'partch', 'window', 'child', 'stair', 'best', 'jump', 'hid', 'stump', 'saw', 'abus', 'perish', 'twelv', 'stabl', 'five', 'parich', 'blanket', 'meant', 'abl', 'deliver', 'report', 'delay', 'contrari', 'bed', 'brother', 'notic', 'militia', 'ventur', 'troop', 'stockad', 'properti', 'strategi', 'quit', 'later', 'januari', 'benjamin', 'susanna', 'lost', 'surpris', 'moravian_convert', 'search', 'canadian', 'borderland', 'john', 'p', 'bow', 'journal', 'entri', 'narrow', 'lt', 'muskingum', 'develop', 'ohio_valley', 'close', 'communiti', 'eighteenth', 'centuri', 'aftermath', 'surviv', 'most', 'munse', 'refug', 'rumor', 'offici', 'earlier', 'complic', 'passag', 'decad', 'often', 'resist', 'presenc', 'lake', 'region', 'disappear', 'deal', 'particular', 'symbol', 'rang', 'impact', 'fairfield', 'moraviantown', 'upper_canada', 'northern', 'thame_river', 'stabil', 'boundari', 'problem', 'incurs', 'immigr', 'intern', 'migrat', 'simpli', 'govern', 'resid_fairfield', 'difficult', 'demand', 'relat', 'british_offici', 'inde', 'southern', 'illustr', 'avail', 'option', 'nineteenth', 'articl', 'despit', 'pressur', 'examin', 'specif', 'individu', 'second', 'polit', 'elimin', 'tragic', 'backcountri', 'defin', 'appalachian', 'conflict', 'warfar', 'local', 'neutral', 'neither', 'violenc', 'readili', 'label', 'limit', 'mid', 'zone', 'grew', 'immin', 'conclus', 'agreement', 'affect', 'raid', 'constant', 'includ', 'impend', 'sometim', 'michigan_histor', 'review', 'headman', 'power', 'jaw', 'danger', 'battl', 'risk', 'play', 'spring', 'tri', 'reloc', 'want', 'across', 'sens', 'wit', 'movement', 'potenti', 'extent', 'convinc', 'juli', 'clinton', 'northwest', 'refuge', 'slowli', 'longer', 'thing', 'lennachgo', 'destruct', 'negat', 'ripen', 'ojibw', 'away', 'welcom', 'son', 'agre', 'father', 'instead', 'journey', 'josiah', 'harmar', 'headmen', 'alcohol', 'stop', 'huron', 'better', 'drop', 'shift', 'auglaiz', 'becom', 'milit', 'confederaci', 'gen', 'messag', 'wampum', 'singular', 'repres', 'advic', 'easi', 'declin', 'area', 'spiritu', 'assist', 'support', 'thought', 'remark', 'agent', 'mckee', 'treati', 'sign', 'unfortun', 'volatil', 'jay', 'term', 'paper', 'econom', 'thame', 'cultiv', 'suffici', 'travel', 'knowledg', 'lieuten', 'simco', 'request', 'bushel', 'mere', 'cano', 'vital', 'season', 'fur', 'heighten', 'pari', 'contend', 'perform', 'western', 'fort_malden', 'negoti', 'somewhat', 'uneasi', 'america', 'worri', 'threaten', 'tension', 'capt', 'mississippi', 'compar', 'anxieti', 'surfac', 'niagara', 'nevertheless', 'begun', 'period', 'regul', 'strict', 'rule', 'requir', 'sin', 'social', 'standard', 'reveal', 'dramat', 'initi', 'primari', 'practic', 'deterior', 'group', 'less', 'diari', 'read', 'michael', 'diarist', 'complain', 'difficulti', 'relationship', 'gift', 'expans', 'wrote', 'count', 'enough', 'connect', 'respons', 'chesapeak', 'alleg', 'desert', 'drew', 'elliot', 'prophet', 'indiana', 'tecumseh', 'amherstburg', 'indic', 'revit', 'fellow', 'onim', 'either', 'gave', 'belong', 'stream', 'malden', 'denk', 'deep', 'hospit', 'chao', 'henri', 'schnall', 'food', 'dispers', 'daughter', 'meanwhil', 'cass', 'kinship', 'charl', 'killbuck', 'gelemend', 'perfect', 'jstor', 'www', 'org', 'scienc', 'art', 'heckeweld', 'guidanc', 'secret', 'verifi', 'exhort', 'resign', 'applic', 'liberti', 'acquaint', 'untim', 'impress', 'justic', 'mr_heckeweld', 'charact', 'intercours', 'marietta', 'amidst', 'full', 'publish', 'putnam', 'wabash', 'fever', 'common', 'barg', 'alarm', 'chimney', 'pocket', 'assur', 'releas', 'intellig', 'messeng', 'shebosh', 'wound', 'latter', 'display', 'glorious', 'youth', 'abel'])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model_p.wv.vocab.keys()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# model_p.wv.most_similar(positive = [\"white\", \"american\"],\n",
"# negative = [\"british\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$\\overrightarrow{Dimension_g} = \\overrightarrow{white} + \\overrightarrow{american} - \\overrightarrow{british}$"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"whiteAmerican_british = [('moravian', 0.9998588562011719),\n",
" ('indian', 0.9998587369918823),\n",
" ('fairfield', 0.9998579621315002),\n",
" ('mani', 0.9998571872711182),\n",
" ('live', 0.9998558163642883),\n",
" ('murder', 0.9998539686203003),\n",
" ('day', 0.9998528957366943),\n",
" ('missionari', 0.9998528957366943),\n",
" ('massacr', 0.9998518824577332),\n",
" ('ohio', 0.999851405620575)]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"linkText": "Export to plot.ly",
"plotlyServerURL": "https://plot.ly",
"showLink": false
},
"data": [
{
"mode": "markers+text",
"name": "Similar to White American",
"text": [
"moravian",
"indians",
"fairfield",
"many",
"lives",
"murder",
"day",
"missionary",
"massacre",
"ohio"
],
"textposition": "bottom center",
"type": "scatter",
"uid": "d99dc882-fc0a-43d7-a522-e0bc37f373fb",
"x": [
-13.265983581542969,
-3.1503617763519287,
13.101271629333496,
5.719395637512207,
-17.57438850402832,
-5.6888861656188965,
27.676631927490234,
33.34973907470703,
-47.03227615356445,
-37.47283172607422
],
"y": [
-45.59456253051758,
-60.21220397949219,
-47.7032470703125,
-93.46844482421875,
-76.33037567138672,
-22.85382080078125,
-70.43335723876953,
-27.5128231048584,
-71.16353607177734,
-33.79647445678711
]
}
],
"layout": {
"paper_bgcolor": "rgb(243,243,243)",
"plot_bgcolor": "rgb(243,243,243)",
"title": {
"text": "Most similar Words to White American"
},
"xaxis": {
"title": {
"text": "Dimension1"
}
},
"yaxis": {
"title": {
"text": "Dimension2"
}
}
}
},
"text/html": [
"<div id=\"21ebc5bc-2926-4fee-9131-033b18843374\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"21ebc5bc-2926-4fee-9131-033b18843374\", [{\"mode\": \"markers+text\", \"name\": \"Similar to White American\", \"text\": [\"moravian\", \"indians\", \"fairfield\", \"many\", \"lives\", \"murder\", \"day\", \"missionary\", \"massacre\", \"ohio\"], \"textposition\": \"bottom center\", \"x\": [-13.265983581542969, -3.1503617763519287, 13.101271629333496, 5.719395637512207, -17.57438850402832, -5.6888861656188965, 27.676631927490234, 33.34973907470703, -47.03227615356445, -37.47283172607422], \"y\": [-45.59456253051758, -60.21220397949219, -47.7032470703125, -93.46844482421875, -76.33037567138672, -22.85382080078125, -70.43335723876953, -27.5128231048584, -71.16353607177734, -33.79647445678711], \"type\": \"scatter\", \"uid\": \"d99dc882-fc0a-43d7-a522-e0bc37f373fb\"}], {\"paper_bgcolor\": \"rgb(243,243,243)\", \"plot_bgcolor\": \"rgb(243,243,243)\", \"title\": {\"text\": \"Most similar Words to White American\"}, \"xaxis\": {\"title\": {\"text\": \"Dimension1\"}}, \"yaxis\": {\"title\": {\"text\": \"Dimension2\"}}}, {\"showLink\": false, \"linkText\": \"Export to plot.ly\", \"plotlyServerURL\": \"https://plot.ly\"})});</script><script type=\"text/javascript\">window.addEventListener(\"resize\", function(){window._Plotly.Plots.resize(document.getElementById(\"21ebc5bc-2926-4fee-9131-033b18843374\"));});</script>"
],
"text/vnd.plotly.v1+html": [
"<div id=\"21ebc5bc-2926-4fee-9131-033b18843374\" style=\"height: 525px; width: 100%;\" class=\"plotly-graph-div\"></div><script type=\"text/javascript\">require([\"plotly\"], function(Plotly) { window.PLOTLYENV=window.PLOTLYENV || {};window.PLOTLYENV.BASE_URL=\"https://plot.ly\";Plotly.newPlot(\"21ebc5bc-2926-4fee-9131-033b18843374\", [{\"mode\": \"markers+text\", \"name\": \"Similar to White American\", \"text\": [\"moravian\", \"indians\", \"fairfield\", \"many\", \"lives\", \"murder\", \"day\", \"missionary\", \"massacre\", \"ohio\"], \"textposition\": \"bottom center\", \"x\": [-13.265983581542969, -3.1503617763519287, 13.101271629333496, 5.719395637512207, -17.57438850402832, -5.6888861656188965, 27.676631927490234, 33.34973907470703, -47.03227615356445, -37.47283172607422], \"y\": [-45.59456253051758, -60.21220397949219, -47.7032470703125, -93.46844482421875, -76.33037567138672, -22.85382080078125, -70.43335723876953, -27.5128231048584, -71.16353607177734, -33.79647445678711], \"type\": \"scatter\", \"uid\": \"d99dc882-fc0a-43d7-a522-e0bc37f373fb\"}], {\"paper_bgcolor\": \"rgb(243,243,243)\", \"plot_bgcolor\": \"rgb(243,243,243)\", \"title\": {\"text\": \"Most similar Words to White American\"}, \"xaxis\": {\"title\": {\"text\": \"Dimension1\"}}, \"yaxis\": {\"title\": {\"text\": \"Dimension2\"}}}, {\"showLink\": false, \"linkText\": \"Export to plot.ly\", \"plotlyServerURL\": \"https://plot.ly\"})});</script><script type=\"text/javascript\">window.addEventListener(\"resize\", function(){window._Plotly.Plots.resize(document.getElementById(\"21ebc5bc-2926-4fee-9131-033b18843374\"));});</script>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## Visualize words most similar to White American\n",
"my_word_list=[]\n",
"my_word_vectors=[]\n",
"# label=[]\n",
"\n",
"for i in whiteAmerican_british: \n",
" if my_word_list not in my_word_list:\n",
" my_word_list.append(i[0])\n",
" my_word_vectors.append(model_p.wv[i[0]])\n",
" \n",
"tsne_model = TSNE(perplexity=5, n_components=2, init='pca', n_iter=3000, random_state=23) #you may need to tune these, epsecially the perplexity. #Use PCA to reduce dimensionality to 2-D, an \"X\" and a \"Y \n",
"new_values = tsne_model.fit_transform(my_word_vectors)\n",
"\n",
"x = []\n",
"y = []\n",
"for value in new_values:\n",
" x.append(value[0])\n",
" y.append(value[1])\n",
"\n",
"\n",
"trace1 = go.Scatter(\n",
" x = x,\n",
" y = y,\n",
" mode = 'markers+text',\n",
" name = \"Similar to White American\",\n",
" text = [stem_dict[word] if stem_dict.get(word) else word for word in my_word_list],\n",
" textposition='bottom center'\n",
")\n",
"\n",
"\n",
"data = [trace1]\n",
"\n",
"layout = go.Layout(dict(title = \"Most similar Words to White American\",\n",
" yaxis = dict(title = \"Dimension2\"),\n",
" xaxis = dict(title = \"Dimension1\"),\n",
" plot_bgcolor = \"rgb(243,243,243)\",\n",
" paper_bgcolor = \"rgb(243,243,243)\",\n",
" )\n",
" )\n",
"\n",
"fig = go.Figure(data=data,layout=layout)\n",
"py.iplot(fig)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Some of the significant words that come up from this specific dimension include**;\n",
"* Massacre\n",
"* Moravian Indians\n",
"* War \n",
"* Missionary \n",
"* Ohio\n",
"\n",
"These words construct what we know about the Gnadenhutten Massacre - \"The Moravian Indians, who were not allies of Britain, were massacred by American militia in Gnadenhutte, Ohio. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"<a id=\"tensor\"></a>\n",
"## 4. Word2Vec to Tensor\n",
"\n",
"I am using [Google Embedding Projector](https://projector.tensorflow.org/) to visualize my Word2Vec model. Since it is built on tensorflow, we need to convert our Word2Vec output to the tensor format. The function below does just that.\n",
"\n",
"The final visualization of this Word2Vec model can be found [here](https://projector.tensorflow.org/?config=https://gist.githubusercontent.com/unglikteng/3d31526c9e090ff8123c4e7a1b07d2bb/raw/74fa1165c92c98a890d7837b97f33599146ebb0f/projector_config.json).\n",
"***"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def word2vec2tensor(model, tensor_filename):\n",
" outfiletsv = tensor_filename + '_tensor.tsv'\n",
" outfiletsvmeta = tensor_filename + '_metadata.tsv'\n",
" \n",
" with smart_open(outfiletsv, 'wb') as file_vector, smart_open(outfiletsvmeta, 'wb') as file_metadata:\n",
" for word in model.wv.index2word:\n",
" word = stem_dict[word] if stem_dict.get(word) else word\n",
" file_metadata.write(gensim.utils.to_utf8(word) + gensim.utils.to_utf8('\\n'))\n",
" vector_row = '\\t'.join(str(x) for x in model.wv.__getitem__(word))\n",
" file_vector.write(gensim.utils.to_utf8(vector_row) + gensim.utils.to_utf8('\\n'))"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# word2vec2tensor(model_p, \"model_primaries\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment