Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mattiasostmar/7a72df09dd2a6531ccd38028f2928d7c to your computer and use it in GitHub Desktop.
Save mattiasostmar/7a72df09dd2a6531ccd38028f2928d7c to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Experiment with word embeddings similarity\n",
"**Author:** Mattias Östmar, mattiasostmar@gmail.com, www.mattiasostmar.se\n",
"**Date:** 2018-03-28\n",
"\n",
"\n",
"### Summary\n",
"\n",
"Inspired by the reasoning about (psyhological) state in Hansi Freinachts book The Listening Society I set this simplistic exploratory experiment up to see if a Twitter-trained word2vec model be used to find similar words to high and low state words. This experiment has not been peer-reviewed or intersubjectively verified since it doesn't seem to be worth the time.\n",
"\n",
"Using a [word2vec model](https://osf.io/j5rgz/) trained on multi-lingual tweets from a non-statistically signifant sample from over 460.000 Twitter accounts during 2016 I find that only 30% of the 50 most similar words to the high state word examples listed in The Listening Society can be judged to belong to the category high state words. The corresponding number for low state words is 78%.\n",
"\n",
"The used word embedding model doesn't seem to be very useful for differenting between high and low state words, even though better to find low state words than high state words.\n",
"\n",
"### Currently applied research within the field of computerized analysis of Metamodern-related theories\n",
"\n",
"**Hierarchical Complexity**\n",
"\n",
"There is at least one example of application of computerized linguistic analysis relevant to the stage theories presented in Hanzi Freinachts book The Listening Society. The [non-profit Lectica](https://lecticalive.org/) has developed what they call [Lectical Assesment System (LAS)](https://lecticalive.org/about/who-we-are) which claims to partly be built on the [theories of hierarchical complexity](https://lecticalive.org/about/hierarchical-complexity). Their research is based on the combination of human judges and computer assistence and is operationalized in the form of word lexcion consisting of [200.000+ categorized words](https://lecticalive.org/about/clas).\n",
"\n",
"**High and Low State**\n",
"LIWC is without question the most well-researched psychological text analysis lexicon. It contains categories that relates to the state or mood of the writer. It has been used to research linguistic features in \"dismissing texts\" in [A Linguistic Inquiry and Word Count Analysis of the Adult Attachment Interview in Two Large Corpora](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4824682/), from the abstract: \n",
"\n",
" First, regression analyses revealed that dismissing states of mind were associated with transcripts that \n",
" were more truncated and deemphasized discussion of the attachment relationship whereas preoccupied states of \n",
" mind were associated with longer, more conflicted, and angry narratives. Second, in aggregate, LIWC variables\n",
" accounted for over a third of the variation in AAI dismissing and preoccupied states of mind, with regression\n",
" weights cross-validating across samples. Third, LIWC-derived dismissing and preoccupied state of mind \n",
" dimensions were associated with direct observations of maternal and paternal sensitivity as well as infant\n",
" attachment security in childhood\n",
" \n",
"[IBM Tone Analyzer](https://www.ibm.com/watson/services/tone-analyzer/) is used to \"Predict whether they are happy, sad, confident, and more.\". \n",
"\n",
"The purpose of this initial explorative experiment is to see if the word embeddings I've already created from tweets can be used to support or in any way help out in the further progression of the metamodern project presented in the book."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Works when running in conda environment 'memeticscience' and using kernel py34gensom on MOSMBP\n",
"# The model needs to be run with python 3.4 which the model was created with due to quirk in Gensim\n",
"# Also the links between libgccc_.... needs to work, which is a bitch to make work with different conda envs\n",
"\n",
"import gensim"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"big_model_150_mincount_30_no_stops\r\n",
"big_model_150_mincount_30_no_stops.syn0.npy\r\n",
"big_model_150_mincount_30_no_stops.syn1neg.npy\r\n"
]
}
],
"source": [
"!ls ../models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load [pre-trained word embedding](https://osf.io/j5rgz/) trained on tweets from 2016"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"model = gensim.models.Word2Vec.load(\"../models/big_model_150_mincount_30_no_stops\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's add Hanzi Freinachts examples of high and low state words from the book [The Listening Society](https://www.amazon.com/Listening-Society-Metamodern-Politics-Guides-ebook/dp/B074MKQ4LR) from the chapter High states, Low states."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# High state words"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# I split \"radiant emptiness\" into two words\n",
"# I suppose \"choseness\" is a typo and added closeness instead\n",
"high_words = [\n",
" \"magnificence\",\"majesty\",\"vastness\",\"greatness\",\"splendor\",\"love\",\"joy\",\"clarity\",\"openness\",\n",
" \"compassion\",\"certitude\",\"flow\",\"jubilation\",\"playfulness\",\"fulness\",\"enlightenment\",\"lightness\",\n",
" \"peace\",\"presence\",\"power\",\"realness\",\"humility\",\"freedom\",\"creation\",\"freshness\",\"birth\",\"wonder\",\n",
" \"victory\",\"serenity\",\"divinity\",\"purity\",\"meaning\",\"unity\",\"union\",\"communion\",\"uniqueness\",\"closeness\",\n",
" \"awe\",\"fulfillment\",\"insight\",\"grace\",\"refinement\",\"subtlety\",\"simplicity\",\"gratefulness\",\"substance\",\n",
" \"radiant\",\"emptiness\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we test the method most_similar of the model on one of the words, do we get other words likely to be \"high state\""
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('divine', 0.6461964249610901),\n",
" ('blessing', 0.6008683443069458),\n",
" ('friendship', 0.5945286154747009),\n",
" ('praise', 0.5944042205810547),\n",
" ('brilliance', 0.59123694896698),\n",
" ('eternal', 0.5894177556037903),\n",
" ('perfection', 0.5870571136474609),\n",
" ('sacrifice', 0.5861201286315918),\n",
" ('sorrow', 0.5844243168830872),\n",
" ('faith', 0.5791643857955933)]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.most_similar(\"grace\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The method returns the 10 most similar words in the vector space. Most words, I would say, are \"high state\", but we also get the word \"sorrow\" which I'd say is a \"low state\" word. Can we get a better result by triangulation all the words - that are in the word embedding model - with each other? "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All high words: 48\n",
" magnificence majesty vastness greatness splendor love joy clarity openness compassion certitude flow jubilation playfulness fulness enlightenment lightness peace presence power realness humility freedom creation freshness birth wonder victory serenity divinity purity meaning unity union communion uniqueness closeness awe fulfillment insight grace refinement subtlety simplicity gratefulness substance radiant emptiness\n",
"\n",
"High words present in the word embedding model: 42\n",
" magnificence majesty greatness splendor love joy clarity openness compassion flow playfulness enlightenment lightness peace presence power realness humility freedom creation freshness birth wonder victory serenity divinity purity meaning unity union communion uniqueness closeness awe fulfillment insight grace refinement simplicity substance radiant emptiness\n"
]
}
],
"source": [
"present_high_words = []\n",
"for word in high_words:\n",
" if word in model.vocab:\n",
" present_high_words.append(word)\n",
"print(\"All high words: {}\\n {}\\n\\nHigh words present in the word embedding model: {}\\n {}\".format(\n",
" len(high_words),\n",
" \" \".join(high_words),\n",
" len(present_high_words),\n",
" \" \".join(present_high_words)))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('contemplation', 0.7986831068992615),\n",
" ('virtue', 0.7952798008918762),\n",
" ('respecting', 0.7897941470146179),\n",
" ('clout', 0.7893505096435547),\n",
" ('abundance', 0.789014995098114),\n",
" ('conjunction', 0.7857810854911804),\n",
" ('adversity', 0.7847736477851868),\n",
" ('belief', 0.7775998711585999),\n",
" ('purpose', 0.7770401835441589),\n",
" ('irrespective', 0.7762966752052307)]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.most_similar(present_high_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The only words that seems to fit the category high state to me are \"contemplation\", \"virtue\", \"abundance\" and with a little good will \"purpose\". The only low state word to me is \"adversity\". But words like \"clout\", \"conjunction\" seems pretty non-emotional or neutral to me. Let's look at a few more."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('maturity', 0.7755938768386841),\n",
" ('creativity', 0.7741824388504028),\n",
" ('preference', 0.7735363245010376),\n",
" ('wherewith', 0.7729119062423706),\n",
" ('coupled', 0.7726088762283325),\n",
" ('remedial', 0.7719709873199463),\n",
" ('essence', 0.7714344263076782),\n",
" ('thine', 0.7696318626403809),\n",
" ('happiness', 0.7690586447715759),\n",
" ('relentless', 0.7682967185974121),\n",
" ('consciousness', 0.7681272625923157),\n",
" ('mankind', 0.7675865888595581),\n",
" ('anent', 0.767257571220398),\n",
" ('significance', 0.7670107483863831),\n",
" ('generosity', 0.7652686834335327),\n",
" ('emotion', 0.7651517391204834),\n",
" ('consistent', 0.7648339867591858),\n",
" ('masses', 0.7615814805030823),\n",
" ('merely', 0.7607730627059937),\n",
" ('unto', 0.7603210806846619),\n",
" ('midst', 0.7594066262245178),\n",
" ('regard', 0.7592296004295349),\n",
" ('delusion', 0.7588174343109131),\n",
" ('embrace', 0.7586248517036438),\n",
" ('always', 0.75837242603302),\n",
" ('intellectual', 0.7572551965713501),\n",
" ('movements', 0.7564965486526489),\n",
" ('nature', 0.755846381187439),\n",
" ('boundaries', 0.7550290822982788),\n",
" ('courage', 0.7549212574958801),\n",
" ('enormous', 0.7548866271972656),\n",
" ('gratitude', 0.7544490098953247),\n",
" ('necessity', 0.7538563013076782),\n",
" ('baksheesh', 0.7536888718605042),\n",
" ('determination', 0.7513951063156128),\n",
" ('inasmuch', 0.7506810426712036),\n",
" ('flexibility', 0.7504537105560303),\n",
" ('weakness', 0.7494118213653564),\n",
" ('both', 0.7493100166320801),\n",
" ('reflection', 0.7492861747741699)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.most_similar(present_high_words, topn=50)[10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To me the following words doesn't seem to match the category high state words, but rather pretty neutral: \n",
"\n",
"\"wherewith\",\"coupled\",\"thine\",\"relentless\",\"conciusness\",\"mankind\",\"anent\",\"significance\", \"emotion\", \"consistent\", \"masses\", \"merely\", \"unto\", \"midst\", \"regard\", \"always\", \"intellectual\",\"movements\", \"nature\", \"boundaries\", \"enormous\",\"baksheesh\",\"inasmuch\",\"flexibility\",\"both\",\"reflection\"\n",
"\n",
"And the following seems to fit in the category low state words:\n",
"\n",
"\"delusion\",\"necessity\",\"weakness\"\n",
"\n",
"That makes a total of 15/50 high state words, or 30% similar words in the same category. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Low state words"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# I skipped \"dead meat\" and added \"dead\" and \"death\" instead\n",
"# I skipped \"painful emptiness\" and split it into \"painful\" and \"emptiness\" instead\n",
"# I split \"unbearable pain\" into \"unbearable\" and \"pain\"\n",
"low_words = [\n",
" \"anxiety\",\"angst\",\"rage\",\"hatred\",\"bitterness\",\"imprisonment\",\"slavery\",\"humiliation\",\"loss\",\"loneliness\",\n",
" \"meaninglessness\",\"unrealness\",\"confusion\",\"boredom\",\"filth\",\"torture\",\"oppression\",\"suffocation\",\n",
" \"unsettlement\",\"darkness\",\"dead\",\"death\",\"rot\",\"painful\",\"emptiness\",\"powerlessness\",\"heaviness\",\n",
" \"uncertainty\",\"grossness\",\"hopelessness\",\"unlife\",\"half-life\",\"screaming\",\"woundedness\",\"suffering\",\n",
" \"unbearable\",\"pain\",\"complication\",\"terror\",\"exposedness\"\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All low words: 40\n",
" anxiety angst rage hatred bitterness imprisonment slavery humiliation loss loneliness meaninglessness unrealness confusion boredom filth torture oppression suffocation unsettlement darkness dead death rot painful emptiness powerlessness heaviness uncertainty grossness hopelessness unlife half-life screaming woundedness suffering unbearable pain complication terror exposedness\n",
"\n",
"Low words present in the word embedding model: 31\n",
" anxiety angst rage hatred bitterness imprisonment slavery humiliation loss loneliness confusion boredom filth torture oppression suffocation darkness dead death rot painful emptiness heaviness uncertainty hopelessness screaming suffering unbearable pain complication terror\n"
]
}
],
"source": [
"present_low_words = []\n",
"for word in low_words:\n",
" if word in model.vocab:\n",
" present_low_words.append(word)\n",
"print(\"All low words: {}\\n {}\\n\\nLow words present in the word embedding model: {}\\n {}\".format(\n",
" len(low_words),\n",
" \" \".join(low_words),\n",
" len(present_low_words),\n",
" \" \".join(present_low_words)))"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('diagnose', 0.6504509449005127),\n",
" ('depresjon', 0.6346930861473083),\n",
" ('risiko', 0.630301833152771),\n",
" ('følelser', 0.615696907043457),\n",
" ('apropos', 0.6063698530197144),\n",
" ('lidelser', 0.5917198061943054),\n",
" ('sykdom', 0.5901660323143005),\n",
" ('lidelse', 0.5886660814285278),\n",
" ('hjernen', 0.5846033692359924),\n",
" ('provokerende', 0.5839634537696838)]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.most_similar(\"angst\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apparently \"angst\" is a common Norwegian word. That we can get results in several languages is due to that the model was trained on tweets from a lot of nordic speaking people, aswell as quite a few English speaking accounts, that sometimes are from the nordic countris, but still tweets a lot in English.\n",
"\n",
"Let's see what happens if we \"triangulate\" the words by adding them all."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('prejudice', 0.826143741607666),\n",
" ('distress', 0.8099950551986694),\n",
" ('greed', 0.7975993156433105),\n",
" ('despair', 0.7955959439277649),\n",
" ('excessive', 0.791906476020813),\n",
" ('merely', 0.7915235757827759),\n",
" ('resentment', 0.7886461615562439),\n",
" ('fear', 0.7879273891448975),\n",
" ('stupidity', 0.7865918278694153),\n",
" ('ignorance', 0.7819235920906067)]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.most_similar(present_low_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of the top 10 words seems to be in the category \"low state\" from what I can judge. Let's look at few more."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('misogyny', 0.781486988067627),\n",
" ('vicious', 0.77909916639328),\n",
" ('betrayal', 0.7752048373222351),\n",
" ('existence', 0.7750952243804932),\n",
" ('sympathy', 0.7743609547615051),\n",
" ('ineffective', 0.7720231413841248),\n",
" ('midst', 0.7718374133110046),\n",
" ('relentless', 0.7716357111930847),\n",
" ('sickness', 0.7716209292411804),\n",
" ('cowardly', 0.7692660689353943),\n",
" ('denial', 0.7683166861534119),\n",
" ('incompetence', 0.7668265700340271),\n",
" ('unnecessary', 0.7667227387428284),\n",
" ('destructive', 0.7662035822868347),\n",
" ('troubles', 0.7633748650550842),\n",
" ('nonsense', 0.7621120810508728),\n",
" ('bigotry', 0.7589126825332642),\n",
" ('danger', 0.7585600018501282),\n",
" ('consequences', 0.7584181427955627),\n",
" ('doctrine', 0.7584144473075867),\n",
" ('otherwise', 0.757522702217102),\n",
" ('hostility', 0.7573735117912292),\n",
" ('agony', 0.7568016648292542),\n",
" ('violent', 0.7559362649917603),\n",
" ('illness', 0.7552935481071472),\n",
" ('patriarchy', 0.7547674775123596),\n",
" ('respecting', 0.7547410726547241),\n",
" ('cynical', 0.7519647479057312),\n",
" ('uncontrolled', 0.7504923939704895),\n",
" ('starvation', 0.7503988146781921),\n",
" ('masculinity', 0.7502506375312805),\n",
" ('ideological', 0.7491779327392578),\n",
" ('virtue', 0.749147355556488),\n",
" ('escalation', 0.7483734488487244),\n",
" ('endure', 0.747993528842926),\n",
" ('grief', 0.7465096116065979),\n",
" ('guilt', 0.7463783025741577),\n",
" ('murderous', 0.7462527751922607),\n",
" ('dignity', 0.7456189393997192),\n",
" ('systematically', 0.7448133826255798)]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.most_similar(present_low_words, topn=50)[10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Except for \"sympathy\", \"midst\", \"relentless\", \"doctrine\", \"otherwise\", \"respecting\", \"masculinity\", \"ideological\", \"virtue\", \"dignity\" and \"systematically\" the words seems low state to me. That is 39/50 words in the right category, or 78% of the similar words in the correct category."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "py34gensim",
"language": "python",
"name": "py34gensim"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment