Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save fnielsen/93f3b68941e74c468522f187e2dbe9a7 to your computer and use it in GitHub Desktop.
Save fnielsen/93f3b68941e74c468522f187e2dbe9a7 to your computer and use it in GitHub Desktop.
Jupyter notebook for a small study on embedding methods on a Danish word intrusion task
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combining embedding methods for a word intrusion task\n",
"=====================================================\n",
"We report a new baseline for a Danish word intrusion task by\n",
"combining pre-trained off-the-shelf word, subword and knowledge graph\n",
"embedding models."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Standard python modules\n",
"from functools import lru_cache\n",
"from os.path import expanduser, join\n",
"\n",
"# Extra python modules\n",
"from bert_serving.client import BertClient\n",
"from bpemb import BPEmb\n",
"from IPython.display import display\n",
"from gensim.models.fasttext import FastText\n",
"import numpy as np\n",
"import pandas as pd\n",
"import requests"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Wembedder may be installed from https://github.com/fnielsen/wembedder\n",
"# Here we include it from its directory\n",
"import sys\n",
"sys.path.append(expanduser(\"~/projects/wembedder/\"))\n",
"import wembedder.model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Data for evaluation and the machine learning parameters must be setup."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# The data we will work on is from the `dasem` toolbox available at https://github.com/fnielsen/dasem\n",
"# Only the four_words_2.csv file is necessary here, - not the rest of `dasem`.\n",
"filename_four_words = expanduser('~/projects/dasem/dasem/data/four_words_2.csv')\n",
"\n",
"# The fasttext model is available from https://fasttext.cc/docs/en/crawl-vectors.html\n",
"filename_fasttext_model = expanduser(join('~', 'data', 'fasttext', 'cc.da.300.bin'))\n",
"\n",
"# multi_cased_L-12_H-768_A-12 may be be downloaded from\n",
"# https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip\n",
"\n",
"# BPEmb are downloaded automatically"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read data\n",
"---------\n",
"The Data is loaded into a Pandas dataframe. The fourth column is the outlier. Results will be appended to the columns of this dataframe.\n",
"\n",
"* Nielsen, F.Å., Hansen, L.K.: Open semantic analysis: The case of word level semantics in Danish. LTC (2017), http://ltc.amu.edu.pl/book/papers/SEM2-2.pdf"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"four_words = pd.read_csv(filename_four_words)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('kradser af', 'trompet')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# These words were wrong in the first dataset\n",
"four_words.loc[80, 'word2'], four_words.loc[16, 'word2']"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word1</th>\n",
" <th>word2</th>\n",
" <th>word3</th>\n",
" <th>word4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>æble</td>\n",
" <td>pære</td>\n",
" <td>kirsebær</td>\n",
" <td>stol</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>stol</td>\n",
" <td>bord</td>\n",
" <td>reol</td>\n",
" <td>græs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>græs</td>\n",
" <td>træ</td>\n",
" <td>blomst</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>bil</td>\n",
" <td>cykel</td>\n",
" <td>tog</td>\n",
" <td>vind</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vind</td>\n",
" <td>regn</td>\n",
" <td>solskin</td>\n",
" <td>mandag</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word1 word2 word3 word4\n",
"0 æble pære kirsebær stol\n",
"1 stol bord reol græs\n",
"2 græs træ blomst bil\n",
"3 bil cykel tog vind\n",
"4 vind regn solskin mandag"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"four_words.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"FastText\n",
"--------\n",
"FastText is used through Gensim. The `doesnt_match` method is used to determine the outlier.\n",
"\n",
"* https://radimrehurek.com/gensim/\n",
"* Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning Word Vectors for 157 Languages. LREC (2018), https://arxiv.org/pdf/1802.06893.pdf"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"model = FastText.load_fasttext_format(filename_fasttext_model)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py:730: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n",
" vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n"
]
}
],
"source": [
"# Identify outlier\n",
"outliers = []\n",
"for idx, words in four_words.iterrows():\n",
" outlier = model.wv.doesnt_match(words.values[:4])\n",
" outliers.append(outlier)\n",
"\n",
"four_words['fasttext'] = outliers"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.78"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Accuracy\n",
"np.mean(four_words.word4 == outliers)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word1</th>\n",
" <th>word2</th>\n",
" <th>word3</th>\n",
" <th>word4</th>\n",
" <th>fasttext</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>æble</td>\n",
" <td>pære</td>\n",
" <td>kirsebær</td>\n",
" <td>stol</td>\n",
" <td>stol</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>stol</td>\n",
" <td>bord</td>\n",
" <td>reol</td>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>græs</td>\n",
" <td>træ</td>\n",
" <td>blomst</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>bil</td>\n",
" <td>cykel</td>\n",
" <td>tog</td>\n",
" <td>vind</td>\n",
" <td>tog</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vind</td>\n",
" <td>regn</td>\n",
" <td>solskin</td>\n",
" <td>mandag</td>\n",
" <td>mandag</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>mandag</td>\n",
" <td>tirsdag</td>\n",
" <td>søndag</td>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>tømrer</td>\n",
" <td>vvs-mand</td>\n",
" <td>snedker</td>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>barn</td>\n",
" <td>far</td>\n",
" <td>mormor</td>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>lampe</td>\n",
" <td>stearinlys</td>\n",
" <td>lommelygte</td>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>jern</td>\n",
" <td>guld</td>\n",
" <td>magnesium</td>\n",
" <td>sjov</td>\n",
" <td>sjov</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>sjov</td>\n",
" <td>dårlig</td>\n",
" <td>vanvittig</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>papir</td>\n",
" <td>ringbind</td>\n",
" <td>blyant</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>vagt</td>\n",
" <td>politimand</td>\n",
" <td>fængselsbetjent</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>by</td>\n",
" <td>landsby</td>\n",
" <td>købstad</td>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>småkage</td>\n",
" <td>citronmåne</td>\n",
" <td>kringle</td>\n",
" <td>dør</td>\n",
" <td>dør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>dør</td>\n",
" <td>væg</td>\n",
" <td>vindue</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>klaver</td>\n",
" <td>trompet</td>\n",
" <td>blokfløjte</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>fandens</td>\n",
" <td>fuck</td>\n",
" <td>sgu</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>vand</td>\n",
" <td>jord</td>\n",
" <td>ild</td>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>hukommelse</td>\n",
" <td>intelligens</td>\n",
" <td>emotion</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Niels Bohr</td>\n",
" <td>H.C. Ørsted</td>\n",
" <td>Ole Rømer</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Ole Rømer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Poul Nyrup Rasmussen</td>\n",
" <td>Anders Fogh Rasmussen</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Anders Fogh Rasmussen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Kasper Schmeichel</td>\n",
" <td>Brian Laudrup</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Steffi Graf</td>\n",
" <td>Serena Williams</td>\n",
" <td>Monaco</td>\n",
" <td>Serena Williams</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Monaco</td>\n",
" <td>Paris</td>\n",
" <td>Milano</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Pia</td>\n",
" <td>Lone</td>\n",
" <td>Marianne</td>\n",
" <td>Ole</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>bold</td>\n",
" <td>fjerbold</td>\n",
" <td>puck</td>\n",
" <td>mave</td>\n",
" <td>mave</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>mave</td>\n",
" <td>bryst</td>\n",
" <td>ryg</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>hat</td>\n",
" <td>kasket</td>\n",
" <td>hue</td>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>ishockey</td>\n",
" <td>skiløb</td>\n",
" <td>skihop</td>\n",
" <td>fodbold</td>\n",
" <td>skiløb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>gå</td>\n",
" <td>løbe</td>\n",
" <td>kravle</td>\n",
" <td>sidde</td>\n",
" <td>løbe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>rød</td>\n",
" <td>blå</td>\n",
" <td>violet</td>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>Finland</td>\n",
" <td>Sverige</td>\n",
" <td>Norge</td>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Kina</td>\n",
" <td>Japan</td>\n",
" <td>Sydkorea</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>humor</td>\n",
" <td>komedie</td>\n",
" <td>comedy</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>vaskemaskine</td>\n",
" <td>strygejern</td>\n",
" <td>tørretumbler</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>restaurant</td>\n",
" <td>café</td>\n",
" <td>bar</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>øl</td>\n",
" <td>vin</td>\n",
" <td>spiritus</td>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>køkken</td>\n",
" <td>baderum</td>\n",
" <td>stue</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>wing</td>\n",
" <td>back</td>\n",
" <td>forward</td>\n",
" <td>vinge</td>\n",
" <td>vinge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>vinge</td>\n",
" <td>landingsstel</td>\n",
" <td>propel</td>\n",
" <td>kartoffel</td>\n",
" <td>kartoffel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>kartoffel</td>\n",
" <td>frikadelle</td>\n",
" <td>salat</td>\n",
" <td>pejs</td>\n",
" <td>pejs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>Viborg</td>\n",
" <td>Randers</td>\n",
" <td>Hobro</td>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>Kattegat</td>\n",
" <td>Øresund</td>\n",
" <td>Alssund</td>\n",
" <td>Sjælland</td>\n",
" <td>Alssund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>eg</td>\n",
" <td>lærketræ</td>\n",
" <td>æbletræ</td>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>hugorm</td>\n",
" <td>pyton</td>\n",
" <td>snog</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>ko</td>\n",
" <td>so</td>\n",
" <td>hest</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>ugle</td>\n",
" <td>krage</td>\n",
" <td>måge</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>hund</td>\n",
" <td>ræv</td>\n",
" <td>ulv</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>spilletid</td>\n",
" <td>halvleg</td>\n",
" <td>dommer</td>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>tv</td>\n",
" <td>radio</td>\n",
" <td>telefon</td>\n",
" <td>klud</td>\n",
" <td>klud</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51</th>\n",
" <td>ondt</td>\n",
" <td>forfærdeligt</td>\n",
" <td>skrækkeligt</td>\n",
" <td>herligt</td>\n",
" <td>herligt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>hoppende</td>\n",
" <td>dansende</td>\n",
" <td>løbende</td>\n",
" <td>døende</td>\n",
" <td>løbende</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>saver</td>\n",
" <td>hamrer</td>\n",
" <td>skruer</td>\n",
" <td>aer</td>\n",
" <td>aer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>går</td>\n",
" <td>spadserer</td>\n",
" <td>vandrer</td>\n",
" <td>siger</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>gange</td>\n",
" <td>dividere</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" <td>lægge sammen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56</th>\n",
" <td>mener</td>\n",
" <td>tror</td>\n",
" <td>ved</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57</th>\n",
" <td>fire</td>\n",
" <td>fem</td>\n",
" <td>sytten</td>\n",
" <td>aldrig</td>\n",
" <td>aldrig</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>Nielsen</td>\n",
" <td>Jensen</td>\n",
" <td>Olsen</td>\n",
" <td>kassen</td>\n",
" <td>kassen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>mega</td>\n",
" <td>kæmpe</td>\n",
" <td>enorm</td>\n",
" <td>smule</td>\n",
" <td>kæmpe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>kufferter</td>\n",
" <td>tasker</td>\n",
" <td>bæreposer</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>landstræner</td>\n",
" <td>håndboldekspert</td>\n",
" <td>mål</td>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>trup</td>\n",
" <td>gruppe</td>\n",
" <td>hold</td>\n",
" <td>sti</td>\n",
" <td>sti</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>grafik</td>\n",
" <td>figur</td>\n",
" <td>plot</td>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>januar</td>\n",
" <td>maj</td>\n",
" <td>juni</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>angiveligt</td>\n",
" <td>muligvis</td>\n",
" <td>sandsynligvis</td>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>66</th>\n",
" <td>oplysninger</td>\n",
" <td>data</td>\n",
" <td>informationer</td>\n",
" <td>fjerner</td>\n",
" <td>fjerner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67</th>\n",
" <td>fire minutter</td>\n",
" <td>tre timer</td>\n",
" <td>en uge</td>\n",
" <td>to piger</td>\n",
" <td>tre timer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>instrumentbrættet</td>\n",
" <td>motorer</td>\n",
" <td>cockpit</td>\n",
" <td>sagen</td>\n",
" <td>sagen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>anmeldelse</td>\n",
" <td>politi</td>\n",
" <td>forbrydelse</td>\n",
" <td>kaffe</td>\n",
" <td>anmeldelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>instrueret</td>\n",
" <td>organiseret</td>\n",
" <td>ledet</td>\n",
" <td>skuffet</td>\n",
" <td>skuffet</td>\n",
" </tr>\n",
" <tr>\n",
" <th>71</th>\n",
" <td>billede</td>\n",
" <td>foto</td>\n",
" <td>tegning</td>\n",
" <td>skål</td>\n",
" <td>skål</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72</th>\n",
" <td>kapitel</td>\n",
" <td>paragraf</td>\n",
" <td>sektion</td>\n",
" <td>park</td>\n",
" <td>park</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>virksomhed</td>\n",
" <td>firma</td>\n",
" <td>selskab</td>\n",
" <td>sovs</td>\n",
" <td>sovs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>tres</td>\n",
" <td>60</td>\n",
" <td>LX</td>\n",
" <td>3</td>\n",
" <td>tres</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>1864</td>\n",
" <td>1807</td>\n",
" <td>1940</td>\n",
" <td>1909</td>\n",
" <td>1807</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>diplom</td>\n",
" <td>udmærkelse</td>\n",
" <td>pris</td>\n",
" <td>øremærke</td>\n",
" <td>øremærke</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>bange</td>\n",
" <td>urolig</td>\n",
" <td>nervøs</td>\n",
" <td>ordentlig</td>\n",
" <td>ordentlig</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>norsk</td>\n",
" <td>engelsk</td>\n",
" <td>spansk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>mus</td>\n",
" <td>tastatur</td>\n",
" <td>skærm</td>\n",
" <td>bræt</td>\n",
" <td>bræt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>dør</td>\n",
" <td>kradser af</td>\n",
" <td>udånder</td>\n",
" <td>åbner</td>\n",
" <td>kradser af</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>og</td>\n",
" <td>samt</td>\n",
" <td>endvidere</td>\n",
" <td>sin</td>\n",
" <td>og</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>hans</td>\n",
" <td>sit</td>\n",
" <td>vores</td>\n",
" <td>vises</td>\n",
" <td>vises</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>stod og råbte</td>\n",
" <td>lå og sov</td>\n",
" <td>sad og så</td>\n",
" <td>mand og kvinde</td>\n",
" <td>stod og råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>frokost</td>\n",
" <td>morgenmad</td>\n",
" <td>brunch</td>\n",
" <td>måne</td>\n",
" <td>måne</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" <td>individer</td>\n",
" <td>gange</td>\n",
" <td>mænd</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>kidnappe</td>\n",
" <td>røve</td>\n",
" <td>stjæle</td>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>råbte</td>\n",
" <td>skreg</td>\n",
" <td>larmede</td>\n",
" <td>vuggede</td>\n",
" <td>råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>kontorist</td>\n",
" <td>embedsmand</td>\n",
" <td>bureaukrat</td>\n",
" <td>spisebord</td>\n",
" <td>spisebord</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>vegetation</td>\n",
" <td>krat</td>\n",
" <td>bed</td>\n",
" <td>skur</td>\n",
" <td>skur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>cyklist</td>\n",
" <td>bilist</td>\n",
" <td>chauffør</td>\n",
" <td>ekspedient</td>\n",
" <td>ekspedient</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>bibliotek</td>\n",
" <td>bog</td>\n",
" <td>låner</td>\n",
" <td>flag</td>\n",
" <td>låner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>halvsyg</td>\n",
" <td>forkølelse</td>\n",
" <td>hoster</td>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>musik</td>\n",
" <td>node</td>\n",
" <td>rytme</td>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>rapport</td>\n",
" <td>sagsakt</td>\n",
" <td>artikel</td>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>tekande</td>\n",
" <td>vinflaske</td>\n",
" <td>slikskål</td>\n",
" <td>racerbil</td>\n",
" <td>racerbil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>forhører</td>\n",
" <td>spørger</td>\n",
" <td>anmoder</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>fremtidige</td>\n",
" <td>fortidige</td>\n",
" <td>nutidige</td>\n",
" <td>havdige</td>\n",
" <td>havdige</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>kanal</td>\n",
" <td>flod</td>\n",
" <td>bæk</td>\n",
" <td>spejl</td>\n",
" <td>spejl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>kanal</td>\n",
" <td>program</td>\n",
" <td>udsendelse</td>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word1 word2 word3 \\\n",
"0 æble pære kirsebær \n",
"1 stol bord reol \n",
"2 græs træ blomst \n",
"3 bil cykel tog \n",
"4 vind regn solskin \n",
"5 mandag tirsdag søndag \n",
"6 tømrer vvs-mand snedker \n",
"7 barn far mormor \n",
"8 lampe stearinlys lommelygte \n",
"9 jern guld magnesium \n",
"10 sjov dårlig vanvittig \n",
"11 papir ringbind blyant \n",
"12 vagt politimand fængselsbetjent \n",
"13 by landsby købstad \n",
"14 småkage citronmåne kringle \n",
"15 dør væg vindue \n",
"16 klaver trompet blokfløjte \n",
"17 fandens fuck sgu \n",
"18 vand jord ild \n",
"19 hukommelse intelligens emotion \n",
"20 Niels Bohr H.C. Ørsted Ole Rømer \n",
"21 Lars Løkke Rasmussen Poul Nyrup Rasmussen Anders Fogh Rasmussen \n",
"22 Peter Schmeichel Kasper Schmeichel Brian Laudrup \n",
"23 Caroline Wozniacki Steffi Graf Serena Williams \n",
"24 Monaco Paris Milano \n",
"25 Pia Lone Marianne \n",
"26 bold fjerbold puck \n",
"27 mave bryst ryg \n",
"28 hat kasket hue \n",
"29 ishockey skiløb skihop \n",
"30 gå løbe kravle \n",
"31 rød blå violet \n",
"32 Finland Sverige Norge \n",
"33 Kina Japan Sydkorea \n",
"34 humor komedie comedy \n",
"35 vaskemaskine strygejern tørretumbler \n",
"36 restaurant café bar \n",
"37 øl vin spiritus \n",
"38 køkken baderum stue \n",
"39 wing back forward \n",
"40 vinge landingsstel propel \n",
"41 kartoffel frikadelle salat \n",
"42 Viborg Randers Hobro \n",
"43 Kattegat Øresund Alssund \n",
"44 eg lærketræ æbletræ \n",
"45 hugorm pyton snog \n",
"46 ko so hest \n",
"47 ugle krage måge \n",
"48 hund ræv ulv \n",
"49 spilletid halvleg dommer \n",
"50 tv radio telefon \n",
"51 ondt forfærdeligt skrækkeligt \n",
"52 hoppende dansende løbende \n",
"53 saver hamrer skruer \n",
"54 går spadserer vandrer \n",
"55 gange dividere lægge sammen \n",
"56 mener tror ved \n",
"57 fire fem sytten \n",
"58 Nielsen Jensen Olsen \n",
"59 mega kæmpe enorm \n",
"60 kufferter tasker bæreposer \n",
"61 landstræner håndboldekspert mål \n",
"62 trup gruppe hold \n",
"63 grafik figur plot \n",
"64 januar maj juni \n",
"65 angiveligt muligvis sandsynligvis \n",
"66 oplysninger data informationer \n",
"67 fire minutter tre timer en uge \n",
"68 instrumentbrættet motorer cockpit \n",
"69 anmeldelse politi forbrydelse \n",
"70 instrueret organiseret ledet \n",
"71 billede foto tegning \n",
"72 kapitel paragraf sektion \n",
"73 virksomhed firma selskab \n",
"74 tres 60 LX \n",
"75 1864 1807 1940 \n",
"76 diplom udmærkelse pris \n",
"77 bange urolig nervøs \n",
"78 norsk engelsk spansk \n",
"79 mus tastatur skærm \n",
"80 dør kradser af udånder \n",
"81 og samt endvidere \n",
"82 hans sit vores \n",
"83 stod og råbte lå og sov sad og så \n",
"84 frokost morgenmad brunch \n",
"85 mænd personer individer \n",
"86 kidnappe røve stjæle \n",
"87 råbte skreg larmede \n",
"88 kontorist embedsmand bureaukrat \n",
"89 vegetation krat bed \n",
"90 cyklist bilist chauffør \n",
"91 bibliotek bog låner \n",
"92 halvsyg forkølelse hoster \n",
"93 musik node rytme \n",
"94 rapport sagsakt artikel \n",
"95 tekande vinflaske slikskål \n",
"96 forhører spørger anmoder \n",
"97 fremtidige fortidige nutidige \n",
"98 kanal flod bæk \n",
"99 kanal program udsendelse \n",
"\n",
" word4 fasttext \n",
"0 stol stol \n",
"1 græs græs \n",
"2 bil bil \n",
"3 vind tog \n",
"4 mandag mandag \n",
"5 tømrer tømrer \n",
"6 barn barn \n",
"7 lampe lampe \n",
"8 jern jern \n",
"9 sjov sjov \n",
"10 papir papir \n",
"11 vagt vagt \n",
"12 by by \n",
"13 småkage småkage \n",
"14 dør dør \n",
"15 klaver klaver \n",
"16 fandens fandens \n",
"17 vand vand \n",
"18 hukommelse hukommelse \n",
"19 Niels Bohr Niels Bohr \n",
"20 Lars Løkke Rasmussen Ole Rømer \n",
"21 Peter Schmeichel Anders Fogh Rasmussen \n",
"22 Caroline Wozniacki Caroline Wozniacki \n",
"23 Monaco Serena Williams \n",
"24 Pia Pia \n",
"25 Ole Pia \n",
"26 mave mave \n",
"27 hat hat \n",
"28 ishockey ishockey \n",
"29 fodbold skiløb \n",
"30 sidde løbe \n",
"31 himmel himmel \n",
"32 Kina Kina \n",
"33 Irland Irland \n",
"34 beskidt beskidt \n",
"35 beskidt beskidt \n",
"36 øl øl \n",
"37 køkken køkken \n",
"38 øl øl \n",
"39 vinge vinge \n",
"40 kartoffel kartoffel \n",
"41 pejs pejs \n",
"42 Kattegat Kattegat \n",
"43 Sjælland Alssund \n",
"44 slange slange \n",
"45 hund hund \n",
"46 krappe krappe \n",
"47 hund hund \n",
"48 krappe krappe \n",
"49 ræv ræv \n",
"50 klud klud \n",
"51 herligt herligt \n",
"52 døende løbende \n",
"53 aer aer \n",
"54 siger går \n",
"55 vandrer lægge sammen \n",
"56 går går \n",
"57 aldrig aldrig \n",
"58 kassen kassen \n",
"59 smule kæmpe \n",
"60 styrelser styrelser \n",
"61 rum rum \n",
"62 sti sti \n",
"63 lån lån \n",
"64 ur ur \n",
"65 nutidigt nutidigt \n",
"66 fjerner fjerner \n",
"67 to piger tre timer \n",
"68 sagen sagen \n",
"69 kaffe anmeldelse \n",
"70 skuffet skuffet \n",
"71 skål skål \n",
"72 park park \n",
"73 sovs sovs \n",
"74 3 tres \n",
"75 1909 1807 \n",
"76 øremærke øremærke \n",
"77 ordentlig ordentlig \n",
"78 falsk falsk \n",
"79 bræt bræt \n",
"80 åbner kradser af \n",
"81 sin og \n",
"82 vises vises \n",
"83 mand og kvinde stod og råbte \n",
"84 måne måne \n",
"85 gange mænd \n",
"86 køre køre \n",
"87 vuggede råbte \n",
"88 spisebord spisebord \n",
"89 skur skur \n",
"90 ekspedient ekspedient \n",
"91 flag låner \n",
"92 vej vej \n",
"93 leder leder \n",
"94 spand spand \n",
"95 racerbil racerbil \n",
"96 banker banker \n",
"97 havdige havdige \n",
"98 spejl spejl \n",
"99 vask vask "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"with pd.option_context(\"display.max_rows\", 100):\n",
" display(four_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"BERT\n",
"----\n",
"\n",
"From the command-line start bert-as-service: with \n",
"```\n",
"bert-serving-start -model_dir multi_cased_L-12_H-768_A-12/ -num_worker 1\n",
"```\n",
"\n",
"* Devlin, J., Chang, M.W., Lee, K., Toutanova, K.N.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018), https://arxiv.org/pdf/1810.04805.pdf"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"bc = BertClient(ip='localhost')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Identify outlier\n",
"corrcoef_outliers, cov_outliers, dot_outliers = [], [], []\n",
"for idx, words in four_words.iterrows():\n",
" vectors = bc.encode(list(words.values[:4]))\n",
" \n",
" R = np.corrcoef(vectors)\n",
" indices = np.argsort(R.sum(axis=0))\n",
" outlier = words[indices[0]]\n",
" corrcoef_outliers.append(outlier)\n",
"\n",
" C = np.cov(vectors)\n",
" indices = np.argsort(C.sum(axis=0))\n",
" outlier = words[indices[0]]\n",
" cov_outliers.append(outlier)\n",
"\n",
" D = np.dot(vectors, vectors.T)\n",
" indices = np.argsort(D.sum(axis=0))\n",
" outlier = words[indices[0]]\n",
" dot_outliers.append(outlier)\n",
"\n",
" \n",
"four_words['bert-corrcoef'] = corrcoef_outliers\n",
"four_words['bert-cov'] = cov_outliers\n",
"four_words['bert-dot'] = dot_outliers"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.32"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(four_words.word4 == four_words['bert-corrcoef'])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.31"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(four_words.word4 == four_words['bert-cov'])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.31"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(four_words.word4 == four_words['bert-dot'])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word1</th>\n",
" <th>word2</th>\n",
" <th>word3</th>\n",
" <th>word4</th>\n",
" <th>bert-corrcoef</th>\n",
" <th>bert-dot</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>æble</td>\n",
" <td>pære</td>\n",
" <td>kirsebær</td>\n",
" <td>stol</td>\n",
" <td>kirsebær</td>\n",
" <td>kirsebær</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>stol</td>\n",
" <td>bord</td>\n",
" <td>reol</td>\n",
" <td>græs</td>\n",
" <td>reol</td>\n",
" <td>græs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>græs</td>\n",
" <td>træ</td>\n",
" <td>blomst</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>bil</td>\n",
" <td>cykel</td>\n",
" <td>tog</td>\n",
" <td>vind</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vind</td>\n",
" <td>regn</td>\n",
" <td>solskin</td>\n",
" <td>mandag</td>\n",
" <td>solskin</td>\n",
" <td>regn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>mandag</td>\n",
" <td>tirsdag</td>\n",
" <td>søndag</td>\n",
" <td>tømrer</td>\n",
" <td>søndag</td>\n",
" <td>søndag</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>tømrer</td>\n",
" <td>vvs-mand</td>\n",
" <td>snedker</td>\n",
" <td>barn</td>\n",
" <td>vvs-mand</td>\n",
" <td>vvs-mand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>barn</td>\n",
" <td>far</td>\n",
" <td>mormor</td>\n",
" <td>lampe</td>\n",
" <td>barn</td>\n",
" <td>far</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>lampe</td>\n",
" <td>stearinlys</td>\n",
" <td>lommelygte</td>\n",
" <td>jern</td>\n",
" <td>stearinlys</td>\n",
" <td>stearinlys</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>jern</td>\n",
" <td>guld</td>\n",
" <td>magnesium</td>\n",
" <td>sjov</td>\n",
" <td>guld</td>\n",
" <td>guld</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>sjov</td>\n",
" <td>dårlig</td>\n",
" <td>vanvittig</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>papir</td>\n",
" <td>ringbind</td>\n",
" <td>blyant</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" <td>ringbind</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>vagt</td>\n",
" <td>politimand</td>\n",
" <td>fængselsbetjent</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" <td>fængselsbetjent</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>by</td>\n",
" <td>landsby</td>\n",
" <td>købstad</td>\n",
" <td>småkage</td>\n",
" <td>by</td>\n",
" <td>landsby</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>småkage</td>\n",
" <td>citronmåne</td>\n",
" <td>kringle</td>\n",
" <td>dør</td>\n",
" <td>dør</td>\n",
" <td>småkage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>dør</td>\n",
" <td>væg</td>\n",
" <td>vindue</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>klaver</td>\n",
" <td>trompet</td>\n",
" <td>blokfløjte</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" <td>klaver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>fandens</td>\n",
" <td>fuck</td>\n",
" <td>sgu</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>vand</td>\n",
" <td>jord</td>\n",
" <td>ild</td>\n",
" <td>hukommelse</td>\n",
" <td>vand</td>\n",
" <td>ild</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>hukommelse</td>\n",
" <td>intelligens</td>\n",
" <td>emotion</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Niels Bohr</td>\n",
" <td>H.C. Ørsted</td>\n",
" <td>Ole Rømer</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Poul Nyrup Rasmussen</td>\n",
" <td>Anders Fogh Rasmussen</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Peter Schmeichel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Kasper Schmeichel</td>\n",
" <td>Brian Laudrup</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Steffi Graf</td>\n",
" <td>Serena Williams</td>\n",
" <td>Monaco</td>\n",
" <td>Serena Williams</td>\n",
" <td>Serena Williams</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Monaco</td>\n",
" <td>Paris</td>\n",
" <td>Milano</td>\n",
" <td>Pia</td>\n",
" <td>Milano</td>\n",
" <td>Milano</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Pia</td>\n",
" <td>Lone</td>\n",
" <td>Marianne</td>\n",
" <td>Ole</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>bold</td>\n",
" <td>fjerbold</td>\n",
" <td>puck</td>\n",
" <td>mave</td>\n",
" <td>fjerbold</td>\n",
" <td>fjerbold</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>mave</td>\n",
" <td>bryst</td>\n",
" <td>ryg</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" <td>mave</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>hat</td>\n",
" <td>kasket</td>\n",
" <td>hue</td>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>ishockey</td>\n",
" <td>skiløb</td>\n",
" <td>skihop</td>\n",
" <td>fodbold</td>\n",
" <td>skiløb</td>\n",
" <td>skiløb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>gå</td>\n",
" <td>løbe</td>\n",
" <td>kravle</td>\n",
" <td>sidde</td>\n",
" <td>sidde</td>\n",
" <td>sidde</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>rød</td>\n",
" <td>blå</td>\n",
" <td>violet</td>\n",
" <td>himmel</td>\n",
" <td>violet</td>\n",
" <td>blå</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>Finland</td>\n",
" <td>Sverige</td>\n",
" <td>Norge</td>\n",
" <td>Kina</td>\n",
" <td>Norge</td>\n",
" <td>Norge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Kina</td>\n",
" <td>Japan</td>\n",
" <td>Sydkorea</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>humor</td>\n",
" <td>komedie</td>\n",
" <td>comedy</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>vaskemaskine</td>\n",
" <td>strygejern</td>\n",
" <td>tørretumbler</td>\n",
" <td>beskidt</td>\n",
" <td>vaskemaskine</td>\n",
" <td>vaskemaskine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>restaurant</td>\n",
" <td>café</td>\n",
" <td>bar</td>\n",
" <td>øl</td>\n",
" <td>restaurant</td>\n",
" <td>restaurant</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>øl</td>\n",
" <td>vin</td>\n",
" <td>spiritus</td>\n",
" <td>køkken</td>\n",
" <td>vin</td>\n",
" <td>vin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>køkken</td>\n",
" <td>baderum</td>\n",
" <td>stue</td>\n",
" <td>øl</td>\n",
" <td>baderum</td>\n",
" <td>baderum</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>wing</td>\n",
" <td>back</td>\n",
" <td>forward</td>\n",
" <td>vinge</td>\n",
" <td>vinge</td>\n",
" <td>forward</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>vinge</td>\n",
" <td>landingsstel</td>\n",
" <td>propel</td>\n",
" <td>kartoffel</td>\n",
" <td>landingsstel</td>\n",
" <td>kartoffel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>kartoffel</td>\n",
" <td>frikadelle</td>\n",
" <td>salat</td>\n",
" <td>pejs</td>\n",
" <td>salat</td>\n",
" <td>salat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>Viborg</td>\n",
" <td>Randers</td>\n",
" <td>Hobro</td>\n",
" <td>Kattegat</td>\n",
" <td>Randers</td>\n",
" <td>Kattegat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>Kattegat</td>\n",
" <td>Øresund</td>\n",
" <td>Alssund</td>\n",
" <td>Sjælland</td>\n",
" <td>Øresund</td>\n",
" <td>Sjælland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>eg</td>\n",
" <td>lærketræ</td>\n",
" <td>æbletræ</td>\n",
" <td>slange</td>\n",
" <td>eg</td>\n",
" <td>lærketræ</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>hugorm</td>\n",
" <td>pyton</td>\n",
" <td>snog</td>\n",
" <td>hund</td>\n",
" <td>pyton</td>\n",
" <td>pyton</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>ko</td>\n",
" <td>so</td>\n",
" <td>hest</td>\n",
" <td>krappe</td>\n",
" <td>so</td>\n",
" <td>so</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>ugle</td>\n",
" <td>krage</td>\n",
" <td>måge</td>\n",
" <td>hund</td>\n",
" <td>krage</td>\n",
" <td>krage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>hund</td>\n",
" <td>ræv</td>\n",
" <td>ulv</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>ræv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>spilletid</td>\n",
" <td>halvleg</td>\n",
" <td>dommer</td>\n",
" <td>ræv</td>\n",
" <td>spilletid</td>\n",
" <td>spilletid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>tv</td>\n",
" <td>radio</td>\n",
" <td>telefon</td>\n",
" <td>klud</td>\n",
" <td>klud</td>\n",
" <td>tv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51</th>\n",
" <td>ondt</td>\n",
" <td>forfærdeligt</td>\n",
" <td>skrækkeligt</td>\n",
" <td>herligt</td>\n",
" <td>ondt</td>\n",
" <td>forfærdeligt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>hoppende</td>\n",
" <td>dansende</td>\n",
" <td>løbende</td>\n",
" <td>døende</td>\n",
" <td>døende</td>\n",
" <td>døende</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>saver</td>\n",
" <td>hamrer</td>\n",
" <td>skruer</td>\n",
" <td>aer</td>\n",
" <td>saver</td>\n",
" <td>hamrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>går</td>\n",
" <td>spadserer</td>\n",
" <td>vandrer</td>\n",
" <td>siger</td>\n",
" <td>går</td>\n",
" <td>spadserer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>gange</td>\n",
" <td>dividere</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" <td>lægge sammen</td>\n",
" <td>lægge sammen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56</th>\n",
" <td>mener</td>\n",
" <td>tror</td>\n",
" <td>ved</td>\n",
" <td>går</td>\n",
" <td>ved</td>\n",
" <td>mener</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57</th>\n",
" <td>fire</td>\n",
" <td>fem</td>\n",
" <td>sytten</td>\n",
" <td>aldrig</td>\n",
" <td>sytten</td>\n",
" <td>sytten</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>Nielsen</td>\n",
" <td>Jensen</td>\n",
" <td>Olsen</td>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" <td>Olsen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>mega</td>\n",
" <td>kæmpe</td>\n",
" <td>enorm</td>\n",
" <td>smule</td>\n",
" <td>kæmpe</td>\n",
" <td>smule</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>kufferter</td>\n",
" <td>tasker</td>\n",
" <td>bæreposer</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>landstræner</td>\n",
" <td>håndboldekspert</td>\n",
" <td>mål</td>\n",
" <td>rum</td>\n",
" <td>landstræner</td>\n",
" <td>landstræner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>trup</td>\n",
" <td>gruppe</td>\n",
" <td>hold</td>\n",
" <td>sti</td>\n",
" <td>gruppe</td>\n",
" <td>gruppe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>grafik</td>\n",
" <td>figur</td>\n",
" <td>plot</td>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>januar</td>\n",
" <td>maj</td>\n",
" <td>juni</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>angiveligt</td>\n",
" <td>muligvis</td>\n",
" <td>sandsynligvis</td>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>66</th>\n",
" <td>oplysninger</td>\n",
" <td>data</td>\n",
" <td>informationer</td>\n",
" <td>fjerner</td>\n",
" <td>fjerner</td>\n",
" <td>oplysninger</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67</th>\n",
" <td>fire minutter</td>\n",
" <td>tre timer</td>\n",
" <td>en uge</td>\n",
" <td>to piger</td>\n",
" <td>to piger</td>\n",
" <td>to piger</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>instrumentbrættet</td>\n",
" <td>motorer</td>\n",
" <td>cockpit</td>\n",
" <td>sagen</td>\n",
" <td>sagen</td>\n",
" <td>instrumentbrættet</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>anmeldelse</td>\n",
" <td>politi</td>\n",
" <td>forbrydelse</td>\n",
" <td>kaffe</td>\n",
" <td>kaffe</td>\n",
" <td>kaffe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>instrueret</td>\n",
" <td>organiseret</td>\n",
" <td>ledet</td>\n",
" <td>skuffet</td>\n",
" <td>instrueret</td>\n",
" <td>instrueret</td>\n",
" </tr>\n",
" <tr>\n",
" <th>71</th>\n",
" <td>billede</td>\n",
" <td>foto</td>\n",
" <td>tegning</td>\n",
" <td>skål</td>\n",
" <td>tegning</td>\n",
" <td>foto</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72</th>\n",
" <td>kapitel</td>\n",
" <td>paragraf</td>\n",
" <td>sektion</td>\n",
" <td>park</td>\n",
" <td>paragraf</td>\n",
" <td>kapitel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>virksomhed</td>\n",
" <td>firma</td>\n",
" <td>selskab</td>\n",
" <td>sovs</td>\n",
" <td>virksomhed</td>\n",
" <td>virksomhed</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>tres</td>\n",
" <td>60</td>\n",
" <td>LX</td>\n",
" <td>3</td>\n",
" <td>LX</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>1864</td>\n",
" <td>1807</td>\n",
" <td>1940</td>\n",
" <td>1909</td>\n",
" <td>1909</td>\n",
" <td>1909</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>diplom</td>\n",
" <td>udmærkelse</td>\n",
" <td>pris</td>\n",
" <td>øremærke</td>\n",
" <td>pris</td>\n",
" <td>øremærke</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>bange</td>\n",
" <td>urolig</td>\n",
" <td>nervøs</td>\n",
" <td>ordentlig</td>\n",
" <td>bange</td>\n",
" <td>ordentlig</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>norsk</td>\n",
" <td>engelsk</td>\n",
" <td>spansk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" <td>engelsk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>mus</td>\n",
" <td>tastatur</td>\n",
" <td>skærm</td>\n",
" <td>bræt</td>\n",
" <td>tastatur</td>\n",
" <td>tastatur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>dør</td>\n",
" <td>kradser af</td>\n",
" <td>udånder</td>\n",
" <td>åbner</td>\n",
" <td>kradser af</td>\n",
" <td>kradser af</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>og</td>\n",
" <td>samt</td>\n",
" <td>endvidere</td>\n",
" <td>sin</td>\n",
" <td>sin</td>\n",
" <td>sin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>hans</td>\n",
" <td>sit</td>\n",
" <td>vores</td>\n",
" <td>vises</td>\n",
" <td>sit</td>\n",
" <td>sit</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>stod og råbte</td>\n",
" <td>lå og sov</td>\n",
" <td>sad og så</td>\n",
" <td>mand og kvinde</td>\n",
" <td>stod og råbte</td>\n",
" <td>mand og kvinde</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>frokost</td>\n",
" <td>morgenmad</td>\n",
" <td>brunch</td>\n",
" <td>måne</td>\n",
" <td>morgenmad</td>\n",
" <td>morgenmad</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" <td>individer</td>\n",
" <td>gange</td>\n",
" <td>personer</td>\n",
" <td>personer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>kidnappe</td>\n",
" <td>røve</td>\n",
" <td>stjæle</td>\n",
" <td>køre</td>\n",
" <td>kidnappe</td>\n",
" <td>stjæle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>råbte</td>\n",
" <td>skreg</td>\n",
" <td>larmede</td>\n",
" <td>vuggede</td>\n",
" <td>skreg</td>\n",
" <td>vuggede</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>kontorist</td>\n",
" <td>embedsmand</td>\n",
" <td>bureaukrat</td>\n",
" <td>spisebord</td>\n",
" <td>bureaukrat</td>\n",
" <td>bureaukrat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>vegetation</td>\n",
" <td>krat</td>\n",
" <td>bed</td>\n",
" <td>skur</td>\n",
" <td>vegetation</td>\n",
" <td>skur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>cyklist</td>\n",
" <td>bilist</td>\n",
" <td>chauffør</td>\n",
" <td>ekspedient</td>\n",
" <td>cyklist</td>\n",
" <td>cyklist</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>bibliotek</td>\n",
" <td>bog</td>\n",
" <td>låner</td>\n",
" <td>flag</td>\n",
" <td>bibliotek</td>\n",
" <td>bibliotek</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>halvsyg</td>\n",
" <td>forkølelse</td>\n",
" <td>hoster</td>\n",
" <td>vej</td>\n",
" <td>forkølelse</td>\n",
" <td>forkølelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>musik</td>\n",
" <td>node</td>\n",
" <td>rytme</td>\n",
" <td>leder</td>\n",
" <td>node</td>\n",
" <td>musik</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>rapport</td>\n",
" <td>sagsakt</td>\n",
" <td>artikel</td>\n",
" <td>spand</td>\n",
" <td>artikel</td>\n",
" <td>artikel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>tekande</td>\n",
" <td>vinflaske</td>\n",
" <td>slikskål</td>\n",
" <td>racerbil</td>\n",
" <td>tekande</td>\n",
" <td>tekande</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>forhører</td>\n",
" <td>spørger</td>\n",
" <td>anmoder</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" <td>anmoder</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>fremtidige</td>\n",
" <td>fortidige</td>\n",
" <td>nutidige</td>\n",
" <td>havdige</td>\n",
" <td>fortidige</td>\n",
" <td>havdige</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>kanal</td>\n",
" <td>flod</td>\n",
" <td>bæk</td>\n",
" <td>spejl</td>\n",
" <td>kanal</td>\n",
" <td>kanal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>kanal</td>\n",
" <td>program</td>\n",
" <td>udsendelse</td>\n",
" <td>vask</td>\n",
" <td>udsendelse</td>\n",
" <td>udsendelse</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word1 word2 word3 \\\n",
"0 æble pære kirsebær \n",
"1 stol bord reol \n",
"2 græs træ blomst \n",
"3 bil cykel tog \n",
"4 vind regn solskin \n",
"5 mandag tirsdag søndag \n",
"6 tømrer vvs-mand snedker \n",
"7 barn far mormor \n",
"8 lampe stearinlys lommelygte \n",
"9 jern guld magnesium \n",
"10 sjov dårlig vanvittig \n",
"11 papir ringbind blyant \n",
"12 vagt politimand fængselsbetjent \n",
"13 by landsby købstad \n",
"14 småkage citronmåne kringle \n",
"15 dør væg vindue \n",
"16 klaver trompet blokfløjte \n",
"17 fandens fuck sgu \n",
"18 vand jord ild \n",
"19 hukommelse intelligens emotion \n",
"20 Niels Bohr H.C. Ørsted Ole Rømer \n",
"21 Lars Løkke Rasmussen Poul Nyrup Rasmussen Anders Fogh Rasmussen \n",
"22 Peter Schmeichel Kasper Schmeichel Brian Laudrup \n",
"23 Caroline Wozniacki Steffi Graf Serena Williams \n",
"24 Monaco Paris Milano \n",
"25 Pia Lone Marianne \n",
"26 bold fjerbold puck \n",
"27 mave bryst ryg \n",
"28 hat kasket hue \n",
"29 ishockey skiløb skihop \n",
"30 gå løbe kravle \n",
"31 rød blå violet \n",
"32 Finland Sverige Norge \n",
"33 Kina Japan Sydkorea \n",
"34 humor komedie comedy \n",
"35 vaskemaskine strygejern tørretumbler \n",
"36 restaurant café bar \n",
"37 øl vin spiritus \n",
"38 køkken baderum stue \n",
"39 wing back forward \n",
"40 vinge landingsstel propel \n",
"41 kartoffel frikadelle salat \n",
"42 Viborg Randers Hobro \n",
"43 Kattegat Øresund Alssund \n",
"44 eg lærketræ æbletræ \n",
"45 hugorm pyton snog \n",
"46 ko so hest \n",
"47 ugle krage måge \n",
"48 hund ræv ulv \n",
"49 spilletid halvleg dommer \n",
"50 tv radio telefon \n",
"51 ondt forfærdeligt skrækkeligt \n",
"52 hoppende dansende løbende \n",
"53 saver hamrer skruer \n",
"54 går spadserer vandrer \n",
"55 gange dividere lægge sammen \n",
"56 mener tror ved \n",
"57 fire fem sytten \n",
"58 Nielsen Jensen Olsen \n",
"59 mega kæmpe enorm \n",
"60 kufferter tasker bæreposer \n",
"61 landstræner håndboldekspert mål \n",
"62 trup gruppe hold \n",
"63 grafik figur plot \n",
"64 januar maj juni \n",
"65 angiveligt muligvis sandsynligvis \n",
"66 oplysninger data informationer \n",
"67 fire minutter tre timer en uge \n",
"68 instrumentbrættet motorer cockpit \n",
"69 anmeldelse politi forbrydelse \n",
"70 instrueret organiseret ledet \n",
"71 billede foto tegning \n",
"72 kapitel paragraf sektion \n",
"73 virksomhed firma selskab \n",
"74 tres 60 LX \n",
"75 1864 1807 1940 \n",
"76 diplom udmærkelse pris \n",
"77 bange urolig nervøs \n",
"78 norsk engelsk spansk \n",
"79 mus tastatur skærm \n",
"80 dør kradser af udånder \n",
"81 og samt endvidere \n",
"82 hans sit vores \n",
"83 stod og råbte lå og sov sad og så \n",
"84 frokost morgenmad brunch \n",
"85 mænd personer individer \n",
"86 kidnappe røve stjæle \n",
"87 råbte skreg larmede \n",
"88 kontorist embedsmand bureaukrat \n",
"89 vegetation krat bed \n",
"90 cyklist bilist chauffør \n",
"91 bibliotek bog låner \n",
"92 halvsyg forkølelse hoster \n",
"93 musik node rytme \n",
"94 rapport sagsakt artikel \n",
"95 tekande vinflaske slikskål \n",
"96 forhører spørger anmoder \n",
"97 fremtidige fortidige nutidige \n",
"98 kanal flod bæk \n",
"99 kanal program udsendelse \n",
"\n",
" word4 bert-corrcoef bert-dot \n",
"0 stol kirsebær kirsebær \n",
"1 græs reol græs \n",
"2 bil bil bil \n",
"3 vind bil bil \n",
"4 mandag solskin regn \n",
"5 tømrer søndag søndag \n",
"6 barn vvs-mand vvs-mand \n",
"7 lampe barn far \n",
"8 jern stearinlys stearinlys \n",
"9 sjov guld guld \n",
"10 papir papir papir \n",
"11 vagt vagt ringbind \n",
"12 by by fængselsbetjent \n",
"13 småkage by landsby \n",
"14 dør dør småkage \n",
"15 klaver klaver klaver \n",
"16 fandens fandens klaver \n",
"17 vand vand vand \n",
"18 hukommelse vand ild \n",
"19 Niels Bohr Niels Bohr Niels Bohr \n",
"20 Lars Løkke Rasmussen Niels Bohr Niels Bohr \n",
"21 Peter Schmeichel Peter Schmeichel Peter Schmeichel \n",
"22 Caroline Wozniacki Caroline Wozniacki Caroline Wozniacki \n",
"23 Monaco Serena Williams Serena Williams \n",
"24 Pia Milano Milano \n",
"25 Ole Pia Pia \n",
"26 mave fjerbold fjerbold \n",
"27 hat hat mave \n",
"28 ishockey ishockey ishockey \n",
"29 fodbold skiløb skiløb \n",
"30 sidde sidde sidde \n",
"31 himmel violet blå \n",
"32 Kina Norge Norge \n",
"33 Irland Irland Irland \n",
"34 beskidt beskidt beskidt \n",
"35 beskidt vaskemaskine vaskemaskine \n",
"36 øl restaurant restaurant \n",
"37 køkken vin vin \n",
"38 øl baderum baderum \n",
"39 vinge vinge forward \n",
"40 kartoffel landingsstel kartoffel \n",
"41 pejs salat salat \n",
"42 Kattegat Randers Kattegat \n",
"43 Sjælland Øresund Sjælland \n",
"44 slange eg lærketræ \n",
"45 hund pyton pyton \n",
"46 krappe so so \n",
"47 hund krage krage \n",
"48 krappe krappe ræv \n",
"49 ræv spilletid spilletid \n",
"50 klud klud tv \n",
"51 herligt ondt forfærdeligt \n",
"52 døende døende døende \n",
"53 aer saver hamrer \n",
"54 siger går spadserer \n",
"55 vandrer lægge sammen lægge sammen \n",
"56 går ved mener \n",
"57 aldrig sytten sytten \n",
"58 kassen Nielsen Olsen \n",
"59 smule kæmpe smule \n",
"60 styrelser styrelser styrelser \n",
"61 rum landstræner landstræner \n",
"62 sti gruppe gruppe \n",
"63 lån lån lån \n",
"64 ur ur ur \n",
"65 nutidigt nutidigt nutidigt \n",
"66 fjerner fjerner oplysninger \n",
"67 to piger to piger to piger \n",
"68 sagen sagen instrumentbrættet \n",
"69 kaffe kaffe kaffe \n",
"70 skuffet instrueret instrueret \n",
"71 skål tegning foto \n",
"72 park paragraf kapitel \n",
"73 sovs virksomhed virksomhed \n",
"74 3 LX 60 \n",
"75 1909 1909 1909 \n",
"76 øremærke pris øremærke \n",
"77 ordentlig bange ordentlig \n",
"78 falsk falsk engelsk \n",
"79 bræt tastatur tastatur \n",
"80 åbner kradser af kradser af \n",
"81 sin sin sin \n",
"82 vises sit sit \n",
"83 mand og kvinde stod og råbte mand og kvinde \n",
"84 måne morgenmad morgenmad \n",
"85 gange personer personer \n",
"86 køre kidnappe stjæle \n",
"87 vuggede skreg vuggede \n",
"88 spisebord bureaukrat bureaukrat \n",
"89 skur vegetation skur \n",
"90 ekspedient cyklist cyklist \n",
"91 flag bibliotek bibliotek \n",
"92 vej forkølelse forkølelse \n",
"93 leder node musik \n",
"94 spand artikel artikel \n",
"95 racerbil tekande tekande \n",
"96 banker banker anmoder \n",
"97 havdige fortidige havdige \n",
"98 spejl kanal kanal \n",
"99 vask udsendelse udsendelse "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"with pd.option_context(\"display.max_rows\", 100):\n",
" display(four_words[['word1', 'word2', 'word3', 'word4', 'bert-corrcoef', 'bert-dot']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"BPEmb\n",
"-----\n",
"* Heinzerling, B., Strube, M.: BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. LREC (2018), http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"dimensions = [25, 50, 100, 200, 300]\n",
"vocabulary_sizes= [1000, 3000, 5000, 10000, 25000, 50000, 100000, 200000]\n",
"for dimension in dimensions:\n",
" for vocabulary_size in vocabulary_sizes:\n",
" bpemb = BPEmb(lang=\"da\", vs=vocabulary_size, dim=dimension)\n",
"\n",
" outliers = []\n",
" for idx, words in four_words.iterrows():\n",
" vectors = np.zeros((4, dimension))\n",
" for j, word in enumerate(words[:4]):\n",
" vectors[j,:] = bpemb.embed(word).mean(axis=0)\n",
"\n",
" # Identify outlier\n",
" R = np.corrcoef(vectors)\n",
" indices = np.argsort(R.sum(axis=0))\n",
" outlier = words[indices[0]]\n",
" outliers.append(outlier)\n",
"\n",
" four_words['bpemb-' + str(vocabulary_size) + '-' + str(dimension)] = outliers"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.36 bpemb-1000-25\n",
"0.45 bpemb-3000-25\n",
"0.52 bpemb-5000-25\n",
"0.56 bpemb-10000-25\n",
"0.58 bpemb-25000-25\n",
"0.58 bpemb-50000-25\n",
"0.58 bpemb-100000-25\n",
"0.60 bpemb-200000-25\n",
"0.34 bpemb-1000-50\n",
"0.42 bpemb-3000-50\n",
"0.50 bpemb-5000-50\n",
"0.59 bpemb-10000-50\n",
"0.58 bpemb-25000-50\n",
"0.63 bpemb-50000-50\n",
"0.63 bpemb-100000-50\n",
"0.64 bpemb-200000-50\n",
"0.34 bpemb-1000-100\n",
"0.48 bpemb-3000-100\n",
"0.51 bpemb-5000-100\n",
"0.59 bpemb-10000-100\n",
"0.62 bpemb-25000-100\n",
"0.65 bpemb-50000-100\n",
"0.63 bpemb-100000-100\n",
"0.67 bpemb-200000-100\n",
"0.36 bpemb-1000-200\n",
"0.47 bpemb-3000-200\n",
"0.54 bpemb-5000-200\n",
"0.63 bpemb-10000-200\n",
"0.63 bpemb-25000-200\n",
"0.69 bpemb-50000-200\n",
"0.69 bpemb-100000-200\n",
"0.67 bpemb-200000-200\n",
"0.33 bpemb-1000-300\n",
"0.47 bpemb-3000-300\n",
"0.55 bpemb-5000-300\n",
"0.59 bpemb-10000-300\n",
"0.67 bpemb-25000-300\n",
"0.69 bpemb-50000-300\n",
"0.69 bpemb-100000-300\n",
"0.64 bpemb-200000-300\n"
]
}
],
"source": [
"for column in four_words.columns:\n",
" if column.startswith('bpemb-'):\n",
" print(\"{:.02f} {}\".format(np.mean(four_words.word4 == four_words[column]), column))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word1</th>\n",
" <th>word2</th>\n",
" <th>word3</th>\n",
" <th>word4</th>\n",
" <th>bpemb-200000-300</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>æble</td>\n",
" <td>pære</td>\n",
" <td>kirsebær</td>\n",
" <td>stol</td>\n",
" <td>stol</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>stol</td>\n",
" <td>bord</td>\n",
" <td>reol</td>\n",
" <td>græs</td>\n",
" <td>reol</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>græs</td>\n",
" <td>træ</td>\n",
" <td>blomst</td>\n",
" <td>bil</td>\n",
" <td>blomst</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>bil</td>\n",
" <td>cykel</td>\n",
" <td>tog</td>\n",
" <td>vind</td>\n",
" <td>vind</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vind</td>\n",
" <td>regn</td>\n",
" <td>solskin</td>\n",
" <td>mandag</td>\n",
" <td>mandag</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>mandag</td>\n",
" <td>tirsdag</td>\n",
" <td>søndag</td>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>tømrer</td>\n",
" <td>vvs-mand</td>\n",
" <td>snedker</td>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>barn</td>\n",
" <td>far</td>\n",
" <td>mormor</td>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>lampe</td>\n",
" <td>stearinlys</td>\n",
" <td>lommelygte</td>\n",
" <td>jern</td>\n",
" <td>lommelygte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>jern</td>\n",
" <td>guld</td>\n",
" <td>magnesium</td>\n",
" <td>sjov</td>\n",
" <td>sjov</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>sjov</td>\n",
" <td>dårlig</td>\n",
" <td>vanvittig</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>papir</td>\n",
" <td>ringbind</td>\n",
" <td>blyant</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>vagt</td>\n",
" <td>politimand</td>\n",
" <td>fængselsbetjent</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>by</td>\n",
" <td>landsby</td>\n",
" <td>købstad</td>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>småkage</td>\n",
" <td>citronmåne</td>\n",
" <td>kringle</td>\n",
" <td>dør</td>\n",
" <td>dør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>dør</td>\n",
" <td>væg</td>\n",
" <td>vindue</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>klaver</td>\n",
" <td>trompet</td>\n",
" <td>blokfløjte</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>fandens</td>\n",
" <td>fuck</td>\n",
" <td>sgu</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>vand</td>\n",
" <td>jord</td>\n",
" <td>ild</td>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>hukommelse</td>\n",
" <td>intelligens</td>\n",
" <td>emotion</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Niels Bohr</td>\n",
" <td>H.C. Ørsted</td>\n",
" <td>Ole Rømer</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Ole Rømer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Poul Nyrup Rasmussen</td>\n",
" <td>Anders Fogh Rasmussen</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Peter Schmeichel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Kasper Schmeichel</td>\n",
" <td>Brian Laudrup</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Steffi Graf</td>\n",
" <td>Serena Williams</td>\n",
" <td>Monaco</td>\n",
" <td>Monaco</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Monaco</td>\n",
" <td>Paris</td>\n",
" <td>Milano</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Pia</td>\n",
" <td>Lone</td>\n",
" <td>Marianne</td>\n",
" <td>Ole</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>bold</td>\n",
" <td>fjerbold</td>\n",
" <td>puck</td>\n",
" <td>mave</td>\n",
" <td>puck</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>mave</td>\n",
" <td>bryst</td>\n",
" <td>ryg</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>hat</td>\n",
" <td>kasket</td>\n",
" <td>hue</td>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>ishockey</td>\n",
" <td>skiløb</td>\n",
" <td>skihop</td>\n",
" <td>fodbold</td>\n",
" <td>skihop</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>gå</td>\n",
" <td>løbe</td>\n",
" <td>kravle</td>\n",
" <td>sidde</td>\n",
" <td>sidde</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>rød</td>\n",
" <td>blå</td>\n",
" <td>violet</td>\n",
" <td>himmel</td>\n",
" <td>violet</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>Finland</td>\n",
" <td>Sverige</td>\n",
" <td>Norge</td>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Kina</td>\n",
" <td>Japan</td>\n",
" <td>Sydkorea</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>humor</td>\n",
" <td>komedie</td>\n",
" <td>comedy</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>vaskemaskine</td>\n",
" <td>strygejern</td>\n",
" <td>tørretumbler</td>\n",
" <td>beskidt</td>\n",
" <td>strygejern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>restaurant</td>\n",
" <td>café</td>\n",
" <td>bar</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>øl</td>\n",
" <td>vin</td>\n",
" <td>spiritus</td>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>køkken</td>\n",
" <td>baderum</td>\n",
" <td>stue</td>\n",
" <td>øl</td>\n",
" <td>baderum</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>wing</td>\n",
" <td>back</td>\n",
" <td>forward</td>\n",
" <td>vinge</td>\n",
" <td>vinge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>vinge</td>\n",
" <td>landingsstel</td>\n",
" <td>propel</td>\n",
" <td>kartoffel</td>\n",
" <td>kartoffel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>kartoffel</td>\n",
" <td>frikadelle</td>\n",
" <td>salat</td>\n",
" <td>pejs</td>\n",
" <td>pejs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>Viborg</td>\n",
" <td>Randers</td>\n",
" <td>Hobro</td>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>Kattegat</td>\n",
" <td>Øresund</td>\n",
" <td>Alssund</td>\n",
" <td>Sjælland</td>\n",
" <td>Alssund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>eg</td>\n",
" <td>lærketræ</td>\n",
" <td>æbletræ</td>\n",
" <td>slange</td>\n",
" <td>eg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>hugorm</td>\n",
" <td>pyton</td>\n",
" <td>snog</td>\n",
" <td>hund</td>\n",
" <td>pyton</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>ko</td>\n",
" <td>so</td>\n",
" <td>hest</td>\n",
" <td>krappe</td>\n",
" <td>hest</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>ugle</td>\n",
" <td>krage</td>\n",
" <td>måge</td>\n",
" <td>hund</td>\n",
" <td>ugle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>hund</td>\n",
" <td>ræv</td>\n",
" <td>ulv</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>spilletid</td>\n",
" <td>halvleg</td>\n",
" <td>dommer</td>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>tv</td>\n",
" <td>radio</td>\n",
" <td>telefon</td>\n",
" <td>klud</td>\n",
" <td>klud</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51</th>\n",
" <td>ondt</td>\n",
" <td>forfærdeligt</td>\n",
" <td>skrækkeligt</td>\n",
" <td>herligt</td>\n",
" <td>herligt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>hoppende</td>\n",
" <td>dansende</td>\n",
" <td>løbende</td>\n",
" <td>døende</td>\n",
" <td>løbende</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>saver</td>\n",
" <td>hamrer</td>\n",
" <td>skruer</td>\n",
" <td>aer</td>\n",
" <td>hamrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>går</td>\n",
" <td>spadserer</td>\n",
" <td>vandrer</td>\n",
" <td>siger</td>\n",
" <td>spadserer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>gange</td>\n",
" <td>dividere</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" <td>vandrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56</th>\n",
" <td>mener</td>\n",
" <td>tror</td>\n",
" <td>ved</td>\n",
" <td>går</td>\n",
" <td>ved</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57</th>\n",
" <td>fire</td>\n",
" <td>fem</td>\n",
" <td>sytten</td>\n",
" <td>aldrig</td>\n",
" <td>sytten</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>Nielsen</td>\n",
" <td>Jensen</td>\n",
" <td>Olsen</td>\n",
" <td>kassen</td>\n",
" <td>kassen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>mega</td>\n",
" <td>kæmpe</td>\n",
" <td>enorm</td>\n",
" <td>smule</td>\n",
" <td>mega</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>kufferter</td>\n",
" <td>tasker</td>\n",
" <td>bæreposer</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>landstræner</td>\n",
" <td>håndboldekspert</td>\n",
" <td>mål</td>\n",
" <td>rum</td>\n",
" <td>landstræner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>trup</td>\n",
" <td>gruppe</td>\n",
" <td>hold</td>\n",
" <td>sti</td>\n",
" <td>sti</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>grafik</td>\n",
" <td>figur</td>\n",
" <td>plot</td>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>januar</td>\n",
" <td>maj</td>\n",
" <td>juni</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>angiveligt</td>\n",
" <td>muligvis</td>\n",
" <td>sandsynligvis</td>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>66</th>\n",
" <td>oplysninger</td>\n",
" <td>data</td>\n",
" <td>informationer</td>\n",
" <td>fjerner</td>\n",
" <td>fjerner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67</th>\n",
" <td>fire minutter</td>\n",
" <td>tre timer</td>\n",
" <td>en uge</td>\n",
" <td>to piger</td>\n",
" <td>en uge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>instrumentbrættet</td>\n",
" <td>motorer</td>\n",
" <td>cockpit</td>\n",
" <td>sagen</td>\n",
" <td>sagen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>anmeldelse</td>\n",
" <td>politi</td>\n",
" <td>forbrydelse</td>\n",
" <td>kaffe</td>\n",
" <td>kaffe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>instrueret</td>\n",
" <td>organiseret</td>\n",
" <td>ledet</td>\n",
" <td>skuffet</td>\n",
" <td>skuffet</td>\n",
" </tr>\n",
" <tr>\n",
" <th>71</th>\n",
" <td>billede</td>\n",
" <td>foto</td>\n",
" <td>tegning</td>\n",
" <td>skål</td>\n",
" <td>skål</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72</th>\n",
" <td>kapitel</td>\n",
" <td>paragraf</td>\n",
" <td>sektion</td>\n",
" <td>park</td>\n",
" <td>park</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>virksomhed</td>\n",
" <td>firma</td>\n",
" <td>selskab</td>\n",
" <td>sovs</td>\n",
" <td>sovs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>tres</td>\n",
" <td>60</td>\n",
" <td>LX</td>\n",
" <td>3</td>\n",
" <td>LX</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>1864</td>\n",
" <td>1807</td>\n",
" <td>1940</td>\n",
" <td>1909</td>\n",
" <td>1864</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>diplom</td>\n",
" <td>udmærkelse</td>\n",
" <td>pris</td>\n",
" <td>øremærke</td>\n",
" <td>øremærke</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>bange</td>\n",
" <td>urolig</td>\n",
" <td>nervøs</td>\n",
" <td>ordentlig</td>\n",
" <td>ordentlig</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>norsk</td>\n",
" <td>engelsk</td>\n",
" <td>spansk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>mus</td>\n",
" <td>tastatur</td>\n",
" <td>skærm</td>\n",
" <td>bræt</td>\n",
" <td>bræt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>dør</td>\n",
" <td>kradser af</td>\n",
" <td>udånder</td>\n",
" <td>åbner</td>\n",
" <td>kradser af</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>og</td>\n",
" <td>samt</td>\n",
" <td>endvidere</td>\n",
" <td>sin</td>\n",
" <td>endvidere</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>hans</td>\n",
" <td>sit</td>\n",
" <td>vores</td>\n",
" <td>vises</td>\n",
" <td>vises</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>stod og råbte</td>\n",
" <td>lå og sov</td>\n",
" <td>sad og så</td>\n",
" <td>mand og kvinde</td>\n",
" <td>mand og kvinde</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>frokost</td>\n",
" <td>morgenmad</td>\n",
" <td>brunch</td>\n",
" <td>måne</td>\n",
" <td>brunch</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" <td>individer</td>\n",
" <td>gange</td>\n",
" <td>individer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>kidnappe</td>\n",
" <td>røve</td>\n",
" <td>stjæle</td>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>råbte</td>\n",
" <td>skreg</td>\n",
" <td>larmede</td>\n",
" <td>vuggede</td>\n",
" <td>larmede</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>kontorist</td>\n",
" <td>embedsmand</td>\n",
" <td>bureaukrat</td>\n",
" <td>spisebord</td>\n",
" <td>spisebord</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>vegetation</td>\n",
" <td>krat</td>\n",
" <td>bed</td>\n",
" <td>skur</td>\n",
" <td>bed</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>cyklist</td>\n",
" <td>bilist</td>\n",
" <td>chauffør</td>\n",
" <td>ekspedient</td>\n",
" <td>chauffør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>bibliotek</td>\n",
" <td>bog</td>\n",
" <td>låner</td>\n",
" <td>flag</td>\n",
" <td>låner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>halvsyg</td>\n",
" <td>forkølelse</td>\n",
" <td>hoster</td>\n",
" <td>vej</td>\n",
" <td>hoster</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>musik</td>\n",
" <td>node</td>\n",
" <td>rytme</td>\n",
" <td>leder</td>\n",
" <td>node</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>rapport</td>\n",
" <td>sagsakt</td>\n",
" <td>artikel</td>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>tekande</td>\n",
" <td>vinflaske</td>\n",
" <td>slikskål</td>\n",
" <td>racerbil</td>\n",
" <td>racerbil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>forhører</td>\n",
" <td>spørger</td>\n",
" <td>anmoder</td>\n",
" <td>banker</td>\n",
" <td>forhører</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>fremtidige</td>\n",
" <td>fortidige</td>\n",
" <td>nutidige</td>\n",
" <td>havdige</td>\n",
" <td>havdige</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>kanal</td>\n",
" <td>flod</td>\n",
" <td>bæk</td>\n",
" <td>spejl</td>\n",
" <td>spejl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>kanal</td>\n",
" <td>program</td>\n",
" <td>udsendelse</td>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word1 word2 word3 \\\n",
"0 æble pære kirsebær \n",
"1 stol bord reol \n",
"2 græs træ blomst \n",
"3 bil cykel tog \n",
"4 vind regn solskin \n",
"5 mandag tirsdag søndag \n",
"6 tømrer vvs-mand snedker \n",
"7 barn far mormor \n",
"8 lampe stearinlys lommelygte \n",
"9 jern guld magnesium \n",
"10 sjov dårlig vanvittig \n",
"11 papir ringbind blyant \n",
"12 vagt politimand fængselsbetjent \n",
"13 by landsby købstad \n",
"14 småkage citronmåne kringle \n",
"15 dør væg vindue \n",
"16 klaver trompet blokfløjte \n",
"17 fandens fuck sgu \n",
"18 vand jord ild \n",
"19 hukommelse intelligens emotion \n",
"20 Niels Bohr H.C. Ørsted Ole Rømer \n",
"21 Lars Løkke Rasmussen Poul Nyrup Rasmussen Anders Fogh Rasmussen \n",
"22 Peter Schmeichel Kasper Schmeichel Brian Laudrup \n",
"23 Caroline Wozniacki Steffi Graf Serena Williams \n",
"24 Monaco Paris Milano \n",
"25 Pia Lone Marianne \n",
"26 bold fjerbold puck \n",
"27 mave bryst ryg \n",
"28 hat kasket hue \n",
"29 ishockey skiløb skihop \n",
"30 gå løbe kravle \n",
"31 rød blå violet \n",
"32 Finland Sverige Norge \n",
"33 Kina Japan Sydkorea \n",
"34 humor komedie comedy \n",
"35 vaskemaskine strygejern tørretumbler \n",
"36 restaurant café bar \n",
"37 øl vin spiritus \n",
"38 køkken baderum stue \n",
"39 wing back forward \n",
"40 vinge landingsstel propel \n",
"41 kartoffel frikadelle salat \n",
"42 Viborg Randers Hobro \n",
"43 Kattegat Øresund Alssund \n",
"44 eg lærketræ æbletræ \n",
"45 hugorm pyton snog \n",
"46 ko so hest \n",
"47 ugle krage måge \n",
"48 hund ræv ulv \n",
"49 spilletid halvleg dommer \n",
"50 tv radio telefon \n",
"51 ondt forfærdeligt skrækkeligt \n",
"52 hoppende dansende løbende \n",
"53 saver hamrer skruer \n",
"54 går spadserer vandrer \n",
"55 gange dividere lægge sammen \n",
"56 mener tror ved \n",
"57 fire fem sytten \n",
"58 Nielsen Jensen Olsen \n",
"59 mega kæmpe enorm \n",
"60 kufferter tasker bæreposer \n",
"61 landstræner håndboldekspert mål \n",
"62 trup gruppe hold \n",
"63 grafik figur plot \n",
"64 januar maj juni \n",
"65 angiveligt muligvis sandsynligvis \n",
"66 oplysninger data informationer \n",
"67 fire minutter tre timer en uge \n",
"68 instrumentbrættet motorer cockpit \n",
"69 anmeldelse politi forbrydelse \n",
"70 instrueret organiseret ledet \n",
"71 billede foto tegning \n",
"72 kapitel paragraf sektion \n",
"73 virksomhed firma selskab \n",
"74 tres 60 LX \n",
"75 1864 1807 1940 \n",
"76 diplom udmærkelse pris \n",
"77 bange urolig nervøs \n",
"78 norsk engelsk spansk \n",
"79 mus tastatur skærm \n",
"80 dør kradser af udånder \n",
"81 og samt endvidere \n",
"82 hans sit vores \n",
"83 stod og råbte lå og sov sad og så \n",
"84 frokost morgenmad brunch \n",
"85 mænd personer individer \n",
"86 kidnappe røve stjæle \n",
"87 råbte skreg larmede \n",
"88 kontorist embedsmand bureaukrat \n",
"89 vegetation krat bed \n",
"90 cyklist bilist chauffør \n",
"91 bibliotek bog låner \n",
"92 halvsyg forkølelse hoster \n",
"93 musik node rytme \n",
"94 rapport sagsakt artikel \n",
"95 tekande vinflaske slikskål \n",
"96 forhører spørger anmoder \n",
"97 fremtidige fortidige nutidige \n",
"98 kanal flod bæk \n",
"99 kanal program udsendelse \n",
"\n",
" word4 bpemb-200000-300 \n",
"0 stol stol \n",
"1 græs reol \n",
"2 bil blomst \n",
"3 vind vind \n",
"4 mandag mandag \n",
"5 tømrer tømrer \n",
"6 barn barn \n",
"7 lampe lampe \n",
"8 jern lommelygte \n",
"9 sjov sjov \n",
"10 papir papir \n",
"11 vagt vagt \n",
"12 by by \n",
"13 småkage småkage \n",
"14 dør dør \n",
"15 klaver klaver \n",
"16 fandens fandens \n",
"17 vand vand \n",
"18 hukommelse hukommelse \n",
"19 Niels Bohr Niels Bohr \n",
"20 Lars Løkke Rasmussen Ole Rømer \n",
"21 Peter Schmeichel Peter Schmeichel \n",
"22 Caroline Wozniacki Caroline Wozniacki \n",
"23 Monaco Monaco \n",
"24 Pia Pia \n",
"25 Ole Pia \n",
"26 mave puck \n",
"27 hat hat \n",
"28 ishockey ishockey \n",
"29 fodbold skihop \n",
"30 sidde sidde \n",
"31 himmel violet \n",
"32 Kina Kina \n",
"33 Irland Irland \n",
"34 beskidt beskidt \n",
"35 beskidt strygejern \n",
"36 øl øl \n",
"37 køkken køkken \n",
"38 øl baderum \n",
"39 vinge vinge \n",
"40 kartoffel kartoffel \n",
"41 pejs pejs \n",
"42 Kattegat Kattegat \n",
"43 Sjælland Alssund \n",
"44 slange eg \n",
"45 hund pyton \n",
"46 krappe hest \n",
"47 hund ugle \n",
"48 krappe krappe \n",
"49 ræv ræv \n",
"50 klud klud \n",
"51 herligt herligt \n",
"52 døende løbende \n",
"53 aer hamrer \n",
"54 siger spadserer \n",
"55 vandrer vandrer \n",
"56 går ved \n",
"57 aldrig sytten \n",
"58 kassen kassen \n",
"59 smule mega \n",
"60 styrelser styrelser \n",
"61 rum landstræner \n",
"62 sti sti \n",
"63 lån lån \n",
"64 ur ur \n",
"65 nutidigt nutidigt \n",
"66 fjerner fjerner \n",
"67 to piger en uge \n",
"68 sagen sagen \n",
"69 kaffe kaffe \n",
"70 skuffet skuffet \n",
"71 skål skål \n",
"72 park park \n",
"73 sovs sovs \n",
"74 3 LX \n",
"75 1909 1864 \n",
"76 øremærke øremærke \n",
"77 ordentlig ordentlig \n",
"78 falsk falsk \n",
"79 bræt bræt \n",
"80 åbner kradser af \n",
"81 sin endvidere \n",
"82 vises vises \n",
"83 mand og kvinde mand og kvinde \n",
"84 måne brunch \n",
"85 gange individer \n",
"86 køre køre \n",
"87 vuggede larmede \n",
"88 spisebord spisebord \n",
"89 skur bed \n",
"90 ekspedient chauffør \n",
"91 flag låner \n",
"92 vej hoster \n",
"93 leder node \n",
"94 spand spand \n",
"95 racerbil racerbil \n",
"96 banker forhører \n",
"97 havdige havdige \n",
"98 spejl spejl \n",
"99 vask vask "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"with pd.option_context(\"display.max_rows\", 100):\n",
" display(four_words[['word1', 'word2', 'word3', 'word4', 'bpemb-200000-300']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wembedder\n",
"---------\n",
"* Nielsen, F.Å.: Wembedder: Wikidata entity embedding web service (2017), https://arxiv.org/pdf/1710.04099"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Create a utility function to search with Wikidata API.\n",
"\n",
"HEADERS = {'User-Agent': 'Finn Årup Nielsen, http://people.compute.dtu.dk/faan/'}\n",
"\n",
"# This code has been modified from Scholia - https://github.com/fnielsen/scholia\n",
"# There is a cache to avoid searching multiple times.\n",
"# The results should have been stored if we wanted to have full reproducibility! :(\n",
"@lru_cache(maxsize=500)\n",
"def search(query, limit=10):\n",
" \"\"\"Search Wikidata.\n",
"\n",
" Parameters\n",
" ----------\n",
" query : str\n",
" Query string.\n",
" limit : int, optional\n",
" Number of maximum search results to return.\n",
"\n",
" Returns\n",
" -------\n",
" result : list of dicts\n",
"\n",
" \"\"\"\n",
" params = {\n",
" \"action\": \"wbsearchentities\",\n",
" \"search\": query,\n",
" \"format\": \"json\",\n",
" \"language\": \"da\",\n",
" \"uselang\": \"da\",\n",
" \"type\": \"item\"}\n",
" \n",
" # Query the Wikidata API\n",
" response = requests.get(\n",
" \"https://www.wikidata.org/w/api.php\",\n",
" params=params,\n",
" headers=HEADERS)\n",
"\n",
" # Convert the response\n",
" response_data = response.json()\n",
" items = response_data['search']\n",
" results = [\n",
" {'q': item['title'],\n",
" 'description': item.get('description', '')}\n",
" for item in items]\n",
"\n",
" return results"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'q': 'Q12284', 'description': 'Kunstigt skabt vandløb'},\n",
" {'q': 'Q1432092', 'description': 'fissure, especially on the Moon'},\n",
" {'q': 'Q3435496', 'description': 'municipality of Slovenia'},\n",
" {'q': 'Q37516453', 'description': 'efternavn'},\n",
" {'q': 'Q1012996', 'description': 'human settlement'},\n",
" {'q': 'Q52676370',\n",
" 'description': 'contemporary art museum in Brussels, Belgium'},\n",
" {'q': 'Q42314', 'description': 'archipelago in the English Channel'}]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check to see whether the Wikidata API search returns something appropriate\n",
"search('kanal')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"wembedder_model = wembedder.model.Model.load()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'filename': 'wikidata-20170613-truthy-BETA-cbow-size=100-window=1-min_count=20-iter=25'}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wembedder_model.metadata"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Utility function and dictionaries to convert between Wikidata Q-idenfier and word\n",
"q_to_word = {}\n",
"word_to_q = {}\n",
"\n",
"def words_to_qs(words):\n",
" qs = []\n",
" for word in words:\n",
" result = search(word)\n",
" if len(result) > 0:\n",
" q = result[0]['q']\n",
" word_to_q[word] = q\n",
" q_to_word[q] = word\n",
" else:\n",
" q = ''\n",
" qs.append(q)\n",
" return qs"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Q89', 'Q434', 'Q190545', 'Q15026']\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py:730: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n",
" vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"kirsebær\n",
"['Q15026', 'Q14748', 'Q2637814', 'Q43238']\n",
"græs\n",
"['Q43238', 'Q287', 'Q506', 'Q1420']\n",
"græs\n",
"['Q1420', 'Q11442', 'Q870', 'Q8094']\n",
"bil\n",
"['Q8094', 'Q7925', 'Q1974425', 'Q105']\n",
"solskin\n",
"['Q105', 'Q127', 'Q132', 'Q154549']\n",
"tømrer\n",
"['Q154549', '', 'Q326358', 'Q7569']\n",
"barn\n",
"['Q7569', 'Q7565', 'Q9235758', 'Q1138737']\n",
"lampe\n",
"['Q1138737', 'Q79746', 'Q235783', 'Q677']\n",
"jern\n",
"['Q677', 'Q897', 'Q660', 'Q1139267']\n",
"magnesium\n",
"['Q1139267', 'Q844819', 'Q10712726', 'Q11472']\n",
"papir\n",
"['Q11472', 'Q421904', 'Q14674', 'Q680928']\n",
"vagt\n",
"['Q680928', '', 'Q311396', 'Q515']\n",
"by\n",
"['Q515', 'Q532', 'Q18511725', 'Q13266']\n",
"småkage\n",
"['Q13266', 'Q12306402', 'Q2098139', 'Q36794']\n",
"småkage\n",
"['Q36794', 'Q42948', 'Q35473', 'Q5994']\n",
"klaver\n",
"['Q5994', 'Q8338', 'Q187851', 'Q30024']\n",
"fandens\n",
"['Q30024', 'Q31928', 'Q194213', 'Q283']\n",
"vand\n",
"['Q283', 'Q36133', 'Q3196', 'Q492']\n",
"hukommelse\n",
"['Q492', 'Q83500', 'Q9415', 'Q7085']\n",
"Niels Bohr\n",
"['Q7085', 'Q44412', 'Q160187', 'Q182397']\n",
"Lars Løkke Rasmussen\n",
"['Q182397', 'Q311063', 'Q46052', 'Q182314']\n",
"Peter Schmeichel\n",
"['Q182314', 'Q295797', 'Q212854', 'Q30767']\n",
"Caroline Wozniacki\n",
"['Q30767', 'Q11662', 'Q11459', 'Q235']\n",
"Monaco\n",
"['Q235', 'Q90', 'Q490', 'Q2300138']\n",
"Pia\n",
"['Q2300138', 'Q4569738', 'Q18760860', 'Q2097883']\n",
"Marianne\n",
"['Q18545', 'Q874669', 'Q15662', 'Q1029907']\n",
"bold\n",
"['Q1029907', 'Q9103', 'Q133279', 'Q80151']\n",
"hat\n",
"['Q80151', 'Q2236437', 'Q36167', 'Q41466']\n",
"hue\n",
"['Q41466', 'Q130949', 'Q7718', 'Q2736']\n",
"fodbold\n",
"['Q861', 'Q27046', 'Q1622332', 'Q1144593']\n",
"løbe\n",
"['Q3142', 'Q1088', 'Q428124', 'Q527']\n",
"himmel\n",
"['Q33', 'Q34', 'Q20', 'Q148']\n",
"Kina\n",
"['Q148', 'Q17', 'Q884', 'Q27']\n",
"Irland\n",
"['Q35874', 'Q40831', 'Q145806', 'Q867805']\n",
"komedie\n",
"['Q124441', 'Q483634', 'Q496334', 'Q867805']\n",
"vaskemaskine\n",
"['Q11707', 'Q30022', 'Q187456', 'Q44']\n",
"restaurant\n",
"['Q44', 'Q282', 'Q17562878', 'Q43164']\n",
"køkken\n",
"['Q43164', '', 'Q475018', 'Q44']\n",
"øl\n",
"['Q45364', 'Q285676', 'Q280658', 'Q161358']\n",
"wing\n",
"['Q161358', 'Q263421', 'Q205451', 'Q10998']\n",
"vinge\n",
"['Q10998', 'Q251265', 'Q9266', 'Q3578987']\n",
"salat\n",
"['Q21176', 'Q27168', 'Q927713', 'Q131716']\n",
"Kattegat\n",
"['Q131716', 'Q104662', 'Q323759', 'Q25535']\n",
"Sjælland\n",
"['Q12004', 'Q25618', 'Q104819', 'Q2102']\n",
"slange\n",
"['Q192056', 'Q271218', 'Q170713', 'Q144']\n",
"hund\n",
"['Q830', 'Q1045', 'Q726', 'Q165159']\n",
"so\n",
"['Q25222', 'Q43365', 'Q27589', 'Q144']\n",
"hund\n",
"['Q144', 'Q8331', 'Q18498', 'Q165159']\n",
"hund\n",
"['', '', 'Q16533', 'Q8331']\n",
"ræv\n",
"['Q672', 'Q872', 'Q11035', 'Q3715160']\n",
"tv\n",
"['Q2555285', '', '', 'Q12059508']\n",
"ondt\n",
"['', 'Q47485094', 'Q22570681', 'Q267505']\n",
"hoppende\n",
"['Q22741', '', 'Q2281138', 'Q156944']\n",
"saver\n",
"['Q131596', '', 'Q184224', 'Q12335209']\n",
"går\n",
"['Q40276', 'Q1226939', '', 'Q184224']\n",
"vandrer\n",
"['Q23995085', 'Q23574732', 'Q37218447', 'Q131596']\n",
"går\n",
"['Q202', 'Q203', 'Q40118', 'Q21075795']\n",
"fire\n",
"['Q16511256', 'Q16871164', 'Q12042571', 'Q37436530']\n",
"Nielsen\n",
"['Q107205', 'Q3707571', 'Q57692091', 'Q7546644']\n",
"kæmpe\n",
"['', 'Q37005674', '', 'Q188628']\n",
"styrelser\n",
"['', '', 'Q18530', 'Q107']\n",
"rum\n",
"['Q18031231', 'Q83478', 'Q327245', 'Q628179']\n",
"trup\n",
"['Q1027879', 'Q3744866', 'Q1758354', 'Q2914547']\n",
"figur\n",
"['Q108', 'Q119', 'Q120', 'Q376']\n",
"ur\n",
"['', 'Q21070568', '', '']\n",
"muligvis\n",
"['Q52593763', 'Q42848', '', '']\n",
"data\n",
"['', '', 'Q19827313', 'Q18602508']\n",
"to piger\n",
"['', 'Q44167', 'Q194156', 'Q36855091']\n",
"cockpit\n",
"['Q265158', 'Q35535', 'Q83267', 'Q8486']\n",
"forbrydelse\n",
"['', 'Q46952', 'Q6512631', '']\n",
"organiseret\n",
"['Q478798', 'Q11982', 'Q93184', 'Q153988']\n",
"tegning\n",
"['Q1980247', 'Q1931107', 'Q6497253', 'Q22698']\n",
"kapitel\n",
"['Q4830453', 'Q168678', 'Q783794', 'Q178359']\n",
"virksomhed\n",
"['Q79998', 'Q79998', 'Q156776', 'Q201']\n",
"LX\n",
"['Q7704', 'Q6909', 'Q18633', 'Q2057']\n",
"1909\n",
"['Q217577', 'Q618779', 'Q160151', 'Q597512']\n",
"udmærkelse\n",
"['Q18982825', 'Q46959', 'Q209522', 'Q7100393']\n",
"bange\n",
"['Q9043', 'Q1860', 'Q1321', 'Q36348']\n",
"falsk\n",
"['Q7987', 'Q250', 'Q5290', 'Q1369158']\n",
"skærm\n",
"['Q36794', '', '', 'Q9138312']\n",
"dør\n",
"['Q1307', 'Q186030', '', 'Q13267']\n",
"sin\n",
"['Q632842', 'Q8045218', 'Q1378288', 'Q18978454']\n",
"sit\n",
"['', '', '', 'Q20441088']\n",
"stod og råbte\n",
"['Q12896105', 'Q80973', 'Q734263', 'Q2537']\n",
"morgenmad\n",
"['Q874078', 'Q215627', 'Q706611', 'Q40276']\n",
"personer\n",
"['Q2349094', 'Q279373', '', 'Q9176']\n",
"køre\n",
"['Q53275988', 'Q41796514', '', '']\n",
"råbte\n",
"['Q738142', 'Q212238', 'Q572700', 'Q10578291']\n",
"embedsmand\n",
"['Q187997', 'Q2741825', 'Q42177', 'Q721931']\n",
"vegetation\n",
"['Q2125610', 'Q468821', 'Q216541', '']\n",
"chauffør\n",
"['Q7075', 'Q571', 'Q1304483', 'Q14660']\n",
"låner\n",
"['', 'Q12125', 'Q37542164', 'Q34442']\n",
"vej\n",
"['Q638', 'Q263478', 'Q170406', 'Q2462658']\n",
"leder\n",
"['Q10870555', '', 'Q191067', 'Q47107']\n",
"spand\n",
"['Q245005', 'Q23490', '', 'Q673687']\n",
"tekande\n",
"['', 'Q268493', '', 'Q22687']\n",
"banker\n",
"['Q1439309', '', '', '']\n",
"fremtidige\n",
"['Q12284', 'Q4022', 'Q47521', 'Q35197']\n",
"flod\n",
"['Q12284', 'Q40056', '', 'Q23841']\n",
"vask\n"
]
}
],
"source": [
"wembedder = []\n",
"for idx, words in four_words.iterrows():\n",
" words = words.values[:4]\n",
" qs = words_to_qs(words) \n",
" \n",
" # This may take some time to download from the Wikidata API\n",
" print(qs)\n",
" \n",
" try:\n",
" q = wembedder_model.wv.doesnt_match(qs)\n",
" outlier = q_to_word.get(q, words[0])\n",
" except ValueError:\n",
" # If the \n",
" outlier = words[0]\n",
" wembedder.append(outlier)\n",
" \n",
" print(outlier)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"four_words['wembedder'] = wembedder"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.47"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(four_words.word4 == four_words['wembedder'])"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word4</th>\n",
" <th>wembedder</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>stol</td>\n",
" <td>kirsebær</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>bil</td>\n",
" <td>græs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>vind</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>mandag</td>\n",
" <td>solskin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>sjov</td>\n",
" <td>magnesium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>dør</td>\n",
" <td>småkage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Peter Schmeichel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>Monaco</td>\n",
" <td>Monaco</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Ole</td>\n",
" <td>Marianne</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>mave</td>\n",
" <td>bold</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>ishockey</td>\n",
" <td>hue</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>fodbold</td>\n",
" <td>fodbold</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>sidde</td>\n",
" <td>løbe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>beskidt</td>\n",
" <td>komedie</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>beskidt</td>\n",
" <td>vaskemaskine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>øl</td>\n",
" <td>restaurant</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>vinge</td>\n",
" <td>wing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>kartoffel</td>\n",
" <td>vinge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>pejs</td>\n",
" <td>salat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>Sjælland</td>\n",
" <td>Sjælland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>krappe</td>\n",
" <td>so</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>krappe</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>klud</td>\n",
" <td>tv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51</th>\n",
" <td>herligt</td>\n",
" <td>ondt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>døende</td>\n",
" <td>hoppende</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>aer</td>\n",
" <td>saver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>siger</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>vandrer</td>\n",
" <td>vandrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56</th>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57</th>\n",
" <td>aldrig</td>\n",
" <td>fire</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>smule</td>\n",
" <td>kæmpe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>sti</td>\n",
" <td>trup</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>lån</td>\n",
" <td>figur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>nutidigt</td>\n",
" <td>muligvis</td>\n",
" </tr>\n",
" <tr>\n",
" <th>66</th>\n",
" <td>fjerner</td>\n",
" <td>data</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67</th>\n",
" <td>to piger</td>\n",
" <td>to piger</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>sagen</td>\n",
" <td>cockpit</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>kaffe</td>\n",
" <td>forbrydelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>skuffet</td>\n",
" <td>organiseret</td>\n",
" </tr>\n",
" <tr>\n",
" <th>71</th>\n",
" <td>skål</td>\n",
" <td>tegning</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72</th>\n",
" <td>park</td>\n",
" <td>kapitel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>sovs</td>\n",
" <td>virksomhed</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>3</td>\n",
" <td>LX</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>1909</td>\n",
" <td>1909</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>øremærke</td>\n",
" <td>udmærkelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>ordentlig</td>\n",
" <td>bange</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>bræt</td>\n",
" <td>skærm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>åbner</td>\n",
" <td>dør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>sin</td>\n",
" <td>sin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>vises</td>\n",
" <td>sit</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>mand og kvinde</td>\n",
" <td>stod og råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>måne</td>\n",
" <td>morgenmad</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>gange</td>\n",
" <td>personer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>vuggede</td>\n",
" <td>råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>spisebord</td>\n",
" <td>embedsmand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>skur</td>\n",
" <td>vegetation</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>ekspedient</td>\n",
" <td>chauffør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>flag</td>\n",
" <td>låner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>racerbil</td>\n",
" <td>tekande</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>havdige</td>\n",
" <td>fremtidige</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>spejl</td>\n",
" <td>flod</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word4 wembedder\n",
"0 stol kirsebær\n",
"1 græs græs\n",
"2 bil græs\n",
"3 vind bil\n",
"4 mandag solskin\n",
"5 tømrer tømrer\n",
"6 barn barn\n",
"7 lampe lampe\n",
"8 jern jern\n",
"9 sjov magnesium\n",
"10 papir papir\n",
"11 vagt vagt\n",
"12 by by\n",
"13 småkage småkage\n",
"14 dør småkage\n",
"15 klaver klaver\n",
"16 fandens fandens\n",
"17 vand vand\n",
"18 hukommelse hukommelse\n",
"19 Niels Bohr Niels Bohr\n",
"20 Lars Løkke Rasmussen Lars Løkke Rasmussen\n",
"21 Peter Schmeichel Peter Schmeichel\n",
"22 Caroline Wozniacki Caroline Wozniacki\n",
"23 Monaco Monaco\n",
"24 Pia Pia\n",
"25 Ole Marianne\n",
"26 mave bold\n",
"27 hat hat\n",
"28 ishockey hue\n",
"29 fodbold fodbold\n",
"30 sidde løbe\n",
"31 himmel himmel\n",
"32 Kina Kina\n",
"33 Irland Irland\n",
"34 beskidt komedie\n",
"35 beskidt vaskemaskine\n",
"36 øl restaurant\n",
"37 køkken køkken\n",
"38 øl øl\n",
"39 vinge wing\n",
"40 kartoffel vinge\n",
"41 pejs salat\n",
"42 Kattegat Kattegat\n",
"43 Sjælland Sjælland\n",
"44 slange slange\n",
"45 hund hund\n",
"46 krappe so\n",
"47 hund hund\n",
"48 krappe hund\n",
"49 ræv ræv\n",
"50 klud tv\n",
"51 herligt ondt\n",
"52 døende hoppende\n",
"53 aer saver\n",
"54 siger går\n",
"55 vandrer vandrer\n",
"56 går går\n",
"57 aldrig fire\n",
"58 kassen Nielsen\n",
"59 smule kæmpe\n",
"60 styrelser styrelser\n",
"61 rum rum\n",
"62 sti trup\n",
"63 lån figur\n",
"64 ur ur\n",
"65 nutidigt muligvis\n",
"66 fjerner data\n",
"67 to piger to piger\n",
"68 sagen cockpit\n",
"69 kaffe forbrydelse\n",
"70 skuffet organiseret\n",
"71 skål tegning\n",
"72 park kapitel\n",
"73 sovs virksomhed\n",
"74 3 LX\n",
"75 1909 1909\n",
"76 øremærke udmærkelse\n",
"77 ordentlig bange\n",
"78 falsk falsk\n",
"79 bræt skærm\n",
"80 åbner dør\n",
"81 sin sin\n",
"82 vises sit\n",
"83 mand og kvinde stod og råbte\n",
"84 måne morgenmad\n",
"85 gange personer\n",
"86 køre køre\n",
"87 vuggede råbte\n",
"88 spisebord embedsmand\n",
"89 skur vegetation\n",
"90 ekspedient chauffør\n",
"91 flag låner\n",
"92 vej vej\n",
"93 leder leder\n",
"94 spand spand\n",
"95 racerbil tekande\n",
"96 banker banker\n",
"97 havdige fremtidige\n",
"98 spejl flod\n",
"99 vask vask"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"with pd.option_context(\"display.max_rows\", 100):\n",
" display(four_words[['word4', 'wembedder']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"FastText + Wembedder\n",
"--------------------\n",
"Here we combine FastText and Wembedder.\n",
"By default FastText is used.\n",
"If the initials of more than two words matches their upper case version (the are probably proper nouns) we switch to Wembedder."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"outliers = []\n",
"for idx, words in four_words.iterrows():\n",
" upper_count = 0\n",
" for word in words:\n",
" if word[0].upper() == word[0]:\n",
" upper_count += 1\n",
" if upper_count > 2:\n",
" outlier = four_words.loc[idx, 'wembedder']\n",
" else:\n",
" outlier = four_words.loc[idx, 'fasttext']\n",
" outliers.append(outlier)\n",
" \n",
"\n",
"four_words['fasttext-wembedder'] = outliers"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.82"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(four_words.word4 == four_words['fasttext-wembedder'])"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word4</th>\n",
" <th>fasttext-wembedder</th>\n",
" <th>fasttext</th>\n",
" <th>wembedder</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>stol</td>\n",
" <td>stol</td>\n",
" <td>stol</td>\n",
" <td>kirsebær</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" <td>græs</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>vind</td>\n",
" <td>tog</td>\n",
" <td>tog</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>mandag</td>\n",
" <td>mandag</td>\n",
" <td>mandag</td>\n",
" <td>solskin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>sjov</td>\n",
" <td>sjov</td>\n",
" <td>sjov</td>\n",
" <td>magnesium</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>dør</td>\n",
" <td>dør</td>\n",
" <td>dør</td>\n",
" <td>småkage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Ole Rømer</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Anders Fogh Rasmussen</td>\n",
" <td>Peter Schmeichel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>Monaco</td>\n",
" <td>Monaco</td>\n",
" <td>Serena Williams</td>\n",
" <td>Monaco</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Ole</td>\n",
" <td>Marianne</td>\n",
" <td>Pia</td>\n",
" <td>Marianne</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>mave</td>\n",
" <td>mave</td>\n",
" <td>mave</td>\n",
" <td>bold</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" <td>hue</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>fodbold</td>\n",
" <td>skiløb</td>\n",
" <td>skiløb</td>\n",
" <td>fodbold</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>sidde</td>\n",
" <td>løbe</td>\n",
" <td>løbe</td>\n",
" <td>løbe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>komedie</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>vaskemaskine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>restaurant</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>vinge</td>\n",
" <td>vinge</td>\n",
" <td>vinge</td>\n",
" <td>wing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>kartoffel</td>\n",
" <td>kartoffel</td>\n",
" <td>kartoffel</td>\n",
" <td>vinge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>pejs</td>\n",
" <td>pejs</td>\n",
" <td>pejs</td>\n",
" <td>salat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>Sjælland</td>\n",
" <td>Sjælland</td>\n",
" <td>Alssund</td>\n",
" <td>Sjælland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>so</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>hund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>klud</td>\n",
" <td>klud</td>\n",
" <td>klud</td>\n",
" <td>tv</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51</th>\n",
" <td>herligt</td>\n",
" <td>herligt</td>\n",
" <td>herligt</td>\n",
" <td>ondt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>døende</td>\n",
" <td>løbende</td>\n",
" <td>løbende</td>\n",
" <td>hoppende</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>aer</td>\n",
" <td>aer</td>\n",
" <td>aer</td>\n",
" <td>saver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>siger</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>vandrer</td>\n",
" <td>lægge sammen</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56</th>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57</th>\n",
" <td>aldrig</td>\n",
" <td>aldrig</td>\n",
" <td>aldrig</td>\n",
" <td>fire</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>smule</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>sti</td>\n",
" <td>sti</td>\n",
" <td>sti</td>\n",
" <td>trup</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" <td>figur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" <td>muligvis</td>\n",
" </tr>\n",
" <tr>\n",
" <th>66</th>\n",
" <td>fjerner</td>\n",
" <td>fjerner</td>\n",
" <td>fjerner</td>\n",
" <td>data</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67</th>\n",
" <td>to piger</td>\n",
" <td>tre timer</td>\n",
" <td>tre timer</td>\n",
" <td>to piger</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>sagen</td>\n",
" <td>sagen</td>\n",
" <td>sagen</td>\n",
" <td>cockpit</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>kaffe</td>\n",
" <td>anmeldelse</td>\n",
" <td>anmeldelse</td>\n",
" <td>forbrydelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>skuffet</td>\n",
" <td>skuffet</td>\n",
" <td>skuffet</td>\n",
" <td>organiseret</td>\n",
" </tr>\n",
" <tr>\n",
" <th>71</th>\n",
" <td>skål</td>\n",
" <td>skål</td>\n",
" <td>skål</td>\n",
" <td>tegning</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72</th>\n",
" <td>park</td>\n",
" <td>park</td>\n",
" <td>park</td>\n",
" <td>kapitel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>sovs</td>\n",
" <td>sovs</td>\n",
" <td>sovs</td>\n",
" <td>virksomhed</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>3</td>\n",
" <td>LX</td>\n",
" <td>tres</td>\n",
" <td>LX</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>1909</td>\n",
" <td>1909</td>\n",
" <td>1807</td>\n",
" <td>1909</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>øremærke</td>\n",
" <td>øremærke</td>\n",
" <td>øremærke</td>\n",
" <td>udmærkelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>ordentlig</td>\n",
" <td>ordentlig</td>\n",
" <td>ordentlig</td>\n",
" <td>bange</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>bræt</td>\n",
" <td>bræt</td>\n",
" <td>bræt</td>\n",
" <td>skærm</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>åbner</td>\n",
" <td>kradser af</td>\n",
" <td>kradser af</td>\n",
" <td>dør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>sin</td>\n",
" <td>og</td>\n",
" <td>og</td>\n",
" <td>sin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>vises</td>\n",
" <td>vises</td>\n",
" <td>vises</td>\n",
" <td>sit</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>mand og kvinde</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>måne</td>\n",
" <td>måne</td>\n",
" <td>måne</td>\n",
" <td>morgenmad</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>gange</td>\n",
" <td>mænd</td>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>vuggede</td>\n",
" <td>råbte</td>\n",
" <td>råbte</td>\n",
" <td>råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>spisebord</td>\n",
" <td>spisebord</td>\n",
" <td>spisebord</td>\n",
" <td>embedsmand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>skur</td>\n",
" <td>skur</td>\n",
" <td>skur</td>\n",
" <td>vegetation</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>ekspedient</td>\n",
" <td>ekspedient</td>\n",
" <td>ekspedient</td>\n",
" <td>chauffør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>flag</td>\n",
" <td>låner</td>\n",
" <td>låner</td>\n",
" <td>låner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>racerbil</td>\n",
" <td>racerbil</td>\n",
" <td>racerbil</td>\n",
" <td>tekande</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>havdige</td>\n",
" <td>havdige</td>\n",
" <td>havdige</td>\n",
" <td>fremtidige</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>spejl</td>\n",
" <td>spejl</td>\n",
" <td>spejl</td>\n",
" <td>flod</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word4 fasttext-wembedder fasttext \\\n",
"0 stol stol stol \n",
"1 græs græs græs \n",
"2 bil bil bil \n",
"3 vind tog tog \n",
"4 mandag mandag mandag \n",
"5 tømrer tømrer tømrer \n",
"6 barn barn barn \n",
"7 lampe lampe lampe \n",
"8 jern jern jern \n",
"9 sjov sjov sjov \n",
"10 papir papir papir \n",
"11 vagt vagt vagt \n",
"12 by by by \n",
"13 småkage småkage småkage \n",
"14 dør dør dør \n",
"15 klaver klaver klaver \n",
"16 fandens fandens fandens \n",
"17 vand vand vand \n",
"18 hukommelse hukommelse hukommelse \n",
"19 Niels Bohr Niels Bohr Niels Bohr \n",
"20 Lars Løkke Rasmussen Lars Løkke Rasmussen Ole Rømer \n",
"21 Peter Schmeichel Peter Schmeichel Anders Fogh Rasmussen \n",
"22 Caroline Wozniacki Caroline Wozniacki Caroline Wozniacki \n",
"23 Monaco Monaco Serena Williams \n",
"24 Pia Pia Pia \n",
"25 Ole Marianne Pia \n",
"26 mave mave mave \n",
"27 hat hat hat \n",
"28 ishockey ishockey ishockey \n",
"29 fodbold skiløb skiløb \n",
"30 sidde løbe løbe \n",
"31 himmel himmel himmel \n",
"32 Kina Kina Kina \n",
"33 Irland Irland Irland \n",
"34 beskidt beskidt beskidt \n",
"35 beskidt beskidt beskidt \n",
"36 øl øl øl \n",
"37 køkken køkken køkken \n",
"38 øl øl øl \n",
"39 vinge vinge vinge \n",
"40 kartoffel kartoffel kartoffel \n",
"41 pejs pejs pejs \n",
"42 Kattegat Kattegat Kattegat \n",
"43 Sjælland Sjælland Alssund \n",
"44 slange slange slange \n",
"45 hund hund hund \n",
"46 krappe krappe krappe \n",
"47 hund hund hund \n",
"48 krappe krappe krappe \n",
"49 ræv ræv ræv \n",
"50 klud klud klud \n",
"51 herligt herligt herligt \n",
"52 døende løbende løbende \n",
"53 aer aer aer \n",
"54 siger går går \n",
"55 vandrer lægge sammen lægge sammen \n",
"56 går går går \n",
"57 aldrig aldrig aldrig \n",
"58 kassen Nielsen kassen \n",
"59 smule kæmpe kæmpe \n",
"60 styrelser styrelser styrelser \n",
"61 rum rum rum \n",
"62 sti sti sti \n",
"63 lån lån lån \n",
"64 ur ur ur \n",
"65 nutidigt nutidigt nutidigt \n",
"66 fjerner fjerner fjerner \n",
"67 to piger tre timer tre timer \n",
"68 sagen sagen sagen \n",
"69 kaffe anmeldelse anmeldelse \n",
"70 skuffet skuffet skuffet \n",
"71 skål skål skål \n",
"72 park park park \n",
"73 sovs sovs sovs \n",
"74 3 LX tres \n",
"75 1909 1909 1807 \n",
"76 øremærke øremærke øremærke \n",
"77 ordentlig ordentlig ordentlig \n",
"78 falsk falsk falsk \n",
"79 bræt bræt bræt \n",
"80 åbner kradser af kradser af \n",
"81 sin og og \n",
"82 vises vises vises \n",
"83 mand og kvinde stod og råbte stod og råbte \n",
"84 måne måne måne \n",
"85 gange mænd mænd \n",
"86 køre køre køre \n",
"87 vuggede råbte råbte \n",
"88 spisebord spisebord spisebord \n",
"89 skur skur skur \n",
"90 ekspedient ekspedient ekspedient \n",
"91 flag låner låner \n",
"92 vej vej vej \n",
"93 leder leder leder \n",
"94 spand spand spand \n",
"95 racerbil racerbil racerbil \n",
"96 banker banker banker \n",
"97 havdige havdige havdige \n",
"98 spejl spejl spejl \n",
"99 vask vask vask \n",
"\n",
" wembedder \n",
"0 kirsebær \n",
"1 græs \n",
"2 græs \n",
"3 bil \n",
"4 solskin \n",
"5 tømrer \n",
"6 barn \n",
"7 lampe \n",
"8 jern \n",
"9 magnesium \n",
"10 papir \n",
"11 vagt \n",
"12 by \n",
"13 småkage \n",
"14 småkage \n",
"15 klaver \n",
"16 fandens \n",
"17 vand \n",
"18 hukommelse \n",
"19 Niels Bohr \n",
"20 Lars Løkke Rasmussen \n",
"21 Peter Schmeichel \n",
"22 Caroline Wozniacki \n",
"23 Monaco \n",
"24 Pia \n",
"25 Marianne \n",
"26 bold \n",
"27 hat \n",
"28 hue \n",
"29 fodbold \n",
"30 løbe \n",
"31 himmel \n",
"32 Kina \n",
"33 Irland \n",
"34 komedie \n",
"35 vaskemaskine \n",
"36 restaurant \n",
"37 køkken \n",
"38 øl \n",
"39 wing \n",
"40 vinge \n",
"41 salat \n",
"42 Kattegat \n",
"43 Sjælland \n",
"44 slange \n",
"45 hund \n",
"46 so \n",
"47 hund \n",
"48 hund \n",
"49 ræv \n",
"50 tv \n",
"51 ondt \n",
"52 hoppende \n",
"53 saver \n",
"54 går \n",
"55 vandrer \n",
"56 går \n",
"57 fire \n",
"58 Nielsen \n",
"59 kæmpe \n",
"60 styrelser \n",
"61 rum \n",
"62 trup \n",
"63 figur \n",
"64 ur \n",
"65 muligvis \n",
"66 data \n",
"67 to piger \n",
"68 cockpit \n",
"69 forbrydelse \n",
"70 organiseret \n",
"71 tegning \n",
"72 kapitel \n",
"73 virksomhed \n",
"74 LX \n",
"75 1909 \n",
"76 udmærkelse \n",
"77 bange \n",
"78 falsk \n",
"79 skærm \n",
"80 dør \n",
"81 sin \n",
"82 sit \n",
"83 stod og råbte \n",
"84 morgenmad \n",
"85 personer \n",
"86 køre \n",
"87 råbte \n",
"88 embedsmand \n",
"89 vegetation \n",
"90 chauffør \n",
"91 låner \n",
"92 vej \n",
"93 leder \n",
"94 spand \n",
"95 tekande \n",
"96 banker \n",
"97 fremtidige \n",
"98 flod \n",
"99 vask "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"with pd.option_context(\"display.max_rows\", 100):\n",
" display(four_words[['word4', 'fasttext-wembedder', 'fasttext', 'wembedder']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"FastText + Wembedder + BERT\n",
"---------------------------\n",
"Combination of FastText, Wembedder and BERT. \n",
"BERT is used if there is a non-proper noun phrases."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"outliers = []\n",
"for idx, words in four_words.iterrows():\n",
" upper_count = 0\n",
" for word in words:\n",
" if word[0].upper() == word[0]:\n",
" upper_count += 1\n",
" if upper_count > 2:\n",
" outlier = four_words.loc[idx, 'wembedder']\n",
" elif ' ' in words[0] or ' ' in words[1] or ' ' in words[2] or ' ' in words[3]:\n",
" outlier = four_words.loc[idx, 'bert-corrcoef']\n",
" else:\n",
" outlier = four_words.loc[idx, 'fasttext']\n",
" outliers.append(outlier)\n",
" \n",
"\n",
"four_words['fasttext-wembedder-bert'] = outliers"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.83"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.mean(four_words.word4 == four_words['fasttext-wembedder-bert'])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word1</th>\n",
" <th>word2</th>\n",
" <th>word3</th>\n",
" <th>word4</th>\n",
" <th>fasttext-wembedder-bert</th>\n",
" <th>fasttext</th>\n",
" <th>wembedder</th>\n",
" <th>bert-corrcoef</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>æble</td>\n",
" <td>pære</td>\n",
" <td>kirsebær</td>\n",
" <td>stol</td>\n",
" <td>stol</td>\n",
" <td>stol</td>\n",
" <td>kirsebær</td>\n",
" <td>kirsebær</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>stol</td>\n",
" <td>bord</td>\n",
" <td>reol</td>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" <td>græs</td>\n",
" <td>reol</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>græs</td>\n",
" <td>træ</td>\n",
" <td>blomst</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" <td>græs</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>bil</td>\n",
" <td>cykel</td>\n",
" <td>tog</td>\n",
" <td>vind</td>\n",
" <td>tog</td>\n",
" <td>tog</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>vind</td>\n",
" <td>regn</td>\n",
" <td>solskin</td>\n",
" <td>mandag</td>\n",
" <td>mandag</td>\n",
" <td>mandag</td>\n",
" <td>solskin</td>\n",
" <td>solskin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>mandag</td>\n",
" <td>tirsdag</td>\n",
" <td>søndag</td>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" <td>tømrer</td>\n",
" <td>søndag</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>tømrer</td>\n",
" <td>vvs-mand</td>\n",
" <td>snedker</td>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" <td>barn</td>\n",
" <td>vvs-mand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>barn</td>\n",
" <td>far</td>\n",
" <td>mormor</td>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" <td>lampe</td>\n",
" <td>barn</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>lampe</td>\n",
" <td>stearinlys</td>\n",
" <td>lommelygte</td>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" <td>jern</td>\n",
" <td>stearinlys</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>jern</td>\n",
" <td>guld</td>\n",
" <td>magnesium</td>\n",
" <td>sjov</td>\n",
" <td>sjov</td>\n",
" <td>sjov</td>\n",
" <td>magnesium</td>\n",
" <td>guld</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>sjov</td>\n",
" <td>dårlig</td>\n",
" <td>vanvittig</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" <td>papir</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>papir</td>\n",
" <td>ringbind</td>\n",
" <td>blyant</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" <td>vagt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>vagt</td>\n",
" <td>politimand</td>\n",
" <td>fængselsbetjent</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" <td>by</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>by</td>\n",
" <td>landsby</td>\n",
" <td>købstad</td>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" <td>småkage</td>\n",
" <td>by</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>småkage</td>\n",
" <td>citronmåne</td>\n",
" <td>kringle</td>\n",
" <td>dør</td>\n",
" <td>dør</td>\n",
" <td>dør</td>\n",
" <td>småkage</td>\n",
" <td>dør</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>dør</td>\n",
" <td>væg</td>\n",
" <td>vindue</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" <td>klaver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>klaver</td>\n",
" <td>trompet</td>\n",
" <td>blokfløjte</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" <td>fandens</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>fandens</td>\n",
" <td>fuck</td>\n",
" <td>sgu</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" <td>vand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>vand</td>\n",
" <td>jord</td>\n",
" <td>ild</td>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" <td>hukommelse</td>\n",
" <td>vand</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>hukommelse</td>\n",
" <td>intelligens</td>\n",
" <td>emotion</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>Niels Bohr</td>\n",
" <td>H.C. Ørsted</td>\n",
" <td>Ole Rømer</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Ole Rømer</td>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Niels Bohr</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>Lars Løkke Rasmussen</td>\n",
" <td>Poul Nyrup Rasmussen</td>\n",
" <td>Anders Fogh Rasmussen</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Anders Fogh Rasmussen</td>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Peter Schmeichel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>Peter Schmeichel</td>\n",
" <td>Kasper Schmeichel</td>\n",
" <td>Brian Laudrup</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Caroline Wozniacki</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>Caroline Wozniacki</td>\n",
" <td>Steffi Graf</td>\n",
" <td>Serena Williams</td>\n",
" <td>Monaco</td>\n",
" <td>Monaco</td>\n",
" <td>Serena Williams</td>\n",
" <td>Monaco</td>\n",
" <td>Serena Williams</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>Monaco</td>\n",
" <td>Paris</td>\n",
" <td>Milano</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" <td>Pia</td>\n",
" <td>Milano</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Pia</td>\n",
" <td>Lone</td>\n",
" <td>Marianne</td>\n",
" <td>Ole</td>\n",
" <td>Marianne</td>\n",
" <td>Pia</td>\n",
" <td>Marianne</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>bold</td>\n",
" <td>fjerbold</td>\n",
" <td>puck</td>\n",
" <td>mave</td>\n",
" <td>mave</td>\n",
" <td>mave</td>\n",
" <td>bold</td>\n",
" <td>fjerbold</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>mave</td>\n",
" <td>bryst</td>\n",
" <td>ryg</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" <td>hat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>hat</td>\n",
" <td>kasket</td>\n",
" <td>hue</td>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" <td>ishockey</td>\n",
" <td>hue</td>\n",
" <td>ishockey</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>ishockey</td>\n",
" <td>skiløb</td>\n",
" <td>skihop</td>\n",
" <td>fodbold</td>\n",
" <td>skiløb</td>\n",
" <td>skiløb</td>\n",
" <td>fodbold</td>\n",
" <td>skiløb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>gå</td>\n",
" <td>løbe</td>\n",
" <td>kravle</td>\n",
" <td>sidde</td>\n",
" <td>løbe</td>\n",
" <td>løbe</td>\n",
" <td>løbe</td>\n",
" <td>sidde</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>rød</td>\n",
" <td>blå</td>\n",
" <td>violet</td>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" <td>himmel</td>\n",
" <td>violet</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32</th>\n",
" <td>Finland</td>\n",
" <td>Sverige</td>\n",
" <td>Norge</td>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" <td>Kina</td>\n",
" <td>Norge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>33</th>\n",
" <td>Kina</td>\n",
" <td>Japan</td>\n",
" <td>Sydkorea</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" <td>Irland</td>\n",
" </tr>\n",
" <tr>\n",
" <th>34</th>\n",
" <td>humor</td>\n",
" <td>komedie</td>\n",
" <td>comedy</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>komedie</td>\n",
" <td>beskidt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>35</th>\n",
" <td>vaskemaskine</td>\n",
" <td>strygejern</td>\n",
" <td>tørretumbler</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>beskidt</td>\n",
" <td>vaskemaskine</td>\n",
" <td>vaskemaskine</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36</th>\n",
" <td>restaurant</td>\n",
" <td>café</td>\n",
" <td>bar</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>restaurant</td>\n",
" <td>restaurant</td>\n",
" </tr>\n",
" <tr>\n",
" <th>37</th>\n",
" <td>øl</td>\n",
" <td>vin</td>\n",
" <td>spiritus</td>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" <td>køkken</td>\n",
" <td>vin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>38</th>\n",
" <td>køkken</td>\n",
" <td>baderum</td>\n",
" <td>stue</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>øl</td>\n",
" <td>baderum</td>\n",
" </tr>\n",
" <tr>\n",
" <th>39</th>\n",
" <td>wing</td>\n",
" <td>back</td>\n",
" <td>forward</td>\n",
" <td>vinge</td>\n",
" <td>vinge</td>\n",
" <td>vinge</td>\n",
" <td>wing</td>\n",
" <td>vinge</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40</th>\n",
" <td>vinge</td>\n",
" <td>landingsstel</td>\n",
" <td>propel</td>\n",
" <td>kartoffel</td>\n",
" <td>kartoffel</td>\n",
" <td>kartoffel</td>\n",
" <td>vinge</td>\n",
" <td>landingsstel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41</th>\n",
" <td>kartoffel</td>\n",
" <td>frikadelle</td>\n",
" <td>salat</td>\n",
" <td>pejs</td>\n",
" <td>pejs</td>\n",
" <td>pejs</td>\n",
" <td>salat</td>\n",
" <td>salat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>42</th>\n",
" <td>Viborg</td>\n",
" <td>Randers</td>\n",
" <td>Hobro</td>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" <td>Kattegat</td>\n",
" <td>Randers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43</th>\n",
" <td>Kattegat</td>\n",
" <td>Øresund</td>\n",
" <td>Alssund</td>\n",
" <td>Sjælland</td>\n",
" <td>Sjælland</td>\n",
" <td>Alssund</td>\n",
" <td>Sjælland</td>\n",
" <td>Øresund</td>\n",
" </tr>\n",
" <tr>\n",
" <th>44</th>\n",
" <td>eg</td>\n",
" <td>lærketræ</td>\n",
" <td>æbletræ</td>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" <td>slange</td>\n",
" <td>eg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>45</th>\n",
" <td>hugorm</td>\n",
" <td>pyton</td>\n",
" <td>snog</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>pyton</td>\n",
" </tr>\n",
" <tr>\n",
" <th>46</th>\n",
" <td>ko</td>\n",
" <td>so</td>\n",
" <td>hest</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>so</td>\n",
" <td>so</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47</th>\n",
" <td>ugle</td>\n",
" <td>krage</td>\n",
" <td>måge</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>hund</td>\n",
" <td>krage</td>\n",
" </tr>\n",
" <tr>\n",
" <th>48</th>\n",
" <td>hund</td>\n",
" <td>ræv</td>\n",
" <td>ulv</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>krappe</td>\n",
" <td>hund</td>\n",
" <td>krappe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49</th>\n",
" <td>spilletid</td>\n",
" <td>halvleg</td>\n",
" <td>dommer</td>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" <td>ræv</td>\n",
" <td>spilletid</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>tv</td>\n",
" <td>radio</td>\n",
" <td>telefon</td>\n",
" <td>klud</td>\n",
" <td>klud</td>\n",
" <td>klud</td>\n",
" <td>tv</td>\n",
" <td>klud</td>\n",
" </tr>\n",
" <tr>\n",
" <th>51</th>\n",
" <td>ondt</td>\n",
" <td>forfærdeligt</td>\n",
" <td>skrækkeligt</td>\n",
" <td>herligt</td>\n",
" <td>herligt</td>\n",
" <td>herligt</td>\n",
" <td>ondt</td>\n",
" <td>ondt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>hoppende</td>\n",
" <td>dansende</td>\n",
" <td>løbende</td>\n",
" <td>døende</td>\n",
" <td>løbende</td>\n",
" <td>løbende</td>\n",
" <td>hoppende</td>\n",
" <td>døende</td>\n",
" </tr>\n",
" <tr>\n",
" <th>53</th>\n",
" <td>saver</td>\n",
" <td>hamrer</td>\n",
" <td>skruer</td>\n",
" <td>aer</td>\n",
" <td>aer</td>\n",
" <td>aer</td>\n",
" <td>saver</td>\n",
" <td>saver</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>går</td>\n",
" <td>spadserer</td>\n",
" <td>vandrer</td>\n",
" <td>siger</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>gange</td>\n",
" <td>dividere</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" <td>lægge sammen</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" <td>lægge sammen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>56</th>\n",
" <td>mener</td>\n",
" <td>tror</td>\n",
" <td>ved</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>ved</td>\n",
" </tr>\n",
" <tr>\n",
" <th>57</th>\n",
" <td>fire</td>\n",
" <td>fem</td>\n",
" <td>sytten</td>\n",
" <td>aldrig</td>\n",
" <td>aldrig</td>\n",
" <td>aldrig</td>\n",
" <td>fire</td>\n",
" <td>sytten</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>Nielsen</td>\n",
" <td>Jensen</td>\n",
" <td>Olsen</td>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" <td>Nielsen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>mega</td>\n",
" <td>kæmpe</td>\n",
" <td>enorm</td>\n",
" <td>smule</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60</th>\n",
" <td>kufferter</td>\n",
" <td>tasker</td>\n",
" <td>bæreposer</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" <td>styrelser</td>\n",
" </tr>\n",
" <tr>\n",
" <th>61</th>\n",
" <td>landstræner</td>\n",
" <td>håndboldekspert</td>\n",
" <td>mål</td>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" <td>rum</td>\n",
" <td>landstræner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62</th>\n",
" <td>trup</td>\n",
" <td>gruppe</td>\n",
" <td>hold</td>\n",
" <td>sti</td>\n",
" <td>sti</td>\n",
" <td>sti</td>\n",
" <td>trup</td>\n",
" <td>gruppe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>63</th>\n",
" <td>grafik</td>\n",
" <td>figur</td>\n",
" <td>plot</td>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" <td>lån</td>\n",
" <td>figur</td>\n",
" <td>lån</td>\n",
" </tr>\n",
" <tr>\n",
" <th>64</th>\n",
" <td>januar</td>\n",
" <td>maj</td>\n",
" <td>juni</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" <td>ur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>65</th>\n",
" <td>angiveligt</td>\n",
" <td>muligvis</td>\n",
" <td>sandsynligvis</td>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" <td>nutidigt</td>\n",
" <td>muligvis</td>\n",
" <td>nutidigt</td>\n",
" </tr>\n",
" <tr>\n",
" <th>66</th>\n",
" <td>oplysninger</td>\n",
" <td>data</td>\n",
" <td>informationer</td>\n",
" <td>fjerner</td>\n",
" <td>fjerner</td>\n",
" <td>fjerner</td>\n",
" <td>data</td>\n",
" <td>fjerner</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67</th>\n",
" <td>fire minutter</td>\n",
" <td>tre timer</td>\n",
" <td>en uge</td>\n",
" <td>to piger</td>\n",
" <td>to piger</td>\n",
" <td>tre timer</td>\n",
" <td>to piger</td>\n",
" <td>to piger</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68</th>\n",
" <td>instrumentbrættet</td>\n",
" <td>motorer</td>\n",
" <td>cockpit</td>\n",
" <td>sagen</td>\n",
" <td>sagen</td>\n",
" <td>sagen</td>\n",
" <td>cockpit</td>\n",
" <td>sagen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>anmeldelse</td>\n",
" <td>politi</td>\n",
" <td>forbrydelse</td>\n",
" <td>kaffe</td>\n",
" <td>anmeldelse</td>\n",
" <td>anmeldelse</td>\n",
" <td>forbrydelse</td>\n",
" <td>kaffe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>instrueret</td>\n",
" <td>organiseret</td>\n",
" <td>ledet</td>\n",
" <td>skuffet</td>\n",
" <td>skuffet</td>\n",
" <td>skuffet</td>\n",
" <td>organiseret</td>\n",
" <td>instrueret</td>\n",
" </tr>\n",
" <tr>\n",
" <th>71</th>\n",
" <td>billede</td>\n",
" <td>foto</td>\n",
" <td>tegning</td>\n",
" <td>skål</td>\n",
" <td>skål</td>\n",
" <td>skål</td>\n",
" <td>tegning</td>\n",
" <td>tegning</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72</th>\n",
" <td>kapitel</td>\n",
" <td>paragraf</td>\n",
" <td>sektion</td>\n",
" <td>park</td>\n",
" <td>park</td>\n",
" <td>park</td>\n",
" <td>kapitel</td>\n",
" <td>paragraf</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>virksomhed</td>\n",
" <td>firma</td>\n",
" <td>selskab</td>\n",
" <td>sovs</td>\n",
" <td>sovs</td>\n",
" <td>sovs</td>\n",
" <td>virksomhed</td>\n",
" <td>virksomhed</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>tres</td>\n",
" <td>60</td>\n",
" <td>LX</td>\n",
" <td>3</td>\n",
" <td>LX</td>\n",
" <td>tres</td>\n",
" <td>LX</td>\n",
" <td>LX</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75</th>\n",
" <td>1864</td>\n",
" <td>1807</td>\n",
" <td>1940</td>\n",
" <td>1909</td>\n",
" <td>1909</td>\n",
" <td>1807</td>\n",
" <td>1909</td>\n",
" <td>1909</td>\n",
" </tr>\n",
" <tr>\n",
" <th>76</th>\n",
" <td>diplom</td>\n",
" <td>udmærkelse</td>\n",
" <td>pris</td>\n",
" <td>øremærke</td>\n",
" <td>øremærke</td>\n",
" <td>øremærke</td>\n",
" <td>udmærkelse</td>\n",
" <td>pris</td>\n",
" </tr>\n",
" <tr>\n",
" <th>77</th>\n",
" <td>bange</td>\n",
" <td>urolig</td>\n",
" <td>nervøs</td>\n",
" <td>ordentlig</td>\n",
" <td>ordentlig</td>\n",
" <td>ordentlig</td>\n",
" <td>bange</td>\n",
" <td>bange</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78</th>\n",
" <td>norsk</td>\n",
" <td>engelsk</td>\n",
" <td>spansk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" <td>falsk</td>\n",
" </tr>\n",
" <tr>\n",
" <th>79</th>\n",
" <td>mus</td>\n",
" <td>tastatur</td>\n",
" <td>skærm</td>\n",
" <td>bræt</td>\n",
" <td>bræt</td>\n",
" <td>bræt</td>\n",
" <td>skærm</td>\n",
" <td>tastatur</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>dør</td>\n",
" <td>kradser af</td>\n",
" <td>udånder</td>\n",
" <td>åbner</td>\n",
" <td>kradser af</td>\n",
" <td>kradser af</td>\n",
" <td>dør</td>\n",
" <td>kradser af</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>og</td>\n",
" <td>samt</td>\n",
" <td>endvidere</td>\n",
" <td>sin</td>\n",
" <td>og</td>\n",
" <td>og</td>\n",
" <td>sin</td>\n",
" <td>sin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>hans</td>\n",
" <td>sit</td>\n",
" <td>vores</td>\n",
" <td>vises</td>\n",
" <td>vises</td>\n",
" <td>vises</td>\n",
" <td>sit</td>\n",
" <td>sit</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>stod og råbte</td>\n",
" <td>lå og sov</td>\n",
" <td>sad og så</td>\n",
" <td>mand og kvinde</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>frokost</td>\n",
" <td>morgenmad</td>\n",
" <td>brunch</td>\n",
" <td>måne</td>\n",
" <td>måne</td>\n",
" <td>måne</td>\n",
" <td>morgenmad</td>\n",
" <td>morgenmad</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" <td>individer</td>\n",
" <td>gange</td>\n",
" <td>mænd</td>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" <td>personer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>kidnappe</td>\n",
" <td>røve</td>\n",
" <td>stjæle</td>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" <td>køre</td>\n",
" <td>kidnappe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>råbte</td>\n",
" <td>skreg</td>\n",
" <td>larmede</td>\n",
" <td>vuggede</td>\n",
" <td>råbte</td>\n",
" <td>råbte</td>\n",
" <td>råbte</td>\n",
" <td>skreg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>kontorist</td>\n",
" <td>embedsmand</td>\n",
" <td>bureaukrat</td>\n",
" <td>spisebord</td>\n",
" <td>spisebord</td>\n",
" <td>spisebord</td>\n",
" <td>embedsmand</td>\n",
" <td>bureaukrat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>vegetation</td>\n",
" <td>krat</td>\n",
" <td>bed</td>\n",
" <td>skur</td>\n",
" <td>skur</td>\n",
" <td>skur</td>\n",
" <td>vegetation</td>\n",
" <td>vegetation</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>cyklist</td>\n",
" <td>bilist</td>\n",
" <td>chauffør</td>\n",
" <td>ekspedient</td>\n",
" <td>ekspedient</td>\n",
" <td>ekspedient</td>\n",
" <td>chauffør</td>\n",
" <td>cyklist</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>bibliotek</td>\n",
" <td>bog</td>\n",
" <td>låner</td>\n",
" <td>flag</td>\n",
" <td>låner</td>\n",
" <td>låner</td>\n",
" <td>låner</td>\n",
" <td>bibliotek</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>halvsyg</td>\n",
" <td>forkølelse</td>\n",
" <td>hoster</td>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" <td>vej</td>\n",
" <td>forkølelse</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>musik</td>\n",
" <td>node</td>\n",
" <td>rytme</td>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" <td>leder</td>\n",
" <td>node</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>rapport</td>\n",
" <td>sagsakt</td>\n",
" <td>artikel</td>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" <td>spand</td>\n",
" <td>artikel</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>tekande</td>\n",
" <td>vinflaske</td>\n",
" <td>slikskål</td>\n",
" <td>racerbil</td>\n",
" <td>racerbil</td>\n",
" <td>racerbil</td>\n",
" <td>tekande</td>\n",
" <td>tekande</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>forhører</td>\n",
" <td>spørger</td>\n",
" <td>anmoder</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" <td>banker</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>fremtidige</td>\n",
" <td>fortidige</td>\n",
" <td>nutidige</td>\n",
" <td>havdige</td>\n",
" <td>havdige</td>\n",
" <td>havdige</td>\n",
" <td>fremtidige</td>\n",
" <td>fortidige</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>kanal</td>\n",
" <td>flod</td>\n",
" <td>bæk</td>\n",
" <td>spejl</td>\n",
" <td>spejl</td>\n",
" <td>spejl</td>\n",
" <td>flod</td>\n",
" <td>kanal</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>kanal</td>\n",
" <td>program</td>\n",
" <td>udsendelse</td>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" <td>vask</td>\n",
" <td>udsendelse</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word1 word2 word3 \\\n",
"0 æble pære kirsebær \n",
"1 stol bord reol \n",
"2 græs træ blomst \n",
"3 bil cykel tog \n",
"4 vind regn solskin \n",
"5 mandag tirsdag søndag \n",
"6 tømrer vvs-mand snedker \n",
"7 barn far mormor \n",
"8 lampe stearinlys lommelygte \n",
"9 jern guld magnesium \n",
"10 sjov dårlig vanvittig \n",
"11 papir ringbind blyant \n",
"12 vagt politimand fængselsbetjent \n",
"13 by landsby købstad \n",
"14 småkage citronmåne kringle \n",
"15 dør væg vindue \n",
"16 klaver trompet blokfløjte \n",
"17 fandens fuck sgu \n",
"18 vand jord ild \n",
"19 hukommelse intelligens emotion \n",
"20 Niels Bohr H.C. Ørsted Ole Rømer \n",
"21 Lars Løkke Rasmussen Poul Nyrup Rasmussen Anders Fogh Rasmussen \n",
"22 Peter Schmeichel Kasper Schmeichel Brian Laudrup \n",
"23 Caroline Wozniacki Steffi Graf Serena Williams \n",
"24 Monaco Paris Milano \n",
"25 Pia Lone Marianne \n",
"26 bold fjerbold puck \n",
"27 mave bryst ryg \n",
"28 hat kasket hue \n",
"29 ishockey skiløb skihop \n",
"30 gå løbe kravle \n",
"31 rød blå violet \n",
"32 Finland Sverige Norge \n",
"33 Kina Japan Sydkorea \n",
"34 humor komedie comedy \n",
"35 vaskemaskine strygejern tørretumbler \n",
"36 restaurant café bar \n",
"37 øl vin spiritus \n",
"38 køkken baderum stue \n",
"39 wing back forward \n",
"40 vinge landingsstel propel \n",
"41 kartoffel frikadelle salat \n",
"42 Viborg Randers Hobro \n",
"43 Kattegat Øresund Alssund \n",
"44 eg lærketræ æbletræ \n",
"45 hugorm pyton snog \n",
"46 ko so hest \n",
"47 ugle krage måge \n",
"48 hund ræv ulv \n",
"49 spilletid halvleg dommer \n",
"50 tv radio telefon \n",
"51 ondt forfærdeligt skrækkeligt \n",
"52 hoppende dansende løbende \n",
"53 saver hamrer skruer \n",
"54 går spadserer vandrer \n",
"55 gange dividere lægge sammen \n",
"56 mener tror ved \n",
"57 fire fem sytten \n",
"58 Nielsen Jensen Olsen \n",
"59 mega kæmpe enorm \n",
"60 kufferter tasker bæreposer \n",
"61 landstræner håndboldekspert mål \n",
"62 trup gruppe hold \n",
"63 grafik figur plot \n",
"64 januar maj juni \n",
"65 angiveligt muligvis sandsynligvis \n",
"66 oplysninger data informationer \n",
"67 fire minutter tre timer en uge \n",
"68 instrumentbrættet motorer cockpit \n",
"69 anmeldelse politi forbrydelse \n",
"70 instrueret organiseret ledet \n",
"71 billede foto tegning \n",
"72 kapitel paragraf sektion \n",
"73 virksomhed firma selskab \n",
"74 tres 60 LX \n",
"75 1864 1807 1940 \n",
"76 diplom udmærkelse pris \n",
"77 bange urolig nervøs \n",
"78 norsk engelsk spansk \n",
"79 mus tastatur skærm \n",
"80 dør kradser af udånder \n",
"81 og samt endvidere \n",
"82 hans sit vores \n",
"83 stod og råbte lå og sov sad og så \n",
"84 frokost morgenmad brunch \n",
"85 mænd personer individer \n",
"86 kidnappe røve stjæle \n",
"87 råbte skreg larmede \n",
"88 kontorist embedsmand bureaukrat \n",
"89 vegetation krat bed \n",
"90 cyklist bilist chauffør \n",
"91 bibliotek bog låner \n",
"92 halvsyg forkølelse hoster \n",
"93 musik node rytme \n",
"94 rapport sagsakt artikel \n",
"95 tekande vinflaske slikskål \n",
"96 forhører spørger anmoder \n",
"97 fremtidige fortidige nutidige \n",
"98 kanal flod bæk \n",
"99 kanal program udsendelse \n",
"\n",
" word4 fasttext-wembedder-bert fasttext \\\n",
"0 stol stol stol \n",
"1 græs græs græs \n",
"2 bil bil bil \n",
"3 vind tog tog \n",
"4 mandag mandag mandag \n",
"5 tømrer tømrer tømrer \n",
"6 barn barn barn \n",
"7 lampe lampe lampe \n",
"8 jern jern jern \n",
"9 sjov sjov sjov \n",
"10 papir papir papir \n",
"11 vagt vagt vagt \n",
"12 by by by \n",
"13 småkage småkage småkage \n",
"14 dør dør dør \n",
"15 klaver klaver klaver \n",
"16 fandens fandens fandens \n",
"17 vand vand vand \n",
"18 hukommelse hukommelse hukommelse \n",
"19 Niels Bohr Niels Bohr Niels Bohr \n",
"20 Lars Løkke Rasmussen Lars Løkke Rasmussen Ole Rømer \n",
"21 Peter Schmeichel Peter Schmeichel Anders Fogh Rasmussen \n",
"22 Caroline Wozniacki Caroline Wozniacki Caroline Wozniacki \n",
"23 Monaco Monaco Serena Williams \n",
"24 Pia Pia Pia \n",
"25 Ole Marianne Pia \n",
"26 mave mave mave \n",
"27 hat hat hat \n",
"28 ishockey ishockey ishockey \n",
"29 fodbold skiløb skiløb \n",
"30 sidde løbe løbe \n",
"31 himmel himmel himmel \n",
"32 Kina Kina Kina \n",
"33 Irland Irland Irland \n",
"34 beskidt beskidt beskidt \n",
"35 beskidt beskidt beskidt \n",
"36 øl øl øl \n",
"37 køkken køkken køkken \n",
"38 øl øl øl \n",
"39 vinge vinge vinge \n",
"40 kartoffel kartoffel kartoffel \n",
"41 pejs pejs pejs \n",
"42 Kattegat Kattegat Kattegat \n",
"43 Sjælland Sjælland Alssund \n",
"44 slange slange slange \n",
"45 hund hund hund \n",
"46 krappe krappe krappe \n",
"47 hund hund hund \n",
"48 krappe krappe krappe \n",
"49 ræv ræv ræv \n",
"50 klud klud klud \n",
"51 herligt herligt herligt \n",
"52 døende løbende løbende \n",
"53 aer aer aer \n",
"54 siger går går \n",
"55 vandrer lægge sammen lægge sammen \n",
"56 går går går \n",
"57 aldrig aldrig aldrig \n",
"58 kassen Nielsen kassen \n",
"59 smule kæmpe kæmpe \n",
"60 styrelser styrelser styrelser \n",
"61 rum rum rum \n",
"62 sti sti sti \n",
"63 lån lån lån \n",
"64 ur ur ur \n",
"65 nutidigt nutidigt nutidigt \n",
"66 fjerner fjerner fjerner \n",
"67 to piger to piger tre timer \n",
"68 sagen sagen sagen \n",
"69 kaffe anmeldelse anmeldelse \n",
"70 skuffet skuffet skuffet \n",
"71 skål skål skål \n",
"72 park park park \n",
"73 sovs sovs sovs \n",
"74 3 LX tres \n",
"75 1909 1909 1807 \n",
"76 øremærke øremærke øremærke \n",
"77 ordentlig ordentlig ordentlig \n",
"78 falsk falsk falsk \n",
"79 bræt bræt bræt \n",
"80 åbner kradser af kradser af \n",
"81 sin og og \n",
"82 vises vises vises \n",
"83 mand og kvinde stod og råbte stod og råbte \n",
"84 måne måne måne \n",
"85 gange mænd mænd \n",
"86 køre køre køre \n",
"87 vuggede råbte råbte \n",
"88 spisebord spisebord spisebord \n",
"89 skur skur skur \n",
"90 ekspedient ekspedient ekspedient \n",
"91 flag låner låner \n",
"92 vej vej vej \n",
"93 leder leder leder \n",
"94 spand spand spand \n",
"95 racerbil racerbil racerbil \n",
"96 banker banker banker \n",
"97 havdige havdige havdige \n",
"98 spejl spejl spejl \n",
"99 vask vask vask \n",
"\n",
" wembedder bert-corrcoef \n",
"0 kirsebær kirsebær \n",
"1 græs reol \n",
"2 græs bil \n",
"3 bil bil \n",
"4 solskin solskin \n",
"5 tømrer søndag \n",
"6 barn vvs-mand \n",
"7 lampe barn \n",
"8 jern stearinlys \n",
"9 magnesium guld \n",
"10 papir papir \n",
"11 vagt vagt \n",
"12 by by \n",
"13 småkage by \n",
"14 småkage dør \n",
"15 klaver klaver \n",
"16 fandens fandens \n",
"17 vand vand \n",
"18 hukommelse vand \n",
"19 Niels Bohr Niels Bohr \n",
"20 Lars Løkke Rasmussen Niels Bohr \n",
"21 Peter Schmeichel Peter Schmeichel \n",
"22 Caroline Wozniacki Caroline Wozniacki \n",
"23 Monaco Serena Williams \n",
"24 Pia Milano \n",
"25 Marianne Pia \n",
"26 bold fjerbold \n",
"27 hat hat \n",
"28 hue ishockey \n",
"29 fodbold skiløb \n",
"30 løbe sidde \n",
"31 himmel violet \n",
"32 Kina Norge \n",
"33 Irland Irland \n",
"34 komedie beskidt \n",
"35 vaskemaskine vaskemaskine \n",
"36 restaurant restaurant \n",
"37 køkken vin \n",
"38 øl baderum \n",
"39 wing vinge \n",
"40 vinge landingsstel \n",
"41 salat salat \n",
"42 Kattegat Randers \n",
"43 Sjælland Øresund \n",
"44 slange eg \n",
"45 hund pyton \n",
"46 so so \n",
"47 hund krage \n",
"48 hund krappe \n",
"49 ræv spilletid \n",
"50 tv klud \n",
"51 ondt ondt \n",
"52 hoppende døende \n",
"53 saver saver \n",
"54 går går \n",
"55 vandrer lægge sammen \n",
"56 går ved \n",
"57 fire sytten \n",
"58 Nielsen Nielsen \n",
"59 kæmpe kæmpe \n",
"60 styrelser styrelser \n",
"61 rum landstræner \n",
"62 trup gruppe \n",
"63 figur lån \n",
"64 ur ur \n",
"65 muligvis nutidigt \n",
"66 data fjerner \n",
"67 to piger to piger \n",
"68 cockpit sagen \n",
"69 forbrydelse kaffe \n",
"70 organiseret instrueret \n",
"71 tegning tegning \n",
"72 kapitel paragraf \n",
"73 virksomhed virksomhed \n",
"74 LX LX \n",
"75 1909 1909 \n",
"76 udmærkelse pris \n",
"77 bange bange \n",
"78 falsk falsk \n",
"79 skærm tastatur \n",
"80 dør kradser af \n",
"81 sin sin \n",
"82 sit sit \n",
"83 stod og råbte stod og råbte \n",
"84 morgenmad morgenmad \n",
"85 personer personer \n",
"86 køre kidnappe \n",
"87 råbte skreg \n",
"88 embedsmand bureaukrat \n",
"89 vegetation vegetation \n",
"90 chauffør cyklist \n",
"91 låner bibliotek \n",
"92 vej forkølelse \n",
"93 leder node \n",
"94 spand artikel \n",
"95 tekande tekande \n",
"96 banker banker \n",
"97 fremtidige fortidige \n",
"98 flod kanal \n",
"99 vask udsendelse "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"with pd.option_context(\"display.max_rows\", 100):\n",
" display(four_words[['word1', 'word2', 'word3', 'word4', 'fasttext-wembedder-bert', 'fasttext', 'wembedder', 'bert-corrcoef']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine details \n",
"----------------------\n",
"Show the misidentified rows."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>word1</th>\n",
" <th>word2</th>\n",
" <th>word3</th>\n",
" <th>word4</th>\n",
" <th>fasttext-wembedder-bert</th>\n",
" <th>fasttext</th>\n",
" <th>wembedder</th>\n",
" <th>bert-corrcoef</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>bil</td>\n",
" <td>cykel</td>\n",
" <td>tog</td>\n",
" <td>vind</td>\n",
" <td>tog</td>\n",
" <td>tog</td>\n",
" <td>bil</td>\n",
" <td>bil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>Pia</td>\n",
" <td>Lone</td>\n",
" <td>Marianne</td>\n",
" <td>Ole</td>\n",
" <td>Marianne</td>\n",
" <td>Pia</td>\n",
" <td>Marianne</td>\n",
" <td>Pia</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>ishockey</td>\n",
" <td>skiløb</td>\n",
" <td>skihop</td>\n",
" <td>fodbold</td>\n",
" <td>skiløb</td>\n",
" <td>skiløb</td>\n",
" <td>fodbold</td>\n",
" <td>skiløb</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>gå</td>\n",
" <td>løbe</td>\n",
" <td>kravle</td>\n",
" <td>sidde</td>\n",
" <td>løbe</td>\n",
" <td>løbe</td>\n",
" <td>løbe</td>\n",
" <td>sidde</td>\n",
" </tr>\n",
" <tr>\n",
" <th>52</th>\n",
" <td>hoppende</td>\n",
" <td>dansende</td>\n",
" <td>løbende</td>\n",
" <td>døende</td>\n",
" <td>løbende</td>\n",
" <td>løbende</td>\n",
" <td>hoppende</td>\n",
" <td>døende</td>\n",
" </tr>\n",
" <tr>\n",
" <th>54</th>\n",
" <td>går</td>\n",
" <td>spadserer</td>\n",
" <td>vandrer</td>\n",
" <td>siger</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" <td>går</td>\n",
" </tr>\n",
" <tr>\n",
" <th>55</th>\n",
" <td>gange</td>\n",
" <td>dividere</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" <td>lægge sammen</td>\n",
" <td>lægge sammen</td>\n",
" <td>vandrer</td>\n",
" <td>lægge sammen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>58</th>\n",
" <td>Nielsen</td>\n",
" <td>Jensen</td>\n",
" <td>Olsen</td>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" <td>kassen</td>\n",
" <td>Nielsen</td>\n",
" <td>Nielsen</td>\n",
" </tr>\n",
" <tr>\n",
" <th>59</th>\n",
" <td>mega</td>\n",
" <td>kæmpe</td>\n",
" <td>enorm</td>\n",
" <td>smule</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" <td>kæmpe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>69</th>\n",
" <td>anmeldelse</td>\n",
" <td>politi</td>\n",
" <td>forbrydelse</td>\n",
" <td>kaffe</td>\n",
" <td>anmeldelse</td>\n",
" <td>anmeldelse</td>\n",
" <td>forbrydelse</td>\n",
" <td>kaffe</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>tres</td>\n",
" <td>60</td>\n",
" <td>LX</td>\n",
" <td>3</td>\n",
" <td>LX</td>\n",
" <td>tres</td>\n",
" <td>LX</td>\n",
" <td>LX</td>\n",
" </tr>\n",
" <tr>\n",
" <th>80</th>\n",
" <td>dør</td>\n",
" <td>kradser af</td>\n",
" <td>udånder</td>\n",
" <td>åbner</td>\n",
" <td>kradser af</td>\n",
" <td>kradser af</td>\n",
" <td>dør</td>\n",
" <td>kradser af</td>\n",
" </tr>\n",
" <tr>\n",
" <th>81</th>\n",
" <td>og</td>\n",
" <td>samt</td>\n",
" <td>endvidere</td>\n",
" <td>sin</td>\n",
" <td>og</td>\n",
" <td>og</td>\n",
" <td>sin</td>\n",
" <td>sin</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>stod og råbte</td>\n",
" <td>lå og sov</td>\n",
" <td>sad og så</td>\n",
" <td>mand og kvinde</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" <td>stod og råbte</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" <td>individer</td>\n",
" <td>gange</td>\n",
" <td>mænd</td>\n",
" <td>mænd</td>\n",
" <td>personer</td>\n",
" <td>personer</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>råbte</td>\n",
" <td>skreg</td>\n",
" <td>larmede</td>\n",
" <td>vuggede</td>\n",
" <td>råbte</td>\n",
" <td>råbte</td>\n",
" <td>råbte</td>\n",
" <td>skreg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>bibliotek</td>\n",
" <td>bog</td>\n",
" <td>låner</td>\n",
" <td>flag</td>\n",
" <td>låner</td>\n",
" <td>låner</td>\n",
" <td>låner</td>\n",
" <td>bibliotek</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" word1 word2 word3 word4 \\\n",
"3 bil cykel tog vind \n",
"25 Pia Lone Marianne Ole \n",
"29 ishockey skiløb skihop fodbold \n",
"30 gå løbe kravle sidde \n",
"52 hoppende dansende løbende døende \n",
"54 går spadserer vandrer siger \n",
"55 gange dividere lægge sammen vandrer \n",
"58 Nielsen Jensen Olsen kassen \n",
"59 mega kæmpe enorm smule \n",
"69 anmeldelse politi forbrydelse kaffe \n",
"74 tres 60 LX 3 \n",
"80 dør kradser af udånder åbner \n",
"81 og samt endvidere sin \n",
"83 stod og råbte lå og sov sad og så mand og kvinde \n",
"85 mænd personer individer gange \n",
"87 råbte skreg larmede vuggede \n",
"91 bibliotek bog låner flag \n",
"\n",
" fasttext-wembedder-bert fasttext wembedder bert-corrcoef \n",
"3 tog tog bil bil \n",
"25 Marianne Pia Marianne Pia \n",
"29 skiløb skiløb fodbold skiløb \n",
"30 løbe løbe løbe sidde \n",
"52 løbende løbende hoppende døende \n",
"54 går går går går \n",
"55 lægge sammen lægge sammen vandrer lægge sammen \n",
"58 Nielsen kassen Nielsen Nielsen \n",
"59 kæmpe kæmpe kæmpe kæmpe \n",
"69 anmeldelse anmeldelse forbrydelse kaffe \n",
"74 LX tres LX LX \n",
"80 kradser af kradser af dør kradser af \n",
"81 og og sin sin \n",
"83 stod og råbte stod og råbte stod og råbte stod og råbte \n",
"85 mænd mænd personer personer \n",
"87 råbte råbte råbte skreg \n",
"91 låner låner låner bibliotek "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"four_words.loc[four_words['word4'] != four_words['fasttext-wembedder-bert'], ['word1', 'word2', 'word3', 'word4', 'fasttext-wembedder-bert', 'fasttext', 'wembedder', 'bert-corrcoef']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show what the correlation is for the row with years and Wembedder where the outlier is the year 1909.\n",
"The Wembedder model is not specific for Danish, so it is suspicious why it selects 1909...!?"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 1864 (Q7704) 1807 (Q6909) 1940 (Q18633) 1909 (Q2057)\n",
"1864 (Q7704) 1.000000 0.987776 0.954457 0.952429\n",
"1807 (Q6909) 0.987776 1.000000 0.954275 0.952641\n",
"1940 (Q18633) 0.954457 0.954275 1.000000 0.980791\n",
"1909 (Q2057) 0.952429 0.952641 0.980791 1.000000\n",
"\n",
" 1864 (Q7704) 1807 (Q6909) 1940 (Q18633) 1909 (Q2057)\n",
"count 4.000000 4.000000 4.000000 4.000000\n",
"mean 0.973666 0.973673 0.972381 0.971465\n",
"std 0.023893 0.023879 0.022231 0.023223\n",
"min 0.952429 0.952641 0.954275 0.952429\n",
"25% 0.953950 0.953866 0.954412 0.952588\n",
"50% 0.971117 0.971026 0.967624 0.966716\n",
"75% 0.990832 0.990832 0.985594 0.985594\n",
"max 1.000000 1.000000 1.000000 1.000000\n"
]
}
],
"source": [
"words = four_words.iloc[75, :4].values.tolist()\n",
"\n",
"# words.append(\"Finn\") # Appending something completely different is also possible\n",
"\n",
"# Find q-items\n",
"qs = words_to_qs(words)\n",
"labels = [\"{} ({})\".format(word, q) for word, q in zip(words, qs)]\n",
"\n",
"# Embed\n",
"vector = np.array([wembedder_model.wv[q] for q in qs])\n",
"\n",
"# Show correlation matrix\n",
"R = np.corrcoef(vector)\n",
"df = pd.DataFrame(R, columns=labels, index=labels)\n",
"print(df)\n",
"print()\n",
"print(df.describe())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment