Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save i4kimura/e886771ee4a311927d1b4633b995fd58 to your computer and use it in GitHub Desktop.
Save i4kimura/e886771ee4a311927d1b4633b995fd58 to your computer and use it in GitHub Desktop.
ゼロから作るディープラーニング② 2章 自然言語と単語の分散表現
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"コンピュータに単語の意味を理解させるためには。\n",
"- シソーラスによる手法\n",
"- カウントベースの手法\n",
"- 推論ベースの手法 (word2vec) (これは次章)\n",
"\n",
"## 2.2 シソーラス\n",
"\n",
"シソーラス、同じ意味の単語が同じグループに分類されている辞書。\n",
"この手法の問題は、人手で辞書を作成しなければならないこと。\n",
"- 時代の変化に対応するのが困難\n",
"- 人な作業コストが高い\n",
"- 単語の細かなニュアンスを表現できない\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2.3 カウントベースの手法\n",
"\n",
"- コーパスとは?大量のテキストデータが、自然言語処理の研究やアプリケーションのために目的をもって収集されたテキストデータ。\n",
" - Wikipedia, Google News, シェイクスピア, 夏目漱石\n",
"\n",
"ここでは、\"You say goodby and I say hello.\" という文章を使用する。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'you say goodby and i say hello .'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text = 'You say goodby and I say hello.'\n",
"text = text.lower()\n",
"text = text.replace('.', ' .')\n",
"text"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['you', 'say', 'goodby', 'and', 'i', 'say', 'hello', '.']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"words = text.split()\n",
"words"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"次に、Pythonのディクショナリを作成して、単語にIDを振ることにする。最後に、文章をIDリストに変換する。"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{0: 'you', 1: 'say', 2: 'goodby', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}\n",
"{'you': 0, 'say': 1, 'goodby': 2, 'and': 3, 'i': 4, 'hello': 5, '.': 6}\n"
]
},
{
"data": {
"text/plain": [
"array([0, 1, 2, 3, 4, 1, 5, 6])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_to_id = {}\n",
"id_to_word = {}\n",
"for word in words:\n",
" if word not in word_to_id:\n",
" new_id = len(word_to_id)\n",
" word_to_id[word] = new_id\n",
" id_to_word[new_id] = word\n",
"print(id_to_word)\n",
"print(word_to_id)\n",
"\n",
"import numpy as np\n",
"corpus = [word_to_id[w] for w in words]\n",
"corpus = np.array(corpus)\n",
"corpus"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def preprocess(text):\n",
" text = text.lower()\n",
" text = text.replace('.', ' .')\n",
" words = text.split()\n",
"\n",
" word_to_id = {}\n",
" id_to_word = {}\n",
"\n",
" for word in words:\n",
" if word not in word_to_id:\n",
" new_id = len(word_to_id)\n",
" word_to_id[word] = new_id\n",
" id_to_word[new_id] = word\n",
" \n",
" corpus = np.array([word_to_id[w] for w in words])\n",
" return corpus, word_to_id, id_to_word\n",
"\n",
"text = 'You say goodby and I say hello.'\n",
"corpus, word_to_id, id_to_word = preprocess(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3.2 単語の分散表現\n",
"単語を単語として表現するのではなく、よりコンパクトで理にかなったベクトルとして表現することを、「単語の分散表現」と呼ぶ。\n",
"### 2.3.3 分布仮説\n",
"「単語の意味は、周囲の単語によって形成される」という仮説を「分布仮説」と呼ぶ。\n",
"- コンテキスト\n",
"注目する単語に対して、その周囲に存在する単語を「コンテキスト」と呼ぶ。\n",
"- ウィンドウサイズ\n",
"注目する単語に対する、コンテキストのサイズ。左右の2つの単語までコンテキストに含むなら、ウィンドウサイズは2である。\n",
"\n",
"### 2.3.4 共起行列\n",
"単語をベクトルで表す方法として素直な方法は、周囲の単語をカウントすること。\n",
"例えば上記の例であれば、7つの単語が登場しているので、行列として周囲の単語をカウントする。"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0 1 2 3 4 1 5 6]\n",
"{0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}\n"
]
}
],
"source": [
"text = 'You say goodbye and I say hello.'\n",
"corupus, word_to_id, id_to_word = preprocess(text)\n",
"print (corpus)\n",
"print (id_to_word)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0, 1, 0, 0, 0, 0, 0],\n",
" [1, 0, 1, 0, 1, 1, 0],\n",
" [0, 1, 0, 1, 0, 0, 0],\n",
" [0, 0, 1, 0, 1, 0, 0],\n",
" [0, 1, 0, 1, 0, 0, 0],\n",
" [0, 1, 0, 0, 0, 0, 1],\n",
" [0, 0, 0, 0, 0, 1, 0]], dtype=int32)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def create_co_matrix(corpus, vocab_size, window_size=1):\n",
" corpus_size = len(corpus)\n",
" co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)\n",
" \n",
" for idx, word_id in enumerate(corpus):\n",
" for i in range(1, window_size+1):\n",
" left_idx = idx - i\n",
" right_idx = idx + i\n",
" \n",
" if left_idx >= 0:\n",
" left_word_id = corpus[left_idx]\n",
" co_matrix[word_id, left_word_id] += 1\n",
" if right_idx < corpus_size:\n",
" right_word_id = corpus[right_idx]\n",
" co_matrix[word_id, right_word_id] += 1\n",
" \n",
" return co_matrix\n",
"\n",
"create_co_matrix(corpus, len(id_to_word))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3.5 ベクトル間の類似度\n",
"ベクトル間の類似度を計測する方法は様々あるが、ここでは、「コサイン類似度」を使用する。\n",
"\n",
"下記の`cos_similarity`の実装において、epsを指定しているのはゼロ除算を避けるため。"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.7071067691154799\n"
]
}
],
"source": [
"def cos_similarity(x, y, eps=1e-8):\n",
" nx = x / (np.sqrt(np.sum(x**2)) + eps)\n",
" ny = y / (np.sqrt(np.sum(y**2)) + eps)\n",
" return np.dot(nx, ny)\n",
"\n",
"vocab_size = len(word_to_id)\n",
"C = create_co_matrix(corpus, vocab_size)\n",
"c0 = C[word_to_id['you']]\n",
"c1 = C[word_to_id['i']]\n",
"print(cos_similarity(c0, c1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上記の結果から、'you'と'i'の類似度は0.70...となり、比較的高いことが分かる。\n",
"\n",
"### 2.3.6 類似単語のランキング表示\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"[query] you\n",
" goodbye 0.7071067691154799\n",
" i 0.7071067691154799\n",
" hello 0.7071067691154799\n",
" say 0.0\n",
" and 0.0\n"
]
}
],
"source": [
"def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):\n",
" if query not in word_to_id:\n",
" print('%s is not found' % query)\n",
" return\n",
" print('\\n[query] ' + query)\n",
" query_id = word_to_id[query]\n",
" query_vec = word_matrix[query_id]\n",
" \n",
" vocab_size = len(id_to_word)\n",
" similarity = np.zeros(vocab_size)\n",
" for i in range(vocab_size):\n",
" similarity[i] = cos_similarity(word_matrix[i], query_vec)\n",
" \n",
" count = 0\n",
" for i in (-1 * similarity).argsort():\n",
" if id_to_word[i] == query:\n",
" continue\n",
" print(' %s %s' % (id_to_word[i], similarity[i]))\n",
" \n",
" count += 1\n",
" if count >= top:\n",
" return\n",
" \n",
"most_similar('you', word_to_id, id_to_word, C, top=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上記の手法では、'goodbye' や 'hello' に類似度があるのは感覚とズレがあるため、これを改善する。\n",
"\n",
"## 2.4 カウントベース手法の改善\n",
"\n",
"### 2.4.1 相互情報量\n",
"相互情報量というのは、単語xと単語yの発生確率と、xyが同時に共起する確率から以下のように表現される。\n",
"\n",
"[tex: PMI(x,y) = \\log_2\\dfrac{P(x,y)}{P(x)P(y)}]\n",
"\n",
"これを使ってCorpusから相互情報量の行列を作成する。\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"covariance matrix\n",
"[[0 1 0 0 0 0 0]\n",
" [1 0 1 0 1 1 0]\n",
" [0 1 0 1 0 0 0]\n",
" [0 0 1 0 1 0 0]\n",
" [0 1 0 1 0 0 0]\n",
" [0 1 0 0 0 0 1]\n",
" [0 0 0 0 0 1 0]]\n",
"--------------------------------------------------\n",
"PPMI\n",
"[[0. 1.807 0. 0. 0. 0. 0. ]\n",
" [1.807 0. 0.807 0. 0.807 0.807 0. ]\n",
" [0. 0.807 0. 1.807 0. 0. 0. ]\n",
" [0. 0. 1.807 0. 1.807 0. 0. ]\n",
" [0. 0.807 0. 1.807 0. 0. 0. ]\n",
" [0. 0.807 0. 0. 0. 0. 2.807]\n",
" [0. 0. 0. 0. 0. 2.807 0. ]]\n"
]
}
],
"source": [
"def ppmi(C, verbose=False, eps = 1e-8):\n",
" '''PPMI(正の相互情報量)の作成\n",
"\n",
" :param C: 共起行列\n",
" :param verbose: 進行状況を出力するかどうか \n",
" :return:\n",
" '''\n",
" M = np.zeros_like(C, dtype=np.float32)\n",
" N = np.sum(C)\n",
" S = np.sum(C, axis=0)\n",
" total = C.shape[0] * C.shape[1]\n",
" cnt = 0\n",
"\n",
" for i in range(C.shape[0]):\n",
" for j in range(C.shape[1]):\n",
" pmi = np.log2(C[i, j] * N / (S[j]*S[i]) + eps)\n",
" M[i, j] = max(0, pmi)\n",
"\n",
" if verbose:\n",
" cnt += 1\n",
" if cnt % (total//100) == 0:\n",
" print('%.1f%% done' % (100*cnt/total))\n",
" return M\n",
"\n",
"W = ppmi(C)\n",
"\n",
"np.set_printoptions(precision=3)\n",
"print('covariance matrix')\n",
"print(C)\n",
"print('-'*50)\n",
"print('PPMI')\n",
"print(W)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PPMI行列を作成したが、この作成には時間がかかる。また、0となる空間が多いので、次にベクトルの削減を行う。\n",
"\n",
"### 2.4.2 次元削減\n",
"\n",
"次元削減を行う手法の一つとして、特異値分解(Singlar Value Decomposition:SVD)を行う。\n",
"\n",
"[tex: X = USV^{T}]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0 1 0 0 0 0 0]\n",
"[0. 1.807 0. 0. 0. 0. 0. ]\n",
"[ 3.409e-01 0.000e+00 -1.205e-01 -3.886e-16 -9.323e-01 -1.110e-16\n",
" -2.426e-17]\n"
]
}
],
"source": [
"# SVDによる次元削減\n",
"\n",
"U, S, V = np.linalg.svd(W)\n",
"print(C[0])\n",
"print(W[0])\n",
"print(U[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"各単語を2次元のベクトルで表し、それをグラフにプロットする。\n",
"\n",
"'i' と 'you', 'goodbye' と 'hello'が近いので、ある程度直観に近い。"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.collections.PathCollection at 0x7f0613bcfbe0>"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAD8CAYAAACRkhiPAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAGqJJREFUeJzt3H90VfWZ7/H3QxJMRuQEUUMKRrCiVRMQOCjWilp+5ba2Qqm/WilqaSrqTNt765IuXK1WZy5W7lXrsNqJXn5ovSMDXJXRyhBQi/hjJNgEQdSIYiGNwVITBQMCee4f2aSHzAkJ7pNzQvbntVZW9nefZ+/vk53D+WTvfQ7m7oiISDT1ynQDIiKSOQoBEZEIUwiIiESYQkBEJMIUAiIiEaYQEBGJMIWAiEiEKQRERCJMISAiEmHZmW6gPSeccIIPHjw4022IiBxV1q9f/xd3P7Gz9d02BAYPHkxlZWWm2xAROaqY2ftHUq/LQSIiEaYQEBGJMIWAiEiEKQRERCJMISAiEmEKAZGj3Je//OWU73Pr1q0UFxcDsHDhQm6++eaUzyGHSjzmnXH77bczd+5cAK699lqWLl36ueZVCIgc5V566aVMtyBHMYWAyGH8/Oc/57777msdz549m/vvv59bbrmF4uJiSkpKWLx4MQDPP/88l156aWvtzTffzMKFC7u8xz59+nDnnXdyxhln8JWvfIWrr76auXPnUlVVxZgxYxg2bBhTpkzho48+Amh3/fr16xk+fDjDhw9n3rx5h8yxbds2Lr74YoYOHcodd9wBtH9sAO655x5Gjx7NsGHD+MUvftHlx6CnOHDgAD/4wQ84++yzmThxIk1NTWzZsoXS0lJGjRrFhRdeyJtvvtnRbo4zsz+a2etmNt/MjjlcsUJA5DCuv/56Hn74YQCam5t57LHHGDRoEFVVVVRXV7Nq1SpuueUW6urqMtZjc3Mzy5Yto7q6mmeeeab1Q5bf+973uPvuu9mwYQMlJSWtL97trb/uuut44IEHqK6u/i9zvPrqqyxbtowNGzawZMkSKisrkx6ba665hpUrV1JTU8Orr75KVVUV69evZ82aNWk6Gke3mpoabrrpJjZt2kR+fj7Lli2jrKyMBx54gPXr1zN37lxuvPHGdrffs2cPwBDgSncvoeUDwTMPN2dKPjFsZqXA/UAW8JC7z2nz+DHAw8AoYGfQ4NZUzC3SFTbXNbJiYz21DU3sJo9lK9dwbPOnjBgxgrVr13L11VeTlZVFQUEBF110EevWraNv375p6+/pDbUsevlP1H+8h72f7eesMZeQm5tLbm4u3/jGN9i9ezcNDQ1cdNFFAEyfPp3LL7+cxsbGpOsbGhpoaGhg7NixAEybNo1nnnmmdb4JEybQv39/AL71rW+xdu1afvzjH9O/f3/++Mc/Ul9fz4gRI+jfvz8rV65k5cqVjBgxAoBdu3ZRU1PTum/5m8TnWd6enQwsOoVzzjkHgFGjRrF161ZeeuklLr/88tZt9u7d2+7+3nrrLYC97v52sGoRcBNwX3vbhA4BM8sC5gETgO3AOjNb7u5vJJR9H/jI3U8zs6uAu4Erw84t0hU21zVSvuY9Ynk5FMZyKRk3hbvu/S0Dcvbw9zfMoKKiIul22dnZNDc3t46Dv8pS7ukNtcx55i2OPSabk/r0xoG17+zk6Q21fH3YwC6Z08ySjmfMmMHChQv54IMPuP766wFwd372s5/xwx/+sEt66SnaPs+2Nexn9z5jc10jZxbGyMrKor6+nvz8fKqqqrqsj1RcDjoXeMfd33X3z4DHgMva1FxGSyIBLAXGWdtnlUg3sWJjPbG8HGJ5OfQy47xLStm24WVeXbeOSZMmceGFF7J48WIOHDjAhx9+yJo1azj33HM55ZRTeOONN9i7dy8NDQ2sXr26S/pb9PKfOPaY7Jb+evWiV69eNLz5CvPX1LBr1y6eeuopjj32WPr168cLL7wAwCOPPMJFF11ELBZLuj4/P5/8/HzWrl0LwKOPPnrInBUVFfz1r3+lqamJJ554ggsuuACAKVOmsGLFCtYFxwZg0qRJzJ8/n127dgFQW1vLjh07uuRYHM3aPs+Oy82mVy9jxcb61pq+ffsyZMgQlixZArQEbLLLdQedccYZAL3N7LRg1TTgD4frIxWXgwYC2xLG24Hz2qtx9/1m1gj0B/6SWGRmZUAZQFFRUQpaEzlytQ1NFMZyW8fZOb0Zes55HMj5O7KyspgyZQovv/wyw4cPx8z41a9+xYABAwC44oorKC4uZsiQIa2XQ1Kt/uM9nNSnd+vYevVi0PCv8Mwd0/hviwdTUlJCLBZj0aJF3HDDDXz66aeceuqpLFiwAKDd9QsWLOD666/HzJg4ceIhc5577rlMnTqV7du3c8011xCPxwHo3bs3l1xyCfn5+WRlZQEwceJENm/ezPnnnw+03Lj+3e9+x0knndQlx+No1fZ5BtDLjNqGpkPWPfroo8ycOZO77rqLffv2cdVVVzF8+PCk+8zNzQXYCiwxs2xgHfDbw/Vh7v65fwgAM/s2UOruM4LxNOA8d785oWZjULM9GG8Jav6SbJ8A8Xjc9b+ISibcW/E2jU37iOXlAC03Pe+ZOZnrf/5r/unaiR1s3fWu+JeX+TihP4CdDY0cnx9j4bThjB07lvLyckaOHNnlvTQ3NzNy5EiWLFnC0KFDu3y+nqTt8wxoHf9kwumfe79mtt7d452tT8XloFrg5ITxoGBd0pognWK03CAW6XZKiwtobNpHY9M+/ry1hrumT2DgWaOZNqntCW5mTD+/iN1799PYtI/m5mYam/ax4V/vofLeGYwcOZKpU6emJQDeeOMNTjvtNMaNG6cA+BwSn2fN7q3LpcUFae0jFWcC2cDbwDhaXuzXAd9x900JNTcBJe5+Q3Bj+FvufsXh9qszAcmkxHdtDMzPo7S4gDMLY5luq1Xiu4MK+uYy/fyiLrspLF2nK55nR3omEDoEgkm/RstbkLKA+e7+j2b2S6DS3ZebWS7wCDAC+Ctwlbu/e7h9KgRERI7ckYZASj4n4O6/B37fZt3PE5b3AJe33U5ERDJLnxgWEYkwhYCISIQpBEREIkwhICISYQoBEZEIUwiIiESYQkBEJMIUAiIiEaYQEBGJMIWAiEiEKQRERCJMISAiEmEKARGRCFMIiIhEmEJARCTCFAIiIhGmEBARiTCFgIhIhCkEREQiTCEgIhJhCgERkQgLFQJmdryZVZhZTfC9Xzt1K8yswcyeCjOfiIikVtgzgVnAancfCqwOxsncA0wLOZeIiKRY2BC4DFgULC8CJicrcvfVwCch5xIRkRQLGwIF7l4XLH8AFITcn4iIpFF2RwVmtgoYkOSh2YkDd3cz8zDNmFkZUAZQVFQUZlciItIJHYaAu49v7zEzqzezQnevM7NCYEeYZty9HCgHiMfjoQJFREQ6FvZy0HJgerA8HXgy5P5ERCSNwobAHGCCmdUA44MxZhY3s4cOFpnZC8ASYJyZbTezSSHnFRGRFOjwctDhuPtOYFyS9ZXAjITxhWHmERGRrqFPDIuIRJhCQEQkwhQCIiIRphAQEYkwhYCISIQpBEREIkwhICISYQoBEZEIUwiIiESYQkBEJMIUAiIiEaYQEBGJMIWAiEiEKQRERCJMISAiEmEKARGRCFMIiIhEmEJARCTCFAIiIhGmEBARiTCFgIhIhIUKATM73swqzKwm+N4vSc05ZvaymW0ysw1mdmWYOUVEJHXCngnMAla7+1BgdTBu61Pge+5+NlAK3Gdm+SHnFRGRFAgbApcBi4LlRcDktgXu/ra71wTLfwZ2ACeGnFdERFIgbAgUuHtdsPwBUHC4YjM7F+gNbAk5r4iIpEB2RwVmtgoYkOSh2YkDd3cz88PspxB4BJju7s3t1JQBZQBFRUUdtSYiIiF1GALuPr69x8ys3swK3b0ueJHf0U5dX+BpYLa7v3KYucqBcoB4PN5uoIiISGqEvRy0HJgeLE8HnmxbYGa9gceBh919acj5REQkhcKGwBxggpnVAOODMWYWN7OHgporgLHAtWZWFXydE3JeERFJAXPvnldd4vG4V1ZWZroNEZGjipmtd/d4Z+v1iWERkQhTCIiIRJhCQEQkwhQCIiIRphAQEYkwhYCISIQpBEREIkwhICISYQoBEZEIUwiIiESYQkBEJMIUAiIiEaYQEBGJMIWAiEiEKQRERCJMISAiEmEKARGRCFMIiIhEmEJARCTCFAIiIhGmEBARibBQIWBmx5tZhZnVBN/7Jak5xcxeM7MqM9tkZjeEmVNERFIn7JnALGC1uw8FVgfjtuqA8939HOA8YJaZfSHkvCIikgJhQ+AyYFGwvAiY3LbA3T9z973B8JgUzCkiIikS9gW5wN3rguUPgIJkRWZ2spltALYBd7v7n0POKyIiKZDdUYGZrQIGJHloduLA3d3MPNk+3H0bMCy4DPSEmS119/okc5UBZQBFRUWdaF9ERMLoMATcfXx7j5lZvZkVunudmRUCOzrY15/NbCNwIbA0yePlQDlAPB5PGigiIpI6YS8HLQemB8vTgSfbFpjZIDPLC5b7AV8B3go5r4iIpEDYEJgDTDCzGmB8MMbM4mb2UFBzJvCfZlYN/AGY6+6vh5xXRERSoMPLQYfj7juBcUnWVwIzguUKYFiYeUREpGvo7ZoiIhGmEBARiTCFgIhIhCkEREQiTCEgIhJhCgERkQhTCIiIRJhCQEQkwhQCIiIRphAQEYkwhYCISIQpBEREIkwhICISYQoBEZEIUwiIiESYQkBEJMIUAiIiEaYQEBGJMIWAiEiEKQRERCJMISAiEmGhQsDMjjezCjOrCb73O0xtXzPbbmb/HGZOERFJnbBnArOA1e4+FFgdjNtzJ7Am5HwiIpJCYUPgMmBRsLwImJysyMxGAQXAypDziYhICoUNgQJ3rwuWP6Dlhf4QZtYL+F/AT0POJSIiKZbdUYGZrQIGJHloduLA3d3MPEndjcDv3X27mXU0VxlQBlBUVNRRayIiElKHIeDu49t7zMzqzazQ3evMrBDYkaTsfOBCM7sR6AP0NrNd7v5f7h+4ezlQDhCPx5MFioiIpFCHIdCB5cB0YE7w/cm2Be7+3YPLZnYtEE8WACIikn5h7wnMASaYWQ0wPhhjZnEzeyhscyIi0rXMvXtedYnH415ZWZnpNkREjipmtt7d452t1yeGRUQiTCEgIhJhCgERkQhTCIiIRJhCQEQkwhQCIiIRphAQEYkwhYCISIQpBEREIkwhICISYQoBEZEIUwiIiESYQkBEJMIUAiIiEaYQEBGJMIWAiEiEKQRERCJMIdBJffr0yXQLIiIppxAQEYmwSIXA5MmTGTVqFGeffTbl5eVAy1/4s2fPZvjw4YwZM4b6+noA3nvvPc4//3xKSkq47bbbMtm2iEiXiVQIzJ8/n/Xr11NZWcmvf/1rdu7cye7duxkzZgzV1dWMHTuWBx98EIAf/ehHzJw5k9dff53CwsIMdy4i0jWyw2xsZscDi4HBwFbgCnf/KEndAeD1YPgnd/9mmHk7a3NdIys21lPb0MTA/DzeWTGftaueAWDbtm3U1NTQu3dvLr30UgBGjRpFRUUFAC+++CLLli0DYNq0adx6663paFlEJK3CngnMAla7+1BgdTBOpsndzwm+0hYA5Wveo7FpH4WxXKpffZEnnv4PFvy/FVRXVzNixAj27NlDTk4OZgZAVlYW+/fvb93HwfUiIj1V2BC4DFgULC8CJofcX8qs2FhPLC+HWF4OvczI2t9En74x/vDuJ7z55pu88sorh93+ggsu4LHHHgPg0UcfTUfLIiJpFzYECty9Llj+AChopy7XzCrN7BUzS0tQ1DY0cVzu3652fSk+FvNm7rqulFmzZjFmzJjDbn///fczb948SkpKqK2t7ep2RUQywtz98AVmq4ABSR6aDSxy9/yE2o/cvV+SfQx091ozOxV4Fhjn7luS1JUBZQBFRUWj3n///SP6YRLdW/E2jU37iOXltK47OP7JhNM/935FRLozM1vv7vHO1nd4JuDu4929OMnXk0C9mRUGExcCO9rZR23w/V3geWBEO3Xl7h539/iJJ57Y2Z8hqdLiAhqb9tHYtI9m99bl0uL2TlZERKIn7OWg5cD0YHk68GTbAjPrZ2bHBMsnABcAb4Sct0NnFsYoGzuEWF4OdY17iOXlUDZ2CGcWxrp6ahGRo0aot4gCc4B/M7PvA+8DVwCYWRy4wd1nAGcC/2JmzbSEzhx37/IQgJYg0Iu+iEj7QoWAu+8ExiVZXwnMCJZfAkrCzCMiIl0jUp8YFhGRQykEREQiTCEgIhJhCgERkQhTCIiIRJhCQEQkwhQCIiIRphAQEYkwhYCISIQpBEREIkwhICISYQoBEZEIUwiIiESYQkBEJMIUAiIiEaYQEBGJMIWAiEiEKQRERCJMISAiEmGRCYHdu3fz9a9/neHDh1NcXMzixYv55S9/yejRoykuLqasrAx3Z8uWLYwcObJ1u5qamkPGIiI9SWRCYMWKFXzhC1+gurqajRs3Ulpays0338y6devYuHEjTU1NPPXUU3zxi18kFotRVVUFwIIFC7juuusy3L2ISNfo0SGwua6Reyve5qdLqqn8uA+/X/Ef3HrrrbzwwgvEYjGee+45zjvvPEpKSnj22WfZtGkTADNmzGDBggUcOHCAxYsX853vfCfDP4mISNfIDrOxmR0PLAYGA1uBK9z9oyR1RcBDwMmAA19z961h5u7I5rpGyte8Rywvh8JYLp8cM4hv3v4Ixze9xW233ca4ceOYN28elZWVnHzyydx+++3s2bMHgKlTp3LHHXfw1a9+lVGjRtG/f/+ubFVEJGPCngnMAla7+1BgdTBO5mHgHnc/EzgX2BFy3g6t2FhPLC+HWF4Ovczg07/SP3Ycvc+4mFtuuYXXXnsNgBNOOIFdu3axdOnS1m1zc3OZNGkSM2fO1KUgEenRQp0JAJcBFwfLi4DngVsTC8zsLCDb3SsA3H1XyDk7pbahicJYbuu47r23+fcHf8X+ZjjlxL785je/4YknnqC4uJgBAwYwevToQ7b/7ne/y+OPP87EiRPT0a6ISEaYu3/+jc0a3D0/WDbgo4PjhJrJwAzgM2AIsAqY5e4HkuyvDCgDKCoqGvX+++9/7t7urXibxqZ9xPJyWtcdHP9kwukdbj937lwaGxu58847P3cPIiLpZmbr3T3e2foOzwTMbBUwIMlDsxMH7u5mlixRsoELgRHAn2i5h3At8H/aFrp7OVAOEI/HP386AaXFBZSveQ+A43Kz+WTPfhqb9nHl6EEdbjtlyhS2bNnCs88+G6YFEZFur8MQcPfx7T1mZvVmVujudWZWSPJr/duBKnd/N9jmCWAMSUIglc4sjFE2dggrNtZT29DEwPw8rhw9iDMLYx1u+/jjj3dlayIi3UbYewLLgenAnOD7k0lq1gH5Znaiu38IfBWoDDlvp5xZGOvUi76ISFSFfXfQHGCCmdUA44MxZhY3s4cAgmv/PwVWm9nrgAEPhpxXRERSINSZgLvvBMYlWV9Jy83gg+MKYFiYuUREJPXCXg7q1jbXNR5yT6C0uECXh0REEvTY/zbi4CeGG5v2URjLpbFpH+Vr3mNzXWOmWxMR6TZ6bAi0/cRwLC+Hp+75Bxb/YUOmWxMR6TZ6bAjUNjRxXO6hV7tu+KcH2ZXVN0MdiYh0Pz02BAbm5/HJnv2HrPtkz34G5udlqCMRke6nx4ZAaXEBjU37aGzaR7N763JpcUGmWxMR6TZ6bAgc/MRwLC+HusY9xPJyKBs7RO8OEhFJ0KPfIqpPDIuIHF6PPRMQEZGOKQRERCJMISAiEmEKARGRCFMIiIhEmEJARCTCFAIiIhGmEBARiTCFgIhIhJm7Z7qHpMzsQ+D9FO3uBOAvKdpXV1KfqaU+U0t9pk5X9niKu5/Y2eJuGwKpZGaV7h7PdB8dUZ+ppT5TS32mTnfqUZeDREQiTCEgIhJhUQmB8kw30EnqM7XUZ2qpz9TpNj1G4p6AiIgkF5UzARERSaJHhYCZlZrZW2b2jpnNSvL4MWa2OHj8P81scPq77FSfY83sNTPbb2bfzkSPQR8d9fnfzewNM9tgZqvN7JRu2ucNZva6mVWZ2VozO6s79plQN9XM3MzS/u6RThzLa83sw+BYVpnZjHT32Jk+g5orgufnJjP7v+nuMeiho+N5b8KxfNvMGtLepLv3iC8gC9gCnAr0BqqBs9rU3Aj8Nli+CljcTfscDAwDHga+3Y2P5yXA3wXLM7vx8eybsPxNYEV37DOoOw5YA7wCxLtbj8C1wD9n4jl5hH0OBf4I9AvGJ3XHPtvU/z0wP9199qQzgXOBd9z9XXf/DHgMuKxNzWXAomB5KTDOzCyNPUIn+nT3re6+AWhOc2+JOtPnc+7+aTB8BRiU5h6hc31+nDA8FsjEjbDOPD8B7gTuBvaks7lAZ3vMtM70+QNgnrt/BODuO9LcIxz58bwa+Ne0dJagJ4XAQGBbwnh7sC5pjbvvBxqB/mnpLkkPgWR9dgdH2uf3gWe6tKPkOtWnmd1kZluAXwH/kKbeEnXYp5mNBE5296fT2ViCzv7OpwaXAJea2cnpae0QnenzdOB0M3vRzF4xs9K0dfc3nf43FFxKHQI8m4a+DtGTQkAyxMyuAeLAPZnupT3uPs/dvwjcCtyW6X7aMrNewP8G/keme+nAvwOD3X0YUMHfzqy7m2xaLgldTMtf2A+aWX5GOzq8q4Cl7n4g3RP3pBCoBRL/KhkUrEtaY2bZQAzYmZbukvQQSNZnd9CpPs1sPDAb+Ka7701Tb4mO9Hg+Bkzu0o6S66jP44Bi4Hkz2wqMAZan+eZwh8fS3Xcm/J4fAkalqbdEnfmdbweWu/s+d38PeJuWUEinI3luXkUGLgUBPerGcDbwLi2nVAdvwpzdpuYmDr0x/G/dsc+E2oVk7sZwZ47nCFpufA3t5r/3oQnL3wAqu2OfbeqfJ/03hjtzLAsTlqcAr3THYwmUAouC5RNouSzTv7v1GdR9CdhK8LmttB/PTEzahQf9a7Qk/hZgdrDul7T8lQqQCywB3gFeBU7tpn2OpuUvmd20nKls6qZ9rgLqgarga3k37fN+YFPQ43OHe/HNZJ9tatMeAp08lv8zOJbVwbH8Unc8loDRcnntDeB14Kru2Gcwvh2Yk4n+3F2fGBYRibKedE9ARESOkEJARCTCFAIiIhGmEBARiTCFgIhIhCkEREQiTCEgIhJhCgERkQj7/6+ej9jUWAXBAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"for word, word_id in word_to_id.items():\n",
" plt.annotate(word, (U[word_id, 0], U[word_id, 1]))\n",
"plt.scatter(U[:,0], U[:,1], alpha=0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4.4 PTBデータセット\n",
"\n",
"PTBデータセットは、Penn Treebankとyばれるコーパス。本格的なコーパスである。"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading ptb.train.txt ... \n",
"Done\n",
"corpus size: 929589\n",
"corpus[:30]: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23\n",
" 24 25 26 27 28 29]\n",
"\n",
"id_to_word[0]: aer\n",
"id_to_word[1]: banknote\n",
"id_to_word[2]: berlitz\n",
"\n",
"word_to_id['car']: 3856\n",
"word_to_id['happy']: 4428\n",
"word_to_id['lexus']: 7426\n"
]
}
],
"source": [
"# coding: utf-8\n",
"import sys\n",
"sys.path.append('..')\n",
"from dataset import ptb\n",
"\n",
"\n",
"corpus, word_to_id, id_to_word = ptb.load_data('train')\n",
"\n",
"print('corpus size:', len(corpus))\n",
"print('corpus[:30]:', corpus[:30])\n",
"print()\n",
"print('id_to_word[0]:', id_to_word[0])\n",
"print('id_to_word[1]:', id_to_word[1])\n",
"print('id_to_word[2]:', id_to_word[2])\n",
"print()\n",
"print(\"word_to_id['car']:\", word_to_id['car'])\n",
"print(\"word_to_id['happy']:\", word_to_id['happy'])\n",
"print(\"word_to_id['lexus']:\", word_to_id['lexus'])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.4.5 PTBデータセットでの評価\n",
"\n",
"PTBデータセットを使ってカウントベースの手法を評価する。\n",
"SVDは自前のものを使ってもよいが、高速化するためにsklearnモジュールを使用する。\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"counting co-occurrence ...\n",
"calculating PPMI ...\n",
"1.0% done\n",
"2.0% done\n",
"3.0% done\n",
"4.0% done\n",
"5.0% done\n",
"6.0% done\n",
"7.0% done\n",
"8.0% done\n",
"9.0% done\n",
"10.0% done\n",
"11.0% done\n",
"12.0% done\n",
"13.0% done\n",
"14.0% done\n",
"15.0% done\n",
"16.0% done\n",
"17.0% done\n",
"18.0% done\n",
"19.0% done\n",
"20.0% done\n",
"21.0% done\n",
"22.0% done\n",
"23.0% done\n",
"24.0% done\n",
"25.0% done\n",
"26.0% done\n",
"27.0% done\n",
"28.0% done\n",
"29.0% done\n",
"30.0% done\n",
"31.0% done\n",
"32.0% done\n",
"33.0% done\n",
"34.0% done\n",
"35.0% done\n",
"36.0% done\n",
"37.0% done\n",
"38.0% done\n",
"39.0% done\n",
"40.0% done\n",
"41.0% done\n",
"42.0% done\n",
"43.0% done\n",
"44.0% done\n",
"45.0% done\n",
"46.0% done\n",
"47.0% done\n",
"48.0% done\n",
"49.0% done\n",
"50.0% done\n",
"51.0% done\n",
"52.0% done\n",
"53.0% done\n",
"54.0% done\n",
"55.0% done\n",
"56.0% done\n",
"57.0% done\n",
"58.0% done\n",
"59.0% done\n",
"60.0% done\n",
"61.0% done\n",
"62.0% done\n",
"63.0% done\n",
"64.0% done\n",
"65.0% done\n",
"66.0% done\n",
"67.0% done\n",
"68.0% done\n",
"69.0% done\n",
"70.0% done\n",
"71.0% done\n",
"72.0% done\n",
"73.0% done\n",
"74.0% done\n",
"75.0% done\n",
"76.0% done\n",
"77.0% done\n",
"78.0% done\n",
"79.0% done\n",
"80.0% done\n",
"81.0% done\n",
"82.0% done\n",
"83.0% done\n",
"84.0% done\n",
"85.0% done\n",
"86.0% done\n",
"87.0% done\n",
"88.0% done\n",
"89.0% done\n",
"90.0% done\n",
"91.0% done\n",
"92.0% done\n",
"93.0% done\n",
"94.0% done\n",
"95.0% done\n",
"96.0% done\n",
"97.0% done\n",
"98.0% done\n",
"99.0% done\n",
"100.0% done\n",
"calculating SVD ...\n",
"\n",
"[query] you\n",
" i: 0.700317919254303\n",
" we: 0.6367185115814209\n",
" anybody: 0.565764307975769\n",
" do: 0.563567042350769\n",
" 'll: 0.5127798318862915\n",
"\n",
"[query] year\n",
" month: 0.6961644291877747\n",
" quarter: 0.6884941458702087\n",
" earlier: 0.6663320660591125\n",
" last: 0.6281364560127258\n",
" next: 0.6175755858421326\n",
"\n",
"[query] car\n",
" luxury: 0.6728832125663757\n",
" auto: 0.6452109813690186\n",
" vehicle: 0.6097723245620728\n",
" cars: 0.6032834053039551\n",
" corsica: 0.5698372721672058\n",
"\n",
"[query] toyota\n",
" motor: 0.7585658431053162\n",
" nissan: 0.7148030996322632\n",
" motors: 0.6926157474517822\n",
" lexus: 0.6583304405212402\n",
" honda: 0.6350275278091431\n"
]
}
],
"source": [
"# coding: utf-8\n",
"import sys\n",
"sys.path.append('..')\n",
"import numpy as np\n",
"from common.util import most_similar, create_co_matrix, ppmi\n",
"from dataset import ptb\n",
"\n",
"\n",
"window_size = 2\n",
"wordvec_size = 100\n",
"\n",
"corpus, word_to_id, id_to_word = ptb.load_data('train')\n",
"vocab_size = len(word_to_id)\n",
"print('counting co-occurrence ...')\n",
"C = create_co_matrix(corpus, vocab_size, window_size)\n",
"print('calculating PPMI ...')\n",
"W = ppmi(C, verbose=True)\n",
"\n",
"print('calculating SVD ...')\n",
"try:\n",
" # truncated SVD (fast!)\n",
" from sklearn.utils.extmath import randomized_svd\n",
" U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,\n",
" random_state=None)\n",
"except ImportError:\n",
" # SVD (slow)\n",
" U, S, V = np.linalg.svd(W)\n",
"\n",
"word_vecs = U[:, :wordvec_size]\n",
"\n",
"querys = ['you', 'year', 'car', 'toyota']\n",
"for query in querys:\n",
" most_similar(query, word_to_id, id_to_word, word_vecs, top=5)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"sklearnの`randomized_svd()`というメソッドを使用する。Truncated SVDを使用し、乱数を使うので実行結果は毎回異なる。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment