Last active
October 2, 2016 15:57
-
-
Save nishimotz/9af8ca00342c64c655589a4ab01ab879 to your computer and use it in GitHub Desktop.
如法会 2 (⊃ LT駆動開発30) http://nyoho.connpass.com/event/39977/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# 「機械学習とPythonとの出会い」との出会い\n", | |
"\n", | |
"\n", | |
"## 2016年10月2日(日曜)\n", | |
"\n", | |
"## @24motz (Takuya Nishimoto)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# PyCon mini Hiroshima 2016\n", | |
"\n", | |
"\n", | |
"http://hiroshima.pycon.jp\n", | |
"\n", | |
"* 2016年11月12日(土曜)\n", | |
"* 発表者・参加者を募集中\n", | |
"* 共催 IoTLT広島" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# 「機械学習とPythonとの出会い」\n", | |
"\n", | |
"## オリジナル\n", | |
"\n", | |
"* https://github.com/tkamishima/mlmpy\n", | |
"* http://www.kamishima.net/mlmpyja/\n", | |
"\n", | |
"## 関連文献\n", | |
"\n", | |
"* オライリー「実践 機械学習システム」\n", | |
"* オライリー「Pythonによるデータ分析入門」\n", | |
"* 技術評論社「科学技術計算のためのPython入門」" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# とにかくやってみる\n", | |
"\n", | |
"* nbayes1.py を Python 3 対応(xrange を range にする)\n", | |
"\n", | |
"## 単純ベイズ カテゴリ特徴 とは?\n", | |
"\n", | |
"* 「実践 機械学習システム」第6章のナイーブベイズ" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# 単語列からネガティブ/ポジティブを予測\n", | |
"\n", | |
"単純な規則では判定できないと仮定\n", | |
"\n", | |
"```\n", | |
"awesome ----- => posi\n", | |
"awesome ----- => posi\n", | |
"awesome crazy => posi\n", | |
"------- crazy => posi\n", | |
"------- crazy => nega\n", | |
"------- crazy => nega\n", | |
"\n", | |
"awesome crazy => posi\n", | |
"awesome crazy => nega\n", | |
"------- ----- => posi\n", | |
"------- ----- => nega\n", | |
"```\n", | |
"\n", | |
"* 下の4つは確率が半々" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# カテゴリで定義\n", | |
"\n", | |
"## 特徴\n", | |
"\n", | |
"* 単語なし = 0\n", | |
"* 単語あり = 1\n", | |
"\n", | |
"## クラス\n", | |
"\n", | |
"* nega = 0\n", | |
"* posi = 1" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# 気持ちが単語を生成する\n", | |
"\n", | |
"* posi モデルと nega モデルがある\n", | |
"* posi モデルはある確率で awesome/crazy を生成する\n", | |
"* nega モデルはある確率で awesome/crazy を生成する\n", | |
"\n", | |
"## 知りたいこと(予測)\n", | |
"\n", | |
"* awesome/crazy がある(ない)場合の posi/nega である確率\n", | |
"* 確率が高い方を推定結果とする\n", | |
"\n", | |
"## 過去に起きたこと(統計)\n", | |
"\n", | |
"* posi だったときに awesome/crazy があったか(なかったか)\n", | |
"* nega だったときに awesome/crazy があったか(なかったか)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# ベイズの定理\n", | |
"\n", | |
"http://www.kamishima.net/mlmpyja/nbayes1/nbayes.html\n", | |
"\n", | |
"* 式(1)右辺: (posi,negaの起こりやすさ) * (各単語の出現確率の積)\n", | |
"* 式(4): (クラスの正規確率) = (そのクラスに属するコーパス数) / (全コーパス数)\n", | |
"* 式(5): クラス nega の場合の単語 awesome の出現確率 = (awesome かつ nega のコーパス数) / (negaのコーパス数)\n", | |
"* 式(6)左辺: 事後確率 => (awesome/crazy の有無を観測した場合の posi/nega である確率)\n", | |
"* 式(6)右辺: 事前確率 => (posi/nega である確率) * (posi/nega である場合にawesome/crazy が有/無である確率)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": false, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"1\t0\t1\r\n", | |
"1\t0\t1\r\n", | |
"1\t1\t1\r\n", | |
"0\t1\t1\r\n", | |
"0\t1\t0\r\n", | |
"0\t1\t0\r\n", | |
"1\t1\t1\r\n", | |
"1\t1\t0\r\n", | |
"0\t0\t1\r\n", | |
"0\t0\t0\r\n" | |
] | |
} | |
], | |
"source": [ | |
"%cat tweet2.tsv" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"from nbayes1 import NaiveBayes1" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"data = np.genfromtxt('tweet2.tsv', dtype=np.int)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(10, 3)" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[1, 0, 1],\n", | |
" [1, 0, 1],\n", | |
" [1, 1, 1],\n", | |
" [0, 1, 1],\n", | |
" [0, 1, 0],\n", | |
" [0, 1, 0],\n", | |
" [1, 1, 1],\n", | |
" [1, 1, 0],\n", | |
" [0, 0, 1],\n", | |
" [0, 0, 0]])" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": true, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"X=data[:, :-1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[1, 0],\n", | |
" [1, 0],\n", | |
" [1, 1],\n", | |
" [0, 1],\n", | |
" [0, 1],\n", | |
" [0, 1],\n", | |
" [1, 1],\n", | |
" [1, 1],\n", | |
" [0, 0],\n", | |
" [0, 0]])" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(10, 2)" | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"X.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": true, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"y=data[:, -1]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"(10,)" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"y.shape" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([1, 1, 1, 1, 0, 0, 1, 0, 1, 0])" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"y" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": { | |
"collapsed": true, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"clr = NaiveBayes1()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"clr.fit(X, y)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"predict_y=clr.predict(X[:, :])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": { | |
"collapsed": false, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0 1 1\n", | |
"1 1 1\n", | |
"2 1 1\n", | |
"3 1 0\n", | |
"4 0 0\n", | |
"5 0 0\n", | |
"6 1 1\n", | |
"7 0 1\n", | |
"8 1 1\n", | |
"9 0 1\n" | |
] | |
} | |
], | |
"source": [ | |
"for i in range(len(y)):\n", | |
" print(i, y[i], predict_y[i])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": { | |
"collapsed": false, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0 1 1\n", | |
"1 1 1\n", | |
"2 1 1\n", | |
"3 1 0\n", | |
"4 0 0\n", | |
"5 0 0\n", | |
"6 1 1\n", | |
"7 0 1\n", | |
"8 1 1\n", | |
"9 0 1\n" | |
] | |
} | |
], | |
"source": [ | |
"for i, yi in enumerate(y):\n", | |
" print(i, yi, predict_y[i])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# 考察\n", | |
"\n", | |
"* 間違っているのは 3, 7, 9 の3件\n", | |
"* 70% の正解率\n", | |
"* ただしこれはクローズドテスト(学習データ=評価データ)\n", | |
"* scikit-learn BernoulliNB" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# 記述と実行の効率\n", | |
"\n", | |
"* ファンシーインデックス参照(整数配列で参照)\n", | |
"* ブールインデックス参照(ブール値配列で参照)\n", | |
"* ユニバーサル関数(ufunc) frompyfunc/vectorize で作成できる \n", | |
"* ブロードキャスト(配列の形状をあわせて演算)\n", | |
"* Cython" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true, | |
"slideshow": { | |
"slide_type": "slide" | |
} | |
}, | |
"source": [ | |
"# まとめ\n", | |
"\n", | |
"* 解ける、解かれた、解けそう(定式化)\n", | |
"* 言語の仕様や高速化技術 (numpy)\n", | |
"* スライド作成環境 (jupyterでLaTeX)\n", | |
"\n", | |
"## 2つのPython\n", | |
"\n", | |
"* 2 と 3\n", | |
"* pip と conda\n" | |
] | |
} | |
], | |
"metadata": { | |
"celltoolbar": "Slideshow", | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment