Skip to content

Instantly share code, notes, and snippets.

@nishimotz
Last active October 2, 2016 15:57
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nishimotz/9af8ca00342c64c655589a4ab01ab879 to your computer and use it in GitHub Desktop.
Save nishimotz/9af8ca00342c64c655589a4ab01ab879 to your computer and use it in GitHub Desktop.
如法会 2 (⊃ LT駆動開発30) http://nyoho.connpass.com/event/39977/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# 「機械学習とPythonとの出会い」との出会い\n",
"\n",
"\n",
"## 2016年10月2日(日曜)\n",
"\n",
"## @24motz (Takuya Nishimoto)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# PyCon mini Hiroshima 2016\n",
"\n",
"\n",
"http://hiroshima.pycon.jp\n",
"\n",
"* 2016年11月12日(土曜)\n",
"* 発表者・参加者を募集中\n",
"* 共催 IoTLT広島"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# 「機械学習とPythonとの出会い」\n",
"\n",
"## オリジナル\n",
"\n",
"* https://github.com/tkamishima/mlmpy\n",
"* http://www.kamishima.net/mlmpyja/\n",
"\n",
"## 関連文献\n",
"\n",
"* オライリー「実践 機械学習システム」\n",
"* オライリー「Pythonによるデータ分析入門」\n",
"* 技術評論社「科学技術計算のためのPython入門」"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# とにかくやってみる\n",
"\n",
"* nbayes1.py を Python 3 対応(xrange を range にする)\n",
"\n",
"## 単純ベイズ カテゴリ特徴 とは?\n",
"\n",
"* 「実践 機械学習システム」第6章のナイーブベイズ"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# 単語列からネガティブ/ポジティブを予測\n",
"\n",
"単純な規則では判定できないと仮定\n",
"\n",
"```\n",
"awesome ----- => posi\n",
"awesome ----- => posi\n",
"awesome crazy => posi\n",
"------- crazy => posi\n",
"------- crazy => nega\n",
"------- crazy => nega\n",
"\n",
"awesome crazy => posi\n",
"awesome crazy => nega\n",
"------- ----- => posi\n",
"------- ----- => nega\n",
"```\n",
"\n",
"* 下の4つは確率が半々"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# カテゴリで定義\n",
"\n",
"## 特徴\n",
"\n",
"* 単語なし = 0\n",
"* 単語あり = 1\n",
"\n",
"## クラス\n",
"\n",
"* nega = 0\n",
"* posi = 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# 気持ちが単語を生成する\n",
"\n",
"* posi モデルと nega モデルがある\n",
"* posi モデルはある確率で awesome/crazy を生成する\n",
"* nega モデルはある確率で awesome/crazy を生成する\n",
"\n",
"## 知りたいこと(予測)\n",
"\n",
"* awesome/crazy がある(ない)場合の posi/nega である確率\n",
"* 確率が高い方を推定結果とする\n",
"\n",
"## 過去に起きたこと(統計)\n",
"\n",
"* posi だったときに awesome/crazy があったか(なかったか)\n",
"* nega だったときに awesome/crazy があったか(なかったか)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# ベイズの定理\n",
"\n",
"http://www.kamishima.net/mlmpyja/nbayes1/nbayes.html\n",
"\n",
"* 式(1)右辺: (posi,negaの起こりやすさ) * (各単語の出現確率の積)\n",
"* 式(4): (クラスの正規確率) = (そのクラスに属するコーパス数) / (全コーパス数)\n",
"* 式(5): クラス nega の場合の単語 awesome の出現確率 = (awesome かつ nega のコーパス数) / (negaのコーパス数)\n",
"* 式(6)左辺: 事後確率 => (awesome/crazy の有無を観測した場合の posi/nega である確率)\n",
"* 式(6)右辺: 事前確率 => (posi/nega である確率) * (posi/nega である場合にawesome/crazy が有/無である確率)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1\t0\t1\r\n",
"1\t0\t1\r\n",
"1\t1\t1\r\n",
"0\t1\t1\r\n",
"0\t1\t0\r\n",
"0\t1\t0\r\n",
"1\t1\t1\r\n",
"1\t1\t0\r\n",
"0\t0\t1\r\n",
"0\t0\t0\r\n"
]
}
],
"source": [
"%cat tweet2.tsv"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"from nbayes1 import NaiveBayes1"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data = np.genfromtxt('tweet2.tsv', dtype=np.int)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(10, 3)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 0, 1],\n",
" [1, 0, 1],\n",
" [1, 1, 1],\n",
" [0, 1, 1],\n",
" [0, 1, 0],\n",
" [0, 1, 0],\n",
" [1, 1, 1],\n",
" [1, 1, 0],\n",
" [0, 0, 1],\n",
" [0, 0, 0]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"X=data[:, :-1]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 0],\n",
" [1, 0],\n",
" [1, 1],\n",
" [0, 1],\n",
" [0, 1],\n",
" [0, 1],\n",
" [1, 1],\n",
" [1, 1],\n",
" [0, 0],\n",
" [0, 0]])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(10, 2)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.shape"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"y=data[:, -1]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(10,)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.shape"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 1, 1, 0, 0, 1, 0, 1, 0])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"clr = NaiveBayes1()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"clr.fit(X, y)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"predict_y=clr.predict(X[:, :])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1 1\n",
"1 1 1\n",
"2 1 1\n",
"3 1 0\n",
"4 0 0\n",
"5 0 0\n",
"6 1 1\n",
"7 0 1\n",
"8 1 1\n",
"9 0 1\n"
]
}
],
"source": [
"for i in range(len(y)):\n",
" print(i, y[i], predict_y[i])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 1 1\n",
"1 1 1\n",
"2 1 1\n",
"3 1 0\n",
"4 0 0\n",
"5 0 0\n",
"6 1 1\n",
"7 0 1\n",
"8 1 1\n",
"9 0 1\n"
]
}
],
"source": [
"for i, yi in enumerate(y):\n",
" print(i, yi, predict_y[i])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# 考察\n",
"\n",
"* 間違っているのは 3, 7, 9 の3件\n",
"* 70% の正解率\n",
"* ただしこれはクローズドテスト(学習データ=評価データ)\n",
"* scikit-learn BernoulliNB"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# 記述と実行の効率\n",
"\n",
"* ファンシーインデックス参照(整数配列で参照)\n",
"* ブールインデックス参照(ブール値配列で参照)\n",
"* ユニバーサル関数(ufunc) frompyfunc/vectorize で作成できる \n",
"* ブロードキャスト(配列の形状をあわせて演算)\n",
"* Cython"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# まとめ\n",
"\n",
"* 解ける、解かれた、解けそう(定式化)\n",
"* 言語の仕様や高速化技術 (numpy)\n",
"* スライド作成環境 (jupyterでLaTeX)\n",
"\n",
"## 2つのPython\n",
"\n",
"* 2 と 3\n",
"* pip と conda\n"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment