atakemura/stfuawsc_itg_day_18.md

## stfuawsc_itg_day_18.md

      
    Raw
  

              stfuawsc_itg_day_18.md
            
          
    はじめに

隠しても仕方ないのであらかじめ書いておきますと、この記事に出てくるコードはUFLDLの演習問題用に書いたOctaveコードをPythonに書き直しただけです。
Q. 実用度は？

A. ほぼありません。判別精度を求めるなら別の方法で組むべきです。

Q. 車輪の再発明？

A. 理解を深めるための意図的な再発明です。

Q. scikit-learnで(ry

(ちゃぶ台がひっくり返る音)
データの準備

分かる人はfetch_mldataでもいいです。

MNISTデータセットといって0-9の手書きの数字の画像のデータセットを使います。ラベル付き教師データが6万件、ラベルなしのテストデータが1万件です。

適当にフォルダをつくり、そこに{train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz, t10k-images-idx3-ubyte.gz, t10k-labels-idx1-ubyte.gz}の4ファイルを保存します。

前にも使ったことはありますが、下記の説明の内容がイミフで加工したものを使ってました。
[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000803(2051) magic number 
0004     32 bit integer  60000            number of images 
0008     32 bit integer  28               number of rows 
0012     32 bit integer  28               number of columns 
0016     unsigned byte   ??               pixel 
0017     unsigned byte   ??               pixel 
........ 
xxxx     unsigned byte   ??               pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

いろいろなところを読んでいるうちに、やっと"ビッグエンディアンで4ブロック読んだあとは全部配列にそのまま突っ込めばいい"ということがようやくわかった。(ラベルファイルは2ブロック)

文字列データをPythonで読むのには標準ライブラリのstructを利用
# coding: utf-8

import gzip
import struct
import numpy as np
import scipy.sparse as sp_sparse
import scipy.optimize as sp_optimize


def mnist_data_parser(image_filename, label_filename):
    """
    解凍めんどいからgzipのままで読み込んでnumpy arrayにして吐き出す
    :rtype : nparray
    中身の仕様: http://yann.lecun.com/exdb/mnist/
    image file: 4ブロック=マジックナンバー、画像の数、行数、列数
    残りは0(白)から255(黒)までのint
    label file: 2ブロック=マジックナンバー、ラベル数
    残りは0から9までの整数
    """
    with gzip.open(image_filename, "rb") as image_file:
        _, num_images, num_rows, num_cols = \
            struct.unpack(">IIII", image_file.read(16))
        images = (np.fromstring(image_file.read(), dtype=np.ubyte)
                  .reshape((num_images, num_rows * num_cols))
                  .transpose()
                  .astype(np.float64) / 255)

    with gzip.open(label_filename, "rb") as label_file:
        _, num_labels = struct.unpack(">II", label_file.read(8))
        labels = (np.fromstring(label_file.read(), dtype=np.ubyte))

    return images, labels


# TODO: fix os.path resolution in real script
def load_training_mnist_data():
    return mnist_data_parser("train-images-idx3-ubyte.gz",
                             "train-labels-idx1-ubyte.gz")


def load_testing_mnist_data():
    return mnist_data_parser("t10k-images-idx3-ubyte.gz",
                             "t10k-labels-idx1-ubyte.gz")
これで、
train_images, train_labels = load_training_mnist_data()
とかするとgz形式のままで読んでくれます。次からはおとなしくfetch_mldataつかいます。たぶん。
ソフトマックス回帰の実装

工夫ポイント

ラベルの数字のインデックスを1に、それ以外を0とする、正解データを保有する行列を作るところに疎行列生成の方法を応用した scipy.sparse.csr_matrix

ラベルの数だけ1の配列をつくり、行列のインデックスを指定し、0で埋める範囲を指定。あとはふつうの行列に戻すときれいに0/1の行列ができる
ソフトマックス関数の値を計算するときにオーバーフローしないようにウェイトの最大値を引き算する

class SoftmaxRegression(object):
    def __init__(self, num_classes, input_size, lambda_):
        self.num_classes = num_classes
        self.input_size = input_size
        self.lambda_ = lambda_
        return

    @staticmethod
    def get_indicator_matrix(labels):
        labels = np.array(labels).flatten()
        indptr = np.arange(len(labels) + 1)
        indicator = np.array(
            sp_sparse.csr_matrix((np.ones(len(labels)), labels, indptr))
            .todense().transpose())
        return indicator

    def cost(self, theta, input_data, labels):
        theta = theta.reshape(self.num_classes, self.input_size)
        num_data = input_data.shape[1]
        theta_data = np.dot(theta, input_data)
        theta_data -= np.max(theta)
        class_probability = \
            np.exp(theta_data) / np.sum(np.exp(theta_data), axis=0)
        binary_indicator = self.get_indicator_matrix(labels)

        cost = ((- 1 / num_data) *
                np.sum(binary_indicator * np.log(class_probability)) +
                (self.lambda_ / 2) * np.sum(theta * theta))

        gradient = ((- 1 / num_data) *
                    np.dot((binary_indicator - class_probability),
                           input_data.transpose()) + self.lambda_ * theta)
        gradient = gradient.flatten()
        return cost, gradient

    def predict(self, theta, input_data):
        theta = theta.reshape(self.num_classes, self.input_size)
        theta_ = np.dot(theta, input_data)
        prob = np.exp(theta_) / np.sum(np.exp(theta_), axis=0)
        pred = np.argmax(prob, axis=0)
        return pred

    def train(self, input_data, labels):
        theta = 0.005 * np.random.randn(self.num_classes * self.input_size)
        J = lambda x: self.cost(x, input_data, labels)
        result = sp_optimize.minimize(J, theta, method="L-BFGS-B",
                                      jac=True,
                                      options={"maxiter": 100, "disp": True})
        print result
        opt_theta = result.x
        return opt_theta
勾配が計算できているかチェック

なにをしているかというと、compute_gradientでコスト関数Jの数値的な勾配を出し、SoftmaxRegression.costを叩いた結果の勾配との差を見ています。costメソッドのほうは微分した結果を返しているので、これと数値的な勾配の差が大きければ微分間違えている(か実装ミス)ということがわかります。

勾配の導出をミスるとあとあと最小化するときにうまくいかなかったりするので、いまのうちにチェックしておきます。
def compute_gradient(J, theta):
    epsilon = 0.0001

    gradient = np.zeros(theta.shape)
    for i in range(theta.shape[0]):
        theta_epsilon_plus = np.array(theta, dtype=np.float64)
        theta_epsilon_plus[i] = theta[i] + epsilon
        theta_epsilon_minus = np.array(theta, dtype=np.float64)
        theta_epsilon_minus[i] = theta[i] - epsilon

        gradient[i] = ((J(theta_epsilon_plus)[0] - J(theta_epsilon_minus)[0]) /
                       (2 * epsilon))
        if i % 100 == 0:
            print "Computing gradient for input:", i

    return gradient

# debug statement
def debug_gradient():
    input_size = 100
    n_classes = 10
    input_data = np.random.randn(input_size, 100)
    labels = np.random.randint(n_classes, size=100)
    theta = 0.005 * np.random.randn(n_classes * input_size)

    a = SoftmaxRegression(n_classes, input_size, 0.0001)

    a.cost(theta, input_data, labels)
    c, grad = a.cost(theta, input_data, labels)
    J = lambda x: a.cost(x, input_data, labels)
    num_grad = compute_gradient(J, theta)
    diff = np.linalg.norm(num_grad - grad) / np.linalg.norm(num_grad + grad)
    print diff
    return

# debug
debug_gradient()
下記のような結果になります。
Computing gradient for input: 0
Computing gradient for input: 100
Computing gradient for input: 200
Computing gradient for input: 300
Computing gradient for input: 400
Computing gradient for input: 500
Computing gradient for input: 600
Computing gradient for input: 700
Computing gradient for input: 800
Computing gradient for input: 900
8.13352733761e-10 #<=差分、1^-10なので十分近い

走らせる

# running
images, labels = load_training_mnist_data()

num_classes = 10 # 分類先のクラス数、0から9の数字なので10
input_size = 28 * 28 # ピクセル数。今回は縦横それぞれ28ピクセル白黒
lambda_ = 0.0005 # L2正規化のパラメータ

regressor = SoftmaxRegression(num_classes, input_size, lambda_)
opt_theta = regressor.train(images, labels) # L-BFGS-Bでコスト最小化

test_images, test_labels = load_testing_mnist_data()
prediction = regressor.predict(opt_theta, test_images) # テストデータを使い予測
print "Accuracy: {0:.4f}".format(
    np.sum(prediction == test_labels, dtype=np.float64) / test_labels.shape[0])
  status: 1
 success: False
    nfev: 111
     fun: 14684.10216380824
       x: array([ 0.00264343, -0.00265375,  0.00378028, ...,  0.00128026,
       -0.01140009,  0.00664116])
 message: 'STOP: TOTAL NO. of ITERATIONS EXCEEDS LIMIT'
     jac: array([  1.32171431e-06,  -1.32687540e-06,   1.89013923e-06, ...,
         6.40128052e-07,  -5.70004480e-06,   3.32058197e-06])
     nit: 101
Accuracy: 0.9245

結果は約92%でした
あとがき

こんなので92%だからそんな難しくないのかと思いきや、MNIST公式ページにはVirtualSVM(正答率99.44%)とかConvNet(正答率99.77%)のような化け物が掲載されていました。