Skip to content

Instantly share code, notes, and snippets.

@regonn
Last active December 19, 2018 23:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save regonn/4e4fa62af6471bae0a3b04482b54ea32 to your computer and use it in GitHub Desktop.
Save regonn/4e4fa62af6471bae0a3b04482b54ea32 to your computer and use it in GitHub Desktop.
Julia 1.0 Hyperopt DecisionTree MNIST
using Hyperopt
using DecisionTree
using MLDatasets
using Statistics
train_x, train_y = MNIST.traindata(Float32)
test_x, test_y = MNIST.testdata(Float32)
train_features = Array(transpose(MNIST.convert2features(train_x)))
test_features = Array(transpose(MNIST.convert2features(test_x)))
# ターゲットが数値だと回帰になってしまうので文字列に直す
function setStringArr(arr,arrstr)
dataSize = size(arr)[1]
for i in 1:dataSize
x=arr[i]
arrstr[i] = "$x"
end
end
train_y_str = fill("", 60000)
setStringArr(train_y, train_y_str)
test_y_str = fill("", 10000)
setStringArr(test_y, test_y_str)
# デフォルトの値
n_folds = 3; n_subfeatures = -1; n_trees = 10; partial_sampling = 0.7; max_depth = -1
min_samples_leaf = 5; min_samples_split = 2; min_purity_increase = 0.0
# クロスバリデーションしてみる
accuracy = nfoldCV_forest(train_y_str, train_features,
n_folds,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase)
mean(accuracy)
# 0.9747499999999999
model = build_forest(train_y_str, train_features,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase)
predict = apply_forest(model, test_features)
mean(test_y_str .== predict)
# 0.941 ← デフォルト設定での正答率よりも良くなるようにするのが目標
# Julia だと boolean 値の配列を mean してあげると正答率を出してくれる。
# 知らなくて、色々正答率出すためのライブラリとか探してしまった。。。
# なるべく、デフォルト設定値の周辺で最適な値を探すように設定、先に動かす変数を定義して、最後に評価に使う値を出す。現在は minimum を求めることしかできないので、最後に 1 から正答率を引いて、この値が小さくなるようにする。
ho_forest = @hyperopt for i = 50, sampler = RandomSampler(), n_folds = 3, n_subfeatures = 25:30, n_trees = 5:15, partial_sampling = 0.6:0.01:0.8, max_depth = 5:30, min_samples_leaf = 2:10, min_samples_split = 2:5, min_purity_increase = 0.0
accuracy = nfoldCV_forest(train_y_str, train_features,
n_folds,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase)
print(mean(accuracy))
1 - mean(accuracy)
end
# これで、実行した中で最も低かった時の変数を表示してくれる
minimum(ho_forest)
# (Real[3, 26, 14, 0.69, 27, 2, 4, 0.0], 0.006933333333333347)
# 設定しなおす
n_folds = 3; n_subfeatures = 26; n_trees = 14; partial_sampling = 0.69; max_depth = 27
min_samples_leaf = 2; min_samples_split = 4; min_purity_increase = 0.0
new_model = build_forest(train_y_str, train_features,
n_subfeatures,
n_trees,
partial_sampling,
max_depth,
min_samples_leaf,
min_samples_split,
min_purity_increase)
new_predict = apply_forest(new_model, test_features)
mean(test_y_str .== new_predict)
# 0.953 (デフォルト設定の時: 0.941)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment