Skip to content

Instantly share code, notes, and snippets.

View myui's full-sized avatar

Makoto YUI myui

View GitHub Profile
@myui
myui / sklearn-sparselr-spark-hdfs.py
Last active December 20, 2015 15:19 — forked from MLnick/sklearn-lr-spark.py
Forked to deal with sparse and large dataset on HDFS.
import sys
from pyspark.context import SparkContext
from numpy import array, random as np_random
from sklearn import linear_model as lm
from sklearn.base import copy
from scipy import sparse as sp
#MAX_FEATURES=1000
MAX_FEATURES=16777216
@myui
myui / sklearn-denselr-spark-hdfs.py
Last active December 20, 2015 15:29 — forked from MLnick/sklearn-lr-spark.py
Forked to deal with large dense dataset on HDFS.
import sys
from pyspark.context import SparkContext
from numpy import array, random as np_random
from sklearn import linear_model as lm
from sklearn.base import copy
ITERATIONS = 5
np_random.seed(seed=42)
#! /usr/bin/env python
import sys
from sklearn.externals import joblib
from scipy import sparse as sp
MAX_FEATURES=16777216
def predict(sgd, line):
@myui
myui / file0.sql
Last active June 29, 2016 13:30
Hive/Hivemallを利用した広告クリックスルー率(CTR)の推定 ref: http://qiita.com/myui/items/f726ca3dcc48410abe45
create or replace view training2 as
select
rowid,
clicks,
(impression - clicks) as noclick,
mhash(concat("1_", displayurl)) as displayurl,
mhash(concat("2_", adid)) as adid,
...
-1 as bias
from (
@myui
myui / reservoir_sampling.java
Last active January 2, 2016 13:59
reservoir sampling
T add(T item) {
T old = null;
if(position < numSamples) {// reservoir not yet full, just append
samples[position] = item;
} else {// find a item to replace
int replaceIndex = rand.nextInt(position + 1);
if(replaceIndex < numSamples) {// replacement opportunity decreases over a time
old = samples[replaceIndex];
samples[replaceIndex] = item;
}
public final class TrainNewsGroups {
public static void main(String[] args) throws IOException {
File base = new File(args[0]);
Multiset<String> overallCounts = HashMultiset.create();
int leakType = 0;
if (args.length > 1) {
leakType = Integer.parseInt(args[1]);
@myui
myui / train10k.scala
Last active August 29, 2015 14:02
spark training on kddcup2012 track2 dataset
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
val training = MLUtils.loadLibSVMFile(sc, "hdfs://dm01:8020/user/hive/warehouse/kdd12track2.db/training_libsvmfmt_10k", multiclass = false, numFeatures = 16777216, minPartitions = 64)
//val training = MLUtils.loadLibSVMFile(sc, "hdfs://dm01:8020/user/hive/warehouse/kdd12track2.db/training_libsvmfmt_10k", multiclass = false)
val model = LogisticRegressionWithSGD.train(training, numIterations = 1)
//val model = LogisticRegressionWithSGD.train(training, numIterations = 20)
@myui
myui / news20b-mllib_logress.md
Last active September 26, 2017 12:35
Classification of news20.binary dataset by LogisticRegressionWithSGD (Spark 1.0 MLlib)
@myui
myui / liblinear_on_spark.md
Last active August 29, 2015 14:02
liblinear on spark
@myui
myui / vw-mr.md
Last active August 29, 2015 14:02

Increasing LBFGS passes

[GD: 1 iters LBFGS: 20 iters mapper: 215]

> real    8m13.805s
> AUC  : 0.707109
> NWMAE: 0.049646
> WRMSE: 0.158077