http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary
head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' > news20.binary.1000
sort -R news20.binary > news20.random
head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' > news20.random.1000
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
//val training = MLUtils.loadLibSVMFile(sc, "hdfs://dm01:8020/dataset/news20-binary/raw/news20.binary.1000", multiclass=false)
val training = MLUtils.loadLibSVMFile(sc, "hdfs://dm01:8020/dataset/news20-binary/raw/news20.random.1000", multiclass=false)
// val training = MLUtils.loadLibSVMFile(sc, "hdfs://dm01:8020/dataset/news20-binary/raw/news20.random.1000", multiclass=false, numFeatures = 1354731 , minPartitions = 32)
val numFeatures = training .take(1)(0).features.size
//numFeatures: Int = 178560 for news20.binary.1000
//numFeatures: Int = 1354731 for news20.random.1000
val model = LogisticRegressionWithSGD.train(training, numIterations=1)
I have evaluated LogisticRegressionWithSGD of Spark 1.0 MLlib on Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though the number of training examples used in the evaluation is just 1,000.
It works fine for the dataset news20.binary.1000 that has 178,560 features. However, it does not work for news20.random.1000 where # of features is large (1,354,731 features) though we used a sparse vector through MLUtils.loadLibSVMFile().
The execution seems not progressing while no error is reported in the spark-shell as well as in the stdout/stderr of executors.
We used 32 executors with each allocating 7GB (2GB is for RDD) for working memory.