pschatzmann/NaiveBayesMultinomial.ipynb

## NaiveBayesMultinomial.ipynb
{"metadata":{"kernelspec":{"display_name":"Scala","language":"scala","name":"scala"},"language_info":{"codemirror_mode":"text/x-scala","file_extension":".scala","mimetype":"","name":"Scala","nbconverter_exporter":"","version":"2.11.12"}},"nbformat_minor":2,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Native Bayes with Weka\n\nIf you deal with Machine Learning in the JVM you should not forget about the good old [Weka](https://www.cs.waikato.ac.nz/ml/weka/). It has basically been desinged to be used by a Swing GUI but it can also be used as an API. In terms of documentation I can recommend [this manual](http://statweb.stanford.edu/~lpekelis/13_datafest_cart/WekaManual-3-7-8.pdf) and the [javadoc](http://weka.sourceforge.net/doc.dev/).\n\nIn my Demo I use the NaiveBayesMultinomial classifier with the iris dataset that is directly loaded from the Internet.\n\n## Setup\nWe add the necessary Maven dependency","metadata":{}},{"cell_type":"code","source":"%%classpath add mvn \nnz.ac.waikato.cms.weka:weka-stable:3.8.3\n","metadata":{"trusted":true},"execution_count":1,"outputs":[{"output_type":"display_data","data":{"application/vnd.jupyter.widget-view+json":{"model_id":"","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{}},{"output_type":"display_data","data":{"application/vnd.jupyter.widget-view+json":{"model_id":"293368ef-75ac-486c-95f6-b2d65f20ae4f","version_major":2,"version_minor":0},"method":"display_data"},"metadata":{}}]},{"cell_type":"markdown","source":"## Data Preparation\nWe can use the CSVLoader to load the data from the Internet. We define the index of the class (category) field which defines the classification result. \n\nFinally we reshuffle the data: We have a dataset with 150 records","metadata":{}},{"cell_type":"code","source":"import weka.core.converters.CSVLoader;\nimport java.net.URL\n\nvar url = new URL(\"https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv\")\nvar loader = new CSVLoader();\nloader.setSource(url.openStream);\nvar dataSet = loader.getDataSet();\ndataSet.setClassIndex(4);\n\n//We could use Collections.shuffle(dataSet) or use the weka functionality\ndataSet = dataSet.resample(new java.util.Random())\n\ndataSet.size","metadata":{"trusted":true},"execution_count":30,"outputs":[{"execution_count":30,"output_type":"execute_result","data":{"text/plain":"150"},"metadata":{}}]},{"cell_type":"code","source":"dataSet.getClass.getName","metadata":{"trusted":true},"execution_count":31,"outputs":[{"execution_count":31,"output_type":"execute_result","data":{"text/plain":"weka.core.Instances"},"metadata":{}}]},{"cell_type":"markdown","source":"Just to double check the data we show the first 10 records","metadata":{}},{"cell_type":"code","source":"import scala.collection.JavaConversions._\n\ndataSet.subList(0,10).foreach(println(_))","metadata":{"trusted":true},"execution_count":32,"outputs":[{"name":"stdout","text":"5,3.4,1.5,0.2,Setosa\n6.7,3,5,1.7,Versicolor\n6.2,2.2,4.5,1.5,Versicolor\n4.9,2.4,3.3,1,Versicolor\n5.7,3.8,1.7,0.3,Setosa\n6.7,3.3,5.7,2.1,Virginica\n6.7,3.1,4.7,1.5,Versicolor\n5.5,2.4,3.7,1,Versicolor\n5.5,2.6,4.4,1.2,Versicolor\n5,3.6,1.4,0.2,Setosa\n","output_type":"stream"},{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"We want to split the data into a training and testing dataset. We will use 90% of the data for training.\nSo we calculate the number of training data","metadata":{}},{"cell_type":"code","source":"(dataSet.size * 0.9).toInt","metadata":{"trusted":true},"execution_count":33,"outputs":[{"execution_count":33,"output_type":"execute_result","data":{"text/plain":"135"},"metadata":{}}]},{"cell_type":"markdown","source":"With this we can split the data into new Instances","metadata":{}},{"cell_type":"code","source":"import weka.core.Instances;\n\nvar trainingDataSet = new Instances(dataSet,0,135)\nvar testingDataSet = new Instances(dataSet,135,15)\n\ndataSet.size + \" = \" +trainingDataSet.size + \" / \" + testingDataSet.size","metadata":{"trusted":true},"execution_count":34,"outputs":[{"execution_count":34,"output_type":"execute_result","data":{"text/plain":"150 = 135 / 15"},"metadata":{}}]},{"cell_type":"markdown","source":"## Defining and Training the Classifier\n\nWe create a new NaiveBayesMultinomial object and train it by calling the buildClassifier method passing the training data.\nThe classifier provides some basic information.","metadata":{}},{"cell_type":"code","source":"import weka.classifiers.bayes.NaiveBayesMultinomial;\n\nvar classifier = new NaiveBayesMultinomial()\nclassifier.buildClassifier(trainingDataSet)\n\nclassifier","metadata":{"trusted":true},"execution_count":35,"outputs":[{"execution_count":35,"output_type":"execute_result","data":{"text/plain":"The independent probability of a class\n--------------------------------------\nSetosa\t0.36\nVersicolor\t0.33\nVirginica\t0.31\n\nThe probability of a word given the class\n-----------------------------------------\n\tSetosa\tVersicolor\tVirginica\t\nsepal.length\t0.49\t0.41\t0.38\t\nsepal.width\t0.34\t0.19\t0.18\t\npetal.length\t0.14\t0.3\t0.32\t\npetal.width\t0.03\t0.09\t0.12\t\n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Evaluate the Model with the Test Data\nFinally we double check how good our classifier is performing by testing it with our test data","metadata":{}},{"cell_type":"code","source":"import weka.classifiers.Evaluation;\n\nvar eval = new Evaluation(trainingDataSet)\neval.evaluateModel(classifier, testingDataSet)\neval.toSummaryString()\n","metadata":{"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"\nCorrectly Classified Instances          15              100      %\nIncorrectly Classified Instances         0                0      %\nKappa statistic                          1     \nMean absolute error                      0.2738\nRoot mean squared error                  0.3341\nRelative absolute error                 61.5984 %\nRoot relative squared error             70.7939 %\nTotal Number of Instances               15     \n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Predicting Data\nFinally I demonstrate how you can process new data because this is a little bit tricky.\nYou need to create a new DenseInstance with the Attributes from the original dataset and populate the numerical input data.\n\nThe prediction instance needs to be assinged a dataset! We use the oringal dataset in our example, but we could also use the trainnig set.\n\nThen we use the classifyInstance method from the classifier to predict the numerical value which needs to be converted back to a String with the help of the value function on the attribute:\n","metadata":{}},{"cell_type":"code","source":"import weka.core.DenseInstance;\n\nfor (rec <- testingDataSet) {\n    print(rec)\n    var predict = new DenseInstance(dataSet.numAttributes());\n    predict.setValue(0,rec.value(0))\n    predict.setValue(1,rec.value(1))\n    predict.setValue(2,rec.value(2))\n    predict.setValue(3,rec.value(3))\n    predict.setDataset(dataSet); \n    \n    var index = classifier.classifyInstance(predict).toInt;\n    var className = trainingDataSet.attribute(4).value(index);\n    println(\" -> \" +className)\n}","metadata":{"trusted":true},"execution_count":18,"outputs":[{"name":"stdout","output_type":"stream","text":"5.7,2.8,4.1,1.3,Versicolor -> Virginica\n5.9,3.2,4.8,1.8,Versicolor -> Virginica\n6.5,3,5.2,2,Virginica -> Virginica\n6,2.2,4,1,Versicolor -> Virginica\n6.2,2.8,4.8,1.8,Virginica -> Virginica\n6.7,2.5,5.8,1.8,Virginica -> Virginica\n6,2.7,5.1,1.6,Versicolor -> Virginica\n4.6,3.2,1.4,0.2,Setosa -> Setosa\n5,3,1.6,0.2,Setosa -> Setosa\n5.6,3,4.1,1.3,Versicolor -> Virginica\n6.5,3,5.5,1.8,Virginica -> Virginica\n7.7,3.8,6.7,2.2,Virginica -> Virginica\n6.7,3,5.2,2.3,Virginica -> Virginica\n4.3,3,1.1,0.1,Setosa -> Setosa\n6.4,3.1,5.5,1.8,Virginica -> Virginica\n"},{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"Finally - just for our piece of mind - we double check that we didn't change our original dataSet:","metadata":{}},{"cell_type":"code","source":"dataSet.size","metadata":{"trusted":true},"execution_count":19,"outputs":[{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"150"},"metadata":{}}]}]}