pschatzmann/DistributedRandomForest -Copy1.ipynb

## DistributedRandomForest -Copy1.ipynb
{"metadata":{"kernelspec":{"display_name":"Scala","language":"scala","name":"scala"},"language_info":{"codemirror_mode":"text/x-scala","file_extension":".scala","mimetype":"","name":"Scala","nbconverter_exporter":"","version":"2.11.12"}},"nbformat_minor":2,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# H2O Sparkling Water - Distriuted Random Forrest\n\nIn this document we demonstrate how [H2O](https://www.h2o.ai/) can be used with Scala & Spark to run a 'Distriuted Random Forrest' classification.\n\nH2O and the related documentaion for Sparkling Water can be found at http://h2o-release.s3.amazonaws.com/h2o/rel-xia/1/index.html\n\nWe are using Jupyter with the BeakerX Scala kernal.\n\n## Setup\nWe install the full Sparkling Water package which also includes Spark with the help of Maven","metadata":{}},{"cell_type":"code","source":"%%classpath add mvn \nai.h2o:sparkling-water-package_2.11:2.3.17\n","metadata":{"trusted":true},"execution_count":1,"outputs":[{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"","version_major":2}},"metadata":{}},{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"995aa09b-06ff-4cd4-8310-4e42d175fef1","version_major":2}},"metadata":{}}]},{"cell_type":"markdown","source":"In order to prevent subsequent runtime errors we must take care of the following:\n- We did not install the scala repl so we want to prevent the h2o repl as well. \n- There are currently some class conflicts with jetty and we therefore deactivate the REST API ","metadata":{}},{"cell_type":"code","source":"import water.H2O\n\nH2O.ARGS.disable_web = true\nSystem.setProperty(\"spark.ext.h2o.repl.enabled\",\"false\")\n\nSystem.getProperty(\"spark.ext.h2o.repl.enabled\")","metadata":{"trusted":true},"execution_count":2,"outputs":[{"execution_count":2,"output_type":"execute_result","data":{"text/plain":"false"},"metadata":{}}]},{"cell_type":"markdown","source":"Now we are ready to start Spark ...","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder()\n                        .appName(\"Iris NaiveBayes\")\n                        .master(\"local\")\n                        .config(\"spark.ui.enabled\", \"false\")\n                        .getOrCreate()\n\n","metadata":{"trusted":true},"execution_count":3,"outputs":[{"execution_count":3,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.sql.SparkSession@1f1d9b1a"},"metadata":{}}]},{"cell_type":"markdown","source":"... and we can create the related H2O Context","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.h2o._\n\nval h2oContext = H2OContext.getOrCreate(spark)\n","metadata":{"trusted":true},"execution_count":4,"outputs":[{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"\nSparkling Water Context:\n * H2O name: sparkling-water-beakerx_local-1541437783196\n * cluster size: 1\n * list of used nodes:\n  (executorId, host, port)\n  ------------------------\n  (driver,928b6a866c15,54323)\n  ------------------------\n\n  Open H2O Flow in browser: http://:54323 (CMD + click in Mac OSX)\n\n    "},"metadata":{}}]},{"cell_type":"markdown","source":"## Data Preparation\nWe start by loading the iris.csv directly into a H2OFrame and double check the data types: The variety has been converted to an Enum!","metadata":{}},{"cell_type":"code","source":"import water.fvec.H2OFrame\nimport java.net.URL\n\nval h2oFrame = new H2OFrame(new URL(\"https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv\").toURI)\n\nh2oFrame.names","metadata":{"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"[sepal.length, sepal.width, petal.length, petal.width, variety]"},"metadata":{}}]},{"cell_type":"code","source":"h2oFrame.rename(0,\"sepalLength\")\nh2oFrame.rename(1,\"sepalWidth\")\nh2oFrame.rename(2,\"petalLenth\")\nh2oFrame.rename(3,\"petalWidth\")\n\nh2oFrame.names","metadata":{"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"[sepalLength, sepalWidth, petalLenth, petalWidth, variety]"},"metadata":{}}]},{"cell_type":"code","source":"h2oFrame.typesStr","metadata":{"trusted":true},"execution_count":7,"outputs":[{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"[Numeric, Numeric, Numeric, Numeric, Enum]"},"metadata":{}}]},{"cell_type":"code","source":"var domains = h2oFrame.vec(\"variety\").domain","metadata":{"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"[Setosa, Versicolor, Virginica]"},"metadata":{}}]},{"cell_type":"markdown","source":"We can display the data by converting it to a DataFrame:","metadata":{}},{"cell_type":"code","source":"val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)\nval ds = h2oContext.asDataFrame(h2oFrame)(sqlContext).toDF\n\nds.show()","metadata":{"trusted":true},"execution_count":9,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+-------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|variety|\n+-----------+----------+----------+----------+-------+\n|        5.1|       3.5|       1.4|       0.2| Setosa|\n|        4.9|       3.0|       1.4|       0.2| Setosa|\n|        4.7|       3.2|       1.3|       0.2| Setosa|\n|        4.6|       3.1|       1.5|       0.2| Setosa|\n|        5.0|       3.6|       1.4|       0.2| Setosa|\n|        5.4|       3.9|       1.7|       0.4| Setosa|\n|        4.6|       3.4|       1.4|       0.3| Setosa|\n|        5.0|       3.4|       1.5|       0.2| Setosa|\n|        4.4|       2.9|       1.4|       0.2| Setosa|\n|        4.9|       3.1|       1.5|       0.1| Setosa|\n|        5.4|       3.7|       1.5|       0.2| Setosa|\n|        4.8|       3.4|       1.6|       0.2| Setosa|\n|        4.8|       3.0|       1.4|       0.1| Setosa|\n|        4.3|       3.0|       1.1|       0.1| Setosa|\n|        5.8|       4.0|       1.2|       0.2| Setosa|\n|        5.7|       4.4|       1.5|       0.4| Setosa|\n|        5.4|       3.9|       1.3|       0.4| Setosa|\n|        5.1|       3.5|       1.4|       0.3| Setosa|\n|        5.7|       3.8|       1.7|       0.3| Setosa|\n|        5.1|       3.8|       1.5|       0.3| Setosa|\n+-----------+----------+----------+----------+-------+\nonly showing top 20 rows\n\n","output_type":"stream"},{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"Then we split the frame into a training and a test frame. The call to splitFrame is also shuffeling the data!","metadata":{}},{"cell_type":"code","source":"import water.support.H2OFrameSupport\n\nval keys = Array(\"train.hex\", \"test.hex\")\nval ratios = Array(0.9, 0.1)\n\nval Array(train, test) =  H2OFrameSupport.splitFrame(h2oFrame,keys, ratios)\nval h2oTraining = h2oContext.asH2OFrame(train)\nval h2oTest = h2oContext.asH2OFrame(test)\n\nh2oTraining+\"-----------\\n\"+ h2oTest","metadata":{"trusted":true},"execution_count":10,"outputs":[{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"Frame key: train.hex\n   cols: 5\n   rows: 131\n chunks: 1\n   size: 3177\n-----------\nFrame key: test.hex\n   cols: 5\n   rows: 19\n chunks: 1\n   size: 1625\n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Training of the Classifier\nWe define the result-column and we indicate both the training and the test data so that the system is providing the metrics for both datasets.\n\nWe get the trained model by calling trainModel.get","metadata":{}},{"cell_type":"code","source":"import hex.tree.drf.DRF\nimport hex.tree.drf.DRFModel\nimport h2oContext.implicits._\n\nvar parameters =  new DRFModel.DRFParameters()\nparameters._train = h2oTraining\nparameters._valid = h2oTest\nparameters._response_column = \"variety\"\n\nvar model = new DRF(parameters).trainModel.get\n","metadata":{"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.h2o.H2OContext$implicits$@5a3d5b41"},"metadata":{}}]},{"cell_type":"markdown","source":"## Evaluation\n","metadata":{}},{"cell_type":"markdown","source":"The toString method of the model provides the evaluation of the Training and Test datasets:","metadata":{}},{"cell_type":"code","source":"model.toString","metadata":{"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"Model Metrics Type: Multinomial\n Description: Metrics reported on Out-Of-Bag training samples\n model id: DRF_model_1541437780529_1\n frame id: train.hex\n MSE: 0.04673963\n RMSE: 0.2161935\n logloss: 0.39659226\n mean_per_class_error: 0.06150583\n hit ratios: [0.9389313, 1.0, 1.0]\n CM: Confusion Matrix (Row labels: Actual class; Column labels: Predicted class):\n            Setosa  Versicolor  Virginica   Error     Rate\n    Setosa      44           0          0  0.0000   0 / 44\nVersicolor       0          37          4  0.0976   4 / 41\n Virginica       0           4         42  0.0870   4 / 46\n    Totals      44          41         46  0.0611  8 / 131\nModel Metrics Type: Multinomial\n Description: N/A\n model id: DRF_model_1541437780529_1\n frame id: test.hex\n MSE: 0.009293742\n RMSE: 0.09640405\n logloss: 0.059128653\n mean_per_class_error: 0.0\n hit ratios: [1.0, 1.0, 1.0]\n CM: Confusion Matrix (Row labels: Actual class; Column labels: Predicted class):\n            Setosa  Versicolor  Virginica   Error    Rate\n    Setosa       6           0          0  0.0000   0 / 6\nVersicolor       0           9          0  0.0000   0 / 9\n Virginica       0           0          4  0.0000   0 / 4\n    Totals       6           9          4  0.0000  0 / 19\nVariable Importances:\n   Variable Relative Importance Scaled Importance Percentage\n petalLenth         1795.945801          1.000000   0.474394\n petalWidth         1578.936157          0.879167   0.417072\nsepalLength          327.999115          0.182633   0.086640\n sepalWidth           82.884659          0.046151   0.021894\nModel Summary:\n Number of Trees Number of Internal Trees Model Size in Bytes Min. Depth Max. Depth Mean Depth Min. Leaves Max. Leaves Mean Leaves\n              50                      150               19906          1          8    3.62667           2          12     5.91333\nScoring History:\n           Timestamp   Duration Number of Trees Training RMSE Training LogLoss Training Classification Error Validation RMSE Validation LogLoss Validation Classification Error\n 2018-11-05 18:09:55  0.063 sec               0           NaN              NaN                           NaN             NaN                NaN                             NaN\n 2018-11-05 18:09:55  0.316 sec               1       0.16366          0.64152                       0.03571         0.22942            1.81783                         0.05263\n 2018-11-05 18:09:55  0.367 sec               2       0.21490          1.31760                       0.06250         0.07647            0.02134                         0.00000\n 2018-11-05 18:09:55  0.395 sec               3       0.21937          1.40235                       0.06000         0.05735            0.01514                         0.00000\n 2018-11-05 18:09:55  0.416 sec               4       0.21750          1.27528                       0.05405         0.04588            0.01174                         0.00000\n 2018-11-05 18:09:55  0.435 sec               5       0.25178          1.52270                       0.07627         0.08550            0.03094                         0.00000\n 2018-11-05 18:09:55  0.450 sec               6       0.24614          1.49340                       0.06667         0.07328            0.02582                         0.00000\n 2018-11-05 18:09:55  0.471 sec               7       0.22902          0.92432                       0.06400         0.08767            0.03961                         0.00000\n 2018-11-05 18:09:55  0.488 sec               8       0.22442          0.90776                       0.06299         0.07795            0.03560                         0.00000\n 2018-11-05 18:09:55  0.509 sec               9       0.22097          0.89179                       0.06202         0.07976            0.03716                         0.00000\n---\n 2018-11-05 18:09:55  0.912 sec              41       0.21637          0.39776                       0.06107         0.09405            0.05657                         0.00000\n 2018-11-05 18:09:55  0.922 sec              42       0.21665          0.39893                       0.06107         0.09180            0.05507                         0.00000\n 2018-11-05 18:09:55  0.931 sec              43       0.21674          0.39871                       0.06107         0.09459            0.05633                         0.00000\n 2018-11-05 18:09:55  0.968 sec              44       0.21532          0.39679                       0.06107         0.09285            0.05539                         0.00000\n 2018-11-05 18:09:55  0.975 sec              45       0.21595          0.39729                       0.06107         0.09262            0.05581                         0.00000\n 2018-11-05 18:09:55  0.984 sec              46       0.21673          0.39842                       0.06107         0.09600            0.05824                         0.00000\n 2018-11-05 18:09:55  0.992 sec              47       0.21713          0.39887                       0.06107         0.09899            0.05947                         0.00000\n 2018-11-05 18:09:55  1.000 sec              48       0.21770          0.39978                       0.06107         0.09932            0.06035                         0.00000\n 2018-11-05 18:09:55  1.010 sec              49       0.21687          0.39822                       0.06107         0.09737            0.05916                         0.00000\n 2018-11-05 18:09:55  1.018 sec              50       0.21619          0.39659                       0.06107         0.09640            0.05913                         0.00000\n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Prediction\nWe display the input data. We remove the variety column to make sure that the prediction works by providing only the 4 input columns:","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.sql.SaveMode\n\nh2oTest.remove(\"variety\")\n\nvar predictionDF = h2oContext.asDataFrame(h2oTest)(sqlContext).toDF\n\npredictionDF.write.format(\"csv\").option(\"header\", \"true\").mode(SaveMode.Overwrite).save(\"prediction.csv\")\npredictionDF.show()\n","metadata":{"trusted":true},"execution_count":13,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|\n+-----------+----------+----------+----------+\n|        5.0|       3.4|       1.5|       0.2|\n|        4.9|       3.1|       1.5|       0.1|\n|        5.7|       4.4|       1.5|       0.4|\n|        5.4|       3.4|       1.7|       0.2|\n|        4.8|       3.1|       1.6|       0.2|\n|        5.0|       3.5|       1.3|       0.3|\n|        6.9|       3.1|       4.9|       1.5|\n|        4.9|       2.4|       3.3|       1.0|\n|        5.0|       2.0|       3.5|       1.0|\n|        6.2|       2.2|       4.5|       1.5|\n|        5.6|       2.5|       3.9|       1.1|\n|        6.6|       3.0|       4.4|       1.4|\n|        6.8|       2.8|       4.8|       1.4|\n|        5.5|       2.6|       4.4|       1.2|\n|        5.6|       2.7|       4.2|       1.3|\n|        6.5|       3.2|       5.1|       2.0|\n|        6.1|       3.0|       4.9|       1.8|\n|        6.7|       3.1|       5.6|       2.4|\n|        6.2|       3.4|       5.4|       2.3|\n+-----------+----------+----------+----------+\n\n","output_type":"stream"},{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"We can execute a prediction by calling the score method on the model.","metadata":{}},{"cell_type":"code","source":"val predictionResult = model.score(h2oTest)\n","metadata":{"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"Frame key: _bbef6b85a84cad502123f5977b4548d1\n   cols: 4\n   rows: 19\n chunks: 1\n   size: 1371\n"},"metadata":{}}]},{"cell_type":"markdown","source":"Finally we convert the result to a Spark Dataset so that we can display it:","metadata":{}},{"cell_type":"code","source":"val df1 = h2oContext.asDataFrame(predictionResult)(sqlContext).toDF\n\ndf1.show()","metadata":{"trusted":true},"execution_count":15,"outputs":[{"name":"stdout","text":"+----------+--------------------+-------------------+--------------------+\n|   predict|              Setosa|         Versicolor|           Virginica|\n+----------+--------------------+-------------------+--------------------+\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|   0.997470677599468|                0.0|0.002529322400532...|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|Versicolor|0.001875732701114...| 0.7561547479537011| 0.24196951934518435|\n|Versicolor|                 0.0| 0.8972644377915675| 0.10273556220843251|\n|Versicolor|                 0.0| 0.9529729730689217| 0.04702702693107835|\n|Versicolor|0.001654777154018...| 0.8182873057105341|  0.1800579171354474|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001820664535787...| 0.7633136094726328| 0.23486572599157932|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n| Virginica|0.001863209378640...|                0.0|  0.9981367906213595|\n| Virginica|0.001774360491032...|0.09537187674828396|  0.9028537627606835|\n| Virginica|0.001863209378640...|                0.0|  0.9981367906213595|\n| Virginica|0.001725005390582263|0.07417523207136159|  0.9240997625380561|\n+----------+--------------------+-------------------+--------------------+\n\n","output_type":"stream"},{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Saving and Loading Binary Models","metadata":{}},{"cell_type":"code","source":"import water.support.ModelSerializationSupport\n\nModelSerializationSupport.exportH2OModel(model,\"model.bin\",true)\nvar modelLoaded:DRFModel = ModelSerializationSupport.loadH2OModel(\"model.bin\")\n\nmodelLoaded.getClass","metadata":{"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"class hex.tree.drf.DRFModel"},"metadata":{}}]},{"cell_type":"code","source":"val predictionResult = model.score(h2oTest)\nval df1 = h2oContext.asDataFrame(predictionResult)(sqlContext).toDF\n\ndf1.show()","metadata":{"trusted":true},"execution_count":17,"outputs":[{"name":"stdout","text":"+----------+--------------------+-------------------+--------------------+\n|   predict|              Setosa|         Versicolor|           Virginica|\n+----------+--------------------+-------------------+--------------------+\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|   0.997470677599468|                0.0|0.002529322400532...|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|    Setosa|  0.9975669100660882|                0.0|0.002433089933911768|\n|Versicolor|0.001875732701114...| 0.7561547479537011| 0.24196951934518435|\n|Versicolor|                 0.0| 0.8972644377915675| 0.10273556220843251|\n|Versicolor|                 0.0| 0.9529729730689217| 0.04702702693107835|\n|Versicolor|0.001654777154018...| 0.8182873057105341|  0.1800579171354474|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001820664535787...| 0.7633136094726328| 0.23486572599157932|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n| Virginica|0.001863209378640...|                0.0|  0.9981367906213595|\n| Virginica|0.001774360491032...|0.09537187674828396|  0.9028537627606835|\n| Virginica|0.001863209378640...|                0.0|  0.9981367906213595|\n| Virginica|0.001725005390582263|0.07417523207136159|  0.9240997625380561|\n+----------+--------------------+-------------------+--------------------+\n\n","output_type":"stream"},{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Saving and Loading Mojo Models in Spark","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.ml.h2o.models._\n\nModelSerializationSupport.exportMOJOModel(model,\"model.mojo\",true)\nval mojoModelLoaded = H2OMOJOModel.createFromMojo(\"model.mojo\")\n\nmojoModelLoaded.getClass","metadata":{"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"class org.apache.spark.ml.h2o.models.H2OMOJOModel"},"metadata":{}}]},{"cell_type":"code","source":"import org.apache.spark.sql.functions._\n\nval df = mojoModelLoaded.transform(predictionDF)\n    .withColumn(\"probabilities\", expr(\"prediction_output.probabilities\"))\n    .drop(\"prediction_output\")\n\ndf.show","metadata":{"trusted":true},"execution_count":19,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+--------------------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|       probabilities|\n+-----------+----------+----------+----------+--------------------+\n|        5.0|       3.4|       1.5|       0.2|[0.99756691006608...|\n|        4.9|       3.1|       1.5|       0.1|[0.99756691006608...|\n|        5.7|       4.4|       1.5|       0.4|[0.99747067759946...|\n|        5.4|       3.4|       1.7|       0.2|[0.99756691006608...|\n|        4.8|       3.1|       1.6|       0.2|[0.99756691006608...|\n|        5.0|       3.5|       1.3|       0.3|[0.99756691006608...|\n|        6.9|       3.1|       4.9|       1.5|[0.00187573270111...|\n|        4.9|       2.4|       3.3|       1.0|[0.0, 0.897264437...|\n|        5.0|       2.0|       3.5|       1.0|[0.0, 0.952972973...|\n|        6.2|       2.2|       4.5|       1.5|[0.00165477715401...|\n|        5.6|       2.5|       3.9|       1.1|[0.00185250030391...|\n|        6.6|       3.0|       4.4|       1.4|[0.00185250030391...|\n|        6.8|       2.8|       4.8|       1.4|[0.00182066453578...|\n|        5.5|       2.6|       4.4|       1.2|[0.00185250030391...|\n|        5.6|       2.7|       4.2|       1.3|[0.00185250030391...|\n|        6.5|       3.2|       5.1|       2.0|[0.00186320937864...|\n|        6.1|       3.0|       4.9|       1.8|[0.00177436049103...|\n|        6.7|       3.1|       5.6|       2.4|[0.00186320937864...|\n|        6.2|       3.4|       5.4|       2.3|[0.00172500539058...|\n+-----------+----------+----------+----------+--------------------+\n\n","output_type":"stream"},{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"code","source":"import spark.implicits._ \n\nvar domains = h2oFrame.vec(\"variety\").domain.toSeq\nprintln(domains)\n\nvar domainsDF = spark.sparkContext.parallelize(domains).zipWithIndex.toDF(\"variety\",\"index\")\ndomainsDF.show","metadata":{"trusted":true},"execution_count":20,"outputs":[{"name":"stdout","text":"WrappedArray(Setosa, Versicolor, Virginica)\n+----------+-----+\n|   variety|index|\n+----------+-----+\n|    Setosa|    0|\n|Versicolor|    1|\n| Virginica|    2|\n+----------+-----+\n\n","output_type":"stream"},{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.sql.SparkSession$implicits$@3c57edc7"},"metadata":{}}]},{"cell_type":"code","source":"\ndef maxIndex: (collection.mutable.WrappedArray[Double] => Int) = { array => (array.indexOf(array.max)) }\nspark.udf.register(\"maxIndex\", maxIndex)\n\nvar predictionResultDF = df.select(df.col(\"*\"), callUDF(\"maxIndex\", df.col(\"probabilities\")).name(\"index\"))\n    .join(domainsDF,Seq(\"index\"))\n\npredictionResultDF.show\n","metadata":{"trusted":true},"execution_count":21,"outputs":[{"name":"stdout","text":"+-----+-----------+----------+----------+----------+--------------------+----------+\n|index|sepalLength|sepalWidth|petalLenth|petalWidth|       probabilities|   variety|\n+-----+-----------+----------+----------+----------+--------------------+----------+\n|    0|        5.0|       3.4|       1.5|       0.2|[0.99756691006608...|    Setosa|\n|    0|        4.9|       3.1|       1.5|       0.1|[0.99756691006608...|    Setosa|\n|    0|        5.7|       4.4|       1.5|       0.4|[0.99747067759946...|    Setosa|\n|    0|        5.4|       3.4|       1.7|       0.2|[0.99756691006608...|    Setosa|\n|    0|        4.8|       3.1|       1.6|       0.2|[0.99756691006608...|    Setosa|\n|    0|        5.0|       3.5|       1.3|       0.3|[0.99756691006608...|    Setosa|\n|    1|        6.9|       3.1|       4.9|       1.5|[0.00187573270111...|Versicolor|\n|    1|        4.9|       2.4|       3.3|       1.0|[0.0, 0.897264437...|Versicolor|\n|    1|        5.0|       2.0|       3.5|       1.0|[0.0, 0.952972973...|Versicolor|\n|    1|        6.2|       2.2|       4.5|       1.5|[0.00165477715401...|Versicolor|\n|    1|        5.6|       2.5|       3.9|       1.1|[0.00185250030391...|Versicolor|\n|    1|        6.6|       3.0|       4.4|       1.4|[0.00185250030391...|Versicolor|\n|    1|        6.8|       2.8|       4.8|       1.4|[0.00182066453578...|Versicolor|\n|    1|        5.5|       2.6|       4.4|       1.2|[0.00185250030391...|Versicolor|\n|    1|        5.6|       2.7|       4.2|       1.3|[0.00185250030391...|Versicolor|\n|    2|        6.5|       3.2|       5.1|       2.0|[0.00186320937864...| Virginica|\n|    2|        6.1|       3.0|       4.9|       1.8|[0.00177436049103...| Virginica|\n|    2|        6.7|       3.1|       5.6|       2.4|[0.00186320937864...| Virginica|\n|    2|        6.2|       3.4|       5.4|       2.3|[0.00172500539058...| Virginica|\n+-----+-----------+----------+----------+----------+--------------------+----------+\n\n","output_type":"stream"},{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Deployment of Standalone Functionality To Production\n\nThe MOJO model can be deployed to production w/o the need of any access to a running H2O or Spark Instance.\nAll you need to do is to add the following dependency:\n\n<dependency>\n    <groupId>ai.h2o</groupId>\n    <artifactId>h2o-genmodel</artifactId>\n    <version>3.10.4.2</version>\n</dependency>\n\nFurther details can be found in http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html\n\n","metadata":{}},{"cell_type":"code","source":"val csv = spark.read.format(\"csv\")\n  .option(\"inferSchema\", \"true\")\n  .option(\"header\", \"true\")\n  .load(\"prediction.csv/*.csv\")\n\ncsv.show","metadata":{"trusted":true},"execution_count":22,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|\n+-----------+----------+----------+----------+\n|        5.0|       3.4|       1.5|       0.2|\n|        4.9|       3.1|       1.5|       0.1|\n|        5.7|       4.4|       1.5|       0.4|\n|        5.4|       3.4|       1.7|       0.2|\n|        4.8|       3.1|       1.6|       0.2|\n|        5.0|       3.5|       1.3|       0.3|\n|        6.9|       3.1|       4.9|       1.5|\n|        4.9|       2.4|       3.3|       1.0|\n|        5.0|       2.0|       3.5|       1.0|\n|        6.2|       2.2|       4.5|       1.5|\n|        5.6|       2.5|       3.9|       1.1|\n|        6.6|       3.0|       4.4|       1.4|\n|        6.8|       2.8|       4.8|       1.4|\n|        5.5|       2.6|       4.4|       1.2|\n|        5.6|       2.7|       4.2|       1.3|\n|        6.5|       3.2|       5.1|       2.0|\n|        6.1|       3.0|       4.9|       1.8|\n|        6.7|       3.1|       5.6|       2.4|\n|        6.2|       3.4|       5.4|       2.3|\n+-----------+----------+----------+----------+\n\n","output_type":"stream"},{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"code","source":"import _root_.hex.genmodel.GenModel\nimport _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}\nimport _root_.hex.genmodel.easy.prediction\nimport _root_.hex.genmodel.MojoModel\nimport _root_.hex.genmodel.easy.RowData\n","metadata":{"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"import _root_.hex.genmodel.GenModel\nimport _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}\nimport _root_.hex.genmodel.easy.prediction\nimport _root_.hex.genmodel.MojoModel\nimport _root_.hex.genmodel.easy.RowData\n"},"metadata":{}}]},{"cell_type":"code","source":"def predict(easyModel: EasyPredictModelWrapper, values:Array[Double]):String = {\n    var row = new RowData()\n    row.put(\"sepalLength\", values(0).toString)\n    row.put(\"sepalWidth\", values(1).toString)\n    row.put(\"petalLength\", values(2).toString)\n    row.put(\"petalWidth\", values(3).toString)\n\n    var p = easyModel.predictMultinomial(row);\n    p.label\n}\n\nvar easyModel = new EasyPredictModelWrapper( MojoModel.load(\"model.mojo\"));\npredict(easyModel, Array(5.0,3.4,1.4,0.2))\n","metadata":{"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"Setosa"},"metadata":{}}]},{"cell_type":"code","source":"csv.collect\n    .map(line => Array(line.getDouble(0),line.getDouble(1),line.getDouble(2),line.getDouble(3)))\n    .foreach(a => println(s\"${a(0)},${a(1)},${a(2)},${a(3)}  =>  ${predict(easyModel, a)}\"))\n\n","metadata":{"trusted":true},"execution_count":25,"outputs":[{"name":"stdout","text":"5.0,3.4,1.5,0.2  =>  Setosa\n4.9,3.1,1.5,0.1  =>  Setosa\n5.7,4.4,1.5,0.4  =>  Setosa\n5.4,3.4,1.7,0.2  =>  Setosa\n4.8,3.1,1.6,0.2  =>  Setosa\n5.0,3.5,1.3,0.3  =>  Setosa\n6.9,3.1,4.9,1.5  =>  Versicolor\n4.9,2.4,3.3,1.0  =>  Versicolor\n5.0,2.0,3.5,1.0  =>  Versicolor\n6.2,2.2,4.5,1.5  =>  Versicolor\n5.6,2.5,3.9,1.1  =>  Versicolor\n6.6,3.0,4.4,1.4  =>  Versicolor\n6.8,2.8,4.8,1.4  =>  Versicolor\n5.5,2.6,4.4,1.2  =>  Versicolor\n5.6,2.7,4.2,1.3  =>  Versicolor\n6.5,3.2,5.1,2.0  =>  Virginica\n6.1,3.0,4.9,1.8  =>  Virginica\n6.7,3.1,5.6,2.4  =>  Virginica\n6.2,3.4,5.4,2.3  =>  Virginica\n","output_type":"stream"}]},{"cell_type":"code","source":"","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}