Created
November 5, 2018 17:43
-
-
Save pschatzmann/b3d2035a20e07d09ff21d92e6e468c4b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"metadata":{"kernelspec":{"display_name":"Scala","language":"scala","name":"scala"},"language_info":{"codemirror_mode":"text/x-scala","file_extension":".scala","mimetype":"","name":"Scala","nbconverter_exporter":"","version":"2.11.12"}},"nbformat_minor":2,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# H2O Sparkling Water - Distriuted Random Forrest\n\nIn this document we demonstrate how [H2O](https://www.h2o.ai/) can be used with Scala & Spark to run a 'Distriuted Random Forrest' classification.\n\nH2O and the related documentaion for Sparkling Water can be found at http://h2o-release.s3.amazonaws.com/h2o/rel-xia/1/index.html\n\nWe are using Jupyter with the BeakerX Scala kernal.\n\n## Setup\nWe install the full Sparkling Water package which also includes Spark with the help of Maven","metadata":{}},{"cell_type":"code","source":"%%classpath add mvn \nai.h2o:sparkling-water-package_2.11:2.3.17\n","metadata":{"trusted":true},"execution_count":1,"outputs":[{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"","version_major":2}},"metadata":{}},{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"995aa09b-06ff-4cd4-8310-4e42d175fef1","version_major":2}},"metadata":{}}]},{"cell_type":"markdown","source":"In order to prevent subsequent runtime errors we must take care of the following:\n- We did not install the scala repl so we want to prevent the h2o repl as well. \n- There are currently some class conflicts with jetty and we therefore deactivate the REST API ","metadata":{}},{"cell_type":"code","source":"import water.H2O\n\nH2O.ARGS.disable_web = true\nSystem.setProperty(\"spark.ext.h2o.repl.enabled\",\"false\")\n\nSystem.getProperty(\"spark.ext.h2o.repl.enabled\")","metadata":{"trusted":true},"execution_count":2,"outputs":[{"execution_count":2,"output_type":"execute_result","data":{"text/plain":"false"},"metadata":{}}]},{"cell_type":"markdown","source":"Now we are ready to start Spark ...","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder()\n .appName(\"Iris NaiveBayes\")\n .master(\"local\")\n .config(\"spark.ui.enabled\", \"false\")\n .getOrCreate()\n\n","metadata":{"trusted":true},"execution_count":3,"outputs":[{"execution_count":3,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.sql.SparkSession@1f1d9b1a"},"metadata":{}}]},{"cell_type":"markdown","source":"... and we can create the related H2O Context","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.h2o._\n\nval h2oContext = H2OContext.getOrCreate(spark)\n","metadata":{"trusted":true},"execution_count":4,"outputs":[{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"\nSparkling Water Context:\n * H2O name: sparkling-water-beakerx_local-1541437783196\n * cluster size: 1\n * list of used nodes:\n (executorId, host, port)\n ------------------------\n (driver,928b6a866c15,54323)\n ------------------------\n\n Open H2O Flow in browser: http://:54323 (CMD + click in Mac OSX)\n\n "},"metadata":{}}]},{"cell_type":"markdown","source":"## Data Preparation\nWe start by loading the iris.csv directly into a H2OFrame and double check the data types: The variety has been converted to an Enum!","metadata":{}},{"cell_type":"code","source":"import water.fvec.H2OFrame\nimport java.net.URL\n\nval h2oFrame = new H2OFrame(new URL(\"https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv\").toURI)\n\nh2oFrame.names","metadata":{"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"[sepal.length, sepal.width, petal.length, petal.width, variety]"},"metadata":{}}]},{"cell_type":"code","source":"h2oFrame.rename(0,\"sepalLength\")\nh2oFrame.rename(1,\"sepalWidth\")\nh2oFrame.rename(2,\"petalLenth\")\nh2oFrame.rename(3,\"petalWidth\")\n\nh2oFrame.names","metadata":{"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"[sepalLength, sepalWidth, petalLenth, petalWidth, variety]"},"metadata":{}}]},{"cell_type":"code","source":"h2oFrame.typesStr","metadata":{"trusted":true},"execution_count":7,"outputs":[{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"[Numeric, Numeric, Numeric, Numeric, Enum]"},"metadata":{}}]},{"cell_type":"code","source":"var domains = h2oFrame.vec(\"variety\").domain","metadata":{"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"[Setosa, Versicolor, Virginica]"},"metadata":{}}]},{"cell_type":"markdown","source":"We can display the data by converting it to a DataFrame:","metadata":{}},{"cell_type":"code","source":"val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)\nval ds = h2oContext.asDataFrame(h2oFrame)(sqlContext).toDF\n\nds.show()","metadata":{"trusted":true},"execution_count":9,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+-------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|variety|\n+-----------+----------+----------+----------+-------+\n| 5.1| 3.5| 1.4| 0.2| Setosa|\n| 4.9| 3.0| 1.4| 0.2| Setosa|\n| 4.7| 3.2| 1.3| 0.2| Setosa|\n| 4.6| 3.1| 1.5| 0.2| Setosa|\n| 5.0| 3.6| 1.4| 0.2| Setosa|\n| 5.4| 3.9| 1.7| 0.4| Setosa|\n| 4.6| 3.4| 1.4| 0.3| Setosa|\n| 5.0| 3.4| 1.5| 0.2| Setosa|\n| 4.4| 2.9| 1.4| 0.2| Setosa|\n| 4.9| 3.1| 1.5| 0.1| Setosa|\n| 5.4| 3.7| 1.5| 0.2| Setosa|\n| 4.8| 3.4| 1.6| 0.2| Setosa|\n| 4.8| 3.0| 1.4| 0.1| Setosa|\n| 4.3| 3.0| 1.1| 0.1| Setosa|\n| 5.8| 4.0| 1.2| 0.2| Setosa|\n| 5.7| 4.4| 1.5| 0.4| Setosa|\n| 5.4| 3.9| 1.3| 0.4| Setosa|\n| 5.1| 3.5| 1.4| 0.3| Setosa|\n| 5.7| 3.8| 1.7| 0.3| Setosa|\n| 5.1| 3.8| 1.5| 0.3| Setosa|\n+-----------+----------+----------+----------+-------+\nonly showing top 20 rows\n\n","output_type":"stream"},{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"Then we split the frame into a training and a test frame. The call to splitFrame is also shuffeling the data!","metadata":{}},{"cell_type":"code","source":"import water.support.H2OFrameSupport\n\nval keys = Array(\"train.hex\", \"test.hex\")\nval ratios = Array(0.9, 0.1)\n\nval Array(train, test) = H2OFrameSupport.splitFrame(h2oFrame,keys, ratios)\nval h2oTraining = h2oContext.asH2OFrame(train)\nval h2oTest = h2oContext.asH2OFrame(test)\n\nh2oTraining+\"-----------\\n\"+ h2oTest","metadata":{"trusted":true},"execution_count":10,"outputs":[{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"Frame key: train.hex\n cols: 5\n rows: 131\n chunks: 1\n size: 3177\n-----------\nFrame key: test.hex\n cols: 5\n rows: 19\n chunks: 1\n size: 1625\n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Training of the Classifier\nWe define the result-column and we indicate both the training and the test data so that the system is providing the metrics for both datasets.\n\nWe get the trained model by calling trainModel.get","metadata":{}},{"cell_type":"code","source":"import hex.tree.drf.DRF\nimport hex.tree.drf.DRFModel\nimport h2oContext.implicits._\n\nvar parameters = new DRFModel.DRFParameters()\nparameters._train = h2oTraining\nparameters._valid = h2oTest\nparameters._response_column = \"variety\"\n\nvar model = new DRF(parameters).trainModel.get\n","metadata":{"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.h2o.H2OContext$implicits$@5a3d5b41"},"metadata":{}}]},{"cell_type":"markdown","source":"## Evaluation\n","metadata":{}},{"cell_type":"markdown","source":"The toString method of the model provides the evaluation of the Training and Test datasets:","metadata":{}},{"cell_type":"code","source":"model.toString","metadata":{"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"Model Metrics Type: Multinomial\n Description: Metrics reported on Out-Of-Bag training samples\n model id: DRF_model_1541437780529_1\n frame id: train.hex\n MSE: 0.04673963\n RMSE: 0.2161935\n logloss: 0.39659226\n mean_per_class_error: 0.06150583\n hit ratios: [0.9389313, 1.0, 1.0]\n CM: Confusion Matrix (Row labels: Actual class; Column labels: Predicted class):\n Setosa Versicolor Virginica Error Rate\n Setosa 44 0 0 0.0000 0 / 44\nVersicolor 0 37 4 0.0976 4 / 41\n Virginica 0 4 42 0.0870 4 / 46\n Totals 44 41 46 0.0611 8 / 131\nModel Metrics Type: Multinomial\n Description: N/A\n model id: DRF_model_1541437780529_1\n frame id: test.hex\n MSE: 0.009293742\n RMSE: 0.09640405\n logloss: 0.059128653\n mean_per_class_error: 0.0\n hit ratios: [1.0, 1.0, 1.0]\n CM: Confusion Matrix (Row labels: Actual class; Column labels: Predicted class):\n Setosa Versicolor Virginica Error Rate\n Setosa 6 0 0 0.0000 0 / 6\nVersicolor 0 9 0 0.0000 0 / 9\n Virginica 0 0 4 0.0000 0 / 4\n Totals 6 9 4 0.0000 0 / 19\nVariable Importances:\n Variable Relative Importance Scaled Importance Percentage\n petalLenth 1795.945801 1.000000 0.474394\n petalWidth 1578.936157 0.879167 0.417072\nsepalLength 327.999115 0.182633 0.086640\n sepalWidth 82.884659 0.046151 0.021894\nModel Summary:\n Number of Trees Number of Internal Trees Model Size in Bytes Min. Depth Max. Depth Mean Depth Min. Leaves Max. Leaves Mean Leaves\n 50 150 19906 1 8 3.62667 2 12 5.91333\nScoring History:\n Timestamp Duration Number of Trees Training RMSE Training LogLoss Training Classification Error Validation RMSE Validation LogLoss Validation Classification Error\n 2018-11-05 18:09:55 0.063 sec 0 NaN NaN NaN NaN NaN NaN\n 2018-11-05 18:09:55 0.316 sec 1 0.16366 0.64152 0.03571 0.22942 1.81783 0.05263\n 2018-11-05 18:09:55 0.367 sec 2 0.21490 1.31760 0.06250 0.07647 0.02134 0.00000\n 2018-11-05 18:09:55 0.395 sec 3 0.21937 1.40235 0.06000 0.05735 0.01514 0.00000\n 2018-11-05 18:09:55 0.416 sec 4 0.21750 1.27528 0.05405 0.04588 0.01174 0.00000\n 2018-11-05 18:09:55 0.435 sec 5 0.25178 1.52270 0.07627 0.08550 0.03094 0.00000\n 2018-11-05 18:09:55 0.450 sec 6 0.24614 1.49340 0.06667 0.07328 0.02582 0.00000\n 2018-11-05 18:09:55 0.471 sec 7 0.22902 0.92432 0.06400 0.08767 0.03961 0.00000\n 2018-11-05 18:09:55 0.488 sec 8 0.22442 0.90776 0.06299 0.07795 0.03560 0.00000\n 2018-11-05 18:09:55 0.509 sec 9 0.22097 0.89179 0.06202 0.07976 0.03716 0.00000\n---\n 2018-11-05 18:09:55 0.912 sec 41 0.21637 0.39776 0.06107 0.09405 0.05657 0.00000\n 2018-11-05 18:09:55 0.922 sec 42 0.21665 0.39893 0.06107 0.09180 0.05507 0.00000\n 2018-11-05 18:09:55 0.931 sec 43 0.21674 0.39871 0.06107 0.09459 0.05633 0.00000\n 2018-11-05 18:09:55 0.968 sec 44 0.21532 0.39679 0.06107 0.09285 0.05539 0.00000\n 2018-11-05 18:09:55 0.975 sec 45 0.21595 0.39729 0.06107 0.09262 0.05581 0.00000\n 2018-11-05 18:09:55 0.984 sec 46 0.21673 0.39842 0.06107 0.09600 0.05824 0.00000\n 2018-11-05 18:09:55 0.992 sec 47 0.21713 0.39887 0.06107 0.09899 0.05947 0.00000\n 2018-11-05 18:09:55 1.000 sec 48 0.21770 0.39978 0.06107 0.09932 0.06035 0.00000\n 2018-11-05 18:09:55 1.010 sec 49 0.21687 0.39822 0.06107 0.09737 0.05916 0.00000\n 2018-11-05 18:09:55 1.018 sec 50 0.21619 0.39659 0.06107 0.09640 0.05913 0.00000\n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Prediction\nWe display the input data. We remove the variety column to make sure that the prediction works by providing only the 4 input columns:","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.sql.SaveMode\n\nh2oTest.remove(\"variety\")\n\nvar predictionDF = h2oContext.asDataFrame(h2oTest)(sqlContext).toDF\n\npredictionDF.write.format(\"csv\").option(\"header\", \"true\").mode(SaveMode.Overwrite).save(\"prediction.csv\")\npredictionDF.show()\n","metadata":{"trusted":true},"execution_count":13,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|\n+-----------+----------+----------+----------+\n| 5.0| 3.4| 1.5| 0.2|\n| 4.9| 3.1| 1.5| 0.1|\n| 5.7| 4.4| 1.5| 0.4|\n| 5.4| 3.4| 1.7| 0.2|\n| 4.8| 3.1| 1.6| 0.2|\n| 5.0| 3.5| 1.3| 0.3|\n| 6.9| 3.1| 4.9| 1.5|\n| 4.9| 2.4| 3.3| 1.0|\n| 5.0| 2.0| 3.5| 1.0|\n| 6.2| 2.2| 4.5| 1.5|\n| 5.6| 2.5| 3.9| 1.1|\n| 6.6| 3.0| 4.4| 1.4|\n| 6.8| 2.8| 4.8| 1.4|\n| 5.5| 2.6| 4.4| 1.2|\n| 5.6| 2.7| 4.2| 1.3|\n| 6.5| 3.2| 5.1| 2.0|\n| 6.1| 3.0| 4.9| 1.8|\n| 6.7| 3.1| 5.6| 2.4|\n| 6.2| 3.4| 5.4| 2.3|\n+-----------+----------+----------+----------+\n\n","output_type":"stream"},{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"We can execute a prediction by calling the score method on the model.","metadata":{}},{"cell_type":"code","source":"val predictionResult = model.score(h2oTest)\n","metadata":{"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"Frame key: _bbef6b85a84cad502123f5977b4548d1\n cols: 4\n rows: 19\n chunks: 1\n size: 1371\n"},"metadata":{}}]},{"cell_type":"markdown","source":"Finally we convert the result to a Spark Dataset so that we can display it:","metadata":{}},{"cell_type":"code","source":"val df1 = h2oContext.asDataFrame(predictionResult)(sqlContext).toDF\n\ndf1.show()","metadata":{"trusted":true},"execution_count":15,"outputs":[{"name":"stdout","text":"+----------+--------------------+-------------------+--------------------+\n| predict| Setosa| Versicolor| Virginica|\n+----------+--------------------+-------------------+--------------------+\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.997470677599468| 0.0|0.002529322400532...|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n|Versicolor|0.001875732701114...| 0.7561547479537011| 0.24196951934518435|\n|Versicolor| 0.0| 0.8972644377915675| 0.10273556220843251|\n|Versicolor| 0.0| 0.9529729730689217| 0.04702702693107835|\n|Versicolor|0.001654777154018...| 0.8182873057105341| 0.1800579171354474|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001820664535787...| 0.7633136094726328| 0.23486572599157932|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001774360491032...|0.09537187674828396| 0.9028537627606835|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001725005390582263|0.07417523207136159| 0.9240997625380561|\n+----------+--------------------+-------------------+--------------------+\n\n","output_type":"stream"},{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Saving and Loading Binary Models","metadata":{}},{"cell_type":"code","source":"import water.support.ModelSerializationSupport\n\nModelSerializationSupport.exportH2OModel(model,\"model.bin\",true)\nvar modelLoaded:DRFModel = ModelSerializationSupport.loadH2OModel(\"model.bin\")\n\nmodelLoaded.getClass","metadata":{"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"class hex.tree.drf.DRFModel"},"metadata":{}}]},{"cell_type":"code","source":"val predictionResult = model.score(h2oTest)\nval df1 = h2oContext.asDataFrame(predictionResult)(sqlContext).toDF\n\ndf1.show()","metadata":{"trusted":true},"execution_count":17,"outputs":[{"name":"stdout","text":"+----------+--------------------+-------------------+--------------------+\n| predict| Setosa| Versicolor| Virginica|\n+----------+--------------------+-------------------+--------------------+\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.997470677599468| 0.0|0.002529322400532...|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n|Versicolor|0.001875732701114...| 0.7561547479537011| 0.24196951934518435|\n|Versicolor| 0.0| 0.8972644377915675| 0.10273556220843251|\n|Versicolor| 0.0| 0.9529729730689217| 0.04702702693107835|\n|Versicolor|0.001654777154018...| 0.8182873057105341| 0.1800579171354474|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001820664535787...| 0.7633136094726328| 0.23486572599157932|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001774360491032...|0.09537187674828396| 0.9028537627606835|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001725005390582263|0.07417523207136159| 0.9240997625380561|\n+----------+--------------------+-------------------+--------------------+\n\n","output_type":"stream"},{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Saving and Loading Mojo Models in Spark","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.ml.h2o.models._\n\nModelSerializationSupport.exportMOJOModel(model,\"model.mojo\",true)\nval mojoModelLoaded = H2OMOJOModel.createFromMojo(\"model.mojo\")\n\nmojoModelLoaded.getClass","metadata":{"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"class org.apache.spark.ml.h2o.models.H2OMOJOModel"},"metadata":{}}]},{"cell_type":"code","source":"import org.apache.spark.sql.functions._\n\nval df = mojoModelLoaded.transform(predictionDF)\n .withColumn(\"probabilities\", expr(\"prediction_output.probabilities\"))\n .drop(\"prediction_output\")\n\ndf.show","metadata":{"trusted":true},"execution_count":19,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+--------------------+\n|sepalLength|sepalWidth|petalLenth|petalWidth| probabilities|\n+-----------+----------+----------+----------+--------------------+\n| 5.0| 3.4| 1.5| 0.2|[0.99756691006608...|\n| 4.9| 3.1| 1.5| 0.1|[0.99756691006608...|\n| 5.7| 4.4| 1.5| 0.4|[0.99747067759946...|\n| 5.4| 3.4| 1.7| 0.2|[0.99756691006608...|\n| 4.8| 3.1| 1.6| 0.2|[0.99756691006608...|\n| 5.0| 3.5| 1.3| 0.3|[0.99756691006608...|\n| 6.9| 3.1| 4.9| 1.5|[0.00187573270111...|\n| 4.9| 2.4| 3.3| 1.0|[0.0, 0.897264437...|\n| 5.0| 2.0| 3.5| 1.0|[0.0, 0.952972973...|\n| 6.2| 2.2| 4.5| 1.5|[0.00165477715401...|\n| 5.6| 2.5| 3.9| 1.1|[0.00185250030391...|\n| 6.6| 3.0| 4.4| 1.4|[0.00185250030391...|\n| 6.8| 2.8| 4.8| 1.4|[0.00182066453578...|\n| 5.5| 2.6| 4.4| 1.2|[0.00185250030391...|\n| 5.6| 2.7| 4.2| 1.3|[0.00185250030391...|\n| 6.5| 3.2| 5.1| 2.0|[0.00186320937864...|\n| 6.1| 3.0| 4.9| 1.8|[0.00177436049103...|\n| 6.7| 3.1| 5.6| 2.4|[0.00186320937864...|\n| 6.2| 3.4| 5.4| 2.3|[0.00172500539058...|\n+-----------+----------+----------+----------+--------------------+\n\n","output_type":"stream"},{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"code","source":"import spark.implicits._ \n\nvar domains = h2oFrame.vec(\"variety\").domain.toSeq\nprintln(domains)\n\nvar domainsDF = spark.sparkContext.parallelize(domains).zipWithIndex.toDF(\"variety\",\"index\")\ndomainsDF.show","metadata":{"trusted":true},"execution_count":20,"outputs":[{"name":"stdout","text":"WrappedArray(Setosa, Versicolor, Virginica)\n+----------+-----+\n| variety|index|\n+----------+-----+\n| Setosa| 0|\n|Versicolor| 1|\n| Virginica| 2|\n+----------+-----+\n\n","output_type":"stream"},{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.sql.SparkSession$implicits$@3c57edc7"},"metadata":{}}]},{"cell_type":"code","source":"\ndef maxIndex: (collection.mutable.WrappedArray[Double] => Int) = { array => (array.indexOf(array.max)) }\nspark.udf.register(\"maxIndex\", maxIndex)\n\nvar predictionResultDF = df.select(df.col(\"*\"), callUDF(\"maxIndex\", df.col(\"probabilities\")).name(\"index\"))\n .join(domainsDF,Seq(\"index\"))\n\npredictionResultDF.show\n","metadata":{"trusted":true},"execution_count":21,"outputs":[{"name":"stdout","text":"+-----+-----------+----------+----------+----------+--------------------+----------+\n|index|sepalLength|sepalWidth|petalLenth|petalWidth| probabilities| variety|\n+-----+-----------+----------+----------+----------+--------------------+----------+\n| 0| 5.0| 3.4| 1.5| 0.2|[0.99756691006608...| Setosa|\n| 0| 4.9| 3.1| 1.5| 0.1|[0.99756691006608...| Setosa|\n| 0| 5.7| 4.4| 1.5| 0.4|[0.99747067759946...| Setosa|\n| 0| 5.4| 3.4| 1.7| 0.2|[0.99756691006608...| Setosa|\n| 0| 4.8| 3.1| 1.6| 0.2|[0.99756691006608...| Setosa|\n| 0| 5.0| 3.5| 1.3| 0.3|[0.99756691006608...| Setosa|\n| 1| 6.9| 3.1| 4.9| 1.5|[0.00187573270111...|Versicolor|\n| 1| 4.9| 2.4| 3.3| 1.0|[0.0, 0.897264437...|Versicolor|\n| 1| 5.0| 2.0| 3.5| 1.0|[0.0, 0.952972973...|Versicolor|\n| 1| 6.2| 2.2| 4.5| 1.5|[0.00165477715401...|Versicolor|\n| 1| 5.6| 2.5| 3.9| 1.1|[0.00185250030391...|Versicolor|\n| 1| 6.6| 3.0| 4.4| 1.4|[0.00185250030391...|Versicolor|\n| 1| 6.8| 2.8| 4.8| 1.4|[0.00182066453578...|Versicolor|\n| 1| 5.5| 2.6| 4.4| 1.2|[0.00185250030391...|Versicolor|\n| 1| 5.6| 2.7| 4.2| 1.3|[0.00185250030391...|Versicolor|\n| 2| 6.5| 3.2| 5.1| 2.0|[0.00186320937864...| Virginica|\n| 2| 6.1| 3.0| 4.9| 1.8|[0.00177436049103...| Virginica|\n| 2| 6.7| 3.1| 5.6| 2.4|[0.00186320937864...| Virginica|\n| 2| 6.2| 3.4| 5.4| 2.3|[0.00172500539058...| Virginica|\n+-----+-----------+----------+----------+----------+--------------------+----------+\n\n","output_type":"stream"},{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Deployment of Standalone Functionality To Production\n\nThe MOJO model can be deployed to production w/o the need of any access to a running H2O or Spark Instance.\nAll you need to do is to add the following dependency:\n\n<dependency>\n <groupId>ai.h2o</groupId>\n <artifactId>h2o-genmodel</artifactId>\n <version>3.10.4.2</version>\n</dependency>\n\nFurther details can be found in http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html\n\n","metadata":{}},{"cell_type":"code","source":"val csv = spark.read.format(\"csv\")\n .option(\"inferSchema\", \"true\")\n .option(\"header\", \"true\")\n .load(\"prediction.csv/*.csv\")\n\ncsv.show","metadata":{"trusted":true},"execution_count":22,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|\n+-----------+----------+----------+----------+\n| 5.0| 3.4| 1.5| 0.2|\n| 4.9| 3.1| 1.5| 0.1|\n| 5.7| 4.4| 1.5| 0.4|\n| 5.4| 3.4| 1.7| 0.2|\n| 4.8| 3.1| 1.6| 0.2|\n| 5.0| 3.5| 1.3| 0.3|\n| 6.9| 3.1| 4.9| 1.5|\n| 4.9| 2.4| 3.3| 1.0|\n| 5.0| 2.0| 3.5| 1.0|\n| 6.2| 2.2| 4.5| 1.5|\n| 5.6| 2.5| 3.9| 1.1|\n| 6.6| 3.0| 4.4| 1.4|\n| 6.8| 2.8| 4.8| 1.4|\n| 5.5| 2.6| 4.4| 1.2|\n| 5.6| 2.7| 4.2| 1.3|\n| 6.5| 3.2| 5.1| 2.0|\n| 6.1| 3.0| 4.9| 1.8|\n| 6.7| 3.1| 5.6| 2.4|\n| 6.2| 3.4| 5.4| 2.3|\n+-----------+----------+----------+----------+\n\n","output_type":"stream"},{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"code","source":"import _root_.hex.genmodel.GenModel\nimport _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}\nimport _root_.hex.genmodel.easy.prediction\nimport _root_.hex.genmodel.MojoModel\nimport _root_.hex.genmodel.easy.RowData\n","metadata":{"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"import _root_.hex.genmodel.GenModel\nimport _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}\nimport _root_.hex.genmodel.easy.prediction\nimport _root_.hex.genmodel.MojoModel\nimport _root_.hex.genmodel.easy.RowData\n"},"metadata":{}}]},{"cell_type":"code","source":"def predict(easyModel: EasyPredictModelWrapper, values:Array[Double]):String = {\n var row = new RowData()\n row.put(\"sepalLength\", values(0).toString)\n row.put(\"sepalWidth\", values(1).toString)\n row.put(\"petalLength\", values(2).toString)\n row.put(\"petalWidth\", values(3).toString)\n\n var p = easyModel.predictMultinomial(row);\n p.label\n}\n\nvar easyModel = new EasyPredictModelWrapper( MojoModel.load(\"model.mojo\"));\npredict(easyModel, Array(5.0,3.4,1.4,0.2))\n","metadata":{"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"Setosa"},"metadata":{}}]},{"cell_type":"code","source":"csv.collect\n .map(line => Array(line.getDouble(0),line.getDouble(1),line.getDouble(2),line.getDouble(3)))\n .foreach(a => println(s\"${a(0)},${a(1)},${a(2)},${a(3)} => ${predict(easyModel, a)}\"))\n\n","metadata":{"trusted":true},"execution_count":25,"outputs":[{"name":"stdout","text":"5.0,3.4,1.5,0.2 => Setosa\n4.9,3.1,1.5,0.1 => Setosa\n5.7,4.4,1.5,0.4 => Setosa\n5.4,3.4,1.7,0.2 => Setosa\n4.8,3.1,1.6,0.2 => Setosa\n5.0,3.5,1.3,0.3 => Setosa\n6.9,3.1,4.9,1.5 => Versicolor\n4.9,2.4,3.3,1.0 => Versicolor\n5.0,2.0,3.5,1.0 => Versicolor\n6.2,2.2,4.5,1.5 => Versicolor\n5.6,2.5,3.9,1.1 => Versicolor\n6.6,3.0,4.4,1.4 => Versicolor\n6.8,2.8,4.8,1.4 => Versicolor\n5.5,2.6,4.4,1.2 => Versicolor\n5.6,2.7,4.2,1.3 => Versicolor\n6.5,3.2,5.1,2.0 => Virginica\n6.1,3.0,4.9,1.8 => Virginica\n6.7,3.1,5.6,2.4 => Virginica\n6.2,3.4,5.4,2.3 => Virginica\n","output_type":"stream"}]},{"cell_type":"code","source":"","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment