Skip to content

Instantly share code, notes, and snippets.

@pschatzmann
Created November 5, 2018 17:43
Show Gist options
  • Save pschatzmann/b3d2035a20e07d09ff21d92e6e468c4b to your computer and use it in GitHub Desktop.
Save pschatzmann/b3d2035a20e07d09ff21d92e6e468c4b to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"metadata":{"kernelspec":{"display_name":"Scala","language":"scala","name":"scala"},"language_info":{"codemirror_mode":"text/x-scala","file_extension":".scala","mimetype":"","name":"Scala","nbconverter_exporter":"","version":"2.11.12"}},"nbformat_minor":2,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# H2O Sparkling Water - Distriuted Random Forrest\n\nIn this document we demonstrate how [H2O](https://www.h2o.ai/) can be used with Scala & Spark to run a 'Distriuted Random Forrest' classification.\n\nH2O and the related documentaion for Sparkling Water can be found at http://h2o-release.s3.amazonaws.com/h2o/rel-xia/1/index.html\n\nWe are using Jupyter with the BeakerX Scala kernal.\n\n## Setup\nWe install the full Sparkling Water package which also includes Spark with the help of Maven","metadata":{}},{"cell_type":"code","source":"%%classpath add mvn \nai.h2o:sparkling-water-package_2.11:2.3.17\n","metadata":{"trusted":true},"execution_count":1,"outputs":[{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"","version_major":2}},"metadata":{}},{"output_type":"display_data","data":{"method":"display_data","application/vnd.jupyter.widget-view+json":{"version_minor":0,"model_id":"995aa09b-06ff-4cd4-8310-4e42d175fef1","version_major":2}},"metadata":{}}]},{"cell_type":"markdown","source":"In order to prevent subsequent runtime errors we must take care of the following:\n- We did not install the scala repl so we want to prevent the h2o repl as well. \n- There are currently some class conflicts with jetty and we therefore deactivate the REST API ","metadata":{}},{"cell_type":"code","source":"import water.H2O\n\nH2O.ARGS.disable_web = true\nSystem.setProperty(\"spark.ext.h2o.repl.enabled\",\"false\")\n\nSystem.getProperty(\"spark.ext.h2o.repl.enabled\")","metadata":{"trusted":true},"execution_count":2,"outputs":[{"execution_count":2,"output_type":"execute_result","data":{"text/plain":"false"},"metadata":{}}]},{"cell_type":"markdown","source":"Now we are ready to start Spark ...","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.sql.SparkSession\n\nval spark = SparkSession.builder()\n .appName(\"Iris NaiveBayes\")\n .master(\"local\")\n .config(\"spark.ui.enabled\", \"false\")\n .getOrCreate()\n\n","metadata":{"trusted":true},"execution_count":3,"outputs":[{"execution_count":3,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.sql.SparkSession@1f1d9b1a"},"metadata":{}}]},{"cell_type":"markdown","source":"... and we can create the related H2O Context","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.h2o._\n\nval h2oContext = H2OContext.getOrCreate(spark)\n","metadata":{"trusted":true},"execution_count":4,"outputs":[{"execution_count":4,"output_type":"execute_result","data":{"text/plain":"\nSparkling Water Context:\n * H2O name: sparkling-water-beakerx_local-1541437783196\n * cluster size: 1\n * list of used nodes:\n (executorId, host, port)\n ------------------------\n (driver,928b6a866c15,54323)\n ------------------------\n\n Open H2O Flow in browser: http://:54323 (CMD + click in Mac OSX)\n\n "},"metadata":{}}]},{"cell_type":"markdown","source":"## Data Preparation\nWe start by loading the iris.csv directly into a H2OFrame and double check the data types: The variety has been converted to an Enum!","metadata":{}},{"cell_type":"code","source":"import water.fvec.H2OFrame\nimport java.net.URL\n\nval h2oFrame = new H2OFrame(new URL(\"https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv\").toURI)\n\nh2oFrame.names","metadata":{"trusted":true},"execution_count":5,"outputs":[{"execution_count":5,"output_type":"execute_result","data":{"text/plain":"[sepal.length, sepal.width, petal.length, petal.width, variety]"},"metadata":{}}]},{"cell_type":"code","source":"h2oFrame.rename(0,\"sepalLength\")\nh2oFrame.rename(1,\"sepalWidth\")\nh2oFrame.rename(2,\"petalLenth\")\nh2oFrame.rename(3,\"petalWidth\")\n\nh2oFrame.names","metadata":{"trusted":true},"execution_count":6,"outputs":[{"execution_count":6,"output_type":"execute_result","data":{"text/plain":"[sepalLength, sepalWidth, petalLenth, petalWidth, variety]"},"metadata":{}}]},{"cell_type":"code","source":"h2oFrame.typesStr","metadata":{"trusted":true},"execution_count":7,"outputs":[{"execution_count":7,"output_type":"execute_result","data":{"text/plain":"[Numeric, Numeric, Numeric, Numeric, Enum]"},"metadata":{}}]},{"cell_type":"code","source":"var domains = h2oFrame.vec(\"variety\").domain","metadata":{"trusted":true},"execution_count":8,"outputs":[{"execution_count":8,"output_type":"execute_result","data":{"text/plain":"[Setosa, Versicolor, Virginica]"},"metadata":{}}]},{"cell_type":"markdown","source":"We can display the data by converting it to a DataFrame:","metadata":{}},{"cell_type":"code","source":"val sqlContext = new org.apache.spark.sql.SQLContext(spark.sparkContext)\nval ds = h2oContext.asDataFrame(h2oFrame)(sqlContext).toDF\n\nds.show()","metadata":{"trusted":true},"execution_count":9,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+-------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|variety|\n+-----------+----------+----------+----------+-------+\n| 5.1| 3.5| 1.4| 0.2| Setosa|\n| 4.9| 3.0| 1.4| 0.2| Setosa|\n| 4.7| 3.2| 1.3| 0.2| Setosa|\n| 4.6| 3.1| 1.5| 0.2| Setosa|\n| 5.0| 3.6| 1.4| 0.2| Setosa|\n| 5.4| 3.9| 1.7| 0.4| Setosa|\n| 4.6| 3.4| 1.4| 0.3| Setosa|\n| 5.0| 3.4| 1.5| 0.2| Setosa|\n| 4.4| 2.9| 1.4| 0.2| Setosa|\n| 4.9| 3.1| 1.5| 0.1| Setosa|\n| 5.4| 3.7| 1.5| 0.2| Setosa|\n| 4.8| 3.4| 1.6| 0.2| Setosa|\n| 4.8| 3.0| 1.4| 0.1| Setosa|\n| 4.3| 3.0| 1.1| 0.1| Setosa|\n| 5.8| 4.0| 1.2| 0.2| Setosa|\n| 5.7| 4.4| 1.5| 0.4| Setosa|\n| 5.4| 3.9| 1.3| 0.4| Setosa|\n| 5.1| 3.5| 1.4| 0.3| Setosa|\n| 5.7| 3.8| 1.7| 0.3| Setosa|\n| 5.1| 3.8| 1.5| 0.3| Setosa|\n+-----------+----------+----------+----------+-------+\nonly showing top 20 rows\n\n","output_type":"stream"},{"execution_count":9,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"Then we split the frame into a training and a test frame. The call to splitFrame is also shuffeling the data!","metadata":{}},{"cell_type":"code","source":"import water.support.H2OFrameSupport\n\nval keys = Array(\"train.hex\", \"test.hex\")\nval ratios = Array(0.9, 0.1)\n\nval Array(train, test) = H2OFrameSupport.splitFrame(h2oFrame,keys, ratios)\nval h2oTraining = h2oContext.asH2OFrame(train)\nval h2oTest = h2oContext.asH2OFrame(test)\n\nh2oTraining+\"-----------\\n\"+ h2oTest","metadata":{"trusted":true},"execution_count":10,"outputs":[{"execution_count":10,"output_type":"execute_result","data":{"text/plain":"Frame key: train.hex\n cols: 5\n rows: 131\n chunks: 1\n size: 3177\n-----------\nFrame key: test.hex\n cols: 5\n rows: 19\n chunks: 1\n size: 1625\n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Training of the Classifier\nWe define the result-column and we indicate both the training and the test data so that the system is providing the metrics for both datasets.\n\nWe get the trained model by calling trainModel.get","metadata":{}},{"cell_type":"code","source":"import hex.tree.drf.DRF\nimport hex.tree.drf.DRFModel\nimport h2oContext.implicits._\n\nvar parameters = new DRFModel.DRFParameters()\nparameters._train = h2oTraining\nparameters._valid = h2oTest\nparameters._response_column = \"variety\"\n\nvar model = new DRF(parameters).trainModel.get\n","metadata":{"trusted":true},"execution_count":11,"outputs":[{"execution_count":11,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.h2o.H2OContext$implicits$@5a3d5b41"},"metadata":{}}]},{"cell_type":"markdown","source":"## Evaluation\n","metadata":{}},{"cell_type":"markdown","source":"The toString method of the model provides the evaluation of the Training and Test datasets:","metadata":{}},{"cell_type":"code","source":"model.toString","metadata":{"trusted":true},"execution_count":12,"outputs":[{"execution_count":12,"output_type":"execute_result","data":{"text/plain":"Model Metrics Type: Multinomial\n Description: Metrics reported on Out-Of-Bag training samples\n model id: DRF_model_1541437780529_1\n frame id: train.hex\n MSE: 0.04673963\n RMSE: 0.2161935\n logloss: 0.39659226\n mean_per_class_error: 0.06150583\n hit ratios: [0.9389313, 1.0, 1.0]\n CM: Confusion Matrix (Row labels: Actual class; Column labels: Predicted class):\n Setosa Versicolor Virginica Error Rate\n Setosa 44 0 0 0.0000 0 / 44\nVersicolor 0 37 4 0.0976 4 / 41\n Virginica 0 4 42 0.0870 4 / 46\n Totals 44 41 46 0.0611 8 / 131\nModel Metrics Type: Multinomial\n Description: N/A\n model id: DRF_model_1541437780529_1\n frame id: test.hex\n MSE: 0.009293742\n RMSE: 0.09640405\n logloss: 0.059128653\n mean_per_class_error: 0.0\n hit ratios: [1.0, 1.0, 1.0]\n CM: Confusion Matrix (Row labels: Actual class; Column labels: Predicted class):\n Setosa Versicolor Virginica Error Rate\n Setosa 6 0 0 0.0000 0 / 6\nVersicolor 0 9 0 0.0000 0 / 9\n Virginica 0 0 4 0.0000 0 / 4\n Totals 6 9 4 0.0000 0 / 19\nVariable Importances:\n Variable Relative Importance Scaled Importance Percentage\n petalLenth 1795.945801 1.000000 0.474394\n petalWidth 1578.936157 0.879167 0.417072\nsepalLength 327.999115 0.182633 0.086640\n sepalWidth 82.884659 0.046151 0.021894\nModel Summary:\n Number of Trees Number of Internal Trees Model Size in Bytes Min. Depth Max. Depth Mean Depth Min. Leaves Max. Leaves Mean Leaves\n 50 150 19906 1 8 3.62667 2 12 5.91333\nScoring History:\n Timestamp Duration Number of Trees Training RMSE Training LogLoss Training Classification Error Validation RMSE Validation LogLoss Validation Classification Error\n 2018-11-05 18:09:55 0.063 sec 0 NaN NaN NaN NaN NaN NaN\n 2018-11-05 18:09:55 0.316 sec 1 0.16366 0.64152 0.03571 0.22942 1.81783 0.05263\n 2018-11-05 18:09:55 0.367 sec 2 0.21490 1.31760 0.06250 0.07647 0.02134 0.00000\n 2018-11-05 18:09:55 0.395 sec 3 0.21937 1.40235 0.06000 0.05735 0.01514 0.00000\n 2018-11-05 18:09:55 0.416 sec 4 0.21750 1.27528 0.05405 0.04588 0.01174 0.00000\n 2018-11-05 18:09:55 0.435 sec 5 0.25178 1.52270 0.07627 0.08550 0.03094 0.00000\n 2018-11-05 18:09:55 0.450 sec 6 0.24614 1.49340 0.06667 0.07328 0.02582 0.00000\n 2018-11-05 18:09:55 0.471 sec 7 0.22902 0.92432 0.06400 0.08767 0.03961 0.00000\n 2018-11-05 18:09:55 0.488 sec 8 0.22442 0.90776 0.06299 0.07795 0.03560 0.00000\n 2018-11-05 18:09:55 0.509 sec 9 0.22097 0.89179 0.06202 0.07976 0.03716 0.00000\n---\n 2018-11-05 18:09:55 0.912 sec 41 0.21637 0.39776 0.06107 0.09405 0.05657 0.00000\n 2018-11-05 18:09:55 0.922 sec 42 0.21665 0.39893 0.06107 0.09180 0.05507 0.00000\n 2018-11-05 18:09:55 0.931 sec 43 0.21674 0.39871 0.06107 0.09459 0.05633 0.00000\n 2018-11-05 18:09:55 0.968 sec 44 0.21532 0.39679 0.06107 0.09285 0.05539 0.00000\n 2018-11-05 18:09:55 0.975 sec 45 0.21595 0.39729 0.06107 0.09262 0.05581 0.00000\n 2018-11-05 18:09:55 0.984 sec 46 0.21673 0.39842 0.06107 0.09600 0.05824 0.00000\n 2018-11-05 18:09:55 0.992 sec 47 0.21713 0.39887 0.06107 0.09899 0.05947 0.00000\n 2018-11-05 18:09:55 1.000 sec 48 0.21770 0.39978 0.06107 0.09932 0.06035 0.00000\n 2018-11-05 18:09:55 1.010 sec 49 0.21687 0.39822 0.06107 0.09737 0.05916 0.00000\n 2018-11-05 18:09:55 1.018 sec 50 0.21619 0.39659 0.06107 0.09640 0.05913 0.00000\n"},"metadata":{}}]},{"cell_type":"markdown","source":"## Prediction\nWe display the input data. We remove the variety column to make sure that the prediction works by providing only the 4 input columns:","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.sql.SaveMode\n\nh2oTest.remove(\"variety\")\n\nvar predictionDF = h2oContext.asDataFrame(h2oTest)(sqlContext).toDF\n\npredictionDF.write.format(\"csv\").option(\"header\", \"true\").mode(SaveMode.Overwrite).save(\"prediction.csv\")\npredictionDF.show()\n","metadata":{"trusted":true},"execution_count":13,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|\n+-----------+----------+----------+----------+\n| 5.0| 3.4| 1.5| 0.2|\n| 4.9| 3.1| 1.5| 0.1|\n| 5.7| 4.4| 1.5| 0.4|\n| 5.4| 3.4| 1.7| 0.2|\n| 4.8| 3.1| 1.6| 0.2|\n| 5.0| 3.5| 1.3| 0.3|\n| 6.9| 3.1| 4.9| 1.5|\n| 4.9| 2.4| 3.3| 1.0|\n| 5.0| 2.0| 3.5| 1.0|\n| 6.2| 2.2| 4.5| 1.5|\n| 5.6| 2.5| 3.9| 1.1|\n| 6.6| 3.0| 4.4| 1.4|\n| 6.8| 2.8| 4.8| 1.4|\n| 5.5| 2.6| 4.4| 1.2|\n| 5.6| 2.7| 4.2| 1.3|\n| 6.5| 3.2| 5.1| 2.0|\n| 6.1| 3.0| 4.9| 1.8|\n| 6.7| 3.1| 5.6| 2.4|\n| 6.2| 3.4| 5.4| 2.3|\n+-----------+----------+----------+----------+\n\n","output_type":"stream"},{"execution_count":13,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"We can execute a prediction by calling the score method on the model.","metadata":{}},{"cell_type":"code","source":"val predictionResult = model.score(h2oTest)\n","metadata":{"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":"Frame key: _bbef6b85a84cad502123f5977b4548d1\n cols: 4\n rows: 19\n chunks: 1\n size: 1371\n"},"metadata":{}}]},{"cell_type":"markdown","source":"Finally we convert the result to a Spark Dataset so that we can display it:","metadata":{}},{"cell_type":"code","source":"val df1 = h2oContext.asDataFrame(predictionResult)(sqlContext).toDF\n\ndf1.show()","metadata":{"trusted":true},"execution_count":15,"outputs":[{"name":"stdout","text":"+----------+--------------------+-------------------+--------------------+\n| predict| Setosa| Versicolor| Virginica|\n+----------+--------------------+-------------------+--------------------+\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.997470677599468| 0.0|0.002529322400532...|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n|Versicolor|0.001875732701114...| 0.7561547479537011| 0.24196951934518435|\n|Versicolor| 0.0| 0.8972644377915675| 0.10273556220843251|\n|Versicolor| 0.0| 0.9529729730689217| 0.04702702693107835|\n|Versicolor|0.001654777154018...| 0.8182873057105341| 0.1800579171354474|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001820664535787...| 0.7633136094726328| 0.23486572599157932|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001774360491032...|0.09537187674828396| 0.9028537627606835|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001725005390582263|0.07417523207136159| 0.9240997625380561|\n+----------+--------------------+-------------------+--------------------+\n\n","output_type":"stream"},{"execution_count":15,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Saving and Loading Binary Models","metadata":{}},{"cell_type":"code","source":"import water.support.ModelSerializationSupport\n\nModelSerializationSupport.exportH2OModel(model,\"model.bin\",true)\nvar modelLoaded:DRFModel = ModelSerializationSupport.loadH2OModel(\"model.bin\")\n\nmodelLoaded.getClass","metadata":{"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":"class hex.tree.drf.DRFModel"},"metadata":{}}]},{"cell_type":"code","source":"val predictionResult = model.score(h2oTest)\nval df1 = h2oContext.asDataFrame(predictionResult)(sqlContext).toDF\n\ndf1.show()","metadata":{"trusted":true},"execution_count":17,"outputs":[{"name":"stdout","text":"+----------+--------------------+-------------------+--------------------+\n| predict| Setosa| Versicolor| Virginica|\n+----------+--------------------+-------------------+--------------------+\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.997470677599468| 0.0|0.002529322400532...|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n| Setosa| 0.9975669100660882| 0.0|0.002433089933911768|\n|Versicolor|0.001875732701114...| 0.7561547479537011| 0.24196951934518435|\n|Versicolor| 0.0| 0.8972644377915675| 0.10273556220843251|\n|Versicolor| 0.0| 0.9529729730689217| 0.04702702693107835|\n|Versicolor|0.001654777154018...| 0.8182873057105341| 0.1800579171354474|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001820664535787...| 0.7633136094726328| 0.23486572599157932|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n|Versicolor|0.001852500303911956| 0.9957189170620183|0.002428582634069...|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001774360491032...|0.09537187674828396| 0.9028537627606835|\n| Virginica|0.001863209378640...| 0.0| 0.9981367906213595|\n| Virginica|0.001725005390582263|0.07417523207136159| 0.9240997625380561|\n+----------+--------------------+-------------------+--------------------+\n\n","output_type":"stream"},{"execution_count":17,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Saving and Loading Mojo Models in Spark","metadata":{}},{"cell_type":"code","source":"import org.apache.spark.ml.h2o.models._\n\nModelSerializationSupport.exportMOJOModel(model,\"model.mojo\",true)\nval mojoModelLoaded = H2OMOJOModel.createFromMojo(\"model.mojo\")\n\nmojoModelLoaded.getClass","metadata":{"trusted":true},"execution_count":18,"outputs":[{"execution_count":18,"output_type":"execute_result","data":{"text/plain":"class org.apache.spark.ml.h2o.models.H2OMOJOModel"},"metadata":{}}]},{"cell_type":"code","source":"import org.apache.spark.sql.functions._\n\nval df = mojoModelLoaded.transform(predictionDF)\n .withColumn(\"probabilities\", expr(\"prediction_output.probabilities\"))\n .drop(\"prediction_output\")\n\ndf.show","metadata":{"trusted":true},"execution_count":19,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+--------------------+\n|sepalLength|sepalWidth|petalLenth|petalWidth| probabilities|\n+-----------+----------+----------+----------+--------------------+\n| 5.0| 3.4| 1.5| 0.2|[0.99756691006608...|\n| 4.9| 3.1| 1.5| 0.1|[0.99756691006608...|\n| 5.7| 4.4| 1.5| 0.4|[0.99747067759946...|\n| 5.4| 3.4| 1.7| 0.2|[0.99756691006608...|\n| 4.8| 3.1| 1.6| 0.2|[0.99756691006608...|\n| 5.0| 3.5| 1.3| 0.3|[0.99756691006608...|\n| 6.9| 3.1| 4.9| 1.5|[0.00187573270111...|\n| 4.9| 2.4| 3.3| 1.0|[0.0, 0.897264437...|\n| 5.0| 2.0| 3.5| 1.0|[0.0, 0.952972973...|\n| 6.2| 2.2| 4.5| 1.5|[0.00165477715401...|\n| 5.6| 2.5| 3.9| 1.1|[0.00185250030391...|\n| 6.6| 3.0| 4.4| 1.4|[0.00185250030391...|\n| 6.8| 2.8| 4.8| 1.4|[0.00182066453578...|\n| 5.5| 2.6| 4.4| 1.2|[0.00185250030391...|\n| 5.6| 2.7| 4.2| 1.3|[0.00185250030391...|\n| 6.5| 3.2| 5.1| 2.0|[0.00186320937864...|\n| 6.1| 3.0| 4.9| 1.8|[0.00177436049103...|\n| 6.7| 3.1| 5.6| 2.4|[0.00186320937864...|\n| 6.2| 3.4| 5.4| 2.3|[0.00172500539058...|\n+-----------+----------+----------+----------+--------------------+\n\n","output_type":"stream"},{"execution_count":19,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"code","source":"import spark.implicits._ \n\nvar domains = h2oFrame.vec(\"variety\").domain.toSeq\nprintln(domains)\n\nvar domainsDF = spark.sparkContext.parallelize(domains).zipWithIndex.toDF(\"variety\",\"index\")\ndomainsDF.show","metadata":{"trusted":true},"execution_count":20,"outputs":[{"name":"stdout","text":"WrappedArray(Setosa, Versicolor, Virginica)\n+----------+-----+\n| variety|index|\n+----------+-----+\n| Setosa| 0|\n|Versicolor| 1|\n| Virginica| 2|\n+----------+-----+\n\n","output_type":"stream"},{"execution_count":20,"output_type":"execute_result","data":{"text/plain":"org.apache.spark.sql.SparkSession$implicits$@3c57edc7"},"metadata":{}}]},{"cell_type":"code","source":"\ndef maxIndex: (collection.mutable.WrappedArray[Double] => Int) = { array => (array.indexOf(array.max)) }\nspark.udf.register(\"maxIndex\", maxIndex)\n\nvar predictionResultDF = df.select(df.col(\"*\"), callUDF(\"maxIndex\", df.col(\"probabilities\")).name(\"index\"))\n .join(domainsDF,Seq(\"index\"))\n\npredictionResultDF.show\n","metadata":{"trusted":true},"execution_count":21,"outputs":[{"name":"stdout","text":"+-----+-----------+----------+----------+----------+--------------------+----------+\n|index|sepalLength|sepalWidth|petalLenth|petalWidth| probabilities| variety|\n+-----+-----------+----------+----------+----------+--------------------+----------+\n| 0| 5.0| 3.4| 1.5| 0.2|[0.99756691006608...| Setosa|\n| 0| 4.9| 3.1| 1.5| 0.1|[0.99756691006608...| Setosa|\n| 0| 5.7| 4.4| 1.5| 0.4|[0.99747067759946...| Setosa|\n| 0| 5.4| 3.4| 1.7| 0.2|[0.99756691006608...| Setosa|\n| 0| 4.8| 3.1| 1.6| 0.2|[0.99756691006608...| Setosa|\n| 0| 5.0| 3.5| 1.3| 0.3|[0.99756691006608...| Setosa|\n| 1| 6.9| 3.1| 4.9| 1.5|[0.00187573270111...|Versicolor|\n| 1| 4.9| 2.4| 3.3| 1.0|[0.0, 0.897264437...|Versicolor|\n| 1| 5.0| 2.0| 3.5| 1.0|[0.0, 0.952972973...|Versicolor|\n| 1| 6.2| 2.2| 4.5| 1.5|[0.00165477715401...|Versicolor|\n| 1| 5.6| 2.5| 3.9| 1.1|[0.00185250030391...|Versicolor|\n| 1| 6.6| 3.0| 4.4| 1.4|[0.00185250030391...|Versicolor|\n| 1| 6.8| 2.8| 4.8| 1.4|[0.00182066453578...|Versicolor|\n| 1| 5.5| 2.6| 4.4| 1.2|[0.00185250030391...|Versicolor|\n| 1| 5.6| 2.7| 4.2| 1.3|[0.00185250030391...|Versicolor|\n| 2| 6.5| 3.2| 5.1| 2.0|[0.00186320937864...| Virginica|\n| 2| 6.1| 3.0| 4.9| 1.8|[0.00177436049103...| Virginica|\n| 2| 6.7| 3.1| 5.6| 2.4|[0.00186320937864...| Virginica|\n| 2| 6.2| 3.4| 5.4| 2.3|[0.00172500539058...| Virginica|\n+-----+-----------+----------+----------+----------+--------------------+----------+\n\n","output_type":"stream"},{"execution_count":21,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"markdown","source":"## Deployment of Standalone Functionality To Production\n\nThe MOJO model can be deployed to production w/o the need of any access to a running H2O or Spark Instance.\nAll you need to do is to add the following dependency:\n\n<dependency>\n <groupId>ai.h2o</groupId>\n <artifactId>h2o-genmodel</artifactId>\n <version>3.10.4.2</version>\n</dependency>\n\nFurther details can be found in http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html\n\n","metadata":{}},{"cell_type":"code","source":"val csv = spark.read.format(\"csv\")\n .option(\"inferSchema\", \"true\")\n .option(\"header\", \"true\")\n .load(\"prediction.csv/*.csv\")\n\ncsv.show","metadata":{"trusted":true},"execution_count":22,"outputs":[{"name":"stdout","text":"+-----------+----------+----------+----------+\n|sepalLength|sepalWidth|petalLenth|petalWidth|\n+-----------+----------+----------+----------+\n| 5.0| 3.4| 1.5| 0.2|\n| 4.9| 3.1| 1.5| 0.1|\n| 5.7| 4.4| 1.5| 0.4|\n| 5.4| 3.4| 1.7| 0.2|\n| 4.8| 3.1| 1.6| 0.2|\n| 5.0| 3.5| 1.3| 0.3|\n| 6.9| 3.1| 4.9| 1.5|\n| 4.9| 2.4| 3.3| 1.0|\n| 5.0| 2.0| 3.5| 1.0|\n| 6.2| 2.2| 4.5| 1.5|\n| 5.6| 2.5| 3.9| 1.1|\n| 6.6| 3.0| 4.4| 1.4|\n| 6.8| 2.8| 4.8| 1.4|\n| 5.5| 2.6| 4.4| 1.2|\n| 5.6| 2.7| 4.2| 1.3|\n| 6.5| 3.2| 5.1| 2.0|\n| 6.1| 3.0| 4.9| 1.8|\n| 6.7| 3.1| 5.6| 2.4|\n| 6.2| 3.4| 5.4| 2.3|\n+-----------+----------+----------+----------+\n\n","output_type":"stream"},{"execution_count":22,"output_type":"execute_result","data":{"text/plain":"null"},"metadata":{}}]},{"cell_type":"code","source":"import _root_.hex.genmodel.GenModel\nimport _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}\nimport _root_.hex.genmodel.easy.prediction\nimport _root_.hex.genmodel.MojoModel\nimport _root_.hex.genmodel.easy.RowData\n","metadata":{"trusted":true},"execution_count":23,"outputs":[{"execution_count":23,"output_type":"execute_result","data":{"text/plain":"import _root_.hex.genmodel.GenModel\nimport _root_.hex.genmodel.easy.{EasyPredictModelWrapper, RowData}\nimport _root_.hex.genmodel.easy.prediction\nimport _root_.hex.genmodel.MojoModel\nimport _root_.hex.genmodel.easy.RowData\n"},"metadata":{}}]},{"cell_type":"code","source":"def predict(easyModel: EasyPredictModelWrapper, values:Array[Double]):String = {\n var row = new RowData()\n row.put(\"sepalLength\", values(0).toString)\n row.put(\"sepalWidth\", values(1).toString)\n row.put(\"petalLength\", values(2).toString)\n row.put(\"petalWidth\", values(3).toString)\n\n var p = easyModel.predictMultinomial(row);\n p.label\n}\n\nvar easyModel = new EasyPredictModelWrapper( MojoModel.load(\"model.mojo\"));\npredict(easyModel, Array(5.0,3.4,1.4,0.2))\n","metadata":{"trusted":true},"execution_count":24,"outputs":[{"execution_count":24,"output_type":"execute_result","data":{"text/plain":"Setosa"},"metadata":{}}]},{"cell_type":"code","source":"csv.collect\n .map(line => Array(line.getDouble(0),line.getDouble(1),line.getDouble(2),line.getDouble(3)))\n .foreach(a => println(s\"${a(0)},${a(1)},${a(2)},${a(3)} => ${predict(easyModel, a)}\"))\n\n","metadata":{"trusted":true},"execution_count":25,"outputs":[{"name":"stdout","text":"5.0,3.4,1.5,0.2 => Setosa\n4.9,3.1,1.5,0.1 => Setosa\n5.7,4.4,1.5,0.4 => Setosa\n5.4,3.4,1.7,0.2 => Setosa\n4.8,3.1,1.6,0.2 => Setosa\n5.0,3.5,1.3,0.3 => Setosa\n6.9,3.1,4.9,1.5 => Versicolor\n4.9,2.4,3.3,1.0 => Versicolor\n5.0,2.0,3.5,1.0 => Versicolor\n6.2,2.2,4.5,1.5 => Versicolor\n5.6,2.5,3.9,1.1 => Versicolor\n6.6,3.0,4.4,1.4 => Versicolor\n6.8,2.8,4.8,1.4 => Versicolor\n5.5,2.6,4.4,1.2 => Versicolor\n5.6,2.7,4.2,1.3 => Versicolor\n6.5,3.2,5.1,2.0 => Virginica\n6.1,3.0,4.9,1.8 => Virginica\n6.7,3.1,5.6,2.4 => Virginica\n6.2,3.4,5.4,2.3 => Virginica\n","output_type":"stream"}]},{"cell_type":"code","source":"","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment