Skip to content

Instantly share code, notes, and snippets.

@Ben-Epstein
Created July 23, 2020 23:35
Show Gist options
  • Save Ben-Epstein/26177b753532f3a630f2187f3023d7c2 to your computer and use it in GitHub Desktop.
Save Ben-Epstein/26177b753532f3a630f2187f3023d7c2 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{"cells":[{"metadata":{},"cell_type":"markdown","source":"# Let's Talk About Model Deployment\n<blockquote>\n Model Deployment is extrememly important and overly complex. We think that it should be as easy and straightforward as possible for Data Scientists to hand off their work to the Application Developers. We see that existing in the database itself. Straightforward APIs for the Data Scientists, and simple SQL statements for the Application Developers. <footer>Splice Machine</footer>\n</blockquote>\n\n#### Let's take a look at deploying some simple models to the database"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Setup\nfrom pyspark.sql import SparkSession\nfrom splicemachine.spark import PySpliceContext\nfrom splicemachine.mlflow_support import *\n\nspark = SparkSession.builder.getOrCreate()\nsplice = PySpliceContext(spark)\nmlflow.register_splice_context(splice)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"help(mlflow.deploy_db)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Model Choice\n<blockquote>\n With our MLManager API, we've abstracted away the model itself, and made our functions model agnostic. Functions like <code>log_model</code> and <code>load_model</code> take any supported model type and handle the rest under the hood<footer>Splice Machine</footer>\n</blockquote>\n\n## We'll try it out with SKLearn, Spark and H2O\n<blockquote>\n Because we're focusing on model deployment specifically, we will skip the logging of parameters and metrics etc. For more information on that, see some of our other <a href='./7.0 MLManager Index.ipynb'>MLManager tutorials</a><footer>Splice Machine</footer>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Set our Experiment\nmlflow.set_experiment('simple model deployment')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## SKLearn\n#### Build our SKLearn Model"},{"metadata":{"trusted":true,"scrolled":true},"cell_type":"code","source":"from sklearn.datasets import load_iris\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import GradientBoostingClassifier\nfrom sklearn.metrics import mean_squared_error\nfrom splicemachine.mlflow_support.utilities import get_user\nimport pandas as pd\nimport numpy as np\n\n# Load our Data\niris = load_iris()\ndf = pd.DataFrame(data= np.c_[iris['data'], iris['target']],\n columns= ['sepal_length', 'sepal_width','petal_length','petal_width','label'])\n\n# Split into train/test\ntrain, test = train_test_split(df, test_size=0.2)\n\n# Train, save and deploy\nwith mlflow.start_run(run_name='SKlearn'):\n model = GradientBoostingClassifier(n_estimators=10, learning_rate=1.0)\n X_train,y_train = train[train.columns[:-1]], train[train.columns[-1]]\n y_train = y_train.map(lambda x: int(x)) # So the model outputs int format\n X_test,y_test = test[test.columns[:-1]], test[test.columns[-1]]\n \n model.fit(X_train,y_train)\n print('MSE:', mean_squared_error(y_test, model.predict(X_test)))\n run_id = mlflow.current_run_id()\n # Save the model for deployment or later use\n mlflow.log_model(model, 'sklearn_model')\n \n# Deploy the model\nschema = get_user()\nsplice._dropTableIfExists(f'{schema}.sklearn_model')\nmlflow.deploy_db(schema, 'sklearn_model', run_id, primary_key=[('MOMENT_KEY', 'INT')], df=df, model_cols=list(X_train.columns), \n create_model_table=True, verbose=True)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### That's it! \n<blockquote>\n It really was that easy. You may be thinking \"well, now what? How do I see my model in action?\" That's a great question, and that's easy too. If you look at the output above, you can see a table called <code>SKLEARN_TABLE</code> was created for you with the columns of your model as well as the primary key provided and an extra column for prediction.<br>\n To invoke your model, simply insert a row of data<footer>Splice Machine</footer>\n</blockquote>\n\n#### Let's use the model"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%sql\ninsert into sklearn_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (1.5, 2.2, 2.5, 4.4, 1);\ninsert into sklearn_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (2.7, 4.0, 3.1, 1.9, 2);","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### View results"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%sql\nselect * from sklearn_model;","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Pretty Cool!\n<blockquote>As you can see, the <code>deploy_db</code> function created the table and injected your ML model directly inside. It also added some automatic columns to track which model is making the predictions, who is using your model and when. If you deploy more complex models with probabilities, more columns will be created to handle that. We can tell MLManager which SKlearn function call to use by passing in the <code>sklearn_args</code> parameter. Let's try that next.<footer>Splice Machine</footer>\n</blockquote>\n \n#### Deploy model with complex output"},{"metadata":{"trusted":true},"cell_type":"code","source":"# This SKLearn model can also output the probability of each column. Let's deploy out model so it contains those probabilities\nprint(f'Model prediction of {X_test.iloc[0].values} with probabilities:', model.predict_proba(X_test)[0], '\\n')\n\nsplice._dropTableIfExists(f'{schema}.sklearn_model_probs')\n# Deploy our model\nmlflow.deploy_db(schema, 'sklearn_model_probs', run_id, primary_key=[('MOMENT_KEY', 'INT')], df=df, model_cols=list(X_train.columns), \n create_model_table=True, classes=list(iris.target_names), verbose=True, sklearn_args={'predict_call': 'predict_proba'}) # Added sklearn_args\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Let's use the model"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%sql\ninsert into sklearn_model_probs (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (6.3, 2.9, 5.6, 1.8, 1);\nselect * from sklearn_model_probs;","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Great!\n<blockquote>You can see that the probabilities of each class match that of the local model prediction two cells above. The prediction column contains the index of prediction class. <br>So, prediction a of 2 means that the model is predicting the 3rd column, virginica (remember that indexes start at 0!), just like the model.<footer>Splice Machine</footer>\n</blockquote>\n\n#### Show the Prediction from the model"},{"metadata":{"trusted":true},"cell_type":"code","source":"print(model.predict([[6.3, 2.9, 5.6, 1.8]])[0])\nprint(model.predict_proba([[6.3, 2.9, 5.6, 1.8]])[0])","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## SparkML\n<blockquote>Let's try the same thing with Spark. Although Spark is typically used for big data processesing, their <a href='https://spark.apache.org/docs/2.4.0/ml-classification-regression.html'>ML Libraries</a> come with some pretty powerful models as well. And they can scale for massive datasets too.<footer>Splice Machine</footer>\n</blockquote>\n\n#### Build and deploy a SparkML model"},{"metadata":{"trusted":true},"cell_type":"code","source":"from pyspark.ml.feature import VectorAssembler\nfrom pyspark.ml.classification import RandomForestClassifier\nfrom pyspark.ml import Pipeline\nfrom splicemachine.stats import SpliceMultiClassificationEvaluator\n\n# Create our dataset\nspark_df = spark.createDataFrame(df)\nspark_df.show(5)\ntrain, test = spark_df.randomSplit([0.8,0.2])\n\nwith mlflow.start_run(run_name='spark'):\n # Set our feature vector to be all column except the label\n va = VectorAssembler(inputCols = train.columns[:-1], outputCol='features')\n rf = RandomForestClassifier(labelCol='label', featuresCol='features')\n pipeline = Pipeline(stages=[va,rf])\n \n trainedModel = pipeline.fit(train)\n predictions = trainedModel.transform(test)\n # Log our model for deployment or future use\n mlflow.log_model(trainedModel, 'spark_model')\n \n ev = SpliceMultiClassificationEvaluator(spark)\n ev.input(predictions)\n run_id = mlflow.current_run_id()\n \nsplice._dropTableIfExists(f'{schema}.spark_model')\n# Deploy our model\nmlflow.deploy_db(schema, 'spark_model', run_id, primary_key=[('MOMENT_KEY', 'INT')], df=spark_df, model_cols=spark_df.columns[:-1], \n create_model_table=True, classes=list(iris.target_names), verbose=True)\n \n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Try out our model"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%sql\ninsert into spark_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (5.1, 3.5, 1.4, 0.2, 1);\ninsert into spark_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (4.9, 3.0, 1.4, 0.2, 2);\ninsert into spark_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (2.7, 4.0, 3.1, 1.9, 3);\nselect * from spark_model","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Last but not least, H2O\n<blockquote>Lastly, we'll build an H2O model for the same prediction task. <a href='http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2ocoxproportionalhazardsestimator'>H2O AI</a> is an extrememly powerful, distributed ML framework with a plethora of Machine Learning models. These models can handle massive data, just like Spark, and they have very sophisticated algorithms.We've pre-configured <a href='http://docs.h2o.ai/sparkling-water/2.1/latest-stable/doc/pysparkling.html'>H2O PySparkling Water</a> into our system so you can immediately use it.<footer>Splice Machine</footer>\n</blockquote>\n\n#### Build an H2O model"},{"metadata":{"trusted":true},"cell_type":"code","source":"# First, we start our PySparkling Cluster\nfrom pysparkling import *\nimport h2o\n# Create H2O Cluster\nconf = H2OConf().setInternalClusterMode()\nhc = H2OContext.getOrCreate(conf)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from h2o.estimators.deeplearning import H2ODeepLearningEstimator\n\n# Get data\nhdf = hc.asH2OFrame(spark_df)\nhdf['label'] = hdf['label'].asfactor()\ntrain, test = hdf.split_frame(ratios=[0.8])\n\nwith mlflow.start_run(run_name='h2o'):\n model = H2ODeepLearningEstimator()\n model.train(x=train.columns[:-1],\n y=train.columns[-1],\n training_frame=train)\n print('logloss', model.logloss())\n \n mlflow.log_model(model, 'h2o_model')\n run_id = mlflow.current_run_id()\n \nsplice._dropTableIfExists(f'{schema}.h2o_model')\n# Deploy our model\nmlflow.deploy_db(schema, 'h2o_model', run_id, primary_key=[('MOMENT_KEY', 'INT')], df=spark_df, model_cols=spark_df.columns[:-1], \n create_model_table=True, classes=list(iris.target_names), verbose=True)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Invoke our model"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%sql\ninsert into h2o_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (6.7, 3.1, 5.6, 2.4, 1);\ninsert into h2o_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (4.9, 3.0, 1.4, 0.2, 2);\ninsert into h2o_model (sepal_length, sepal_width, petal_length, petal_width, moment_key) values (5.6, 3.0, 4.5, 1.5, 3);\nselect * from h2o_model","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Amazing! \n<blockquote>Just like that, you've deployed 3 models to the database, one of them in two different ways! If you'd like to see everything you've learned put together in an end-to-end example, check out our <a href='./Examples'>Example</a> notebooks. <footer>Splice Machine</footer>\n</blockquote>"}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":true,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{"height":"calc(100% - 180px)","width":"190.188px","left":"10px","top":"150px"},"toc_section_display":true,"toc_window_display":true}},"nbformat":4,"nbformat_minor":4}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment