nathansutton/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
  title
  output
  
  
  html_document
  
  
https://gist.github.com/1895a6f8b6b89a17df8a82cad4a93d25
Deploy your H2O model without hassle in plain SQL

I consistently feel (warranted) pressure to improve my model with more elaborate feature engineering only to later facepalm when I need to put that model into production.  The path forwards to deploy a complex model is clear -> make an API.  This post represents the other end of the spectrum -> do as much in the database as possible.
tl;dr

You can still leverage all the power of Python/R to build and validate a generalized linear model.  These functions will allow you to quickly deploy your model in a database as raw SQL code.
Who this is for

I have written this article especially for

Data Science practitioners
who are asked to deliver interpretable linear models quickly
whose data starts in SQL
whose data ends in SQL
who want to keep is simple

Functions

In this post we'll use the R version.
https://gist.github.com/4aecc0fa499e779941dfbdaaa5a41405
But you could just as easily use the equivalent Python.
https://gist.github.com/b8a34b6c9f310a7a3a299ae9b4b9d226
Walkthrough

In this post I'll build a purposefully simple model for all-cause readmissions based on a demo sample of the MIMIC-III critical care database.  This is a relational database that keeps track of all of a patient's medical information (labs, diagnoses, vitals, etc) during a single hospital stay.
I purposefully did all the feature engineering / denormalization in SQL so that my deployment script would be this simple.
https://gist.github.com/7b7f62d7c36cf3b112f02af7e9497a02
In another post I'll speak about how I got to this view from the MIMIC source tables.  The important thing to know here is that the rows represent a cross-section of individuals across time.  This is a really simple model, and so each of the columns represent a binary indicator (encoded as 1=present, 0=absent).
Now that we've inspected the data, let's bring it into a dataframe.
https://gist.github.com/0dc8113b511aea7c24826846d5d34e0b
With our data in hand we can build a simple logistic regression model with H2O.
https://gist.github.com/f2b2ad86cfe5b6d176d69a846ec1c6db
Then translate it into SQL.
https://gist.github.com/732849a7a7f4e484788decc6cebd183c
The power here is that we can cut R (or Python) out of the loop entirely.  All that we need to do is put this into an appropriate SQL structure, like a view.
https://gist.github.com/128336e2c1924fd946dbc8bc9bf6bd0a
This function is only useful if it faithfully returns the same predictions.
https://gist.github.com/6fee816eea9fd89f44ab12ec0ca3b197
https://gist.github.com/b9da2361a2b375929dbcd0776d6f178a
Walk away

Some of my most successful deployments have followed this pattern.  Without any complications from R or Python code to mess anything up, these views silently did their job and never failed.
Feel free to (reach out)[https://www.linkedin.com/in/nasutton] and connect if this is helpful to you!