Skip to content

Instantly share code, notes, and snippets.

@nathansutton
Created March 6, 2019 21:10
Show Gist options
  • Save nathansutton/56c9c040f6183c0ba6bba133798b3ae7 to your computer and use it in GitHub Desktop.
Save nathansutton/56c9c040f6183c0ba6bba133798b3ae7 to your computer and use it in GitHub Desktop.
title output
html_document

https://gist.github.com/1895a6f8b6b89a17df8a82cad4a93d25

Deploy your H2O model without hassle in plain SQL

I consistently feel (warranted) pressure to improve my model with more elaborate feature engineering only to later facepalm when I need to put that model into production. The path forwards to deploy a complex model is clear -> make an API. This post represents the other end of the spectrum -> do as much in the database as possible.

tl;dr

You can still leverage all the power of Python/R to build and validate a generalized linear model. These functions will allow you to quickly deploy your model in a database as raw SQL code.

Who this is for

I have written this article especially for

  • Data Science practitioners
  • who are asked to deliver interpretable linear models quickly
  • whose data starts in SQL
  • whose data ends in SQL
  • who want to keep is simple

Functions

In this post we'll use the R version.

https://gist.github.com/4aecc0fa499e779941dfbdaaa5a41405

But you could just as easily use the equivalent Python.

https://gist.github.com/b8a34b6c9f310a7a3a299ae9b4b9d226

Walkthrough

In this post I'll build a purposefully simple model for all-cause readmissions based on a demo sample of the MIMIC-III critical care database. This is a relational database that keeps track of all of a patient's medical information (labs, diagnoses, vitals, etc) during a single hospital stay.

I purposefully did all the feature engineering / denormalization in SQL so that my deployment script would be this simple.

https://gist.github.com/7b7f62d7c36cf3b112f02af7e9497a02

In another post I'll speak about how I got to this view from the MIMIC source tables. The important thing to know here is that the rows represent a cross-section of individuals across time. This is a really simple model, and so each of the columns represent a binary indicator (encoded as 1=present, 0=absent).

Now that we've inspected the data, let's bring it into a dataframe.

https://gist.github.com/0dc8113b511aea7c24826846d5d34e0b

With our data in hand we can build a simple logistic regression model with H2O.

https://gist.github.com/f2b2ad86cfe5b6d176d69a846ec1c6db

Then translate it into SQL.

https://gist.github.com/732849a7a7f4e484788decc6cebd183c

The power here is that we can cut R (or Python) out of the loop entirely. All that we need to do is put this into an appropriate SQL structure, like a view.

https://gist.github.com/128336e2c1924fd946dbc8bc9bf6bd0a

This function is only useful if it faithfully returns the same predictions.

https://gist.github.com/6fee816eea9fd89f44ab12ec0ca3b197

https://gist.github.com/b9da2361a2b375929dbcd0776d6f178a

Walk away

Some of my most successful deployments have followed this pattern. Without any complications from R or Python code to mess anything up, these views silently did their job and never failed.

Feel free to (reach out)[https://www.linkedin.com/in/nasutton] and connect if this is helpful to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment