title | output |
---|---|
html_document |
https://gist.github.com/1895a6f8b6b89a17df8a82cad4a93d25
I consistently feel (warranted) pressure to improve my model with more elaborate feature engineering only to later facepalm when I need to put that model into production. The path forwards to deploy a complex model is clear -> make an API. This post represents the other end of the spectrum -> do as much in the database as possible.
You can still leverage all the power of Python/R to build and validate a generalized linear model. These functions will allow you to quickly deploy your model in a database as raw SQL code.
I have written this article especially for
- Data Science practitioners
- who are asked to deliver interpretable linear models quickly
- whose data starts in SQL
- whose data ends in SQL
- who want to keep is simple
In this post we'll use the R version.
https://gist.github.com/4aecc0fa499e779941dfbdaaa5a41405
But you could just as easily use the equivalent Python.
https://gist.github.com/b8a34b6c9f310a7a3a299ae9b4b9d226
In this post I'll build a purposefully simple model for all-cause readmissions based on a demo sample of the MIMIC-III critical care database. This is a relational database that keeps track of all of a patient's medical information (labs, diagnoses, vitals, etc) during a single hospital stay.
I purposefully did all the feature engineering / denormalization in SQL so that my deployment script would be this simple.
https://gist.github.com/7b7f62d7c36cf3b112f02af7e9497a02
In another post I'll speak about how I got to this view from the MIMIC source tables. The important thing to know here is that the rows represent a cross-section of individuals across time. This is a really simple model, and so each of the columns represent a binary indicator (encoded as 1=present, 0=absent).
Now that we've inspected the data, let's bring it into a dataframe.
https://gist.github.com/0dc8113b511aea7c24826846d5d34e0b
With our data in hand we can build a simple logistic regression model with H2O.
https://gist.github.com/f2b2ad86cfe5b6d176d69a846ec1c6db
Then translate it into SQL.
https://gist.github.com/732849a7a7f4e484788decc6cebd183c
The power here is that we can cut R (or Python) out of the loop entirely. All that we need to do is put this into an appropriate SQL structure, like a view.
https://gist.github.com/128336e2c1924fd946dbc8bc9bf6bd0a
This function is only useful if it faithfully returns the same predictions.
https://gist.github.com/6fee816eea9fd89f44ab12ec0ca3b197
https://gist.github.com/b9da2361a2b375929dbcd0776d6f178a
Some of my most successful deployments have followed this pattern. Without any complications from R or Python code to mess anything up, these views silently did their job and never failed.
Feel free to (reach out)[https://www.linkedin.com/in/nasutton] and connect if this is helpful to you!