Skip to content

Instantly share code, notes, and snippets.

@zgulde
Last active June 10, 2021 21:16
Show Gist options
  • Save zgulde/2b90fa9649b918ebb5cc7d919ac45609 to your computer and use it in GitHub Desktop.
Save zgulde/2b90fa9649b918ebb5cc7d919ac45609 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Data Science Pipeline

When working on a project, we probably go through the pipeline multiple times.

First pass is our MVP

  1. Acquire: whatever SQL query gives us workable data
  2. Prepare: drop nulls, data split
  3. Explore: visualize the target against independent variables
  4. Model: baseline, LinearRegression, LassoLars compare performance with rmse on validate

NB. not worried about scaling or automated feature engineering

Second pass:

  • focus on modeling, lets scale our data in prep and then try out models on the scaled data

Third pass:

  • let's look at null values more, impute instead of drop
  • rerun models to see if performance changes

fourth pass

  • let's do more exploration, lets visualize more variable interactions

....

Eventually

  1. Acquire: fancy SQL query that joins and stuff
  2. Prepare: handle nulls, handle outliers, scale data, split data
  3. Explore: multiple visualizations of independent variable interactions as well as drivers of the target and statistical tests
  4. Modeling: try out multiple different model types with different hyperparameters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment