zgulde/modeling.ipynb

## modeling.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              modeling.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## pipeline_recap.md

      
    Raw
  

              pipeline_recap.md
            
          
    Data Science Pipeline

When working on a project, we probably go through the pipeline multiple times.
First pass is our MVP

Acquire: whatever SQL query gives us workable data
Prepare: drop nulls, data split
Explore: visualize the target against independent variables
Model: baseline, LinearRegression, LassoLars compare performance with rmse on validate

NB. not worried about scaling or automated feature engineering
Second pass:

focus on modeling, lets scale our data in prep and then try out models on the
scaled data

Third pass:

let's look at null values more, impute instead of drop
rerun models to see if performance changes

fourth pass

let's do more exploration, lets visualize more variable interactions

....
Eventually

Acquire: fancy SQL query that joins and stuff
Prepare: handle nulls, handle outliers, scale data, split data
Explore: multiple visualizations of independent variable interactions as well
as drivers of the target and statistical tests
Modeling: try out multiple different model types with different
hyperparameters