When working on a project, we probably go through the pipeline multiple times.
First pass is our MVP
- Acquire: whatever SQL query gives us workable data
- Prepare: drop nulls, data split
- Explore: visualize the target against independent variables
- Model: baseline, LinearRegression, LassoLars compare performance with rmse on validate
NB. not worried about scaling or automated feature engineering
Second pass:
- focus on modeling, lets scale our data in prep and then try out models on the scaled data
Third pass:
- let's look at null values more, impute instead of drop
- rerun models to see if performance changes
fourth pass
- let's do more exploration, lets visualize more variable interactions
....
Eventually
- Acquire: fancy SQL query that joins and stuff
- Prepare: handle nulls, handle outliers, scale data, split data
- Explore: multiple visualizations of independent variable interactions as well as drivers of the target and statistical tests
- Modeling: try out multiple different model types with different hyperparameters