- Explore and clean the data.
- Clean continuous features.
- Clean categorical features.
- Split the data three ways.
- Training - 60%
- Validation - 20%
- Testing - 20%
- Fit an inital model and evaluate using fivefold cross-validation of the training data.
- A Random Forest classifier is used for decision trees.
- Tune hyperparameters.
- For a Random Forest, tune the number of estinators (decisions) and the max depth.
- A grid search can be used.
- Evaluate some models using the validation set.
- Look at performance metrics beyond just accuracy.
- Use the three best models based on the training set.
- Predict on the validation set and look for the best accuracy, precision, and recall based on the context of the data.
- Finally, select the best model and evaluate on the testing test.
- Does the test set match the model we picked?
- Deploy the best model.
Last active
January 18, 2020 16:44
-
-
Save evanmcclure/3c33a23c3ac4d58ea9dbac2fabc94274 to your computer and use it in GitHub Desktop.
End-to-End Machine Learning Pipeline
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment