Skip to content

Instantly share code, notes, and snippets.

@evanmcclure
Last active January 18, 2020 16:44
Show Gist options
  • Save evanmcclure/3c33a23c3ac4d58ea9dbac2fabc94274 to your computer and use it in GitHub Desktop.
Save evanmcclure/3c33a23c3ac4d58ea9dbac2fabc94274 to your computer and use it in GitHub Desktop.
End-to-End Machine Learning Pipeline

End-to-End Machine Learning Pipeline

Outline of the process of for developing a ML project.

  • Explore and clean the data.
    • Clean continuous features.
    • Clean categorical features.
  • Split the data three ways.
    • Training - 60%
    • Validation - 20%
    • Testing - 20%
  • Fit an inital model and evaluate using fivefold cross-validation of the training data.
    • A Random Forest classifier is used for decision trees.
  • Tune hyperparameters.
    • For a Random Forest, tune the number of estinators (decisions) and the max depth.
    • A grid search can be used.
  • Evaluate some models using the validation set.
    • Look at performance metrics beyond just accuracy.
    • Use the three best models based on the training set.
    • Predict on the validation set and look for the best accuracy, precision, and recall based on the context of the data.
  • Finally, select the best model and evaluate on the testing test.
    • Does the test set match the model we picked?
    • Deploy the best model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment