evanmcclure/ML_outline.md

## ML_outline.md

      
    Raw
  

              ML_outline.md
            
          
    End-to-End Machine Learning Pipeline

Outline of the process of for developing a ML project.


Explore and clean the data.

Clean continuous features.
Clean categorical features.


Split the data three ways.

Training - 60%
Validation - 20%
Testing - 20%


Fit an inital model and evaluate using fivefold cross-validation of the training data.

A Random Forest classifier is used for decision trees.


Tune hyperparameters.

For a Random Forest, tune the number of estinators (decisions) and the max depth.
A grid search can be used.


Evaluate some models using the validation set.

Look at performance metrics beyond just accuracy.
Use the three best models based on the training set.
Predict on the validation set and look for the best accuracy, precision, and recall based on the context of the data.


Finally, select the best model and evaluate on the testing test.

Does the test set match the model we picked?
Deploy the best model.