ClementWalter/machine_learning_principles.md

## machine_learning_principles.md

      
    Raw
  

              machine_learning_principles.md
            
          
    5 things you need to know before starting your machine learning project

and why Jupyter Notebook is not a good tool to do it.
At Sicara, we build machine learning based products for our customers:

we build products: we need to develop in a production-ready mindset. Algorithms are served in the cloud,
served and updated with APIs, etc.
machine learning products: the customer comes with a business need and we have to deliver a satisfying solution
as fast as possible.

From this experience we have derived few standards your projects should meet to be successful.
Define your experiment

When you think about your problem, what are the parameters that you want to play with? Do you have several data sources ?
Do you intend to add a pre-processing (resizing, color transformation, etc.) ? What will be your model ? What will be
your training process ? Is there any post-processing to apply from the output of the model ? etc.
Think about your problem as a flow:
My raw data in my file system > Organized path as train set, validation test, test set > preprocessing applied to each image >
data generator for train and predict method > model train > model predict > post-processing > metrics
Never refactor

When you are satisfied with your flow, build a code that reflects (and respects it). All that parts are separate bricks,
don't mix them!
Be aware that you are likely to change any one of them at any time (why not adding data augmentation?). Doing so,
you don't want to have to rebuild all your scripts or change the signature of tones of methods. This means that
all your blocks should have well defined interfaces. You should
be able to build your flow somehow like a playbook using any blocks that you have already developed.
This will help you focus on data science instead of code: looking at your results you will want to change some
parameters and will be able to develop the required code in an isolated - easily tested - way.
Stick with one single library

As much as you can. Using several overlapping libraries is like rolling out the red carpet for bugs. It is highly
probable that they don't have all the same default parameters or convention. Either you know them all by heart at
expert level or you end up exchanging color channel from Pillow to
cv2 without noticing it.
Most of all, your machine learning library has probably already some image processing tools:

scikit-image
keras preprocessing

Every experiment should be reproducible

Reproducibility is the key to improvement. Without a proper 100% sure reproducibility of your experiments, you will
never know if your tiny little change made a difference or not. You will not be able to compare and you will end up
redoing things to be sure.
Reproducibility means that the one single experience that you have run so far can be run again in the exact same
conditions giving the exact same outputs. It is like versioning your experiments.
A good tool to achieve that is to containerize your code. Using for instance docker tags, you will be able to re-run
exactly what happened.
Another advantage of containerizing your code is the fact that you cannot make inline changes on it. We know it is bad,
but we know we do it when it is possible. Run, bug, debug, re-run from failure point, non-reproducible experiment, evil.
Run experiments in one click

As a researcher you want to have directly the results of your ideas. Ideally you would try tons of experiments to
figure out what is going on with your problem. Unfortunately each one of these experiment has a development cost. You
can minimize it.
Currently the situation is like if you were a chemist without a laboratory technician. You spend your days preparing
mixtures instead of analyzing their reactions.
With a containerized code you can already do docker run. Because you have defined independent blocks,
you should be able to mix and match them with a single parameter file (json, yaml, etc.). This means that your single
entry point for your code is a parameters file as expressive as possible. It is just a plain description of your
experiment. From that single file you should be able to use any part of your code built so far.
Furthermore, you will be able to run several experiment in parallel without any possible confusions. Using cloud tools
like Amazon ECR and Amazon SageMaker you will
be able to pop-up instances for each one of your experiments and retrieve the results in dedicated folder with a single
API call.
TODO: repo example?