Skip to content

Instantly share code, notes, and snippets.

@ClementWalter
Last active September 17, 2019 12:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ClementWalter/8fb2b6a6e7de0a8fd8ee6040c5fdd116 to your computer and use it in GitHub Desktop.
Save ClementWalter/8fb2b6a6e7de0a8fd8ee6040c5fdd116 to your computer and use it in GitHub Desktop.

5 things you need to know before starting your machine learning project

and why Jupyter Notebook is not a good tool to do it.

At Sicara, we build machine learning based products for our customers:

  • we build products: we need to develop in a production-ready mindset. Algorithms are served in the cloud, served and updated with APIs, etc.
  • machine learning products: the customer comes with a business need and we have to deliver a satisfying solution as fast as possible.

From this experience we have derived few standards your projects should meet to be successful.

Define your experiment

When you think about your problem, what are the parameters that you want to play with? Do you have several data sources ? Do you intend to add a pre-processing (resizing, color transformation, etc.) ? What will be your model ? What will be your training process ? Is there any post-processing to apply from the output of the model ? etc.

Think about your problem as a flow: My raw data in my file system > Organized path as train set, validation test, test set > preprocessing applied to each image > data generator for train and predict method > model train > model predict > post-processing > metrics

Never refactor

When you are satisfied with your flow, build a code that reflects (and respects it). All that parts are separate bricks, don't mix them!

Be aware that you are likely to change any one of them at any time (why not adding data augmentation?). Doing so, you don't want to have to rebuild all your scripts or change the signature of tones of methods. This means that all your blocks should have well defined interfaces. You should be able to build your flow somehow like a playbook using any blocks that you have already developed.

This will help you focus on data science instead of code: looking at your results you will want to change some parameters and will be able to develop the required code in an isolated - easily tested - way.

Stick with one single library

As much as you can. Using several overlapping libraries is like rolling out the red carpet for bugs. It is highly probable that they don't have all the same default parameters or convention. Either you know them all by heart at expert level or you end up exchanging color channel from Pillow to cv2 without noticing it.

Most of all, your machine learning library has probably already some image processing tools:

Every experiment should be reproducible

Reproducibility is the key to improvement. Without a proper 100% sure reproducibility of your experiments, you will never know if your tiny little change made a difference or not. You will not be able to compare and you will end up redoing things to be sure.

Reproducibility means that the one single experience that you have run so far can be run again in the exact same conditions giving the exact same outputs. It is like versioning your experiments.

A good tool to achieve that is to containerize your code. Using for instance docker tags, you will be able to re-run exactly what happened.

Another advantage of containerizing your code is the fact that you cannot make inline changes on it. We know it is bad, but we know we do it when it is possible. Run, bug, debug, re-run from failure point, non-reproducible experiment, evil.

Run experiments in one click

As a researcher you want to have directly the results of your ideas. Ideally you would try tons of experiments to figure out what is going on with your problem. Unfortunately each one of these experiment has a development cost. You can minimize it.

Currently the situation is like if you were a chemist without a laboratory technician. You spend your days preparing mixtures instead of analyzing their reactions.

With a containerized code you can already do docker run. Because you have defined independent blocks, you should be able to mix and match them with a single parameter file (json, yaml, etc.). This means that your single entry point for your code is a parameters file as expressive as possible. It is just a plain description of your experiment. From that single file you should be able to use any part of your code built so far.

Furthermore, you will be able to run several experiment in parallel without any possible confusions. Using cloud tools like Amazon ECR and Amazon SageMaker you will be able to pop-up instances for each one of your experiments and retrieve the results in dedicated folder with a single API call.

TODO: repo example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment