ClementWalter/scratch.md

## scratch.md

      
    Raw
  

              scratch.md
            
          
    The perfect notebook is not a jupyter one

Jupyter notebook has been heavily reported as the perfect prototyping tool for data scientist. Its main features are:

inline code execution
easy idea structuring
nice displays of pictures and dataframe

This overall flexibility has made it a preferred tool compared to the more rustic iPython command line. However it
should not be forgotten that this is no more than an REPL where you can navigate efficiently throughout the history.
Tones of machine learning developers have experienced the deep pain of refactoring a deep learning notebook into a
real algorithm in production.
Applying the lean framework, we should strive to reduce waste as much as possible. And refactoring means waste.
Despite the attempts to deploy jupyter notebooks in production, this tool has not been developed for that purpose.
At Sicara, we build machine learning based products for our customers:

machine learning: the customer comes with a business need and we have to deliver a satisfying algorithm
as fast as possible;
we build products: we need to develop in a production-ready mindset. Algorithms are deployed in the cloud,
served and updated with APIs, etc.

In this post I want ot share my best practices to go from EDA to API as fast as possible.
Notebooks, datascience and reporting

First of all you definitely need a versioning tool which is a pain with Jupyter. Not only for your code, but also
for your experiments. You need to be able to re-run any results got so far with 100% confidence. How often come
datascientists with results they cannot reproduce?
Furthermore, when using notebooks, people often tend to mix three kinds of usage:

development: defining methods and tools to actually do something
debugging: running the piece of code with real data to see what is going on
visualization: presenting the results in a clean and reproducible output.

In order to reduce waste, these steps should be clearly defined and separated so as to be able to change one without
the other and vice versa:

to produce high-quality code, I have come to the conclusion that nothing is better than an first-class IDE
to debug code, nothing is better than visual debugging tools
to write down reports, nothing is better than an expressive markup language (markdown, reST, Latex)

Fortunately a well-configured IDE can do all of these things. If you come from the R community you certainly use RStudio
which allows you to do so:

code completion, auto-fix, etc.
direct visual debugging
Rmarkdown/knitr/Sweave to generate dynamic and beautiful reports.

Develop production-ready code

As soon as you want to test something, i.e. write a method to do something to your data, think about its usage, limit
case, etc. Do it in a separate file, document and unit-test it. Doing so you make sure that:

your method actually does what you
your code can be safely used somewhere else in your project

Because you will have to organize your tools, it makes you think about the structure of your pipeline, the things you
need, what you are likely to change, etc.
Debug and display

This is where you import your code and try it on your data. And also where notebooks can be found very convenient
because of their cell mechanism. However it is a tool switch. Why would you quit your IDE with all you shortcuts and
comfort to run code into your web browser? What you need is inline execution of your code directly into your IDE.
A tool like PyCharm has a native support of this feature: execute selected code or script with a single keyboard
shortcut. Furthermore its console runs the iPython ones with a very nice Variables tool window. In scientific mode
you can also display and change plots and dataframes/arrays within the IDE.
Other tools like VSCode or Atom do have plugins (Hydrogen for instance) for such features as well.
Report and share

At this point you should have your tested code in some directory in your project and a plain python file running it
onto your data. You have run it inline into your IDE and checked the results, they are great! You job is almost done:
you need to report them to the team to justify the migration of the algorithm to your new version.
You need to explain your logic and make a step-by-step clear explanation to prove your results. Of course you don't
want to retype everything in another file; reporting is boring.
This is why there exists tools for dynamic report generation. Documentation tools like Sphynx are built into that spirit:
write your code and the documentation into the same file and generate from it a readable version. For your python
notebook, I recommend using Pweave and especially its pypublish command. Just add comments to your scripts and
run pypublish my_script.py to generate a clear shareable html from it. Every commented lines are markdown interepreted,
every cell (or code block) can be displayed or hidden, etc.
For instance this notebooks:
# %% # This is the title of the notebook

#+ setup, echo=False
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'a': list(range(4))})

#' Let us see what a plot looks like
#+ plot_df, echo=False
df.plot.bar()

#' Let us make now some visible computation
#+ echo=True
a = 1
print(a)

#' Also it is possible to use variable in context: a is <% a %>

#+ echo=True
a = 2

#' a is now <% a %>
renders as
<iframe src="/Users/clementwalter/Library/Preferences/PyCharm2018.1/scratches/scratch_1.html"></iframe>