Atlas7/data_science_workflow_with_kaggle_docker_image.md Secret

## data_science_workflow_with_kaggle_docker_image.md

      
    Raw
  

              data_science_workflow_with_kaggle_docker_image.md
            
          
    Introduction

Kaggle has a very nice online feature called Kernels, where we get to create and share Python codes (called Script) and Jupyter Notebooks online, using datasets directly online (accessible via ../input) without the needs to download datasets to local machine. This is a great feature as others may simply fork your notebook and run it online, and see the result directly. This greatly helps speeding up learning and collaboration, whilst minimising excessive "fiddling" too much with setting up infrastructure. Having a centralised Docker Image also ensures everybody uses (pretty much) the same infrastructure - this can be very important on enabling remote collaboration.
Motivation

So why not just create all Jupyter Notebooks via the Kaggle Kernels online? Why do we even bother setting up Docker Image locally on our machine? The reason to this could vary from person to person. For me, it was a neccessary "workaround" to an existing Kaggle Notebook "feature". Let me give you a few examples:

the double question mark (??) notebook feature to lookup documentation doesn't work when you use Kaggle Notebook online. It (appears to) work only locally on vanilla Jupyter notebook locally. See this Stackoverflow for more detail.
Bokeh Interactive JavaScript framework doesn't seem to load when you use the Kaggle Notebook online. It is possible that the Kaggle implementation has disabled it. See this Kaggle forum details. More personal background: I recently came across this very cool Kaggle Notebook Interactive Data Visualization - NYC Taxi Trip that uses bokeh for some really amazing visualization. After I've forked it and run the step that does the bokeh.plotting.output_notebook(), it gets "stuck" on "Loading BokehJS ...". If I am to run locally on my machine, it would have said something like "BokehJS 0.12.5 successfully loaded.". So to ensure I could still run the notebook, learn how bokeh work, and contribute, I could only think of creating a notebook locally, run it, generate the graphics, then upload to Kaggle kernel.
the Kaggle Docker Image may be reused for other non Kaggle related data science projects. Once you've setup a repeatable workflow and directory structure to handle storing datasets, accessing datasets, and generating outputs, it can be quite an empowering experience. (I yet to experiment and validate more on this though!)

To give you a visual, currently I could only create the following output via Jupyter notebook locally, and not online:


Purpose of this post is to document the steps need to reproduce a data science workflow that may be used for both Kaggle (primary) and non-Kaggle (secondary) related projects. Mainly for my own benefit should I forget how to do it again. Though I really hope it may help you out too! This post is primilarily inspired by the blog posts listed under the References section.
As of writing this post I have only tested this on a Macbook (OSX El Capitan 10.11.x). The instruction for Windows and Linux might be slightly different but I would guess not that much. (If I happen to reproduce this on Windows and/or Linux I will try and write a similar post).
Install Docker on Laptop

First of all, download the latest Docker software onto the Mac. Docker Instruction here. I simply downloaded and installed the CE (Community Edition) version - since it is free. This should take around 5 minutes.
Create Docker Machine (One-off)

Create a new virtual docker machine that is big enough to cope with data science project / Kaggle Docker Images. Change the parameters to suit your needs.
docker-machine create -d virtualbox --virtualbox-disk-size "50000" --virtualbox-cpu-count "4" --virtualbox-memory "8092" docker2

Setup bash_profile (run as often as you need)

Start the newly created docker2 virtual docker machine and gets its IP address. (Don't worry if it's already running. The script is clever enough to detect that and let you know).
docker-machine start docker2

Set some docker environmental variables that will be used later. There is no harm to run it again and again - it just simply overwrite old environmental variables with the new ones (e.g. Docker IP Address) - which will very likely be the same anyway.
eval $(docker-machine env docker2)

Just to double check, if you do a printenv | grep DOCKER you will see all the newly created DOCKER related environmental variables. For eample:
johnny@Chuns-MBP ~ $ printenv | grep DOCKER
DOCKER_HOST=tcp://192.168.99.100:2376
DOCKER_MACHINE_NAME=docker2
DOCKER_TLS_VERIFY=1
DOCKER_CERT_PATH=/Users/johnny/.docker/machine/machines/docker2

Pull the Kaggle Docker Image you want to use (one-off)

For the purpose of this demo example, we will pull the popular Kaggle Python docker image:
docker pull kaggle/python

Other available Kaggle Docker images may be found on Kaggle GitHub Page.
Setup bash_profile (One-off)

Add these to your .bash_profile.
kpython(){
  docker run -v $PWD:/tmp/working -w=/tmp/working --rm -it kaggle/python python "$@"  
}
ikpython() {
  docker run -v $PWD:/tmp/working -w=/tmp/working --rm -it kaggle/python ipython
}
kjupyter() {
  (sleep 3 && open "http://$(docker-machine ip docker2):8888")&
  docker run -v $PWD:/tmp/working -w=/tmp/working -p 8888:8888 --rm -it kaggle/python jupyter notebook --no-browser --ip="0.0.0.0" --notebook-dir=/tmp/working --allow-root
}
We've essentially created three handy shorthand functions (see these as "aliases") to invoke Jupyter Notebook, iPython Console, and Python Console - pre-bundled with all the data science Python packages. Notice also we've specified the kaggle image kaggle/python in the script. Replace the bits and bobs as needed.
Note also that the change will only take effect on new terminals. (i.e. just exit existing terminal and open a new terminal. Or just open a new one, knowing that the change does not impact the old terminal).
A Note on Directory Structure (Important) 

Say if you have a directory structure like this (this is how mine is steup):
|- /Users/Johnny
  |- kaggle
    |- project 1
      |- nbs
        |- notebook1.pynb
        |-notebook2.pynb
      |- input
        |- train.csv
        |- test.csv
    |- project 2
      |- nbs
        |- notebook1.pynb
        |-notebook2.pynb
      |- input
        |- train.csv
        |- test.csv        

Say if you are working on project1, I would suggest you do invokve kjupyter / ikpython / ipython at /Users/Johnny/kaggle/project1. This way, your notebook / console will have visibily to all subdirectory within project1, whilst not interfering with other projects. (e.g. accidentally delete something in project2 - we want to avoid that!)
To give you a concrete example on how I would run my notebooks:

Navigate to /Users/Johnny/kaggle/project1.
run kjupyer.
Copy the Juypyter Notebook URL (along with token) string from terminal, and paste it into a browser (e.g. chrome). The string should look like this:

http://0.0.0.0:8888/?token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

if you were asked to enter a token in a box however, just copy the token string xxxx and paste into the box. For example:


navigate to the nbs directory (where we store notebooks). Open up any notebook (or just create one).


Issue some commands to ensure you can access the ../input directory where you store your datasets. Benefit of this setup is that the way we access our datasets is consistent with how we would have done on Kaggle online kernel. i.e. via the ../input directory. Sample commands:

%pwd

The above should return /tmp/working/nbs. (this is the directory on the docker virtual machine. This is mapped to your current working directory. i.e. /Users/Johnny/kaggle/project1/nbs.).
%ls "../input"

The above should list the datasets listed under ../input. This should lists out train.csv and test.csv.
(optional) run the following default Kaggle Kernels Jupyter notebook code - it should work without error, if you have followed the above instruction correctly. Motivation is to check that we can import python libraries shipped with the docker image, and also checking our access to ../input (where we can grab datasets later on). That way, after we've uploaded our notebook back to the Kaggle Kernels, others may also access the data the same way as we do locally on a laptop.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
This is what we see if we run code via local Dockerized Kaggle (k)jupyter notebook:

This is what we see if we run code via the online Kaggle Notebook kernel:

Note that both notebook are able to "see" the datasets stored under ../input?
Use Kaggle Jupyter Notebook

Issue kjupyter at an approporate directory. See section A Note on Directory Structure 
Use Kaggle iPython Console

Issue ikpython at an approporate directory. See section A Note on Directory Structure 
Use Kaggle Python Console

Issue kpython at an approporate directory. See section A Note on Directory Structure 
Conclusion

In this post we have covered the motivation of setting up a docker image locally on a mac (though should be similar for Windows and Linux - see blog posts in references), and provided instructions on running Kaggle "dockered" Jupyter Notebook (via kjupyter), iPython console (via "ikpython") and Python console (via kpython). This post may be used as a quick reference guide to reproduce a data science workflow for kaggle and non-kaggle related projects.
References


How to get started with data science in containers: a Kaggle No Free Hunch post by James Hall. Illustrating recreating a Kaggle Docker environment on a local machine. Example is in Mac but should be similar for Linux and Windows with slight changes. Probably the first go to post.
Reproducible infrastructure for Data Scientist: a Parrot Prediction post by Nobert. Similar post to above with slight "upgrade" to the .bash_profile script. Probably the second go to post.
Steps for Running Kaggle Docker Images: covers examples of pulling and using the kaggle/rstat docker image too. Probably the third go to post.