Skip to content

Instantly share code, notes, and snippets.

@mayankskb
Last active November 14, 2018 11:06
Show Gist options
  • Save mayankskb/0df8124e51d6cad31e7b7dcf1369acdb to your computer and use it in GitHub Desktop.
Save mayankskb/0df8124e51d6cad31e7b7dcf1369acdb to your computer and use it in GitHub Desktop.
Dask - API Introduction

DASK API - Introduction

We have Numpy and Pandas as a strong analytics ecosystem but are deigned to run on a single core.

Dask Uses Numpy and Pandas as there backend and sets a kind of logical connection between 'n' number of Numpy arrays or dataframes. It is often called as Parallel Computing library of Python designed to run across multiple systems. Typically, dask keep all of its data on the dask with an intent of saving memory and read only the fraction of data in memory when required. Moreover, it tries to discard the intermediary results as soon as possible.

Dask provides exposure to its apis in three formats :

  1. Dask Array
  2. Dask DataFrames
  3. Dask ML

Dask Array:

Dask array forms a logical connection between Numpy arrays under the hood and execute them in a distributed function across the cores of a machine or group of machines often called as cluster. These small - small Numpy arrays form a large Numpy array that is referred to as a Dask Array. Dask Array support number of Numpy functions.

Dask performs Lazy Computation. It builds a DAG (Directed Acyclic Graph) when a task is defined to perform over the data and will compute only when the compute() is invoked.


Dask DataFrames:

Similar to dask arrays, dask dataframes forms a logical connection between multiple smaller sized pandas dataframes. These dataframes are distributed units of larger dataframe called as dask dataframe in a row oriented form. Whenever an action is called upon the dask dataframe, under the hood executed upon pandas dataframes in a distributed manner across the cores of connected machines.

The dask dataframe api also support pandas based functionalities.


Dask ML:

Dask ML provides scalable machine learning algorithm in python compatible with sciket-learn. Through Sciket-learn one can do parallel machine learning but through dask one can execute parallel machine learning task spawning over cluster. Dask provides parallel support for performing ml task on one node as well as on distributed set of nodes.

Sciket learn takes use of "joblib" named python library which is responsible for doing parallelization. Dask ML uses a centralized scheduler and group of worker nodes to accomplish this where the task of scheduler is to assign task for the worker nodes and task for each worker node is to process the job assigned by the scheduler and propagate the result on demand.

For doing this there are following two ways:

  1. Using Dask Arrays to implement the algorithm Dask provides the implementation for some algorithm under the library dask_ml from where they can be directly used. These algorithm used numpy array and therefore are replaced with dask array to make these algorithms scalable. These are:
    --- Linear models (linear regression, logistic regression, poisson regression)
    --- Pre-processing (scalers , transforms)
    --- Clustering (k-means, spectral clustering)

  2. Parallelizing Sciket-learn algorithms directly:
    Though sciket-learn already supports parallelization of its task using joblib however they are limited to a single node system only. For scaling it to a cluster need to write few more lines of codes:

--- Import client from dask.distribited

from dask.distributed import Client  
client = Client()

--- Instantiate dask joblib in the backend, need to import parallel_backend from sklearn library

import dask_ml.joblib  
from sklearn.externals.joblib import parallel_backend  
with parallel_backend('dask'):  

--- Normal Sklearn code will work here

     from sklearn.ensemble import RandomForestClassifier  
     model = RandomForestClassifier()  

Dask CV:

Sklearn package provide the GridSearchCV package for the purpose of hyperparameter tuning. But it comes with its own cons which are:
Among the hyperparameter (to be optimized) with the options from which the most suitable value need to chosen, it forms combination. e.g. if there are 3 hyperparameter with 3 values each then it end up making all possible combination amongst them and at last having 27 combination. This requires processing tasks and sometimes ending up processing same task multiple times.

Dask provides a package named as dask_searchcv which eliminates the processing of a similar task multiple times. It works pretty similar to the one given by sklearn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment