mhash1m/GSoC21.md

## GSoC21.md

      
    Raw
  

              GSoC21.md
            
          
Google Summer of Code 2021 Final Work Submission Report


Name: Ch. M. Hashim
Organisation: Python Software Foundation
Sub-Organisation: DFFML - DataFlow Facilitator for Machine Learning
Project: Enhancing User Experience with Notebook Examples for Machine Learning Use Cases.
Proposal: https://blogs.python-gsoc.org/media/proposals/GSoC_2021_DFFML_Project_Proposal1.pdf

Aim

The project highlights use-cases that explain the Machine Learning workflow of DFFML API through Jupyter notebook examples. The project also adds multi-output model support to DFFML.
Project Scope

The project is divided into 2 phases:
Phase1. Adding General Machine Learning use-case examples:

Evaluating model performance.
Tuning models.
Saving and loading models.
Ensemble by stacking.
Transfer Learning.

Phase2. Adding Multi-Output models support and use-case examples:

Add support for Multi-Output models.
Use-case example for Multi-Output models.

Executive Summary

DFFML curates several models and provides its users with a simple way to use Machine Learning. However, it can be hectic and/or time-consuming especially for new users to discover all the use-cases of an API. For users to be able to fully understand how they can integrate DFFML into their routine machine learning tasks or even state-of-the-art projects, they need to see the potential DFFML has and how it can make their jobs a lot easier!
It is, in fact, one of the hardest jobs to understand another person’s code, let alone figure out what that code could possibly help you achieve. However, this project captures DFFML’s use-cases and functionality with runnable code examples as notebooks and it has really improved the user experience.
Jupyter notebooks allow documentation right along with the code in a very presentable way. It also allows selective cell-level code execution which can help users understand the flow better. These notebooks were designed to be beginner-friendly. The code was written with new users, and their possible use cases, in mind to maximize the ctrl+c and ctrl+v prowess 😅. Moreover, the code in all the notebooks contains appropriate descriptions so it minimizes the learning curve for new users and helps them be productive in a relatively shorter time.
The project also involved testing of all the usecase examples. The notebooks were tested using nteract/testbook, a unit testing framework for testing code in Jupyter Notebooks. Notebooks are quite presentable on their own, but the idea of presenting a link to a notebook in your project documentation is not so much. To overcome this, the notebooks were integrated into DFFML documentation which ultimately improved their discoverability as well. To carry out the task, nbsphinx was utilized, a Sphinx extension that provides a source parser for *.ipynb files.
The biggest challenge in this project was to add support for Multi-output models. Since DFFML has already wrapped scikit-learn models, my job was to integrate the use of multi-output scikit wrappers in the existing code. However, I made things a lot more simpler for the users. Instead of having the user call wrappers to wrap models and make the models multi-ouput, I enabled users to use the multi-output version of a model by simply sending multioutput data to any of the scikit models that DFFML has. Everything was done on the backend and users don't have to bother looking up wrappers for multioutput, they are able to train their multioutput models like any other.
This task became more challenging when one of my peers added scikit accuracy scorers into DFFML. Not only did I manage to rebase those huge changes, I also added support for multi-output scikit scorers to assess all the multi-output models in the new workflow. And, of course, I also added a usecase example for multi-output like the rest.
Project Tasks Completed

The project went on as planned! All of the tasks were completed and are listed below:
Phase 1


Use-case example 'Evaluating model performance' and test_notebook script
This is the first use-case example notebook that I worked on as part of GSoC. It involved using DFFML API to show how to create basic models, train, and evaluate their performance. The performance of these models was shown using scorers, and plotting their resultant scores on a bar chart.
This Pull request also added test_notebook, a script to test all notebooks in CI, using testbook, and make sure there is no error in the execution. This allows any future contributors to not worry about testing notebooks while adding new notebooks to DFFML.
Related Links:

intel/dffml#1127
https://intel.github.io/dffml/master/examples/notebooks/evaluating_model_performance.html


Usecase example 'Tuning Models'
This use-case example notebook shows how to build, train, and tune models in the DFFML API. This involves mutating the config values of the hyperparameters saved as part of the model. However, this didn't seem sufficient to go with the tutorial so I decided to add in something more.
As part of the PR, and not required by the GSoC project, I added a new plugin optimizer. This plugin will have different algorithms used to tune hyperparameters of a model. I implemented ParameterGrid which takes a grid of hyperparameter values and finds the best out of them, ie. the set with the highest score on a specified dataset.
Related Links:

intel/dffml#1192
https://intel.github.io/dffml/master/examples/notebooks/tuning_models.html


Use-case example 'Saving and loading models'
This use-case example notebook highlights how to create, train, save and load models in the DFFML API. Once the model has been saved, the notebook restarts the kernel to enter a new session and shows how to load a saved model.
Related Links:

intel/dffml#1140
https://intel.github.io/dffml/master/examples/notebooks/saving_and_loading_models.html


Use-case example 'Ensemble by stacking'
This use-case example notebook highlights how to build different models, train them and perform stacking, an ensemble learning technique, using the DFFML API.
Related Links:

intel/dffml#1134
https://intel.github.io/dffml/master/examples/notebooks/ensemble_by_stacking.html


Use-case example 'Transfer Learning'


This use-case example notebook shows how to perform Transfer Learning using the Pytorch pre-trained models through DFFML Python API. However, Pytorch didn't have support for adding additional layers to pretrained models in the API, so I also ended up adding the required support before creating the use-case example.
Related Links:

intel/dffml#1151
intel/dffml#1148
https://intel.github.io/dffml/master/examples/notebooks/transferlearning.html

Phase 2


Support and add use-case examples for Multi-output models

This part of the project is focussed on adding support for multi-output models. This enabled users to use the multi-output version of a model by simply sending multi-output data to any of the multioutput models that we have. It also added support for multi-output scikit scorers to assess all the multi-output models in the machinelearning workflow. After multi-output models and scorers had complete support and tests in the codebase, a use-case example notebook was added to show how easy it is to perform multi-output tasks in DFFML API.
Related Links:

intel/dffml#1175
https://intel.github.io/dffml/master/examples/notebooks/multioutput_models.html

Detailed weekly description of tasks and work done can be found in:

Weekly Blogs: https://blogs.python-gsoc.org/en/mhash1ms-blog/
Weekly Sync: https://docs.google.com/document/d/16u9Tev3O0CcUDe2nfikHmrO3Xnd4ASJ45myFgQLpvzM/edit
DFFML YouTube Channel: https://www.youtube.com/channel/UCorEDRWGikwBH3dsJdDK1qA

Future Work

DFFML has become quite polished over the years, especially after this summer as all the projects were meant to do exactly that! However, there is always something you can contribute to.
Following could be possible future contributions to this project and DFFML:

Currently, we are using a lot of small tools to integrate notebooks, we could potentially come up with something to get things done in a more unified manner. A potential start could be looking into executablebooks/jupyter-book.
Adding more optimizer plugins and more detailed tutorial utilizing more plugins for tuning models.
Adding pre-processing plugins and creating pre-processing tutorials.
Implementing a native multi-output model, currently we use scikit models only.
Implementing a native multioutput scorer, currently we use the scikit scorers only.

As for me, I'll be focusing on completing my pending pre-GSoC tasks, such as the separate confidence from prediction PR to make sure it's done before the next release.

This summer has been the most fun and productive so far!
I would like to thank all my mentors, especially John Andersen who has been the most supportive and enabling mentor. This wouldn't have been possible without his guidance and support.
Also thanks to Saksham Arora and Himanshu Tripathi for help and some great advices.
I'd also like to thank my fellow GSoC students Saahil Ali and Sudhanshu Kumar, it was a pleasure working with them.
Thank you to Google and Python Software Foundation for providing this opportunity!