programmer290399/GSOC21_work_product.md

## GSOC21_work_product.md

      
    Raw
  

              GSOC21_work_product.md
            
          
Google Summer of Code 2021 Final Work Submission Report


Name: Saahil Ali
Organisation: Python Software Foundation
Sub-Organisation: DFFML - DataFlow Facilitator for Machine Learning
Project: Support Archive Storage for Models
Proposal: https://blogs.python-gsoc.org/media/proposals/DFFML__Support_Archive_Storage_for__Models_2.pdf

Aim

Add support for archive storage for all existing models in DFFML,
the project roughly consists of three parts:

Implementing archiving and compression related operations.
Updating the Model base class to support archive storage.
Update all models to support this change.

Executive Summary

DFFML provides a unified interface for training, testing, and deploying various
machine learning models. It allows users to write their own plugins for fine-grained
control on various aspects of a pipeline like reading data or training a model while
providing a framework agnostic interface which helps put various parts together in a
pipeline without affecting or breaking other pre-existing parts. This makes DFFML
based projects easy to maintain, update and extend.
Before the execution of this project, issue #662
was open, which stated that DFFML has no support for saving Models as a single archived file
which holds all the necessary information to restore the working state of the Model, which will
enable a user to resume/reproduce work at a different point in time or on another machine/environment.
My project successfully added support for archive storage (i.e in the form of .zip or
.tar.* family) of models for all pre-existing models in DFFML and updated tests,
documentation and fixed all bugs related to various models that were introduced while
implementing this feature. This has now enabled users to not only save the model state
but also all the configuration of a model.
There would be two major benefits of this implementation to the user:

Increased Reproducibility of DFFML models.
Better Portability of DFFML models.

I would like to break down this project into three sub-parts:


Implementing a suite of compression and archiving operations. This suite includes
zip and tar format of archives which can be paired with gzip, lzma &  bzip2
compression to produce compressed archives.


Updating the Model base class to use archive creation/extraction dataflows on entering
and exiting the context to seamlessly load and save models as archives if needed, keeping
all the existing functionality of using folders as it is.


Updating all the models listed below to support this change, effectively refactoring
considerably large parts of these models.


Sno.
Model Name


01.
SLR model


02.
Auto Sklearn model


03.
Daal4py model


04.
Pytorch model


05.
Scikit model


06.
Scratch mode


07.
Spacy model


08.
Tensorflow model


09.
Vowpal Wabbit model


10.
Xgboost model


11.
Tensorflow hub model


Project Tasks Completed

Adding Support for Archives and Compression

Added DFFML operations to enable zip and tar file creation/extraction, also added compression
operations to enable compression/decompression using gzip, lzma &  bzip2 algorithms, all
by only using utilities from Python Standard Library, there were a couple of additions and changes
made to these operations to better fit the needs of the project.
Related Links:

intel/dffml#1128

Added initial implementation of all the operations discussed above.


intel/dffml#1161

Added output to all the operations so that they could be chained in a
single dataflow.


intel/dffml#1199

Fixed an issue where the directory structure was not preserved in archive.


Renaming directory Property to location

Renamed all the instances of the directory property to location in order to pave way
for update of Model base class to support archive storage, this was the biggest change
I made in my project affecting more than 100 files.
Related Links:

intel/dffml#1155

The PR where I made this change.


Implementing Archive Support in Model base class and Updating all the Models

Added support for archive storage by implementing a helper function in dffml.df.archive to
seamlessly create dataflows just by using the input and output paths. Using this helper function
the Model base class creates appropriate dataflows for saving and loading models into/from archives
in its __aexit__ & __aenter__ functions respectively. This basically enables all the models,
inheriting directly from this to have the capability to use archive storage, after some changes
all the models inheriting from SimpleModel were also able to use archive storage. However small tweaks
and fixes were required in almost all models and also major refactor was required for a few models.
Related Links:

intel/dffml#1172

This was done to solve a circular dependency issue.


intel/dffml#1174

Updated Model & SimpleModel classes and all other models as well.


Detailed weekly description of tasks and work done can be found in:

Weekly Blogs: https://blogs.python-gsoc.org/en/programmer290399s-blog/
Weekly Sync: https://docs.google.com/document/d/16u9Tev3O0CcUDe2nfikHmrO3Xnd4ASJ45myFgQLpvzM/edit
DFFML YouTube Channel: https://www.youtube.com/channel/UCorEDRWGikwBH3dsJdDK1qA

Future Work

As DFFML is a Machine Learning based project, I think it is a good place for me to learn new things
related to Machine Learning and also learn things related to software engineering in general. Thus I
would love to be an active member of this community and plan to keep contributing to it.
Goals for future contributions:

I plan to work on various enhancement issues I have opened while working on this project.
Also I'm looking forward to work on various issues of my interest like this one.
Other than that I've got plans to add various models like Fasttext, Flair, Dart, etc. to DFFML.


I would like to thank my mentors John Andersen and Saksham Arora for helping me out on every stage of my project and patiently guiding me whenever necessary. Undoubtedly, I have learned a lot from both of them and I wouldn't have been able to make it without their valuable feedback, guidance and help.
I'd also like to thank my fellow GSoC students Sudhanshu Kumar and Hashim Chaudry for helping me during the summer.
Thank you to Google and Python Software Foundation for providing this opportunity!
Sno.	Model Name
01.	SLR model
02.	Auto Sklearn model
03.	Daal4py model
04.	Pytorch model
05.	Scikit model
06.	Scratch mode
07.	Spacy model
08.	Tensorflow model
09.	Vowpal Wabbit model
10.	Xgboost model
11.	Tensorflow hub model