Skip to content

Instantly share code, notes, and snippets.

@programmer290399
Last active August 30, 2021 05:14
Show Gist options
  • Save programmer290399/9aa13a9f2a72486714687e6e74073c23 to your computer and use it in GitHub Desktop.
Save programmer290399/9aa13a9f2a72486714687e6e74073c23 to your computer and use it in GitHub Desktop.

GSoC 2021

Google Summer of Code 2021 Final Work Submission Report

Aim

Add support for archive storage for all existing models in DFFML, the project roughly consists of three parts:

  1. Implementing archiving and compression related operations.
  2. Updating the Model base class to support archive storage.
  3. Update all models to support this change.

Executive Summary

DFFML provides a unified interface for training, testing, and deploying various machine learning models. It allows users to write their own plugins for fine-grained control on various aspects of a pipeline like reading data or training a model while providing a framework agnostic interface which helps put various parts together in a pipeline without affecting or breaking other pre-existing parts. This makes DFFML based projects easy to maintain, update and extend.

Before the execution of this project, issue #662 was open, which stated that DFFML has no support for saving Models as a single archived file which holds all the necessary information to restore the working state of the Model, which will enable a user to resume/reproduce work at a different point in time or on another machine/environment.

My project successfully added support for archive storage (i.e in the form of .zip or .tar.* family) of models for all pre-existing models in DFFML and updated tests, documentation and fixed all bugs related to various models that were introduced while implementing this feature. This has now enabled users to not only save the model state but also all the configuration of a model.

There would be two major benefits of this implementation to the user:

  1. Increased Reproducibility of DFFML models.
  2. Better Portability of DFFML models.

I would like to break down this project into three sub-parts:

  1. Implementing a suite of compression and archiving operations. This suite includes zip and tar format of archives which can be paired with gzip, lzma & bzip2 compression to produce compressed archives.

  2. Updating the Model base class to use archive creation/extraction dataflows on entering and exiting the context to seamlessly load and save models as archives if needed, keeping all the existing functionality of using folders as it is.

  3. Updating all the models listed below to support this change, effectively refactoring considerably large parts of these models.

    Sno. Model Name
    01. SLR model
    02. Auto Sklearn model
    03. Daal4py model
    04. Pytorch model
    05. Scikit model
    06. Scratch mode
    07. Spacy model
    08. Tensorflow model
    09. Vowpal Wabbit model
    10. Xgboost model
    11. Tensorflow hub model

Project Tasks Completed

Adding Support for Archives and Compression

Added DFFML operations to enable zip and tar file creation/extraction, also added compression operations to enable compression/decompression using gzip, lzma & bzip2 algorithms, all by only using utilities from Python Standard Library, there were a couple of additions and changes made to these operations to better fit the needs of the project.

Related Links:

  • intel/dffml#1128
    • Added initial implementation of all the operations discussed above.
  • intel/dffml#1161
    • Added output to all the operations so that they could be chained in a single dataflow.
  • intel/dffml#1199
    • Fixed an issue where the directory structure was not preserved in archive.

Renaming directory Property to location

Renamed all the instances of the directory property to location in order to pave way for update of Model base class to support archive storage, this was the biggest change I made in my project affecting more than 100 files.

Related Links:

Implementing Archive Support in Model base class and Updating all the Models

Added support for archive storage by implementing a helper function in dffml.df.archive to seamlessly create dataflows just by using the input and output paths. Using this helper function the Model base class creates appropriate dataflows for saving and loading models into/from archives in its __aexit__ & __aenter__ functions respectively. This basically enables all the models, inheriting directly from this to have the capability to use archive storage, after some changes all the models inheriting from SimpleModel were also able to use archive storage. However small tweaks and fixes were required in almost all models and also major refactor was required for a few models.

Related Links:

Detailed weekly description of tasks and work done can be found in:

Future Work

As DFFML is a Machine Learning based project, I think it is a good place for me to learn new things related to Machine Learning and also learn things related to software engineering in general. Thus I would love to be an active member of this community and plan to keep contributing to it.

Goals for future contributions:

  1. I plan to work on various enhancement issues I have opened while working on this project.
  2. Also I'm looking forward to work on various issues of my interest like this one.
  3. Other than that I've got plans to add various models like Fasttext, Flair, Dart, etc. to DFFML.

I would like to thank my mentors John Andersen and Saksham Arora for helping me out on every stage of my project and patiently guiding me whenever necessary. Undoubtedly, I have learned a lot from both of them and I wouldn't have been able to make it without their valuable feedback, guidance and help.

I'd also like to thank my fellow GSoC students Sudhanshu Kumar and Hashim Chaudry for helping me during the summer.

Thank you to Google and Python Software Foundation for providing this opportunity!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment