- Name: Saahil Ali
- Organisation: Python Software Foundation
- Sub-Organisation: DFFML - DataFlow Facilitator for Machine Learning
- Project: Support Archive Storage for Models
- Proposal: https://blogs.python-gsoc.org/media/proposals/DFFML__Support_Archive_Storage_for__Models_2.pdf
Add support for archive storage for all existing models in DFFML, the project roughly consists of three parts:
- Implementing archiving and compression related operations.
- Updating the
Model
base class to support archive storage. - Update all models to support this change.
DFFML provides a unified interface for training, testing, and deploying various machine learning models. It allows users to write their own plugins for fine-grained control on various aspects of a pipeline like reading data or training a model while providing a framework agnostic interface which helps put various parts together in a pipeline without affecting or breaking other pre-existing parts. This makes DFFML based projects easy to maintain, update and extend.
Before the execution of this project, issue #662 was open, which stated that DFFML has no support for saving Models as a single archived file which holds all the necessary information to restore the working state of the Model, which will enable a user to resume/reproduce work at a different point in time or on another machine/environment.
My project successfully added support for archive storage (i.e in the form of .zip or .tar.* family) of models for all pre-existing models in DFFML and updated tests, documentation and fixed all bugs related to various models that were introduced while implementing this feature. This has now enabled users to not only save the model state but also all the configuration of a model.
There would be two major benefits of this implementation to the user:
- Increased Reproducibility of DFFML models.
- Better Portability of DFFML models.
I would like to break down this project into three sub-parts:
-
Implementing a suite of compression and archiving operations. This suite includes
zip
andtar
format of archives which can be paired withgzip
,lzma
&bzip2
compression to produce compressed archives. -
Updating the
Model
base class to use archive creation/extraction dataflows on entering and exiting the context to seamlessly load and save models as archives if needed, keeping all the existing functionality of using folders as it is. -
Updating all the models listed below to support this change, effectively refactoring considerably large parts of these models.
Sno. Model Name 01. SLR model 02. Auto Sklearn model 03. Daal4py model 04. Pytorch model 05. Scikit model 06. Scratch mode 07. Spacy model 08. Tensorflow model 09. Vowpal Wabbit model 10. Xgboost model 11. Tensorflow hub model
Added DFFML operations to enable zip
and tar
file creation/extraction, also added compression
operations to enable compression/decompression using gzip
, lzma
& bzip2
algorithms, all
by only using utilities from Python Standard Library, there were a couple of additions and changes
made to these operations to better fit the needs of the project.
Related Links:
- intel/dffml#1128
- Added initial implementation of all the operations discussed above.
- intel/dffml#1161
- Added output to all the operations so that they could be chained in a single dataflow.
- intel/dffml#1199
- Fixed an issue where the directory structure was not preserved in archive.
Renamed all the instances of the directory
property to location
in order to pave way
for update of Model
base class to support archive storage, this was the biggest change
I made in my project affecting more than 100 files.
Related Links:
- intel/dffml#1155
- The PR where I made this change.
Added support for archive storage by implementing a helper function in dffml.df.archive
to
seamlessly create dataflows just by using the input and output paths. Using this helper function
the Model
base class creates appropriate dataflows for saving and loading models into/from archives
in its __aexit__
& __aenter__
functions respectively. This basically enables all the models,
inheriting directly from this to have the capability to use archive storage, after some changes
all the models inheriting from SimpleModel
were also able to use archive storage. However small tweaks
and fixes were required in almost all models and also major refactor was required for a few models.
Related Links:
- intel/dffml#1172
- This was done to solve a circular dependency issue.
- intel/dffml#1174
- Updated
Model
&SimpleModel
classes and all other models as well.
- Updated
Detailed weekly description of tasks and work done can be found in:
- Weekly Blogs: https://blogs.python-gsoc.org/en/programmer290399s-blog/
- Weekly Sync: https://docs.google.com/document/d/16u9Tev3O0CcUDe2nfikHmrO3Xnd4ASJ45myFgQLpvzM/edit
- DFFML YouTube Channel: https://www.youtube.com/channel/UCorEDRWGikwBH3dsJdDK1qA
As DFFML is a Machine Learning based project, I think it is a good place for me to learn new things related to Machine Learning and also learn things related to software engineering in general. Thus I would love to be an active member of this community and plan to keep contributing to it.
Goals for future contributions:
- I plan to work on various enhancement issues I have opened while working on this project.
- Also I'm looking forward to work on various issues of my interest like this one.
- Other than that I've got plans to add various models like Fasttext, Flair, Dart, etc. to DFFML.
I would like to thank my mentors John Andersen and Saksham Arora for helping me out on every stage of my project and patiently guiding me whenever necessary. Undoubtedly, I have learned a lot from both of them and I wouldn't have been able to make it without their valuable feedback, guidance and help.
I'd also like to thank my fellow GSoC students Sudhanshu Kumar and Hashim Chaudry for helping me during the summer.
Thank you to Google and Python Software Foundation for providing this opportunity!