Param-29/GSoC 2021: Integrating sklearn into PySyft.md

## GSoC 2021: Integrating sklearn into PySyft.md

      
    Raw
  

              GSoC 2021: Integrating sklearn into PySyft.md
            
          
GSoC 21: Integrating sklearn into PySyft

OpenMined & PySyft

As per their website, “OpenMined is an open-source community whose goal is to make the world more privacy-preserving by lowering the barrier-to-entry to private AI technologies”, pretty cool right 😍. To be honest, it was this reason that drew me to Openmined, their fantastic courses, and the community. My experience since January 2021 has been nothing short of thrilling. From asking silly doubts to solving them for new members, from creating new issues to getting assigned one, the experience has been surreal, and I am very sure this is just the start of a journey.
Their Python Library, Syft, allows users to compute and manipulate data they cannot see. This allows Data-Owners to work with Data-Scientists, without compromising the privacy of the data. Syft’s 0.5.0 release allows its users to train machine learning models using Pytorch. While Pytorch is a great library to train deep learning models, it does not implement many machine learning models like SVM’s, Random Forests, etc. Sklearn implements many such algorithms along with many pre-processing functionalities. So, for GSoC 2021, I was selected to “Integrate scikit-learn into syft“, enabling its users all the features that sklearn provides.
About Project Redback

Integration of a library into syft required the following:

Protobuffers (along with their serialization and deserialization functions) for primitive data types which libraries work on (example, DataFrames for pandas)
Serialization and Deserialization (serde) tests for these data types.
List of modules, classes, and various methods of the classes which need to be supported along with their expected return types.
Unit tests for supported methods.

This process is very time-consuming for each library and many parts of it can also be automated. This can be done with the help of inspect and mypy types which are generally present in codebases of python libraries. However, many libraries (like sklearn) do not have typing hints indicating the return type of the libraries.  Hence after some brainstorming sessions, we decided to work on a new feature, (called RedBack), a tool to semi-automate the process of adding libraries to syft.
Goals:

Following were the high-level goals of the project:

Provide tooling which generates sensible defaults automatically
Allow definition or auto-detection of Signatures and Return Types
A Tutorial should be created on using this new system with concrete examples of adding a real example library
The AST loader should support Package Dependencies at load time to allow those to be loaded first, for example, many packages rely on NumPy
Autocomplete for all AST methods should be available in IDEs like Jupyter Notebooks

My Contribution

My contribution heavily lies towards building an automated script to generate all the modules, classes, and various methods of those classes along with extracting their return types from the code-base.
Example:
class Person:
 
   def __init__(self, name, age) -> A:
       self.name = name
       self.age = age
 
   def ret_age(self) -> int:
       return self.age
 
   def ret_name(self):
       return self.name
For the above class, the following entries would be generated
{
   "A.__init__" : "A",
   "A.ret_age" : "int",
   "A.ret_name" : "_syft_missing",
}
Along with this, a new folder named "_missing_return" would be generated. This would contain various Jupyter-notebooks for various methods of classes where mypy return types are missing. A jupyter notebook would be created for each class whose functions do not have mypy return types. This would allow various open-source contributors to come together and work on experimenting on code-blocks to get return types of methods.
Then simply running update.py would extract return types of these methods and update them in a configurable json file.
Pull Requests

Following are the list of PRs where I contributed during GSoC 21:

RedBack: xgboost
[Redback]: Script Update
[Redback] Script updated for creating jupyter-notebooks for missing return types
[Redback] Moving out sklearn #5904

Future Plans

For integrating all the features present in sklearn, the next steps would be working on notebooks to update code snippets and running return types. Along with this, a medium-term goal is to move all the third-party libraries that syft currently supports out of syft and into syft-libs. This would include libraries like PyDP, statsmodels, torch, torchvision, etc.
Acknowledgment

I'd want to express my gratitude to my mentors, Madhava Jay and Tudor Cebere, for always being there for me when I needed them. I am grateful to them for patiently reviewing my code, giving constructive suggestions, various pair-programming sessions, and guiding me through the program. Their constructive feedback helped me improve during this period. I'd also like to express my appreciation to Arpitsinh Vaghela; it was a pleasure working with you!
Finally, I'd want to thank Andrew Trask, Patrick Cason, and the rest of the team, as well as Google, for this wonderful opportunity and for making the summer so enjoyable! 🙌