Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
GSoC'21 Work Product Report

gsoc logo

GSoC'21 Work Product Report

Name Arpitsinh Vaghela
Project Integrating Pandas into Syft
Organisation logo OpenMined
Mentor Mahava Jay, Tudor Cebere

About the project

Integrating a library into syft requires:

  • Protobufs for types; that are communicated between the nodes.
  • Wrapper class for these types with a method to serialize (object2proto) and deserialize (proto2object).
  • List of modules, classes, functions/methods to support.

Goals

The above mention process is tedious and time-consuming, parts of it were automated and the following were the goals of the project,

  • Allow definition outside of syft codebase wherever possible
  • Provide tooling which generates defaults automatically
  • Allow for opt-in importing of a target library
  • Package support should be able to be defined as a JSON like a configuration file
  • A separate Deny list should be built which contains all known library/methods which are potentially insecure and prevents their use by default

This would make adding support for any library to syft easier.

Contributions

  1. Move statsmodels support out of Syft core

    • Move library support for statsmodels from syft.lib into its own new package packages/syft-libs/syft-statsmodels.
    • Add library support into syft AST from a config JSON file.
    • PyScaffold Extension to generate library support packages syft-XYZ with a custom directory structure.
  2. Added Denylist Utilities

    • Allow syft to internally deny methods and classes that may give rise to security issues.
  3. CI to test external lib support packages and Meta Package

    • CI to test library support packages.
    • Meta package that on installation installs all support packages in packages/syft-libs.
    $ pip install syft-lib
    # installs syft-pandas, syft-xgboost ...
  4. Union and Primitive Type Support

    To support a method/function, syft ast requires (method_path,return_type) tuple.

    • If a method returns a python primitive the path to return_type is expected to be syft.lib.python.Dict rather than dict, this conversion was automated.
    • If the return type is a Union then it is expected to be instance of UnionGenerator, i.e, Union[int, float] => UnionGenerator[syft.lib.python.Int, syft.lib.python.Float], this conversion was automated.
  5. Generate exploration Notebooks within the package

    To automate the process of generating lib ast, the paths to classes, modules, and methods with its return_type are autogenerated using a script. If the script fails to get a return_type for a function it creates notebooks to help retrieve the return_type dynamically. On running an update script the config JSON is updated based on these notebooks.

    • Updated the extension to add all these exploration notebooks to the _missing_return directory.
  6. Long Short Path

    There can be multiple paths from which a class/function can be accessed, however, if there is a missing return_type in methods of a class one would have to update the return_type in all the paths from which the class/function can be accessed.

    • Updated the JSON generation script and the update script to add return_type changes to all these paths based on the return_type changes made to the original path.
  7. Recursive Wrapper (Open)

    • Generate Wrapper Class based on attributes of the type/class without a need to create a proto for the type.
  8. Add Support for Pandas (Open)

  9. Add Support for Petlib and Opacus (Open)

Future Improvements/Work

  • Add support for all pandas Indexes and Indexer (eg, _LocIndexer)/
  • Add support for Window, Groupby, and Resampling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment