Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
GSoC-2021

High Performance Data Channel for Duet

Info

GSoC Project Page

GSoC Project Proposal

Organization GitHub Repository

Personal (forked) GitHub Repository

Project Abstract

This project was aimed at investigating the performance of the possible alternatives for improving the data channel performance for duet, and further integrating those improvements into the PySyft stack.

Work Summary

Apache Arrow & Flight RPC

PR: OpenMined/PySyft#5877

This part of the work was to integrate Apache Arrow and its corresponding backend RPC-based network communication Apache Flight.

Currently, for all communication between the DS & DO, a WebRTC-backend is used with protobuf serialization-deserialization. Here, client classes were created which would establish a duel connection, a WebRTC connection which is used for all usual communication (messages, requests etc.) and a Flight RPC connection which is used for high-bandwidth requirement transfers of NumPy Arrays and PyTorch Tensors. This allows for an improvement in performance due to the zero-copy nature of Flight RPC.

Notes / Challenges

Special care was taken to ensure that the use of Flight RPC within the code-base is also zero-copy.

Flight RPC, as of now, only supports transfer of one-dimensional arrays, to circumvent this, the tensor/array is flattened and an augmented context is also sent accross.

Benchmarks / Performance

~ 3-4x faster transport i.e. 65-75% transfer speed improvement

                                               |   Arrow    |    WebRTC  
------------------------------------------------------------------------
      dtype: float16, shape: (100, 100)        |      88.5  |       62.5
      dtype: float16, shape: (200, 200)        |      68.1  |       96.3
      dtype: float16, shape: (500, 500)        |     311.4  |     1375.0
      dtype: float16, shape: (1000, 1000)      |     332.7  |     1674.7
      dtype: float32, shape: (100, 100)        |      56.5  |       57.1
      dtype: float32, shape: (200, 200)        |      68.8  |      132.2
      dtype: float32, shape: (500, 500)        |     184.8  |      681.9
      dtype: float32, shape: (1000, 1000)      |     537.0  |     2807.7
      dtype: float64, shape: (100, 100)        |      74.4  |       90.3
      dtype: float64, shape: (200, 200)        |      92.1  |      245.4
      dtype: float64, shape: (500, 500)        |     272.4  |     1302.5
      dtype: float64, shape: (1000, 1000)      |     926.8  |     5338.6
      dtype: float64, shape: (3000, 3000)      |    5447.7  |    21595.1

Compression

PR: OpenMined/PySyft#5917

This part of the work was to introduce various data-compression algorithms at possible stages of transfer.

compression module was developed which is initialized with a duet instance and automatically syncs with the remote-client everytime any configuration parameter is changed. Support was also added to enable automatic choice of configuration parameters based on the connection speed (i.e. if the connection speed is high enough, compression is not necessarily required, however, if the connection speed is very slow, the time-spent on compression would result in a much-better performace).

The module is extensible so as to ease the addition of new compression algorithms and in-line with standard python-compression structures.

Bytes Compression (lossless)

lzma and blosc compression libs were integrated with the serialization and de-serialization steps in PySyft.

This compression step is applied to all data (with varying intensity) sent over the WebRTC interface (after confirmation that the client has support for de-compression).

With the automatic configuration paramteres, an average compression of ~20-25% is observed (raw bytes), overall an improvement of ~15-18% is observed in terms of the transfer speed.

Tensor Compression (lossless & lossy)

For lossless tensor compression, a sparse-compressor is added, which first checks for the sparsity of the tensor before applying the compression step.

For lossy tensor compression algorithms: DGCCompressor [1], TopKCompressor, etc.

Gradient Compression (lossy)

An implementation from the deep-reduce paper [2] is integrated for gradient compression. For a tensor, this attaches a compressor object so as to store the past gradient information.

Flight Data (bytes) Compression (lossless)

The bytes compression module was tested to be extended within Flight RPC, however with the tensor compression no performance improvement was observed. This is plausible since in-case of any sparsity within the tensor, the tensor compression would automatically handle it. Hence, this part was discarded for now.

Chunking

A major part of the chunking algorithms research is focused on data-deduplication. An constant-time algorithm for data-deduplication [3] was tested.

With the compression module disabled, a transfer speed improvement of ~10-15% was observed, however with the compression module enabled, the transfer speed improvement was negligible (< 5%), possibly due to the de-deuplication step that takes place within the bytes compression step.

Additionally, integration of a chunking algorithm within the codebase is not very extensible due to its effect only within the network layer 6-7. For instance, addition of a custom-chunking algorithm within the Flight RPC would require changes on the gRPC-bytes interaction level.

Hence, due to only a minor performance improvement and integration complexities, the module was discarded.

Work Remaining

As a module, the Apache Flight RPC implementation is in working order, however, a blocker prevents its integration to enable usage within the codebase.

Currently, for Duet, a WebRTC backend is used which has STUN/TURN implementations & integrations for establishing the connection. However, Flight RPC utilizes a gRPC backend, and due to the dynamic nature of Duet, it requires a similar implementation to establish the connection.

A few options are being explored involving utilizing the ICE interface of WebRTC to establish the connection, each with some caveats.

Solving this blocker would enable a lot more developments. PySyft primarily has a server-client architecture, hence, an entire gRPC-network-backend is a welcoming choice. This would also allow PySyft to move the network-stack into Rust utilizing tonic, enabling a much performant communication.

Other Work

Jupyter Widget

(basic implementation extended from the work by avinsit123: halted due to unsurety about its usability after the current codebase changes)

Branch: https://github.com/vsquareg/PySyft/tree/duet-widget

References

[1] Hang Xu, Chen-Yu Ho, Ahmed M Abdelmoniem, Aritra Dutta, El Houcine Bergou, Konstantinos Karatsenidis, Marco Canini, and Panos Kalnis. GRACE: A Compressed Communication Framework for Distributed Machine Learning. In Proc. of ICDCS, 2021.

[2] Kostopoulou K, Xu H, Dutta A, et al. DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning[J]. arXiv preprint arXiv:2102.03112, 2021.

[3] MyungKeun Yoon, A constant-time chunking algorithm for packet-level deduplication, https://doi.org/10.1016/j.icte.2018.05.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment