- Name: Vaibhav Vardhan
- University: Indian Institute of Technology (IIT) Kanpur
- e-Mail: hi@vsquareg.com | vbhv.vardhan@gmail.com
- Github: https://github.com/vsquareg
- LinkedIn: https://www.linkedin.com/in/vsquareg/
- Country: India
- Time Zome: Indian Standard Time (UTC+0530)
This project was aimed at investigating the performance of the possible alternatives for improving the data channel performance for duet, and further integrating those improvements into the PySyft stack.
This part of the work was to integrate Apache Arrow and its corresponding backend RPC-based network communication Apache Flight.
Currently, for all communication between the DS & DO, a WebRTC-backend is used with protobuf serialization-deserialization. Here, client classes were created which would establish a duel connection, a WebRTC connection which is used for all usual communication (messages, requests etc.) and a Flight RPC connection which is used for high-bandwidth requirement transfers of NumPy Arrays and PyTorch Tensors. This allows for an improvement in performance due to the zero-copy nature of Flight RPC.
Special care was taken to ensure that the use of Flight RPC within the code-base is also zero-copy.
Flight RPC, as of now, only supports transfer of one-dimensional arrays, to circumvent this, the tensor/array is flattened and an augmented context is also sent accross.
~ 3-4x
faster transport i.e. 65-75%
transfer speed improvement
| Arrow | WebRTC
------------------------------------------------------------------------
dtype: float16, shape: (100, 100) | 88.5 | 62.5
dtype: float16, shape: (200, 200) | 68.1 | 96.3
dtype: float16, shape: (500, 500) | 311.4 | 1375.0
dtype: float16, shape: (1000, 1000) | 332.7 | 1674.7
dtype: float32, shape: (100, 100) | 56.5 | 57.1
dtype: float32, shape: (200, 200) | 68.8 | 132.2
dtype: float32, shape: (500, 500) | 184.8 | 681.9
dtype: float32, shape: (1000, 1000) | 537.0 | 2807.7
dtype: float64, shape: (100, 100) | 74.4 | 90.3
dtype: float64, shape: (200, 200) | 92.1 | 245.4
dtype: float64, shape: (500, 500) | 272.4 | 1302.5
dtype: float64, shape: (1000, 1000) | 926.8 | 5338.6
dtype: float64, shape: (3000, 3000) | 5447.7 | 21595.1
This part of the work was to introduce various data-compression algorithms at possible stages of transfer.
compression
module was developed which is initialized with a duet instance and automatically syncs with the remote-client everytime any configuration parameter is changed. Support was also added to enable automatic choice of configuration parameters based on the connection speed (i.e. if the connection speed is high enough, compression is not necessarily required, however, if the connection speed is very slow, the time-spent on compression would result in a much-better performace).
The module is extensible so as to ease the addition of new compression algorithms and in-line with standard python-compression structures.
lzma
and blosc
compression libs were integrated with the serialization and de-serialization steps in PySyft.
This compression step is applied to all data (with varying intensity) sent over the WebRTC interface (after confirmation that the client has support for de-compression).
With the automatic configuration paramteres, an average compression of ~20-25%
is observed (raw bytes), overall an improvement of ~15-18%
is observed in terms of the transfer speed.
For lossless tensor compression, a sparse-compressor is added, which first checks for the sparsity of the tensor before applying the compression step.
For lossy tensor compression algorithms: DGCCompressor [1], TopKCompressor, etc.
An implementation from the deep-reduce paper [2] is integrated for gradient compression. For a tensor, this attaches a compressor object so as to store the past gradient information.
The bytes compression module was tested to be extended within Flight RPC, however with the tensor compression no performance improvement was observed. This is plausible since in-case of any sparsity within the tensor, the tensor compression would automatically handle it. Hence, this part was discarded for now.
A major part of the chunking algorithms research is focused on data-deduplication. An constant-time algorithm for data-deduplication [3] was tested.
With the compression module disabled, a transfer speed improvement of ~10-15%
was observed, however with the compression module enabled, the transfer speed improvement was negligible (< 5%
), possibly due to the de-deuplication step that takes place within the bytes compression
step.
Additionally, integration of a chunking algorithm within the codebase is not very extensible due to its effect only within the network layer 6-7. For instance, addition of a custom-chunking algorithm within the Flight RPC would require changes on the gRPC-bytes interaction level.
Hence, due to only a minor performance improvement and integration complexities, the module was discarded.
As a module, the Apache Flight RPC implementation is in working order, however, a blocker prevents its integration to enable usage within the codebase.
Currently, for Duet, a WebRTC backend is used which has STUN/TURN implementations & integrations for establishing the connection. However, Flight RPC utilizes a gRPC backend, and due to the dynamic nature of Duet, it requires a similar implementation to establish the connection.
A few options are being explored involving utilizing the ICE interface of WebRTC to establish the connection, each with some caveats.
Solving this blocker would enable a lot more developments. PySyft primarily has a server-client architecture, hence, an entire gRPC-network-backend is a welcoming choice. This would also allow PySyft to move the network-stack into Rust utilizing tonic, enabling a much performant communication.
(basic implementation extended from the work by avinsit123: halted due to unsurety about its usability after the current codebase changes)
Branch: https://github.com/vsquareg/PySyft/tree/duet-widget
[1] Hang Xu, Chen-Yu Ho, Ahmed M Abdelmoniem, Aritra Dutta, El Houcine Bergou, Konstantinos Karatsenidis, Marco Canini, and Panos Kalnis. GRACE: A Compressed Communication Framework for Distributed Machine Learning. In Proc. of ICDCS, 2021.
[2] Kostopoulou K, Xu H, Dutta A, et al. DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning[J]. arXiv preprint arXiv:2102.03112, 2021.
[3] MyungKeun Yoon, A constant-time chunking algorithm for packet-level deduplication, https://doi.org/10.1016/j.icte.2018.05.00