hsandid/Meeting-Jan-9-2021.md

## Meeting-Jan-9-2021.md

      
    Raw
  

              Meeting-Jan-9-2021.md
            
          
    Arrow FYP - Update on Software Component

General Update


Progress has been made on porting tests from the MLPerf Inference Benchmark suite to C++.
I did not start working on the gem5 model yet. I plan to do so over the remainder of the winter break.

Porting ML Inference Algorithms from Python to C++.

Research :

The MLPerf Inference reference implementation supports multiple Python ML frameworks, like TensorFlow, PyTorch, ONNX.
Using tools like Cython\Nuitka to port the reference implementation 'as-is' lead to many issues (i.e. missing dependencies).
I've opted to look into the Python ML framework mentioned above and see if they offered any C\C++ API.
TensorFlow and PyTorch both offer a C++ API, which allows to write pure C++ code to import ML models and do inference on them.

Implementation :

I am currently using PyTorch's C++ API/library to write inference tests in pure C++.
Code Demo

Running valid MLPerf Inference Tests

Research :

The MLPerf inference submission system contains a system-under-test (SUT), the Load Generator (LoadGen), a data set, and an accuracy script. The data set, LoadGen, and Accuracy script are fixed for all submissions and are provided by MLPerf. Submitters implement the SUT according to their architecture's requirements, and engineering judgment.


For the purposes of our research, we will focus on porting the Edge-device suite offered by MLPerf :


Task
Model
Dataset
QSL Size
Quality
Required Scenarios
Reference App
Framework


Image Classification
Resnet50-v1.5
ImageNet (224x224)
1024
99% of FP32 (76.46%)
Single Stream, Offline
Link
tensorflow, pytorch, onnx


Object Detection (large)
SSD-ResNet34
COCO (1200x1200)
64
99% of FP32 (0.20 mAP)
Single Stream, Offline
Link
tensorflow, pytorch, onnx


Object Detection (small)
SSD-MobileNets-v1
COCO (300x300)
256
99% of FP32 (0.22 mAP)
Single Stream, Offline
Link
tensorflow, pytorch, onnx


Medical Image Segmentation
3D UNET
BraTS 2019 (224x224x160)
16
99% of FP32 and 99.9% of FP32 (0.85300 mean DICE score)
Single Stream, Offline
Link
tensorflow(?), pytorch, onnx (?)


Speech-to-Text
RNNT
Librispeech dev-clean (samples < 15 seconds)
2513
99% of FP32 (1 - WER, where WER=7.452253714852645%
Single Stream, Offline
Link
tensorflow (?), pytorch, onnx (?)


Language Processing
BERT
SQuAD v1.1 (max_seq_len=384)
10833
99% of FP32 (f1_score=90.874%)
Single Stream, Offline
Link
pytorch


There are four evaluation scenarios in MLPerf Inference, which have been selected to represent real-world critical inference applications : (1) Single-stream, (2) Multi-stream, (3) Server, (4) and Off-line.


Scenario
Query Generation
Duration
Samples/query
Latency Constraint
Tail Latency
Performance Metric


Single stream
LoadGen sends next query as soon as SUT completes the previous query
1024 queries and 60 seconds
1
None
90%
90%-ile measured latency


Offline
LoadGen sends all queries to the SUT at start
1 query and 60 seconds
At least 24,576
None
N/A
Measured throughput
Task	Model	Dataset	QSL Size	Quality	Required Scenarios	Reference App	Framework
Image Classification	Resnet50-v1.5	ImageNet (224x224)	1024	99% of FP32 (76.46%)	Single Stream, Offline	Link	tensorflow, pytorch, onnx
Object Detection (large)	SSD-ResNet34	COCO (1200x1200)	64	99% of FP32 (0.20 mAP)	Single Stream, Offline	Link	tensorflow, pytorch, onnx
Object Detection (small)	SSD-MobileNets-v1	COCO (300x300)	256	99% of FP32 (0.22 mAP)	Single Stream, Offline	Link	tensorflow, pytorch, onnx
Medical Image Segmentation	3D UNET	BraTS 2019 (224x224x160)	16	99% of FP32 and 99.9% of FP32 (0.85300 mean DICE score)	Single Stream, Offline	Link	tensorflow(?), pytorch, onnx (?)
Speech-to-Text	RNNT	Librispeech dev-clean (samples < 15 seconds)	2513	99% of FP32 (1 - WER, where WER=7.452253714852645%	Single Stream, Offline	Link	tensorflow (?), pytorch, onnx (?)
Language Processing	BERT	SQuAD v1.1 (max_seq_len=384)	10833	99% of FP32 (f1_score=90.874%)	Single Stream, Offline	Link	pytorch
Scenario	Query Generation	Duration	Samples/query	Latency Constraint	Tail Latency	Performance Metric
Single stream	LoadGen sends next query as soon as SUT completes the previous query	1024 queries and 60 seconds	1	None	90%	90%-ile measured latency
Offline	LoadGen sends all queries to the SUT at start	1 query and 60 seconds	At least 24,576	None	N/A	Measured throughput