- Progress has been made on porting tests from the MLPerf Inference Benchmark suite to C++.
- I did not start working on the gem5 model yet. I plan to do so over the remainder of the winter break.
Research :
- The MLPerf Inference reference implementation supports multiple Python ML frameworks, like TensorFlow, PyTorch, ONNX.
- Using tools like Cython\Nuitka to port the reference implementation 'as-is' lead to many issues (i.e. missing dependencies).
- I've opted to look into the Python ML framework mentioned above and see if they offered any C\C++ API.
- TensorFlow and PyTorch both offer a C++ API, which allows to write pure C++ code to import ML models and do inference on them.
Implementation :
- I am currently using PyTorch's C++ API/library to write inference tests in pure C++.
- Code Demo
Research :
- The MLPerf inference submission system contains a system-under-test (SUT), the Load Generator (LoadGen), a data set, and an accuracy script. The data set, LoadGen, and Accuracy script are fixed for all submissions and are provided by MLPerf. Submitters implement the SUT according to their architecture's requirements, and engineering judgment.
- For the purposes of our research, we will focus on porting the Edge-device suite offered by MLPerf :
Task | Model | Dataset | QSL Size | Quality | Required Scenarios | Reference App | Framework |
---|---|---|---|---|---|---|---|
Image Classification | Resnet50-v1.5 | ImageNet (224x224) | 1024 | 99% of FP32 (76.46%) | Single Stream, Offline | Link | tensorflow, pytorch, onnx |
Object Detection (large) | SSD-ResNet34 | COCO (1200x1200) | 64 | 99% of FP32 (0.20 mAP) | Single Stream, Offline | Link | tensorflow, pytorch, onnx |
Object Detection (small) | SSD-MobileNets-v1 | COCO (300x300) | 256 | 99% of FP32 (0.22 mAP) | Single Stream, Offline | Link | tensorflow, pytorch, onnx |
Medical Image Segmentation | 3D UNET | BraTS 2019 (224x224x160) | 16 | 99% of FP32 and 99.9% of FP32 (0.85300 mean DICE score) | Single Stream, Offline | Link | tensorflow(?), pytorch, onnx (?) |
Speech-to-Text | RNNT | Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 - WER, where WER=7.452253714852645% | Single Stream, Offline | Link | tensorflow (?), pytorch, onnx (?) |
Language Processing | BERT | SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 (f1_score=90.874%) | Single Stream, Offline | Link | pytorch |
- There are four evaluation scenarios in MLPerf Inference, which have been selected to represent real-world critical inference applications : (1) Single-stream, (2) Multi-stream, (3) Server, (4) and Off-line.
Scenario | Query Generation | Duration | Samples/query | Latency Constraint | Tail Latency | Performance Metric |
---|---|---|---|---|---|---|
Single stream | LoadGen sends next query as soon as SUT completes the previous query | 1024 queries and 60 seconds | 1 | None | 90% | 90%-ile measured latency |
Offline | LoadGen sends all queries to the SUT at start | 1 query and 60 seconds | At least 24,576 | None | N/A | Measured throughput |