ezrapierce000/gsoc22.md

## gsoc22.md

      
    Raw
  

              gsoc22.md
            
          
    Machine Learning with Bela & IREE

This project took place over the summer of 2022 as part of the Google Summer of Code, with support from the Intelligent Instruments Lab, Beagleboard Foundation and Bela.
The project's objective was to improve the tooling available for those looking to use machine learning models in their Bela projects. For some background, Bela is a maker platform built on top of the Beaglebone Black with a focus on real-time audio and sensor processing for use in interactive art projects including digital instrument design. The availability of machine learning tools to be used with Bela would allow for new design practices incorporating machine learning models.
The original goal of this project arose out of the constraints of Bela's low-powered (in machine learning terms) processor and the real-time constraints of interactive projects. To aid development on this platform there is a need for performance analysis tools that allow for quick evaluation of different models on Bela. This project has built some tools that can be used for this purpose including benchmarking and profiling utilities.
Although not originally a focus for this project, this project also worked on supporting the Intermediate Representation Execution Environment (IREE) on Bela. This project is part of the larger MLIR compiler infrastructure project. It includes a machine learning compiler and runtime. In short, MLIR is based on dialects, which are themselves collections of operations. The passes within MLIR compilers translate between dialects. The MLIR infrastructure allows different compilers to reuse optimizations and compiler passes on and in between new dialects without starting from scratch. IREE is using MLIR to create a compiler that goes all the way down to scheduling workloads, with a lightweight Hardware Abstraction Layer to run on. I didn't know a thing about what a machine learning compiler was 4 months ago so I am not the best person to explain them, I would highly recommend checking out the original MLIR paper as well as this IREE paper for more in-depth background.
IREE allows for multiple different code generation backends from the compiler including portable IREE virtual machine bytecode, LLVM IR or C source code. The main advantages that I see with IREE is the portability across different hardware platforms including bare-metal options, option for parallelization on platforms where it is an option as well as the multiple frontends available for importing models (although some are in the very early stages - see Torch-MLIR). Past the Bela/Beaglebone Black, I do think that if embedded in a more portable audio application (Pure Data, VST, CLAP, etc.) the IREE runtime could be a lightweight way of running machine learning models in different types of audio(or other multimedia) projects while still being able to take advantage of larger multiprocessor systems.
IREE architecture diagram from https://iree-org.github.io/iree/:

My work this summer built two new related projects for running and measuring IREE on Bela. Firstly I created a docker image that contains a toolchain setup for compiling IREE projects for Bela that also contains some utilities to compile, benchmark and profile programs. You can see this project at https://github.com/ezrapierce000/bela-iree-container. This project also contains a model zoo and runtime as submodules so you can have a full end to end development environment for using IREE in Bela projects. The second project I created was an IREE runtime for Bela, this can be found at https://github.com/ezrapierce000/bela-iree-runtime. This project contains a Bela project with the IREE runtime setup to allow for a model to be loaded into a Bela project. The runtime has two branches with different project structure. The main branch requires the IREE compiler to export a VMFB file to the Bela, which will be then loaded at runtime. Alternatively, there is the EmitC option on the emit-c branch. This option requires the IREE compiler to output C source code in a module.c file to be compiled into the binary ahead of time. The runtime also has the option to enable Xenomai diagnostics during runtime to inspect how the IREE thread is behaving. It is functional but still in the early stages, I plan on improving it further so it is easier to use.
Benchmarking

The benchmark utility created can be found in the bela-iree-container docker container. You can follow the readme in that repository to set it up. Below are the most recent benchmarks from the BBB and BBAI (CPU only). As you can see the Cortex-A8 on the Bela is quite a bit slower at inference than the AArch64 Cortex-A73 on the BBAI64. There are also more errors encountered running on the 32-bit ARM platform when using the LLVM-CPU codegen backend. The interpreted VMVX runtime is still quite new and is expected to be improved as the IREE developers begin to add new microkernels to VMVX which will hopefully translate to new performance gains on Bela. All models in this table other than the MDRNN were processing blocks of 1024.
Current benchmarks using IREE:


Model
IREE input type
Bela IREE Benchmark
BBAI (CPU only) IREE Benchmark - LLVM-CPU


basic_mlp_1024
TOSA from TFLite
222ms (VMVX)
24.0ms


resnet_1d_1024
TOSA from TFLite
segfault
NA


simple_conv_1d_1024
TOSA from TFLite
2549ms (LLVM-CPU)
137ms


simple_rnn_1024
NA
NA - unable to export to TOSA
NA


single_mm_1024
TOSA from Torch-MLIR
19.7ms (LLVM-CPU)
7.72ms


siren_mlp_1024
TOSA from TFLite
778ms (LLVM-CPU)
50.4ms


transformer_block_1024
TOSA from TFLite
Segmentation fault
142ms


variational_encoder_1024
NA
NA - unable to export to TOSA
NA


mdrnn (64 hidden units)
MHLO from JAX
37.6ms (VMVX)
0.176ms


Profiling

Another benefit of using IREE is the built-in instrumentation using the Tracy profiler. This profiler can be enabled throughout the IREE runtime with a compiler flag, allowing for finegrained profiling data from running IREE programs. The data can be sent over TCP to a capture tool which allows for visualization of traces, memory allocations, etc. Unfortunately, the instrumented binaries were unstable on the Bela making for unreliable profile recording, although some profiles could be recorded. More work could be done here debugging the cause of the instability as it could be a very useful tool and it is somewhat functional.
Tracy example from Bela:

As an alternative the perf profiler was used to record profiles and performance monitor events on the Bela while models are running. The profiling utility in the docker container allows for recording profiles as the model runs. The profiles and events can then be viewed in various formats. The viewer I have been using is the TraceCompass. Additional work could be done in the IREE runtime on Bela to provide similar instrumentation to the Tracy profiler, this could possibly be done with the LTTng tool to insert tracepoints.
Workflow example

As an example, these are the steps need to benchmark a single matrix multiply using IREE on Bela as well as print out some profiling information. These steps assume you are in the bela-iree-container.
cd /workspaces/bela-iree-container/models/embedded-model-zoo/ && conda activate zoo && python -m zoo
cd tosa/ && compile -i single_mm.tosa -t bbb -d tosa -f vm-bytecode -h llvm-cpu -o single_mm.vmfb
benchmark -f single_mm.vmfb -t bbb -r 10 -e forward -i 1x1024xf32=4 -d 192.168.7.2
profile -f single_mm.vmfb -m stat -e forward -i 1x1024xf32=4 -l cache-misses,cache-references -d 192.168.7.2
Future work

Although I did not complete all of the goals set out at the start of this project, what I've accomplished gives a good starting point for future experimentation with IREE on Bela and other embedded devices (RPi, Teensy, etc.). I plan on continuing this work in some form and will be especially focused on improving the IREE runtime on Bela so that is stable and easier to use. In the short term I plan on getting a full demo of a control-rate MDRNN running in a Bela project using IREE. In addition to more practical improvements I also would like to automate some of the process of building the IREE runtime components + profiling/instrumentation utilities + MLIR components so that the upstream projects can be easily tracked as they are moving quite fast. Another interesting thing to investigate in the future would be the GPU on the Bela. There were GPU drivers finally released by PowerVR (GPU manafacturer) in 2020 for the GPU on the Bela so it may be interesting to look at. I don't think it would be very useful for audio synthesis as it would still be using large block sizes but you may be able to use it for control-rate processing or image processing.
Links

https://arxiv.org/abs/2205.14479 - TinyIREE: An ML Execution Environment for Embedded Systems from Compilation to Deployment
https://arxiv.org/abs/2002.11054 - MLIR: A Compiler Infrastructure for the End of Moore's Law
https://archive.eclipse.org/tracecompass.incubator/doc/org.eclipse.tracecompass.incubator.perf.profiling.doc.user/User-Guide.html - Perf User Guide for TraceCompass
https://github.com/rodrigodzf/DeepLearningForBela - Great runtime and examples for deep learning with Bela
https://www.brendangregg.com/linuxperf.html - Linux performance info
https://groups.google.com/g/iree-discuss/c/qyTy88KLq2c - Post from iree-discuss talking about the VMVX backend, plans for implementing more microkernels
https://www.lei.chat/ - Blog of one of the IREE developers who has some very helpful posts regarding MLIR + GPUs
https://github.com/wolfpld/tracy - Tracy profiler
https://github.com/iree-org/iree-jax - Start of an IREE-JAX integration
Model	IREE input type	Bela IREE Benchmark	BBAI (CPU only) IREE Benchmark - LLVM-CPU
basic_mlp_1024	TOSA from TFLite	222ms (VMVX)	24.0ms
resnet_1d_1024	TOSA from TFLite	segfault	NA
simple_conv_1d_1024	TOSA from TFLite	2549ms (LLVM-CPU)	137ms
simple_rnn_1024	NA	NA - unable to export to TOSA	NA
single_mm_1024	TOSA from Torch-MLIR	19.7ms (LLVM-CPU)	7.72ms
siren_mlp_1024	TOSA from TFLite	778ms (LLVM-CPU)	50.4ms
transformer_block_1024	TOSA from TFLite	Segmentation fault	142ms
variational_encoder_1024	NA	NA - unable to export to TOSA	NA
mdrnn (64 hidden units)	MHLO from JAX	37.6ms (VMVX)	0.176ms