After checking the status quo of aarch64 vector code generation by recent-enough compilers, particularly where it comes to intrinsincs codegen, a few points surfaced, which I'll try to reason about here. The originally participating compilers were gcc-11 through 13 and clang-11 through 13. Here we will focus on just one "lineage" of the sampled compliers -- clang, as its vector codegen is as-good or better than its peers. Particularly, focus will be on how percieved optimisations may affect the codegen, and thus performance, in the context of a simple function -- a vec3-by-matrix3x3 transformation, i.e. multiplication.
Performance surveyed on the following uarches:
- cortex-a76 -- most of aarch64 cloud install base, raspberry pi 5, et al; clang-13
- cortex-a78 -- a fair midrange performer these days; NV Orin et al; clang-11 & 12
- apple M1 -- the uarch which set the bar of aarch64 high-end performance a few years ago; apple clang-12