{{ message }}

Instantly share code, notes, and snippets.

# bitonic/vectorized-atan2f.cpp

Last active Aug 25, 2021
Vectorized & branchless atan2f

### aconz2 commented Aug 17, 2021

 I'm not sure if I'm missing something with the `-static` flag (are you statically linking `libm` somehow?), but manually unrolling the baseline loop by 8 gives me an 8x speedup. Unrolling to 16 gives a 15x speedup and puts it between `auto_3` and `auto_4` on my machine. Again this is all assuming the `atan2f` call isn't getting inlined somehow and then the loop getting unrolled. ```NOINLINE static void atan2_baseline8(size_t cases, const float* ys, const float* xs, float* out) { assert(cases % 8 == 0); for (size_t i = 0; i < cases / 8; i+=8) { out[i] = atan2f(ys[i], xs[i]); out[i + 1] = atan2f(ys[i+1], xs[i+1]); out[i + 2] = atan2f(ys[i+2], xs[i+2]); out[i + 3] = atan2f(ys[i+3], xs[i+3]); out[i + 4] = atan2f(ys[i+4], xs[i+4]); out[i + 5] = atan2f(ys[i+5], xs[i+5]); out[i + 6] = atan2f(ys[i+6], xs[i+6]); out[i + 7] = atan2f(ys[i+7], xs[i+7]); } }```

### bitonic commented Aug 17, 2021 • edited

 @aconz2 you're processing an eight of the elements, given the `cases / 8` condition. You can ask the compiler to unroll the loop, and get exactly the same results as `atan2_baseline`: ```NOINLINE static void atan2_baseline8(size_t cases, const float* ys, const float* xs, float* out) { #pragma unroll 8 for (size_t i = 0; i < cases; i++) { out[i] = atan2f(ys[i], xs[i]); } }``` This is since `atan2f` can't be inlined anyway (at least not in `glibc`), so unrolling doesn't buy you much. By the way, yes, I am compiling the binary statically (since I develop on my laptop and try it on a server). I do it using this `default.nix` besides the file: ```with import {}; stdenvNoCC.mkDerivation { name = "vectorized-atan2-shell"; buildInputs = [ glibc.static clang_12 ]; }``` And then working from inside `nix-shell`.

### aconz2 commented Aug 17, 2021

 Haha oops! 🤦 I'm still curious what the performance would be like if we could inline `atan2f`. Would be neat to have an LTO libc for cases like this. Great article btw and thanks for putting up with my silly mistake. I learned a lot from this and the article. Cheers! (PS I also couldn't get the perf events code to work correctly on my machine; no calls errored but was getting all zeros in `perf_read_value->value`. Tried debugging a bit but wondering if you had to do anything special to get that to work)

### bitonic commented Aug 17, 2021

 @aconz2 no shame in making these sort of mistakes! The vast majority of surprising results I get are due to some silly mistake :). I have not tried to just inline the stock `atan2f`. Re perf, I had to install `linuxPackages.perf` on NixOS. On Debian it'd be `linux-perf`, iirc.

### aconz2 commented Aug 18, 2021

 I figured out why I was getting no perf events. From the manual on perf_event_open(2) ``````An event group is scheduled onto the CPU as a unit: it will be put onto the CPU only if all of the events in the group can be put onto the CPU. `````` Turns out there was one too many hardware counters for my CPU so none of them returned values, but you also don't get any errors. (Aside: though I was getting a bogus value out of the software task-clock counter, which is weird) Got a similar result from `perf` but it correctly counts as many as it can and then says `` ``````→ perf stat -e cycles,instructions,cache-misses,cache-references,branch-misses,branches echo 'hi' hi Performance counter stats for 'echo hi': 349,593 cycles:u 274,441 instructions:u # 0.79 insn per cycle 3,018 cache-misses:u # 20.166 % of all cache refs 14,966 cache-references:u 4,240 branch-misses:u # 0.00% of all branches branches:u (0.00%) 0.000626665 seconds time elapsed 0.000647000 seconds user 0.000000000 seconds sys Some events weren't counted. Try disabling the NMI watchdog: echo 0 > /proc/sys/kernel/nmi_watchdog perf stat ... echo 1 > /proc/sys/kernel/nmi_watchdog `````` And after the bit about `nmi_watchdog`, I did let me get results from all the counters. I'll have to look later why/how `perf` detects that there are too many counters and/or detecting the number of available ones beforehand.