mtsamis/SIMD_AABB.md

## SIMD_AABB.md

      
    Raw
  

              SIMD_AABB.md
            
          
    (Note: This writeup assumes some knowledge of SIMD programming, assembly and AABBs).
Introduction

There are a couple of ways to represent an axis aligned bounding box in 2D, but I think the most common is to store
the minimum and maximum coordinates. With this approach, most basic operations (overlap tests, union/intersection, etc.) have simple implementations and offer decent performance.
Some time ago, I came up with a modification to this AABB representation that has the potential for better performance.
I haven't seen this technique mentioned or used anywhere, but if you have seen it somehow I would appreciate it if you could let me know.
So, let's say we have a 2D AABB struct in a C-like language (as seen below) together with two basic operations, TestOverlap (overlap check for two AABBs) and Union. I believe these are the most basic operations to optimize, at least when working with broad-phase collision detection algorithms. TestOverlap with an AABB and a point can be reduced to AABB vs AABB and an Intersect function is symmetric to Union so while I won't be talking about these, the same optimizations apply.
struct AABB {
  float minx, miny;
  float maxx, maxy;
};

bool TestOverlap(AABB a, AABB b) {
  return a.minx <= b.maxx
      && a.miny <= b.maxy
      && a.maxx >= b.minx
      && a.maxx >= b.minx;
}

AABB Union(AABB a, AABB b) {
  return {
    min(a.minx, b.minx),
    min(a.miny, b.miny),
    max(a.maxx, b.maxx),
    max(a.maxy, b.maxy)
  };
}

AABB Union(AABB a, float x, float y) { ... }
AABB Intersect(AABB a, AABB b) { ... }
Basic idea

Since the optimizations are SIMD-oriented, a first observation is that the four float coordinates fit nicely in a 128-bit SIMD vector (or similarly 256-bit vector if doubles are used). But while the data fits nicely, the code has an awkward asymmetry: TestOverlap has a mixture of >= and <= and Union has min and max. Otherwise, it would be straightforward to use a packed vector min/max or a vector comparison.
So following this observation, my idea was to store the max coordinate of the AABB negated instead. Negating only one of the two coordinates makes the implementations symmetric and SIMD-friendly. If we substitute maxx/maxy with -maxx/-maxy in the above operations we get:
bool TestOverlap(AABB a, AABB b) {
  return a.minx <= -b.maxx
      && a.miny <= -b.maxy
      && a.maxx <= -b.minx  // -a.maxx >= b.minx
      && a.maxx <= -b.minx; // -a.maxy >= b.miny
}

AABB Union(AABB a, AABB b) {
  return {
    min(a.minx, b.minx),
    min(a.miny, b.miny),
    min(a.maxx, b.maxx), // -max(-a.maxx, -b.maxx)
    min(a.maxy, b.maxy)  // -max(-a.maxy, -b.maxy)
  };
}
This looks better! Union could now be a single packed minimum operation and while TestOverlap needs more instructions it has more potential than before. What's also great with this is that this trick can be nicely abstracted away in the AABB implementation, we can have something that looks like:
struct AABB {
  AABB(float x1, float y1, float x2, float y2) : minx(x1), miny(y1), maxx(-x2), maxy(-y2) {}
  float getMinX() { return minx; }
  float getMaxX() { return -maxx; }
  ...  
};
The user of this AABB API doesn't need to know about the 'unusual' representation used.
Compilers and analysis of the generated code

Now, this looks nice, but since we haven't used any explicit SIMD programming we're basically leaving the rest up to the compiler... which may or may not be a great idea. My first reaction was to see what assembly is produced by the latest (and greatest) GCC and clang. The results can be seen in this godbolt link (you can play around with this, e.g. by adding -ffast-math or other flags).
I won't analyze this very deeply, but there are a lot of interesting things going on here:

There are some instructions in these functions that only exist due to the ABI. Please note that all these functions would be inlined and the compiler would produce better code, but this acts as a decent baseline anyway. You can see that if you make changes to the function signatures (e.g. pass by reference instead of value). As a good example, all instructions except for minps in Union_2 are just packing/unpacking/moves that wouldn't exist otherwise.
GCC produces almost twice the code and emits more jumps (as a GCC developer myself, sigh...)
clang produces branchless and very concise (and quite beautiful) code for TestOverlap_1. One nice trick is that it uses a packed comparison for half the comparisons to reduce the number of operations needed.
There is no vector negation instruction, so -x is implemented with two instructions that also involve a memory read: movaps xmm4, xmmword ptr [rip + .LCPI2_0] + xorps xmm4, xmm3. This is unfortunate to say the least; one alternative is to use 0.0f - x to get xorps xmm5, xmm5 and subss xmm5, xmm3 but this shouldn't be very important.
TestOverlap_2 does very badly for both compilers. I have included TestOverlap_2_2 which is the same as TestOverlap_2 but with bitwise & used instead of logical &&, and for clang this one almost does the job: Branchless, short and we can spot the expected cmpnleps. It is funny and sad for me that such a difference can be observed due to using & instead of && in code without side effects... but anyway.

My takeaway is: clang tries and does decently but code generation is still sensitive to small things, while GCC is quite far from optimal (Note to self: open a GCC ticket with these testcases). In any case, since we're dealing with potentially very hot functions and the compilers aren't consistent enough, the real answer is that we probably have to use SIMD intrinsics or something equivalent.
Explicit SIMD with intrinsics

My current SIMD implementations for these, at the time of writing, are:
bool TestOverlap(__m128 a, __m128 b) {
  // Negate B
  b = -b; // or 0.0f - b
  // Shuffle AABB to align a.max with -b.min and a.min with -b.max
  a = _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 0, 3, 2));
  // Overlap iff a <= b
  __m128 cmp = (a <= b);
  return _mm_test_all_ones(_mm_castps_si128(cmp));
}

__m128 Union(__m128 a, __m128 b) {
  return _mm_min_ps(a, b);
}
The code generation comparison of these SIMD functions with the rest can be seen in this godbolt link. The union version is shorter only because of ABI, so there's no difference really, and the handwritten TestOverlap_3 is only one instruction less than the previously best function TestOverlap_2_2. But more importantly, we have won consistent and predictable code generation. GCC's TestOverlap_3 is unfortunately still a bit worse, even with the intrinsics, and results in 2 more instructions vs clang's.
Performance?

At this point, a good question would be "Are the SIMD versions any faster?". I don't want to create microbenchmarks for these functions and I believe they would be mostly misleading for the real-world performance implications of these implementations. I prefer to keep the comparison at just looking the generated code:

Union_2/Union_3 consist of measurably fewer instructions so they could be faster in principle.
TestOverlap_3 is marginally better (-2 instructions) than the best TestOverlap_1. Although this could be measurable depending on the use case, it's still less than what I'd like.
The intrinsics versions are much better than the worse options, so depending on your compiler and flags a quite measurable improvement could be observed.

A final important optimization/observation

These were originally part of an effort to optimize a broad-phase algorithm. While the faster union operation is nice, it is mostly used when building the broad-phase tree. For the actual collision detection, the hottest part of my algorithm looked like this:
void CollisionDetection(AABB* aabbs, AABB ref) {
  ...
  for (int i = start; i < end; i++) {
    if (TestOverlap(aabbs[i], ref)) {
      // do something
    }
  }
  ...
}
where the 'do something part' is just a few instructions. I believe this pattern is common in many broad-phase algorithms and that's why I'll explain a last but important-ish optimization. If you'll excuse some more assembly, if the SIMD TestOverlap_3 is used as part of a jump, e.g. consider this function
void CollisionDetection(__m128 a, __m128 b) {
  if (TestOverlap_3(a, b)) {
    do_something();
  }
}
Then the generated code is something like
CollisionDetection(float __vector(4), float __vector(4)):          # @CollisionDetection(float __vector(4), float __vector(4))
  xorps   xmm1, xmmword ptr [rip + .LCPI7_0]
  shufps  xmm0, xmm0, 78                  # xmm0 = xmm0[2,3,0,1]
  cmpleps xmm0, xmm1
  pcmpeqd xmm1, xmm1
  ptest   xmm0, xmm1
  jb      do_something()@PLT           # TAILCALL
  ret

Now if we discount the jb and ret which are there due to the function calls, the actual test is 5 instructions which is quite good. At this point, if we use TestOverlap_1 instead things will get worse and hence we can already expect some benefits. But if you recall the actual loop that we want to optimize, an important insight is that one of the two AABBs is loop-invariant; we're testing overlap between a specific AABB and a collection of AABBs. Of the 5 instructions in the hot loop, one is the negation and the other one is the shuffle. Because we can shuffle any of the two vectors and get the same result, this means we can 'preprocess' the ref AABB outside of the loop with a shuffle and a negation and then the hot loop overlap test will consist of just three instructions! (cmpleps + pcmpeqd + ptest). We don't even need to do this by hand as the compiler is (probably) smart enough to do this on his own. If we change a with b in the _mm_shuffle_ps in TestOverlap_3  and use it within a loop like that where one of the two AABBs is loop-invariant then the compiler will move the negation and shuffle outside of the loop.
We have single-instruction Union and an overlap test in three instructions (for a particular type of hot loop). I doubt we can do better for these, so let me say: mission complete!
Closing

Although I believe I provided enough details, you should take care if you implement this in your code. Remember to check the generated assembly because things can go wrong for several reasons and compilers can be fragile. I intentionally didn't provide performance numbers, but my feeling is that a proper implementation of the SIMD optimized AABB can provide a performance benefit in the right circumstances. Also while the ideas could be used for 3D AABBs they don't apply as cleanly (6 coordinates) and I haven't thought much about that case.
Feel free to use these ideas and code; I would appreciate mentioning my work if it helped you. I also would like to know if anyone found the techniques to be useful in practice.
I should note that I've been meaning to write this up for a long time and didn't. I only now found a good excuse, which is Erin Catto's amazing Box2D v3 (i.e. https://github.com/erincatto/box2c) and all the amazing optimizations he did. If these can provide the right benefit, my hope is for this implementation to find its way in Box2D (and other physics engines).