Skip to content

Instantly share code, notes, and snippets.

@mhroth
Created October 21, 2015 17:48
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mhroth/2b494cc18b1c6eabf4e4 to your computer and use it in GitHub Desktop.
Save mhroth/2b494cc18b1c6eabf4e4 to your computer and use it in GitHub Desktop.
A basic NEON implementation of SSE _mm_movemask_ps
uint32_t _mm_movemask_ps(float32x4_t x) {
uint32x4_t mmA = vandq_u32(
vreinterpretq_u32_f32(x), (uint32x4_t) {0x1, 0x2, 0x4, 0x8}); // [0 1 2 3]
uint32x4_t mmB = vextq_u32(mmA, mmA, 2); // [2 3 0 1]
uint32x4_t mmC = vorrq_u32(mmA, mmB); // [0+2 1+3 0+2 1+3]
uint32x4_t mmD = vextq_u32(mmC, mmC, 3); // [1+3 0+2 1+3 0+2]
uint32x4_t mmE = vorrq_u32(mmC, mmD); // [0+1+2+3 ...]
return vgetq_lane_u32(mmE, 0);
}
@mhroth
Copy link
Author

mhroth commented Oct 21, 2015

Unfortunately a NEON analog to SSE's _mm_movemask_ps does not exist. There are some reimplementations (e.g. here), this one is mine. The input is assumed to be the result of a comparison (e.g. consisting of either 0x0 or ~0x0). This implementation is "reasonably" fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment