Skip to content

Instantly share code, notes, and snippets.

@breakin
Created October 4, 2016 17:47
Show Gist options
  • Save breakin/caec7e2e264e50381a989962461ab002 to your computer and use it in GitHub Desktop.
Save breakin/caec7e2e264e50381a989962461ab002 to your computer and use it in GitHub Desktop.
ISPC 8-bit/16-bit
export void simple(uniform unsigned int8 vin[], uniform unsigned int8 vout[], uniform int count) {
foreach (index = 0 ... count) {
unsigned int8 v = vin[index];
// Do some calculations in 16-bit lanes. I expect v to be split in 2 registers
unsigned int16 v2 = ((unsigned int16)v) * ((unsigned int16)v);
// And somehow fold back into 8-bit lanes
unsigned int8 m = v2>>8;
// Write result
vout[index] = m;
}
}
Assembly for full body is along the lines of (for "--target=avx1-i32x8"):
pmovzxbw (%rcx,%rax), %xmm1 # xmm1 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero,mem[4],zero,mem[5],zero,mem[6],zero,mem[7],zero
pmullw %xmm1, %xmm1
psrlw $8, %xmm1
pshufb %xmm0, %xmm1
movq %xmm1, (%rdx,%rax)
addl $8, %eax
cmpl %r9d, %eax
Maybe this is faster, not sure. But easier to prototype with intrinsics in this case.
Note that with other targets the code might get better, but I didn't manage to make it so.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment