Skip to content

Instantly share code, notes, and snippets.

@rygorous
Created October 25, 2015 09:58
Embed
What would you like to do?
Pixel format conversion skunkworks pt. 2 ;)
// ---- Explanation starts here
//
// I actually use a general strategy to position multiple sequential bitfields
// at once: for every bit, compute the distance it needs to move. Say you have
// 4 groups: one needs to move by 0 bits, one by 3, one by 6 and one by 9 bits.
// You can do that in two passes: first pass handles the two groups that need to
// move by >=6 bits. Second pass shifts the two groups that still need to travel
// an extra 3 bits. The actual "move groups of adjacent bits by a distance of N"
// steps are pretty straightforward. This gist has the general version (when there
// are collisions between the source and dest bit positions for the move), the 4444
// code had the slightly simpler case when there are no collisions (which saves one
// masking op).
//
// One result from this is that you can handle *any* "deposit"-style permutation in
// log2(reg_width_in_bits) passes: your move distances are simply powers of 2 (in
// descending order). The example above also shows that while powers of 2 provide a
// reasonable log2(N) lower bound, they are not always optimal:
// 0 = 0
// 3 = 2 + 1
// 6 = 4 + 2
// 9 = 8 + 1
// so the move distances above would result in 4 passes using powers of 2 (move by
// 8, 4, 2 bits then by 1 bit) where only 2 are necessary.
//
// The actual minimum number of passes is surprisingly tricky in certain cases; this
// is related to addition chains.
//
// -----
// expand R6G6B6A6 -> R8G8B8A8
// in: x = 00000000 RRRRRRGG GGGGBBBB BBAAAAAA
// everyone needs to move by >=2 bits; r and g move by >=6.
t0 = ((x << 6) & 0x3ffc0000) | ((x << 2) & 0x3ffc); // 00RRRRRR GGGGGG00 00BBBBBB AAAAAA00
// r and b need to move an extra 2 bits.
t0 = ((t0 << 2) & 0xfc00fc00) | (t0 & 0x00fc00fc); // RRRRRR00 GGGGGG00 BBBBBB00 AAAAAA00
out = t0 | ((t0 >> 6) & 0x03030303);
asm:
lea r1, [r0*4]
shl r0, 6
and r0, 0x3ffc0000
and r1, 0x00003ffc
or r0, r1
mov r1, r0
shl r0, 2
and r0, 0xfc00fc00
and r1, 0x00fc00fc
or r0, r1
mov r1, r0
; --- this part is shared with BMI2 path
shr r0, 6
and r0, 0x03030303
or r0, r1
gen:
LEA(32, R(scratch2), MScaled(scratch1, SCALE_4, 0));
SHL(32, R(scratch1), Imm8(6));
AND(32, R(scratch1), Imm32(0x3FFC0000));
AND(32, R(scratch2), Imm32(0x00003FFC));
OR(32, R(scratch1), R(scratch2));
MOV(32, R(scratch2), R(scratch1));
SHL(32, R(scratch1), Imm8(2));
AND(32, R(scratch1), Imm32(0xFC00FC00));
AND(32, R(scratch2), Imm32(0x00FC00FC));
OR(32, R(scratch1), R(scratch2));
MOV(32, R(scratch2), R(scratch1));
// shared tail is shared!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment