Pixel format conversion skunkworks pt. 2 ;)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| // ---- Explanation starts here | |
| // | |
| // I actually use a general strategy to position multiple sequential bitfields | |
| // at once: for every bit, compute the distance it needs to move. Say you have | |
| // 4 groups: one needs to move by 0 bits, one by 3, one by 6 and one by 9 bits. | |
| // You can do that in two passes: first pass handles the two groups that need to | |
| // move by >=6 bits. Second pass shifts the two groups that still need to travel | |
| // an extra 3 bits. The actual "move groups of adjacent bits by a distance of N" | |
| // steps are pretty straightforward. This gist has the general version (when there | |
| // are collisions between the source and dest bit positions for the move), the 4444 | |
| // code had the slightly simpler case when there are no collisions (which saves one | |
| // masking op). | |
| // | |
| // One result from this is that you can handle *any* "deposit"-style permutation in | |
| // log2(reg_width_in_bits) passes: your move distances are simply powers of 2 (in | |
| // descending order). The example above also shows that while powers of 2 provide a | |
| // reasonable log2(N) lower bound, they are not always optimal: | |
| // 0 = 0 | |
| // 3 = 2 + 1 | |
| // 6 = 4 + 2 | |
| // 9 = 8 + 1 | |
| // so the move distances above would result in 4 passes using powers of 2 (move by | |
| // 8, 4, 2 bits then by 1 bit) where only 2 are necessary. | |
| // | |
| // The actual minimum number of passes is surprisingly tricky in certain cases; this | |
| // is related to addition chains. | |
| // | |
| // ----- | |
| // expand R6G6B6A6 -> R8G8B8A8 | |
| // in: x = 00000000 RRRRRRGG GGGGBBBB BBAAAAAA | |
| // everyone needs to move by >=2 bits; r and g move by >=6. | |
| t0 = ((x << 6) & 0x3ffc0000) | ((x << 2) & 0x3ffc); // 00RRRRRR GGGGGG00 00BBBBBB AAAAAA00 | |
| // r and b need to move an extra 2 bits. | |
| t0 = ((t0 << 2) & 0xfc00fc00) | (t0 & 0x00fc00fc); // RRRRRR00 GGGGGG00 BBBBBB00 AAAAAA00 | |
| out = t0 | ((t0 >> 6) & 0x03030303); | |
| asm: | |
| lea r1, [r0*4] | |
| shl r0, 6 | |
| and r0, 0x3ffc0000 | |
| and r1, 0x00003ffc | |
| or r0, r1 | |
| mov r1, r0 | |
| shl r0, 2 | |
| and r0, 0xfc00fc00 | |
| and r1, 0x00fc00fc | |
| or r0, r1 | |
| mov r1, r0 | |
| ; --- this part is shared with BMI2 path | |
| shr r0, 6 | |
| and r0, 0x03030303 | |
| or r0, r1 | |
| gen: | |
| LEA(32, R(scratch2), MScaled(scratch1, SCALE_4, 0)); | |
| SHL(32, R(scratch1), Imm8(6)); | |
| AND(32, R(scratch1), Imm32(0x3FFC0000)); | |
| AND(32, R(scratch2), Imm32(0x00003FFC)); | |
| OR(32, R(scratch1), R(scratch2)); | |
| MOV(32, R(scratch2), R(scratch1)); | |
| SHL(32, R(scratch1), Imm8(2)); | |
| AND(32, R(scratch1), Imm32(0xFC00FC00)); | |
| AND(32, R(scratch2), Imm32(0x00FC00FC)); | |
| OR(32, R(scratch1), R(scratch2)); | |
| MOV(32, R(scratch2), R(scratch1)); | |
| // shared tail is shared! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment