Pixel format conversion skunkworks
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| // So, looking at: | |
| // https://github.com/dolphin-emu/dolphin/blob/3b75f45cf63e8455efb539109bebf6626bcb40e3/Source/Core/VideoCommon/VertexLoaderX64.cpp#L311 | |
| // in = RRRRGGGG,BBBBAAAA | |
| x = LoadAndSwap16(in); // x=0000 0000 0000 0000 rrrr gggg bbbb aaaa | |
| x = (x ^ (x << 8)) & 0x00ff00ff; // x=0000 0000 rrrr gggg 0000 0000 bbbb aaaa | |
| x = (x ^ (x << 4)) & 0x0f0f0f0f; // x=0000 rrrr 0000 gggg 0000 bbbb 0000 aaaa | |
| x = x | (x << 4); | |
| so that should be: (non-PDEP variant) | |
| scratch1 = LoadAndSwap16(data) | |
| // first swizzle step | |
| MOV(32, R(scratch2), R(scratch1)); | |
| SHL(32, R(scratch1), Imm8(8)); | |
| XOR(32, R(scratch1), R(scratch2)); | |
| AND(32, R(scratch1), Imm32(0x00FF00FF)); | |
| // second swizzle step | |
| MOV(32, R(scratch2), R(scratch1)); | |
| SHL(32, R(scratch1), Imm8(4)); | |
| XOR(32, R(scratch1), R(scratch2)); | |
| AND(32, R(scratch1), Imm32(0x0F0F0F0F)); | |
| // final step | |
| MOV(32, R(scratch2), R(scratch1)); | |
| SHL(32, R(scratch1), Imm8(4)); | |
| OR(32, R(scratch1), R(scratch2)); | |
| // and store! |
Fastest IMUL by 8 bit constant in x86 family is 3 cycle latency, that sequence is 2.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
You could use
instead of the last shift/or - the imul is only 3 bytes and might be faster (probably depends on the machine?)