Skip to content

Instantly share code, notes, and snippets.

@rygorous
Created October 22, 2015 06:15
Embed
What would you like to do?
Pixel format conversion skunkworks
// So, looking at:
// https://github.com/dolphin-emu/dolphin/blob/3b75f45cf63e8455efb539109bebf6626bcb40e3/Source/Core/VideoCommon/VertexLoaderX64.cpp#L311
// in = RRRRGGGG,BBBBAAAA
x = LoadAndSwap16(in); // x=0000 0000 0000 0000 rrrr gggg bbbb aaaa
x = (x ^ (x << 8)) & 0x00ff00ff; // x=0000 0000 rrrr gggg 0000 0000 bbbb aaaa
x = (x ^ (x << 4)) & 0x0f0f0f0f; // x=0000 rrrr 0000 gggg 0000 bbbb 0000 aaaa
x = x | (x << 4);
so that should be: (non-PDEP variant)
scratch1 = LoadAndSwap16(data)
// first swizzle step
MOV(32, R(scratch2), R(scratch1));
SHL(32, R(scratch1), Imm8(8));
XOR(32, R(scratch1), R(scratch2));
AND(32, R(scratch1), Imm32(0x00FF00FF));
// second swizzle step
MOV(32, R(scratch2), R(scratch1));
SHL(32, R(scratch1), Imm8(4));
XOR(32, R(scratch1), R(scratch2));
AND(32, R(scratch1), Imm32(0x0F0F0F0F));
// final step
MOV(32, R(scratch2), R(scratch1));
SHL(32, R(scratch1), Imm8(4));
OR(32, R(scratch1), R(scratch2));
// and store!
@dougallj
Copy link

You could use

x = x * 0x11;

instead of the last shift/or - the imul is only 3 bytes and might be faster (probably depends on the machine?)

@rygorous
Copy link
Author

Fastest IMUL by 8 bit constant in x86 family is 3 cycle latency, that sequence is 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment