Whitecube is a reversing challenge from Google CTF 2022. The challenge hands out a encryption tool whitecube.exe
(AMD64 PE), along with an encrypted flag file.
It is implied that the players should figure out what the algorithm is by looking at the binary, then break/reverse the encryption.
The challenge description strongly hints at something GPU related and indeed, the main algorithm is implemented as a SPIR-V shader and potentially runs on a GPU.
The challenge took us about 6 hours to solve.
The challenge binary prints a nice usage message when invoked without arguments:
Z:\>whitecube.exe
Usage: whitecube.exe <input_file> <output_file>
Running the binary with an input file containing 1024 'A'-s, it outputs a 1044 bytes file with visible repeating patterns starting at offset 0x14. We can make an educated guess that the encrypted file starts with a 20 bytes header. Moreover, the first 8 bytes is a 64-bit integer 0x400
in little endian, our original file size. This leaves 12 unknown and seemingly random bytes in the header.
Loading the binary into IDA Pro, we spotted a few interesting strings:
Failed to init GLFW
GLSL.std.450
f5(mf44[16]);l1;
Which suggested that the challenge uses the GLFW library and contains an embedded GLSL shader. Since the challenge statically linked against GLFW, it would be helpful to make FLAIR signatures of the GLFW library so that we can avoid reversing library functions. After some investigations we realized that GLFW releases precompiled Windows binaries. Using pcf+sigmake to prepare a signature, we get 378 matches in IDA Pro, which makes the analysis a lot easier:
$ ~/lib/flair77/bin/linux/pcf glfw-3.3.7.bin.WIN64/lib-vc2022/glfw3_mt.lib
/home/riatre/ctf/googlectf22/re/whitecube/glfw-3.3.7.bin.WIN64/lib-vc2022/glfw3_mt.lib: skipped 6, total 333
$ ~/lib/flair77/bin/linux/sigmake glfw3_mt.pat glfw3_mt.sig
glfw3_mt.sig: modules/leaves: 241/326, COLLISIONS: 4
See the documentation to learn how to resolve collisions.
# Edit glfw3_mt.enc, just remove the first few lines.
$ ~/lib/flair77/bin/linux/sigmake glfw3_mt.pat glfw3_mt.sig
$
The main()
logic is quite simple:
- Read the input file into a buffer. Check that input file size is less than
$800 \times 640 \times 4$ =$2048000$ bytes. Write input size as a 64-bit little endian integer to output file. - Initialize GLFW and create a pair of vertex and fragment shaders.
- Generate 12 random bytes with
std::mt19937
andstd::random_device
. The 12 bytes goes directly into the output file. It then concatenates [1, 1, 1, 1] to the 12 items and builds a 16x16 diagonal matrix out of it. Then binds the matrix to shader variableu_nonce
. - Transpose input in 4x4 blocks, bind the input file to a shader buffer and render.
- Copy the rendered pixels back to shader buffer and render. Repeat for 7 times.
- Write the final rendered pixels to the output file.
Note that the 16x16 matrix has a weird layout in memory, it was saved as 16 row-major 4x4 matrices, where the 16 matrices itself are also arranged in 4x4 row-major order. We'll see why later.
It is clear that actual transformation logic is in the shader code, time to dive in.
Dump the shader binary, file
told me that they are SPIR-V bytecodes:
$ file fragment_shader.bin
fragment_shader.bin: Khronos SPIR-V binary, little-endian, version 0x010000, generator 0x08000a
Searching the Internet we found a nice SPIR-V decompiler, which gives very readable decompilation. With this, it is not hard to understand what happens in the shader. For data types, see the GLSL wiki. Here is a summary:
- mat4: 4x4 float32 matrix in column-major order
- vec4: 4-component float32 vector
- uvec2: 2-component unsigned integer vector
The vertex shader is empty, it simply copies the input to the output. We can ignore it from now.
The fragment shader is a bit more complex, it contains a reasonably long main()
and 8 helper functions. Let's start with the helper functions. In general, they work on 16x16 matrices in 16 mat4 block matrix layout. In our familiar numpy term, they are:
- f6(dst, src) simply copies 16 mat4 from
mat4 src[16]
tomat4 dst[16]
. In other words it is justdst = src
. - f4(data, val) fills the matrix
data
with val, i.e.data = np.full((16, 16), val)
. - f5(data) mods each element by 256.
- f3(res, a, b) is quite long, but after some cleanup it could look like:
void f3(inout mat4 res[16], mat4 a[16], mat4 b[16]) {
res[ 0] = a[ 0] * b[ 0] + a[ 1] * b[ 4] + a[ 2] * b[ 8] + a[ 3] * b[12];
res[ 1] = a[ 0] * b[ 1] + a[ 1] * b[ 5] + a[ 2] * b[ 9] + a[ 3] * b[13];
res[ 2] = a[ 0] * b[ 2] + a[ 1] * b[ 6] + a[ 2] * b[10] + a[ 3] * b[14];
// ...
it is a matrix multiplication on block matrices, i.e. res = a @ b
.
5. f0(dst, m, k): dst = (intclip(126.0 * (1.0 + np.sin(m))) % 256) @ k
6. f1(dst, m, k): dst = (intclip(126.0 * (1.0 + np.cos(m))) % 256) @ k
7. f2(dst, m, k): dst = (intclip(126.0 * (1.0 + np.tan((m - 127.0) / 256.0))) % 256) @ k
8. f7(dst, src): dst += src
With these in mind, it isn't hard to figure out what the main()
does. The "GPU computation model" applied here is simple: each main()
call generates one single output pixel, written to FragColor
, at coordinate (gl_FragCoord.x, gl_FragCoord.y)
. It actually does a lot of wasteful work, computing the entire 16x16 matrix the output pixel located in, only for a single 4 colors group of the current pixel. With that said, the logic reimplemented in Python looks like:
# Assume that block is in flatten bytestream order
def process(block: np.ndarray, block_idx: int, u_nonce):
assert block.dtype == np.uint8
assert block.shape == (1024,)
# Whitecube.exe passes transposed matrix to the shader, so we are good to just reshape
# in row-major order
block = block.reshape(4, 16, 4, 4).astype(np.float32)
nonce = u_nonce.copy()
nonce[15, 0, 0] = 2 * (block_idx + 1) - 1
nonce = convert_bmat_to_mat(nonce)
FUNC = [
lambda x: f0(x, PARAM_6),
lambda x: f1(x, PARAM_9),
lambda x: f2(x, PARAM_12),
lambda x: x,
]
result = []
for subblock_idx in range(4):
if subblock_idx != 3:
bias = block[subblock_idx]
else:
bias = np.zeros_like(block[0])
bias = convert_bmat_to_mat(bias)
bias = FUNC[subblock_idx](bias)
x = convert_bmat_to_mat(block[(subblock_idx+1)%4])
x = (x + bias % 256) % 256
x = (x @ nonce) % 256
x = (x @ PARAM_23) % 256
x = convert_mat_to_bmat(x)
result.append(x.reshape(-1))
return np.concatenate(result).astype(np.uint8)
With the u_nonce known and fixed, the process()
function is clearly reversible. Simply run its inverse 8 times recovered the flag.
Note that there are more than 800*640*4
bytes in the encrypted flag file, which the Whitecube.exe
refuses to encrypt. We can assume that the remaining blocks works the same.
See solve.ipynb for the full solution code.