This is the write-up for the task "AVX2 Encoder" from TCTF (0CTF) Finals 2017.
We are given the following files:
avx2_encoder.exe: PE32+ executable (console) x86-64, for MS Windows
flag.jpg.enc: data
key.dat: data
Executable takes input file and produces two files: encrypted input and some data used in encryption.
Pseudocode of encryption process:
key_f = open('key.dat')
SBox = list(range(256))
for _ in range(32):
random_shuffle(SBox)
key_f.write(SBox)
for i in range(256):
block[i] = encryptors[SBox[i]](block[i])
Actually, encryptors
are not separate functions but a big switch. It uses AVX2 instructions to cipher data, so we do not get clean code in decompiler. Example of decompiled encryption code:
case 0:
__asm
{
vmovdqu ymm0, cs:ymmword_14000C1E0+10E0h; jumptable 00000001400016A9 case 0
vmovdqu ymm1, ymmword ptr [rbx]
}
i = 67i64;
do
{
__asm
{
vpshufhw ymm0, ymm0, 2Dh
vpsrlq ymm2, ymm1, 1Dh
vpshuflw ymm5, ymm0, 93h
vpsllq ymm0, ymm1, 23h
vpermq ymm1, ymm0, 39h
vpxor ymm2, ymm1, ymm2
vpxor ymm3, ymm2, ymm5
vpsllq ymm0, ymm3, 3Bh
vpsrlq ymm4, ymm3, 5
vpermq ymm1, ymm0, 39h
vpxor ymm2, ymm1, ymm4
vpshufhw ymm3, ymm2, 93h
vpshuflw ymm0, ymm3, 2Dh
vpsllq ymm2, ymm0, 3Eh
vpsrlq ymm0, ymm0, 2
vpermq ymm1, ymm0, 93h
vpxor ymm2, ymm1, ymm2
vpsubq ymm3, ymm2, ymm5
vpsrlq ymm0, ymm3, 16h
vpsllq ymm4, ymm3, 2Ah
vpermq ymm1, ymm0, 93h
vpshufhw ymm0, ymm5, 6Ch
vpshuflw ymm0, ymm0, 0D2h
vpxor ymm1, ymm1, ymm4
}
--i;
}
while ( i );
__asm { vmovdqu ymmword ptr [rbx], ymm1 }
break;
rbx
points to input data blocks.
We are given all the code as well as used "secret key" (order of used encryptors for each block). With all this info we have several ways to reverse the encryption. The first one that comes to mind is to literally reverse all the AVX2 code somehow. The problem is, the data flow is not linear, in some places data is splitted in two registers to be combined a bit later. To easily overcome this problem (i.e., to avoid thinking) we can use some tool that will solve everything for us, given the right formulation of problem and input data. That tool is Z3 Theorem Prover.
The solution now goes as following:
- Load all data (program, SBoxes, encrypted file, magic constants for encryptors)
- Make solver for given encryptor
- Feed it with magic and encrypted result
- Get decrypted result
To achieve this flow I've rewritten all used AVX2 instructions in corresponding Z3 expressions. First of all, I created some essentials like definitions of registers and easy access to its parts:
def make_ymm(name):
return BitVec(name, 32*8)
def vec_slice(vec, from_, to):
return Extract(vec.size() - from_ - 1, vec.size() - to, vec)
def i64(ymm, index):
index = ymm.size() // 64 - index - 1
return vec_slice(ymm, index * 64, (index + 1) * 64)
Next go AVX2 operations. For example, vpshuflw
(_mm256_shufflelo_epi16
(shuffle words in low 64 bits in each 128-bit part of YMM register in our case)), given two registers and imm value returns corresponding result in the form of big And
:
def vpshuflw(reg1, reg2, imm):
cond = True
cond = And(cond, i64(reg1, 1) == i64(reg2, 1))
cond = And(cond, i64(reg1, 3) == i64(reg2, 3))
for i in range(4):
idx, imm = imm & 3, imm >> 2
cond = And(cond, i16(reg1, i + 0) == i16(reg2, idx + 0))
cond = And(cond, i16(reg1, i + 8) == i16(reg2, idx + 8))
return cond
Now we can parse the encryptor ASM code and stack operations one by one using these AVX2 operations-conditions. We use SSA form to correctly follow the flow, i.e., ymm0 = 1; ymm0 ^= 2
becomes YMM0_0 = 1; YMM0_1 = YMM0_0 ^ 2
.
vpshuflw ymm0, ymm0, 0D2h
First, create a solver:
s = Solver()
Next, take the first instruction, vpshufhw ymm0, ymm0, 0D2h
. We parse it in tuple op = ("vpshuflw", "ymm0", "ymm0", 0xD2)
. Obtain all source registers, add new register version for destination register and obtain it as well:
reg2 = regs[op[2]][-1]
regs[op[1]].append(make_ymm('{}_{}'.format(op[1], len(regs[op[1]]))))
reg1 = regs[op[1]][-1]
Finally, add constraint to solver:
s.add(AVX2[op[0]](reg1, reg2, op[3]))
where AVX2[op[0]]
points to function vpshuflw
.
Solution can be optimized in several ways.
First of all, it is trivially parallelized. Each block is encrypted with one of 256 encryptors -> we can decrypt them in up to 256 threads.
Second, building a solver takes time. It is faster to build it one time and then use cached version (use push
-pop
contexts on cached solver).
Now comes the sad part. We need to note that each encryptor is a block of instructions repeated in a loop no more that ~250 times. So if typical length of code in loop is 25 lines then we get 6000 lines for the whole loop. Solver built for this code takes up to 3 gigabyte of RAM. If we want 16 threads we need 16*3=48 Gb
of RAM to be sure that CPU will work on 100%. That all means that our parallelization efforts are severely limited by installed hardware.
Execution speed in combination with previous limitations gives terrible performance: in single-thread mode the solution takes about 130 hours to finish. I used 8 threads and computed the answer in ~16 hours. Needless to say, it was too slow and we didn't get the flag in time. On the other side, it was a nice example of using Z3 in some real situation, even if it performed not so fast.
Edit:
As you can see in the process of creating a solver I'm not adding constraints one by one but instead combine them in one big And
. It occurs that inserting a line cond = simplify(cond)
before line 175 reduces execution time by 40% (i.e. it executes for ~80 hours on one core instead of 130 hours).