ocr
was an easy-ish machine learning misc challenge from the Google CTF 2022 Qualifier.
Train our neural network to read handwritten letters! I'm sure with the newest technological advances, you'll be able to do it with a tiny network and just a handful of training images.
Attachment
ocr.2022.ctfcompetition.com 1337
Feel free to skip this section if you're already familiar with basic neural networks.
Modern machine learning includes a lot of advanced techniques, but the simple Feedforward Neural Network (FNN) is pretty easy to understand. They are often drawn like this:
You can imagine the data "moving" from top to bottom. Each circle holds a single number, and there is a "weight" associated with each black line. When a number "moves" through a black line, it gets multiplied by that weight. All the numbers going into a circle are then summed to produce a single number. At this point, it is also common to add an additional constant called a "bias" to this sum. Finally, the value of that circle is determined by applying some kind of nonlinear "activation function" to that sum1.
Note that, although the diagram above shows three layers of equal size, this does not have to be the case. A network can have any number of layers and any number of circles in each layer.
The idea above is pretty simple to understand, but in its current form it is a bit difficult to work with. Instead of thinking of a layer as "a bunch of circles holding numbers," we can just think of it as a single vector. Each layer, then, is a function that takes a vector as an input and outputs another vector. Finally, the entire network is just a composition of several of these functions (i.e.
Each layer function can be broken down into two steps: multiplying by weights, and applying the activation function. The latter is trivial—simply apply some function to each element of the vector. But how do we do the weight multiplication with vectors? Let's use a small pair of layers as an example:
The weight going from index
Using similar logic, we can find the contribution to the final
We have just represented multiplication by all the weights as a linear combination of three vectors with weights, where the coefficients are all elements from our original vector. This is exactly the same thing as a matrix-vector multiplication, like this:
Now the "backward" indices should make sense—they are exactly the matrix indices. For convenience, we usually don't write out each element, instead just referring to the entire matrix of weights and vectors of numbers with a single symbol. Putting this all together, an entire layer looks like this (
Linear Algebra and computers are pretty close together already. The main diference is that, while we generally use column vectors in Linear Algebra, we'll use row vectors here.
This also means we need to transpose the weight matrix and switch the order of multiplication, like this:
This is a minor detail, and the computations are identical.
We have really only touched the surface of neural networks. Most notably, we've entirely skipped over how these networks "learn" anything, partly because it is irrelevant to the CTF challenge but also because it is much more complicated. If you're really interested, there are many resources available on the internet, and 3Blue1Brown's Neural Network series is highly recommended.
We're given a zip file with server.py
and a folder of four images. Let's break down server.py
section by section.
First, the server iterates through all the image files and loads them as normalized (values 0-1) arrays. Then, all the image arrays are combined into one big array image_data
.
image_data = []
for f in sorted(glob.glob("images/*.png")):
im = tf.keras.preprocessing.image.load_img(
f, grayscale=False, color_mode="rgb", target_size=None, interpolation="nearest"
)
im = tf.keras.preprocessing.image.img_to_array(im)
im = im.astype("float32") / 255
image_data.append(im)
image_data = np.array(image_data, "float32")
Next, the server constructs the actual network. The first Flatten
layer takes each image as a 16x16x3 array and flattens (crazy name, right?) it into a 768-size vector. Next, everything is fed into a layer of size 4 using ReLU for the activation function. Finally, the output layer is size 128 using Softmax for the activation function. We'll look at these two activation functions in more detail shortly.
# The network is pretty tiny, as it has to run on a potato.
model = Sequential()
model.add(Flatten(input_shape=(16,16,3)))
# I'm sure we can compress it all down to four numbers.
model.add(Dense(4, activation='relu'))
model.add(Dense(128, activation='softmax'))
Now, we can pick three operations to perform, in a loop:
- "Clear weights" just sets all the weights to 0
- "Set weight" lets us set a single weight to any value
- "Read flag" tests the network on the input images
"Read flag" is the most interesting operation. It applies the network with whatever weights are currently set to the image data, then tells us what the network thinks each character is by checking which of the 128 outputs is above 0.5. If none of them are above 0.5, then that character is outputted as "?" 2.
results = model.predict(image_data)
s = ""
for row in results:
k = "?"
for i, v in enumerate(row):
if v > 0.5:
k = chr(i)
s += k
print("The neural network sees:", repr(s))
if s == flag:
print("And that is the correct flag!")
Let's also take a look at the provided images.
Each image is 16x16, with a background of two random colors. The character is written in a third, distinct color.
The neural network in the challenge only uses two types of activation functions—ReLU and Softmax. Let's see what each of these do.
ReLU stands for "Rectified Linear Unit." It is defined as:
If the input is negative, then the output is 0. If the input is nonnegative, then the output is equal to the input. Simple enough.
The Softmax function converts a vector to a probability distribution. First, it applies the exponential function (
It is impossible to train the network because we do not have enough test images. It is clear that the color, position, and size of each character is different, and we can not guess how all the characters look. Even if we could, there is simply no way that a single 4-size hidden layer is enough to classify these characters[citation needed]. Thus, we will have to get the flag some other way.
Instead, we can try to manipulate the weights in such a way that we can deduce information about the original input by intepreting the output. Remember, the output from the server tells us which of the 128 outputs is greater than 0.5, so we can use this a signal for something else.
We can essentially "disconnect" all the layer connections by simply setting their associated weights to 0. Then, because the hidden layer uses ReLU as the activation function, we can grab a single value from the input image by setting the weight from that element to any of the four hidden layer elements to 1. Since all the other weights are 0, the sum of all the inputs is just that one element (multiplied by 1), and applying ReLU to a positive value does nothing.
Here's the important part. The output from the server tells us if that output value was greater than 0.5, after applying Softmax. So by carefully adjusting a single weight between the number we are trying to leak and the output layer, we can determine if the number we are trying to leak is greater than an arbitrary threshold value.
We have pretty much reduced the entire network to this:
Basically what we want to do is set
Imagine we are trying to determine if the mystery value is greater than
Setting this equal to 0.5 and rearranging, we get:
Therefore, we can compare a single value from the test image to any value
Query | Response | Range |
---|---|---|
|
no | |
|
yes | |
|
no | |
|
no | |
|
no | |
|
yes | |
|
no | |
|
yes |
In this scenario, we needed 8 queries to leak the value 69.
Using the strategy outlined above, we would need 8 queries to leak a single value. In testing, we were able to do about 400 queries in the 2 minute window that the server gives us. With some napkin math,
It is certainly feasible within a reasonable amount of time3, but it is still annoyingly long. We can be a little smarter with how we leak information. For example, we can convert the images to grayscale by selecting all three channels at once and adding them together4. However, we did not bother with any of this during the CTF because we came up with a far easier method which is explained in the next section.
We revisited this strategy after the CTF. Actually, by limiting the number of comparisons that can be made per pixel (effectively reducing the resolution), the runtime becomes significantly lower because most of the queries can be cached.
Most characters do not need more than a few comparisons, but some do. To avoid wasteful queries, we can first try leaking every image with only two comparisons per pixel. Then, reject the characters that do not return three distinct colors, and increment the number of comparisons per pixel before trying again. Because all comparisons are cached, no redundant queries are performed.
Comparisons | Leaked Characters | Flag |
---|---|---|
2 | 10 | C?????????ky??e???????5?i????e4k?} |
3 | 20 | C??{??????ky_?L?_y3t_?t?l?l_?e4ky} |
4 | 24 | CT?{??_???ky_Re?U_y3t?5?i?l_le4ky} |
5 | 27 | CT?{??_?3?ky_Re?U_y3t?5till_le4ky} |
6 | 31 | CT?{?0_?3aky_ReLU_y3t_5till_le4ky} |
7 | 33 | CTF{n0_?3aky_ReLU_y3t_5till_le4ky} |
8 | 34 | CTF{n0_l3aky_ReLU_y3t_5till_le4ky} |
In total, this takes around 40 minutes to finish. It is still quite a long time, but definitely more reasonable than the initial strategy without any adjustments.
If you're familiar with assembly, you might know that the cmp
(compare) instruction is actually just a sub
(subtract) instruction, except the result is discarded. We can use a similar idea here.
The key insight is that we don't actually care what colors are in the image—we just want to read characters! Instead of trying to get an actual exact value, we can compare values. We can set the weight from one element to 1 like we did before, but we can then set the weight from another element to -1. When these are added together, we will have subtracted the two values.
With ReLU, this is extra nice—if the result of the subtraction is negative (so the second element is greater), then the output of ReLU is zero; if the result of the subtraction is positive (so the first element is greater), then the output of ReLU is nonzero. Then, all we need to do is set the weight coming from this value to something really big, so anything positive will definitely be outputted as a probability greater than 0.5.
Using this strategy, we can compare two of the channels for a single pixel at a time. For example, the image below shows 00.png
with a white pixel where red > green and a black pixel everywhere else:
This clearly shows a letter, but, for some images, this filter shows nothing. However, we can repeat this a total of three times to compare all the channels to each other5! We figured that at least one of these results would be legible for each character.
The best part about this is how fast it is:
- We only need to query three times per pixel, for a total of just
$16 * 16 * 3 = 768$ queries - Each query does not depend on any particular values (binary search would return different responses per image and require different queries for each one), so we can recover every image simultaneously, eliminating a factor of 346 from the runtime
- Each query does not depend on the response from the server, so we can send all the queries without waiting for a response at all, then collect all the results at the end, eliminating nearly all network latency (which is the primary source of delay)
def compare(r, c, c1, c2):
idx = r * 16 + c
io.sendline(b'0')
# output first of 4-vector
io.sendline(b'1')
io.sendline(b'2 0 0 69420')
# channel 1
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + c1} 0 1'.encode())
# channel 2
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + c2} 0 -1'.encode())
# read output
io.sendline(b'2')
for r, c in product(range(16), range(16)):
cmp = partial(compare, r, c)
cmp(0, 1)
cmp(1, 2)
cmp(2, 0)
def get_result():
io.recvuntil(b'The neural network sees: ')
return 1 - (np.array(list(parse_output(io.recvline()).encode())) // 0x3f)
out = np.zeros((16 * 3, 16 * length))
for r, c, h in tqdm(product(range(16), range(16), range(3)), desc='collecting results', total=768):
out[16 * h + r, c::16] = get_result() * 255
out[16 * h + r, c::16] = get_result() * 255
out[16 * h + r, c::16] = get_result() * 255
This code takes around 10 seconds to generate the following output:
This gives us CTF{n0_l3ak?_R?LU_?3t_5t?ll_?e?ky}
. During the CTF, this was enough to guess the entire flag (thank you Strellic).
We can combine the result of all three comparisons into a single image, then show only the least frequent value for each character.
out = np.zeros((16, 16 * length, 3))
for r, c, h in tqdm(product(range(16), range(16), range(3)), desc='collecting results', total=768):
out[r, c::16, h] = get_result()
def fix_img(img):
fix = np.zeros((16, 16))
values, counts = np.unique(img.reshape(-1, img.shape[2]), axis=0, return_counts=True)
least = values[np.argmin(counts)]
for r, c in zip(*np.where(np.all(img == least, axis=-1))):
fix[r, c] = 255
return fix
img = np.zeros((16, 16 * length))
for c in range(length):
idx = (slice(None), slice(c * 16, (c+1) * 16))
img[idx] = fix_img(out[idx])
This gives us almost all of the characters: CTF{n0_l3aky_R?LU_y3t_5till_le4ky}
. It is fairly easy to deduce that the single missing character is supposed to be e
, giving the final flag.
CTF{n0_l3aky_ReLU_y3t_5till_le4ky}
We can still do better, by utilizing something we did not consider during the CTF very much—biases! We'll use the same technique as before, but, since the bias is directly added, we'll be comparing unknown pixel values to known values instead of comparing two unknown pixel values.
We'll make use of all four hidden elements to divide the 0-255 range into five roughly equal pieces. Each of the four hidden elements will be nonzero after ReLU if the input is greater than 52, 104, 156, and 208.
Note that an element greater than 208 must also be greater than 156, 104, and 52 (similar logic applies to the other thresholds). Therefore, the only possible values of
input range |
|
---|---|
$\begin{bmatrix}0&0&0&0\end{bmatrix}$ | |
$\begin{bmatrix}+&0&0&0\end{bmatrix}$ | |
$\begin{bmatrix}+&+&0&0\end{bmatrix}$ | |
$\begin{bmatrix}+&+&+&0\end{bmatrix}$ | |
$\begin{bmatrix}+&+&+&+\end{bmatrix}$ |
To separate these five intervals into different observable outputs at the the Softmax layer, we'll connect each element in the hidden layer to a different output element, but also connect each element to the output element corresponding to the previous hidden layer element with a large negative weight. Essentially, this "turns off" each output when the next one "turns on." Using example values of 38, 69, 114, 200, and 247, we get (for the sake of brevity, five column vectors are stacked into a matrix):
Applying Softmax and classifying,
This is the desired result.
Overall, this strategy requires an identical amount of queries (and they are still predetermined so they can be sent without waiting for a response) as the one we used during the CTF7, so the runtime is similar (around 10 seconds). The benefit, however, is that all characters are successfully leaked without any guessing required.
I have certainly never used neural networks in this way before 🙃. Thanks to the Google CTF authors for another interesting challenge!
Here are full scripts that perform each of the solves outlined above.
import ast
from functools import cache
from itertools import count, product
from pathlib import Path
import cv2
import numpy as np
from pwn import args, context, log, process, remote
from tqdm import tqdm
def parse_output(output):
return ast.literal_eval(output.decode())
def connect():
if args.LOCAL:
io = process(['python3.10', 'server.py'])
else:
io = remote('ocr.2022.ctfcompetition.com', 1337)
io.recvuntil(b'\n ')
cmd = io.recvline().strip().decode()
with log.progress('solving pow'), context.local(log_level='error'):
pow_solve = process(['/bin/bash', '-c', cmd])
pow_solve.recvuntil(b'Solution: \n')
solution = pow_solve.recvline()
pow_solve.close()
io.send(solution)
return io
io = connect()
io.sendline(b'2')
io.recvuntil(b'The neural network sees: ')
length = len(parse_output(io.recvline()))
counter = 0
@cache
def compare(r, c, t):
global counter, io
counter += 1
if not args.LOCAL and counter > 400:
io.close()
io = connect()
counter = 0
idx = r * 16 + c
io.sendline(b'0')
# threshold
io.sendline(b'1')
io.sendline(f'2 0 0 {np.log(127) / t}'.encode())
# pixel
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + 0} 0 76.5'.encode())
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + 1} 0 150.45'.encode())
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + 2} 0 28.05'.encode())
# read output
io.sendline(b'2')
io.recvuntil(b'The neural network sees: ')
return 1 - (np.array(list(parse_output(io.recvline()).encode())) // 0x3f)
def get_pixel(img, r, c, maxcmp=10):
lo, hi = 0, 255
cmps = 0
while lo < hi:
mid = (lo + hi) // 2
if cmps == maxcmp:
return mid
if compare(r, c, mid)[img] == 1:
lo = mid + 1
else:
hi = mid
cmps += 1
return lo
def fix_img(img):
values, counts = np.unique(img, return_counts=True)
if len(values) != 3:
return None
least = values[np.argmin(counts)]
return np.array(img == least, dtype=np.uint8) * 255
done = set()
for cmps in count(2):
for img in tqdm(range(length), desc=f'using {cmps} comparisons'):
if img in done:
continue
fn = f'out_img/{img:02d}.png'
if Path(fn).is_file():
done.add(img)
continue
out = np.zeros((16, 16))
for r, c in tqdm(product(range(16), range(16)), desc=f'leaking img {img}', total=256):
out[r, c] = get_pixel(img, r, c, cmps)
out = fix_img(out)
if out is None:
continue
cv2.imwrite(fn, out)
done.add(img)
if len(done) == length:
break
import ast
from functools import partial
from itertools import product
import cv2
import numpy as np
from pwn import args, context, log, process, remote
from tqdm import tqdm
def parse_output(output):
return ast.literal_eval(output.decode())
def connect():
if args.LOCAL:
io = process(['python3.10', 'server.py'])
else:
io = remote('ocr.2022.ctfcompetition.com', 1337)
io.recvuntil(b'\n ')
cmd = io.recvline().strip().decode()
with log.progress('solving pow'), context.local(log_level='error'):
pow_solve = process(['/bin/bash', '-c', cmd])
pow_solve.recvuntil(b'Solution: \n')
solution = pow_solve.recvline()
pow_solve.close()
io.send(solution)
return io
io = connect()
io.sendline(b'2')
io.recvuntil(b'The neural network sees: ')
length = len(parse_output(io.recvline()))
def compare(r, c, c1, c2):
idx = r * 16 + c
io.sendline(b'0')
# output first of 4-vector
io.sendline(b'1')
io.sendline(b'2 0 0 69420')
# channel 1
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + c1} 0 1'.encode())
# channel 2
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + c2} 0 -1'.encode())
# read output
io.sendline(b'2')
for r, c in product(range(16), range(16)):
cmp = partial(compare, r, c)
cmp(0, 1)
cmp(1, 2)
cmp(2, 0)
def get_result():
io.recvuntil(b'The neural network sees: ')
return 1 - (np.array(list(parse_output(io.recvline()).encode())) // 0x3f)
out = np.zeros((16, 16 * length, 3))
for r, c, h in tqdm(product(range(16), range(16), range(3)), desc='collecting results', total=768):
out[r, c::16, h] = get_result()
def fix_img(img):
fix = np.zeros((16, 16))
values, counts = np.unique(img.reshape(-1, img.shape[2]), axis=0, return_counts=True)
least = values[np.argmin(counts)]
for r, c in zip(*np.where(np.all(img == least, axis=-1))):
fix[r, c] = 255
return fix
img = np.zeros((16, 16 * length))
for c in range(length):
idx = (slice(None), slice(c * 16, (c+1) * 16))
img[idx] = fix_img(out[idx])
cv2.imwrite('solve.png', img)
import ast
from itertools import product
import cv2
import numpy as np
from pwn import args, context, log, process, remote
from tqdm import tqdm
def parse_output(output):
return ast.literal_eval(output.decode())
def connect():
if args.LOCAL:
io = process(['python3.10', 'server.py'])
else:
io = remote('ocr.2022.ctfcompetition.com', 1337)
io.recvuntil(b'\n ')
cmd = io.recvline().strip().decode()
with log.progress('solving pow'), context.local(log_level='error'):
pow_solve = process(['/bin/bash', '-c', cmd])
pow_solve.recvuntil(b'Solution: \n')
solution = pow_solve.recvline()
pow_solve.close()
io.send(solution)
return io
io = connect()
io.sendline(b'2')
io.recvuntil(b'The neural network sees: ')
length = len(parse_output(io.recvline()))
# setup constant weights
io.sendline(b'0')
io.sendline(b'1')
io.sendline(b'1 0 -52')
io.sendline(b'1')
io.sendline(b'1 1 -104')
io.sendline(b'1')
io.sendline(b'1 2 -156')
io.sendline(b'1')
io.sendline(b'1 3 -208')
io.sendline(b'1')
io.sendline(b'2 0 0 69')
io.sendline(b'1')
io.sendline(b'2 1 0 -420')
io.sendline(b'1')
io.sendline(b'2 1 1 69')
io.sendline(b'1')
io.sendline(b'2 2 1 -420')
io.sendline(b'1')
io.sendline(b'2 2 2 69')
io.sendline(b'1')
io.sendline(b'2 3 2 -420')
io.sendline(b'1')
io.sendline(b'2 3 3 69')
def query(r, c, h):
idx = r * 16 + c
# set
for i in range(4):
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + h} {i} 255'.encode())
# read output
io.sendline(b'2')
# unset
for i in range(4):
io.sendline(b'1')
io.sendline(f'0 {idx * 3 + h} {i} 0'.encode())
for r, c, h in product(range(16), range(16), range(3)):
query(r, c, h)
def get_result():
io.recvuntil(b'The neural network sees: ')
return np.array(list(parse_output(io.recvline()).encode()))
out = np.zeros((16, 16 * length, 3))
for r, c, h in tqdm(product(range(16), range(16), range(3)), desc='collecting results', total=768):
out[r, c::16, h] = get_result()
def fix_img(img):
fix = np.zeros((16, 16))
values, counts = np.unique(img.reshape(-1, img.shape[2]), axis=0, return_counts=True)
least = values[np.argmin(counts)]
for r, c in zip(*np.where(np.all(img == least, axis=-1))):
fix[r, c] = 255
return fix
img = np.zeros((16, 16 * length))
for c in range(length):
idx = (slice(None), slice(c * 16, (c+1) * 16))
img[idx] = fix_img(out[idx])
cv2.imwrite('solve.png', img)
Footnotes
-
Without a nonlinear activation function, the entire network (with all the layers composed) just becomes a linear transformation, which is not enough to learn anything that is not linearly separable. There are also other reasons to use nonlinear activation functions, but this is an important one. ↩
-
How would we tell the difference between a real
?
and an "unknown"?
? Literally unusable. ↩ -
Additionally, this is easily parallelizable—just find as many people as there are characters in the flag and have each person run the same script for a different image. At DiceGang, this is colloquially referred to as DiceGrid. ↩
-
Simply averaging the three channels would work, but a common formula is $0.30r + 0.59g + 0.11b$. ↩
-
There are three ways to pick two channels, and comparing it one direction is the same as comparing it the other direction—the result is simply inverted. Of course, this does not account for the case where both channels are equal, but we thought that was pretty unlikely anyway. ↩
-
Okay, it's less than 34 because of the caching and adjustments explained above. ↩
-
This is purely coincidental, and it is because the images have three channels and $\binom{3}{2} = 3$. If the images had four channels, then the subtraction method would require $\binom{4}{2} = 6$ queries per pixel, while the bias method would only require 4. ↩