Futuredisk 1 - 22 solves
I'm from the year 2123. Here's what I did:
- Mounted my 10 exabyte flash drive
- fallocate -l 8E haystack.bin
- dd if=flag.txt bs=1 seek=[REDACTED] conv=notrunc of=haystack.bin
- gzip haystack.bin
- Put haystack.bin.gz on my web server for you to download
HTTP over Time Travel is a bit slow, so I hope gzipping it made it a little faster to download :)
Author: kuilin
Category: Web
Overview
We are given a server serving a massive gzip compressed file, which(when decompressed) contains null bytes and the flag stored at some unknown offset.
Exploration
For starters it's worth looking at the headers returned by the server, especially since the challenge is tagged as "web".
Let's try curling the file:
~ curl -v 'https://futuredisk-web.chal.uiuc.tf/haystack.bin.gz' -o /dev/null
[...]
< HTTP/2 200
< accept-ranges: bytes
< content-type: application/octet-stream
< date: Sat, 08 Jul 2023 18:02:49 GMT
< etag: "1209f04b4-7fffffffffffffff"
< last-modified: Sat, 12 Jun 2123 16:07:16 GMT
< server: nginx/1.23.1
< content-length: 9223372036854775807
<
there is some headers here I find interesting, namely:
-
last-modified: Sat, 12 Jun 2123 16:07:16 GMT
- not really helpful, but I appreciate that the year checks out -
accept-ranges: bytes
this header is used by the web-server to indicate to the client that it supports "partial" requests, meaning we can use the
range
header to specify a part of the file that we want to download.For example:
curl -r "9223372036854775707-9223372036854775807" 'https://futuredisk-web.chal.uiuc.tf/haystack.bin.gz'
will download the last 100 bytes of the file. -
etag: "1209f04b4-7fffffffffffffff"
The etag header is used to identify if a file has been changed, and the e-tag generation algorithm is left up to the server. Maybe the way nginx generates it is in some way reveals useful information to us? Unfortunately the nginx etag is made up of the last-modified time and the content-length of the file, both of which are already known, and aren't really helpful.
Gzip
After trying a few dumb approaches it became obvious to me that we can't solve this without at least taking a look at the file itself. Quickly skimming the wikipedia page for "gzip" we find that it's mainly a thin wrapper over DEFLATE, with a 10 byte header, containing a magic number (0x1f 0x8b), the compression method (always 0x08 for DEFLATE), 1-byte of header flags(null in our case, indicating that the gzip wrapper contains not extra information), a 4-byte timestamp, compression flags(indicating the level of compression applied, either 2(slowest) or 4(fastest)), the operating system ID, and an an 8-byte trailer, containing a CRC-32 checksum and the length of the original uncompressed data(modulo 232).
For a second I've considered if we could use the CRC-32 checksum to find the flag or maybe deduce it's position, but after further investigation I've rejected that idea as implausible.
DEFLATE
The body of a gzip file contains the real star of the show: DEFLATE. Deflate is a lossless compression algorithm using a bunch of smart math that I don't fully understand.
What's important for the purpouses of this challenge is that deflate compresses the data in two steps:
-
First it replaces duplicate strings with references to data at most 32KiB behind. In our case the previous data will just be null bytes(or the flag, but we can ignore that)
-
It then divides the data into blocks, either storing the uncompressed data, or compressing it using huffman coding.
- Huffman coding uses something called huffman trees, which can be either static(defined by the deflate standard), or dynamic(defined as a part of the header. We don't need to understand this as a part of this challenge
Interestingly since the decompressed contents of previous blocks are(for our purpouses) static that means the value of any one block doesn't depend on the value of a previous block. That means that if we can figure out where a block starts we can decompress it
Code!
Now that we more or less understand what's going on in the file we can try to play around with it a bit more.
Let's search for a "simple python deflate implementation", and download the first github link: https://github.com/nayuki/Simple-DEFLATE-decompressor/blob/master/python/deflatedecompress.py
** Note **: This section features some modifications being made to the decompressor. They are for completeness, should you want to follow along and can be skipped.
Now we can download a sample from the file and see what happens when we try to decompress it.
curl -r 0-131072 'https://futuredisk-web.chal.uiuc.tf/haystack.bin.gz' -o haystack.part.bin
128 KiB ought to be enough for anyone
import deflatedecompress
with open("./haystack.part.bin", "rb") as f:
# Skip the gzip header
f.seek(10)
inp = deflatedecompress.BitInputStream(f)
decompressed = deflatedecompress.Decompressor.decompress_to_bytes(inp)
print(decompressed)
Traceback (most recent call last):
[...]
EOFError: Unexpected end of stream
...right, we don't have the complete file, and decompressor tries to decompress the whole thing at once.
Let's modify the decompressor to allow us to process the stream one block at a time:
diff --git a/deflatedecompress.py b/deflatedecompress.py
index feda42e..10fb7e0 100644
--- a/deflatedecompress.py
+++ b/deflatedecompress.py
@@ -293,26 +293,25 @@ class Decompressor:
self._output = out
self._dictionary = ByteHistory(32 * 1024)
- # Process the stream of blocks
- while True:
- # Read the block header
- isfinal: bool = self._input.read_uint(1) != 0 # bfinal
- type: int = self._input.read_uint(2) # btype
-
- # Decompress rest of block based on the type
- if type == 0:
- self._decompress_uncompressed_block()
- elif type == 1:
- self._decompress_huffman_block(Decompressor._FIXED_LITERAL_LENGTH_CODE, Decompressor._FIXED_DISTANCE_CODE)
- elif type == 2:
- litlencode, distcode = self._decode_huffman_codes()
- self._decompress_huffman_block(litlencode, distcode)
- elif type == 3:
- raise ValueError("Reserved block type")
- else:
- assert False, "Unreachable value"
- if isfinal:
- break
+ def process_block(self):
+ # Read the block header
+ isfinal: bool = self._input.read_uint(1) != 0 # bfinal
+ type: int = self._input.read_uint(2) # btype
+
+ # Decompress rest of block based on the type
+ if type == 0:
+ self._decompress_uncompressed_block()
+ elif type == 1:
+ self._decompress_huffman_block(Decompressor._FIXED_LITERAL_LENGTH_CODE, Decompressor._FIXED_DISTANCE_CODE)
+ elif type == 2:
+ litlencode, distcode = self._decode_huffman_codes()
+ self._decompress_huffman_block(litlencode, distcode)
+ elif type == 3:
+ raise ValueError("Reserved block type")
+ else:
+ assert False, "Unreachable value"
+ if isfinal:
+ raise EOFError()
Let's see now:
import deflatedecompress
import io
out = io.BytesIO()
with open("./haystack.part.bin", "rb") as f:
# Skip the gzip header
f.read(10)
inp = deflatedecompress.BitInputStream(f)
dec = deflatedecompress.Decompressor(inp, out)
dec.process_block()
output = out.getvalue()
assert b"\x00"*len(output) == output
~ python solve.py
[no output]
~
yay!
Let's try to see if we can notice something interesting about the blocks:
import deflatedecompress
import io
out = io.BytesIO()
f = open("./haystack.part.bin", "rb")
# Skip the gzip header
f.read(10)
inp = deflatedecompress.BitInputStream(f)
dec = deflatedecompress.Decompressor(inp, out)
while True:
pos_pre = f.tell()
out_pre = out.tell()
dec.process_block()
pos_after = f.tell()
out_post = out.tell()
print(f"block: {hex(pos_pre)}, length: {hex(pos_after-pos_pre)}, decompressed_len: {hex(out_post-out_pre)}")
output = out.getvalue()
assert b"\x00"*len(output) == output
~ python solve.py
block: 0xa, length: 0x200e, decompressed_len: 0x80fcfc
block: 0x2018, length: 0x200c, decompressed_len: 0x80fefe
block: 0x4024, length: 0x200c, decompressed_len: 0x80fefe
block: 0x6030, length: 0x200c, decompressed_len: 0x80fefe
block: 0x803c, length: 0x200d, decompressed_len: 0x80fefe
block: 0xa049, length: 0x200c, decompressed_len: 0x80fefe
block: 0xc055, length: 0x200c, decompressed_len: 0x80fefe
block: 0xe061, length: 0x200c, decompressed_len: 0x80fefe
block: 0x1006d, length: 0x200d, decompressed_len: 0x80fefe
block: 0x1207a, length: 0x200c, decompressed_len: 0x80fefe
block: 0x14086, length: 0x200c, decompressed_len: 0x80fefe
block: 0x16092, length: 0x200c, decompressed_len: 0x80fefe
block: 0x1809e, length: 0x200d, decompressed_len: 0x80fefe
block: 0x1a0ab, length: 0x200c, decompressed_len: 0x80fefe
block: 0x1c0b7, length: 0x200c, decompressed_len: 0x80fefe
Traceback (most recent call last):
[...]
EOFError: Unexpected end of stream
...interesting... very very interesting
So the first block has a length of 0x200e, presumably because of the afformentioned DEFLATE string deduplication not being applied,
but after that the blocks seems to follow a pattern of: 0x200c,0x200c,0x200c,0x200d
...
where does the 0x200d come from? I guess it could just be a quirk of the deflate implementation? Something feels off though
After reading some more I noticed that deflate headers don't have to be byte aligned!
Let's modify the BitInputStream class to add a function for checking our bit position.
diff --git a/deflatedecompress.py b/deflatedecompress.py
index 10fb7e0..c4a1e73 100644
--- a/deflatedecompress.py
+++ b/deflatedecompress.py
@@ -44,6 +44,9 @@ class BitInputStream:
"""Returns the current bit position, which ascends from 0 to 7 as bits are read."""
assert 0 <= self._num_bits_remaining <= 7, "Unreachable state"
return -self._num_bits_remaining % 8
+
+ def tell_bits(self):
+ return self._input.tell()*8+self.get_bit_position()
def read_bit_maybe(self) -> int:
import deflatedecompress
import io
out = io.BytesIO()
f = open("./haystack.part.bin", "rb")
# Skip the gzip header
f.read(10)
inp = deflatedecompress.BitInputStream(f)
dec = deflatedecompress.Decompressor(inp, out)
while True:
pos_pre = f.tell_bits() # tell -> tell_bits()
out_pre = out.tell()
dec.process_block()
pos_after = f.tell_bits() # tell -> tell_bits()
out_post = out.tell()
print(f"block: {hex(pos_pre)}, length: {hex(pos_after-pos_pre)}, decompressed_len: {hex(out_post-out_pre)}")
output = out.getvalue()
assert b"\x00"*len(output) == output
~ python solve.py
block: 0x50, length: 0x10071, decompressed_len: 0x80fcfc
block: 0x100c1, length: 0x10062, decompressed_len: 0x80fefe
block: 0x20123, length: 0x10062, decompressed_len: 0x80fefe
block: 0x30185, length: 0x10062, decompressed_len: 0x80fefe
block: 0x401e7, length: 0x10062, decompressed_len: 0x80fefe
block: 0x50249, length: 0x10062, decompressed_len: 0x80fefe
block: 0x602ab, length: 0x10062, decompressed_len: 0x80fefe
block: 0x7030d, length: 0x10062, decompressed_len: 0x80fefe
block: 0x8036f, length: 0x10062, decompressed_len: 0x80fefe
block: 0x903d1, length: 0x10062, decompressed_len: 0x80fefe
block: 0xa0433, length: 0x10062, decompressed_len: 0x80fefe
block: 0xb0495, length: 0x10062, decompressed_len: 0x80fefe
block: 0xc04f7, length: 0x10062, decompressed_len: 0x80fefe
block: 0xd0559, length: 0x10062, decompressed_len: 0x80fefe
block: 0xe05bb, length: 0x10062, decompressed_len: 0x80fefe
now this makes more sense.
Finding the flag
At this point an idea started to emerge. Since all of the non-flag blocks are going to be the same when uncompressed, it would make the sense for them to be the same when compressed, since one compressed block doesn't depend on any other blocks.
So neatly arranged! That's all cool and good but what happens when the flag shows up? Presumably it can't be compressed as well as the null bytes have been and the block is gonna take up more space
After the block containing the flag appears all the following blocks will be slightly offset, from where we expect them to be.
This means we can do a binary search to find out the position of the flag, by simply checking whether a block is at at the offset we expect it to be.
Solve
Let's write a simple function that tells us what block an address belongs to.
def addr_to_block(bit_addr):
LENGTH = 0x10062
# First block of length 0x10062
first_block=0x100c1
bit_addr = bit_addr-first_block
block=bit_addr - (bit_addr%LENGTH)
return block + first_block
>>> hex(addr_to_block(0x7030d))
'0x7030d'
>>> hex(addr_to_block(0x7030d+5))
'0x7030d'
>>> hex(addr_to_block(0x7090d+5))
'0x7030d'
>>> hex(addr_to_block(0x9090d+5))
'0x903d1'
That checks out!
Now we need some way for the python code to easily access an arbitrary offset into the haystack file.
I tried using httpdirfs, but after that didn't work I wrote my own implementation of binaryIO.
import pickle
from urllib.parse import quote_plus
import requests
from typing import BinaryIO, Optional
class HTTPBinaryIO(BinaryIO):
url: str
sess: requests.Session
pos: int
_block_size: int
_cache: dict[int, bytes]
def __init__(self, url):
self.url = url
self.sess = requests.Session()
self.pos = 0
self._block_size = 1024*10
self._cache = {}
self._load_cache_from_file()
def _save_cache_to_file(self):
with open(f"./_cache.{quote_plus(self.url)}.pick", "wb") as f:
pickle.dump(self._cache, f)
def _load_cache_from_file(self):
try:
with open(f"./_cache.{quote_plus(self.url)}.pick", "rb") as f:
self._cache = pickle.load(f)
except:
pass
def _fetch_range(self, start: int, end: int):
range_header = f"{start}-{end-1}"
headers={"range": f"bytes={range_header}"}
res = self.sess.get(self.url, headers=headers)
assert res.status_code == 206, f"status code is {res.status_code}, not 206"
return res.content
def _fetch_block(self, block_start: int):
if data := self._cache.get(block_start):
return data
block_end = block_start+self._block_size
res = self._fetch_range(block_start,block_end)
assert len(res) == self._block_size, f"{len(res)} != {self._block_size}"
self._cache[block_start] = res
self._save_cache_to_file()
return res
def _get_range(self, start: int, end: int) -> Optional[bytes]:
block_start = start - (start % self._block_size)
block_end = (end - (end % self._block_size)) + self._block_size
res = b""
for cur_block_start in range(block_start, block_end+1, self._block_size):
res += self._fetch_block(cur_block_start)
return res[start-block_start:end-block_start]
def read(self, size):
start = self.pos
end = self.pos + size
res = self._get_range(start, end)
self.pos = end
return res
def seek(self, pos):
self.pos = pos
def tell(self):
return self.pos
We also need a way to seek into a bit offset. We can once again modify the BitInputStream to include a seek_bits function.
diff --git a/deflatedecompress.py b/deflatedecompress.py
index 09c7f6a..2e6e7c5 100644
--- a/deflatedecompress.py
+++ b/deflatedecompress.py
@@ -47,6 +47,14 @@ class BitInputStream:
def tell_bits(self):
return self._input.tell()*8+self.get_bit_position()
+
+ def seek_bits(self, loc):
+ byte = loc >> 3
+ bit = loc & 0b111
+
+ self._input.seek(byte-1)
+ self._current_byte = int.from_bytes(self._input.read(1), byteorder='big')
+ self._num_bits_remaining = 8-bit
def read_bit_maybe(self) -> int:
Let's try it!
import deflatedecompress
import io
import http_binary_io
def addr_to_block(bit_addr):
LENGTH = 0x10062
# First block of length 0x10062
first_block=0x100c1
bit_addr = bit_addr-first_block
block=bit_addr - (bit_addr%LENGTH)
return block + first_block
out = io.BytesIO()
f = http_binary_io.HTTPBinaryIO("https://futuredisk-web.chal.uiuc.tf/haystack.bin.gz")
# Skip the gzip header
f.seek(10)
inp = deflatedecompress.BitInputStream(f)
dec = deflatedecompress.Decompressor(inp, out)
# Process the first block to fill out the backtrack dictionary
dec.process_block()
inp.seek_bits(addr_to_block(0x133713371337))
dec.process_block()
output = out.getvalue()
assert b"\x00"*len(output) == output
~ python solve.py
[no output]
it works 🎉
Time to put it all together with a binary search
import deflatedecompress
import io
import http_binary_io
def addr_to_block(bit_addr):
LENGTH = 0x10062
# First block of length 0x10062
first_block=0x100c1
bit_addr = bit_addr-first_block
block=bit_addr - (bit_addr%LENGTH)
return block + first_block
def decompress_block_at(pos):
f = http_binary_io.HTTPBinaryIO("https://futuredisk-web.chal.uiuc.tf/haystack.bin.gz")
# Skip the gzip header
f.seek(10)
inp = deflatedecompress.BitInputStream(f)
out = io.BytesIO()
dec = deflatedecompress.Decompressor(inp, out)
# Process the first block to fill out the backtrack dictionary
dec.process_block()
inp.seek_bits(pos)
dec.process_block()
output = out.getvalue()
return output
def check_block(pos):
try:
output = decompress_block_at(pos)
assert output == len(output)*b"\x00"
return True
except Exception:
return False
def find_flag_block():
# Perform a binary search
low = 0
high = 9223372036854775807*8
while True:
checked_addr = addr_to_block((high - low) // 2 + low)
if checked_addr == low or checked_addr == high:
return high
print(f"Checking: {checked_addr}, range: {low} - {high} ( {high-low} bits )")
if check_block(checked_addr):
low = checked_addr
else:
high = checked_addr
flag_block = find_flag_block()
print(f"Flag is at {flag_block}")
print(decompress_block_at(flag_block).strip(b"\x00").decode("utf8"))
~ python solve.py
Checking: 18446744073709465031, range: 0 - 36893488147419061235 ( 36893488147419061235 bits ) [1/57]
Checking: 27670116110564263133, range: 18446744073709465031 - 36893488147419061235 ( 18446744073709596204 bits )
Checking: 23058430092136831265, range: 18446744073709465031 - 27670116110564263133 ( 9223372036854798102 bits )
Checking: 25364273101350547199, range: 23058430092136831265 - 27670116110564263133 ( 4611686018427431868 bits )
[snip]
Checking: 23819729546392263779, range: 23819729546392132511 - 23819729546392395047 ( 262536 bits )
Checking: 23819729546392329413, range: 23819729546392263779 - 23819729546392395047 ( 131268 bits )
Flag is at 23819729546392329413
uiuctf{binary search means searching a binary stream, right :D}