Template for a simple, fast searcher. It just reads from stdin (optionally starting from a specified offset) and searches for a hardcoded string (or more complex criteria) and prints out the offsets of the occurences as a line to stdout. The program will stop searching if an I/O error is found, though that's easy to change.
This can work with huge files correctly. It's zero-dependency
and should build and run in any UNIX. Make sure to build with -O3
!
Things you can adjust:
-
Search match function: this function should check for an occurency at the beginning of
data
, and print the offset if there is. -
Lookahead: guarantee that at least N bytes are available after the byte pointed by
data
. For example, if lookahead is set to 2, thendata[0]
,data[1]
anddata[2]
point to valid data. If it's set to 0, then onlydata[0]
should be accessed by the search match function. Don't set this higher than what you need to read, or you'll miss occurences near to the end of the file. -
Start offset: Start searching from that offset in file. This won't work if stdin is a pipe or any other non-seekable file; in that case, it should be left as zero.
-
Block size: Size of each call to
read
. It should be a power of two, such as 512, 1024, 2048, 4096. Only affects performance and memory usage. The default should be fine.
For better performance, the search match function should start by checking for uncommon patterns, and return as soon as it knows there is no occurency in that offset.
Example: Git commits stored in repos typically start with
0x78 0x01 0x4B 0xCE 0xCF 0xCD 0xCD 0x2C 0x51
Our function could start by checking that data[0]
is 0x78, but
that's an 'x' so it's a very common character to find in an HDD dump.
Instead, it's better to check that data[3]
is 0xCE first, then
check data[4]
is 0xCF, and the same with data[5]
. If these tests
all succeed, the probability of having found a Git commit is high,
and we can now check the rest of the bytes to be sure.
Instead of constantly checking start < end
, one could just append
an occurency to the end of the data. This alone would make the search
up to 67% faster.
The code could be modified to have the buffer aligned, and compare shorts, or quads, instead of single bytes. This would, if done correctly, give much better performance, since a lot less offsets would now pass the first check.