Problems with pehash implementations
I've started to add a pehash implementation to YARA. I decided to base my implementation on the description in the paper and only use the totalhash and viper implementations for comparing results. In doing so I've noticed some problems, and it is unclear who is right.
For starters let's take a look at running the pehash.py implementation from totalhash against a binary.
wxs@psh Desktop % shasum 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36 a0dd89c2bba615dab45da0d036d7ddbd9ae7cca0 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36 wxs@psh Desktop % python ./pehash.py 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36 32f9476ee27bdff3bf11d6f9765d9b3e721bc784 wxs@psh Desktop %
Compare the output of that against the same sample in the totalhash database and you can see a difference in the pehash value.
Here's another example of it:
wxs@psh Desktop % shasum 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576 781596de17a08fb466c22dbffd0e5b1195f5f863 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576 wxs@psh Desktop % python ./pehash.py 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576 ffa9ec5c2df4621397e1a0151560cf021e58feb7 wxs@psh Desktop %
And here's the same file on totalhash. The only thing I can conclude so far is that one of the two implementations is wrong. Either the published code is wrong or the code being used by their backend is wrong.
On more problem with the published implementation on totalhash: it doesn't support 64bit binaries (PE32+). I sent them patches to fix it a long time ago and they were ignored. I got the sense that the implementation on totalhash is abandoned, so I went looking for an alternative implementation.
I came across this implementation in the viper project. I downloaded their code and made it a stand-alone python script (removed the few viper specific things) and ran it against the same two samples. Here's the output:
wxs@psh Desktop % python ./viper_pehash.py 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36 c8e405e2d686d79a0eae5d14f513ee30b06c1213 wxs@psh Desktop % python ./viper_pehash.py 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576 75e0fcab7a4de7c4891f91d729ba9aaeca442e2f wxs@psh Desktop %
These two binaries match the output on the totalhash website, which is good. Also, based upon this commit it appears the viper implementation is able to handle PE32+ files. Looks like I may have found a good reference implementation to test against.
Almost but not quite right
When I started to compare my implementation against the viper and totalhash website values I was unable to get it to work. To figure out what is going on I looked closer at both implementations. I noticed that the viper implementation does this:
address = section.VirtualAddress size = section.SizeOfRawData raw = exe.write()[address+size:] if size == 0: kolmog = bitstring.BitArray(float=1, length=32) pehash_bin.append(kolmog[0:7]) continue bz2_raw = bz2.compress(raw) bz2_size = len(bz2_raw) #k = round(bz2_size / size, 5) k = bz2_size / size kolmog = bitstring.BitArray(float=k, length=32) pehash_bin.append(kolmog[0:7])
I instrumented that code to print the section number, name and also the length of
raw. This is the result when run against
section 0 .text len(raw): 8192 section 1 .data len(raw): 4096 section 2 .rsrc len(raw): 0
The first section actually has a size (not virtual size) of 36864. This means it is not compressing the entire section. I think it should be using
section.get_data() instead of
exe.write(). When I made the corresponding change I got sizes that match the size of the raw data for the section:
wxs@psh Desktop % python ./viper_pehash.py ~/Desktop/4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576 section 0 .text len(raw): 36864 section 1 .data len(raw): 4096 section 2 .rsrc len(raw): 4096 4439048cb913db7c193e0be2d9da763f64719a0d wxs@psh Desktop %
However, upon another reading of the paper I noticed that the ratio is scaled from 0-1 up to 0-7. The python implementations just take the low order byte from the ratio and don't scale at all. I can see why this was done because I made the same mistake upon first reading the paper as the notation used is similar.
Given that both python implementations do not scale the result I believe both implementations are wrong, with respect to the paper, and that the viper implementation is at least correct with respect to the one used by totalhash (the one they published is still wrong though).
I'm currently trying to obtain a result from the reference implementation discussed in the paper or the implementation itself so I have a way to generate correct results for testing. If anyone knows a way I can get my hands on either of those I would love to talk to you!