Skip to content

Instantly share code, notes, and snippets.

What would you like to do?

Problems with pehash implementations

I've started to add a pehash implementation to YARA. I decided to base my implementation on the description in the paper and only use the totalhash and viper implementations for comparing results. In doing so I've noticed some problems, and it is unclear who is right.

Totalhash implementation

For starters let's take a look at running the implementation from totalhash against a binary.

wxs@psh Desktop % shasum 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
a0dd89c2bba615dab45da0d036d7ddbd9ae7cca0  4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
wxs@psh Desktop % python ./ 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
wxs@psh Desktop % 

Compare the output of that against the same sample in the totalhash database and you can see a difference in the pehash value.

Here's another example of it:

wxs@psh Desktop % shasum 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
781596de17a08fb466c22dbffd0e5b1195f5f863  4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
wxs@psh Desktop % python ./ 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
wxs@psh Desktop % 

And here's the same file on totalhash. The only thing I can conclude so far is that one of the two implementations is wrong. Either the published code is wrong or the code being used by their backend is wrong.

On more problem with the published implementation on totalhash: it doesn't support 64bit binaries (PE32+). I sent them patches to fix it a long time ago and they were ignored. I got the sense that the implementation on totalhash is abandoned, so I went looking for an alternative implementation.

viper implementation

I came across this implementation in the viper project. I downloaded their code and made it a stand-alone python script (removed the few viper specific things) and ran it against the same two samples. Here's the output:

wxs@psh Desktop % python ./ 4180ee367740c271e05b3637ee64619fb9fe7b1d2b28866e590e731b9f81de36
wxs@psh Desktop % python ./ 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
wxs@psh Desktop %

These two binaries match the output on the totalhash website, which is good. Also, based upon this commit it appears the viper implementation is able to handle PE32+ files. Looks like I may have found a good reference implementation to test against.

Almost but not quite right

When I started to compare my implementation against the viper and totalhash website values I was unable to get it to work. To figure out what is going on I looked closer at both implementations. I noticed that the viper implementation does this:

            address = section.VirtualAddress
            size = section.SizeOfRawData
            raw = exe.write()[address+size:]
            if size == 0: 
                kolmog = bitstring.BitArray(float=1, length=32)
            bz2_raw = bz2.compress(raw)
            bz2_size = len(bz2_raw)
            #k = round(bz2_size / size, 5)
            k = bz2_size / size
            kolmog = bitstring.BitArray(float=k, length=32)

I instrumented that code to print the section number, name and also the length of raw. This is the result when run against 4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576:

section 0 .text
len(raw): 8192
section 1 .data
len(raw): 4096
section 2 .rsrc
len(raw): 0

The first section actually has a size (not virtual size) of 36864. This means it is not compressing the entire section. I think it should be using section.get_data() instead of exe.write(). When I made the corresponding change I got sizes that match the size of the raw data for the section:

wxs@psh Desktop % python ./ ~/Desktop/4eb4f00222e8930cab2dfa5fbaa9ad03a6bda06817b41e07f0796586af20c576
section 0 .text
len(raw): 36864
section 1 .data
len(raw): 4096
section 2 .rsrc
len(raw): 4096
wxs@psh Desktop % 

However, upon another reading of the paper I noticed that the ratio is scaled from 0-1 up to 0-7. The python implementations just take the low order byte from the ratio and don't scale at all. I can see why this was done because I made the same mistake upon first reading the paper as the notation used is similar.

Given that both python implementations do not scale the result I believe both implementations are wrong, with respect to the paper, and that the viper implementation is at least correct with respect to the one used by totalhash (the one they published is still wrong though).

I'm currently trying to obtain a result from the reference implementation discussed in the paper or the implementation itself so I have a way to generate correct results for testing. If anyone knows a way I can get my hands on either of those I would love to talk to you!

Copy link

AnyMaster commented May 9, 2015

Hello, you are right, code from totalhash is BAD, also as from viper, based on it...

  • it skip high order bits instead of low order
  • It add up to 7 zero bits at the end in case if hash component is not in byte boundary
  • Kolmogorov Complexity have to be converted # (0..1} ∈ R -> [0..7] ⊂ N , i.e. it have be 0,1,2,3,4,5,6,7 but not some part of float type bits structure (sign and exponent)
  • etc....

Check this one:

Good luck :)

Copy link

knowmalware commented Jun 4, 2016

I've started tracking different implementations (4 so far) into a single project and normalizing calling them (same args across all):
I also noticed that AnyMaster has attempted to improve upon the original (have not yet included it into my project; want to sanity test first):

Copy link

wxsBSD commented Jun 6, 2016

Sorry for the late reply, but I didn't get a notification that there were comments here. ;)

I just took a quick look at your implementation and noticed you've addressed the findings I have documented here. I'm going to take a closer look at it and also see how well it clusters some packed samples I have laying around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment