Skip to content

Instantly share code, notes, and snippets.

Last active March 29, 2021 15:46
Show Gist options
  • Save wxsBSD/019740e83faa7a7206f4 to your computer and use it in GitHub Desktop.
Save wxsBSD/019740e83faa7a7206f4 to your computer and use it in GitHub Desktop.
YARA, now with more Math(TM)! (Thanks @alexcpsec)


I'd like to explain some of the new things I've added to YARA which will be in the next release. This is in addition to the stuff I've written about here, which are already in 3.2.0. If you have not read that I suggest you start there as it will tie in nicely with some of the things I'm going to mention here. Lastly, some of these things are not yet merged into master but I expect them to be very soon.

Math Module

There is a new module in YARA called math. The intention of this module is to expose some functions which you can use in your rules to calculate specific properties.


In particular it provides these functions for calculating different values:

  • entropy
    • entropy(offset, length)
    • entropy(string)
  • monte_carlo_pi
    • monte_carlo_pi(offset, length)
    • monte_carlo_pi(string)
  • serial_correlation
    • serial_correlation(offset, length)
    • serial_correlation(string)
  • mean
    • mean(offset, length)
    • mean(string)

Each of the above functions can be called two ways. First is with an offset and length pair. The second is by passing a string. The notion of string here is a normal string like Hello World, but also any string that is exposed via a YARA module. Each function serves a different purpose, and came directly from here. Please read that to get an understanding of what each function calculates, but note that the monte_carlo_pi function returns the percent error from Pi, not the exact value that was calculated.

This leaves the two remaining functions:

  • deviation
    • deviation(offset, length, mean)
    • deviation(string, mean)
  • in_range
    • in_range(test, upper, lower)

The deviation function calculates the deviation from the mean for the data provided. To make things easier we have provided a constant called MEAN_BYTES which has the value 127.5 and can be used in the calculation if necessary.

The in_range function is an inclusive range check. It will return true if the test value is between upper and lower inclusively, and false otherwise.


Let's put this together to see some of these in action:

import "math"

// The entropy of a file is > 7
// The monte_carlo_pi percentage error is < 0.07
// The serial correlation is less than 0.2
// The deviation of the mean from 127.5 is between 63.9 and 64.1 inclusive
rule random_test {
    math.entropy(0, filesize) > 7 and
    math.monte_carlo_pi(0, filesize) < 0.07 and
    math.serial_correlation(0, filesize) < 0.2
    math.in_range(math.deviation(0, filesize, math.MEAN_BYTES), 63.9, 64.1) and

Create a file that contains the output of your random device and run it through this rule and it should trigger the rule. If not, you better check just how random your random device is. ;)

One note about the deviation. Because the possible range of values for our input data is 0 to 255, the mean of an equally distributed random sample would be 127.5 (math.MEAN_BYTES), and the deviation from that would by 64.0, which explains the upper and lower values used. The mean argument to the deviation function is user controlled because there may be situations where math.MEAN_BYTES is not an acceptable value. An example of this is where only 7 bit ascii values are allowed, which would make the acceptable range from 0 to 127, and make the mean of an equally distributed random sample in that range be 63.5. By allowing this to be user controlled we have a flexible deviation function.

The nice thing about taking an offset and length pair is that you can look at specific chunks of the file. This rule calculates the entropy of the last 1024 bytes of a file (note: feeding this a file that is less than 1024 bytes long will cause an error):

import "math"

rule last_1k_random_test {
    math.entropy(filesize - 1024, 1024) > 7

Resource Improvements

PE resources now have a bunch of new features. The goal of all these changes is to be able to create more intricate and detailed signatures for specific attributes of PE files.

Resource Table Properties

When parsing resources the PE module will now expose the following attributes:

  • resource_timestamp
  • resource_version.major
  • resource_version.minor
  • number_of_resources

The resource_timestamp attribute is an integer, so any comparisons will need to be done accordingly.

Resource Properties

Each individual resource parsed by the module is stored in an array of structures called resources. Each structure has these elements:

  • offset
  • length
  • type
  • id
  • language
  • type_string
  • name_string
  • language_string

type, id, language are all integers. If a resource has a name at any of those levels it will be stored in the corresponding _string attribute. Please note that these strings are, according to the specification, Unicode and comparisons need to be done accordingly.


Here's some examples of it in action:

import "pe"

// Check for a specific resource timestamp value (Mon Jun 19 07:07:15 UTC 2006).
rule rsrc_timestamp {
    pe.resource_timestamp == 1150700835

 * Binaries where the resource timestamp is before the PE timestamp,
 * suggesting the binary was rebuilt without resources being touched.
 * wxs@psh wxs % date -ur 1373882334 # PE timestamp
 * Mon Jul 15 09:58:54 UTC 2013
 * wxs@psh wxs % date -ur 1122985819 # Resource timestamp
 * Tue Aug  2 12:30:19 UTC 2005
 * wxs@psh wxs % 
rule resource_timestamp_before {
    pe.resource_timestamp != 0 and
    pe.resource_timestamp < pe.timestamp

I'm not sure how useful the above is, or if it is even accurate, but it is a theory worth exploring. Here's some more examples of using the resource array.

import "pe"

// Exactly 4 resources and one of them has a type of "BINARY" in UTF-8.
rule type_string_test {
    pe.number_of_resources == 4 and
    for any i in ( - 1):
      (pe.resources[i].type_string == "B\x00I\x00N\x00A\x00R\x00Y\x00")

This one uses type_string, but if your binary does not use custom strings for these fields you can use the resource type definitions from MSDN. To make things easier than using the numbers you can use the names. The only difference is that for increased clarity YARA uses RESOURCE_TYPE_ as a prefix instead of RT_.

Here's an example of it in action:

import "pe"

rule resource_type {
    for any i in ( - 1):
      (pe.resources[i].type == pe.RESOURCE_TYPE_CURSOR)

Lastly, you can combine these things to make new and interesting rules.

import "math"
import "hash"
import "pe"

// Look for a resource with a specific hash
rule resource_hash {
    for any i in ( - 1):
      (hash.md5(pe.resources[i].offset, pe.resources[i].length) == "49f68a5c8493ec2c0bf489821c21fc3b")

// Look for a resource with a entropy greater than 7
rule resource_type {
    for any i in ( - 1):
      (math.entropy(pe.resources[i].offset, pe.resources[i].length) > 7)

Also, if you ever come across a PE that YARA fails to parse I'd love to get my hands on it. I can always be reached at!

-- WXS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment