mylamour/yara_performance_guidelines.md

## yara_performance_guidelines.md

      
    Raw
  

              yara_performance_guidelines.md
            
          
    YARA Performance Guidelines

When creating your rules for YARA keep in mind the following guidelines in order to get the best performance from them.
This guide is based on ideas and recommendations by Victor M. Alvarez and WXS.

Revision 1.1, February 2016, applies to all YARA version 3.3+

Global Rules

Global rules are evaluated first. Only if they are satisfied non-global rules are evaluated. This may be useful if all samples exhibit the same characteristics. Use them combined with the "private" statement to suppress a match notification on the global rules.
Examples:
All rules of the ruleset should match on Windows executables
global private rule EXE {
	meta:
		description = "Executable File"
	condition:
		uint16(0) == 0x5A4D and uint32(uint32(0x3C)) == 0x00004550
}

All rules of the ruleset should match on JAR/ZIP files of a certain size
global private rule Small_JARZIP {
	meta:
		description = "Small JARZIP File"
	condition:
		uint16(0) == 0x4B50 and filesize < 800KB
}

Consider the us of the "filesize" variable in global rules. If all the files that you want to analyze are smaller than 3MB than set this limit in a global rule to skip all files of a bigger size before the strings of every rule of the string set get evaluated.
global private rule malware_size {
	meta:
		description = "Size of all samples is lower than 1MB - setting limit to 3MB"
	condition:
		uint16(0) == 0x4B50 and filesize < 3MB
}

Faster Statements


Slow: Regular Expressions
Fast: Strings
Fastest: Bytes at offset or virtual address

Atoms

YARA extracts from the strings short substrings up to 4 bytes long that are called "atoms". Those atoms can be extracted from any place within the string, and YARA searches for those atoms while scanning the file, if it finds one of the atoms then it verifies that the string actually matches.
For example, consider this strings:
/abc.*cde/

=> posible atoms are abc and cde, either one or the other can be used
/(one|two)three/

=> posible atoms are one, two and three, we can search for three alone, or for both one and two
YARA does its best effort to select the best atoms from each string, for example:
{ 00 00 00 00 [1-4] 01 02 03 04 }

=> here YARA uses the atom 01 02 03 04, because 00 00 00 00 is too common
{ 01 02 [1-4] 01 02 03 04 }

=> 01 02 03 04 is preferred over 01 02 because it's longer
So, the important point is that strings should contain good atoms.
These are bad strings because they contain either too short or too common atoms:
{00 00 00 00 [1-2] FF FF [1-2] 00 00 00 00}
{AB  [1-2] 03 21 [1-2] 01 02}
/a.*b/
/a(c|d)/

The worst strings are those that don't contain any atoms at all, like:
/\w.*\d/
/[0-9]+\n/

This regular expression don't contain any fixed substring that can be used as atom, so it must be evaluated at every offset of the file to see if it matches there.
Too Many Loop Iterations

Another good import recommendation is to avoid for loops with too many iterations, specially of the statement within the loop is too complex, for example:
strings:
	$a = {00 00}
condition:
	for all i in (1..#a) : (@a[i] < 10000)

This rule has two problems. The first is that the string $a is too common, the second one is that because $a is too common #a can be too high and can be evaluated thousands of times.
This other condition is also inefficient because the number of iterations depends on filesize, which can be also very high:
for all i in (1..filesize) : ($a at i)

Magic Header Definitions

It is good practice to select files by the first bytes at offset 0. (magic)
strings:
	$mz = { 4d 5a }
condition:
	( $mz at 0 )

The best way to do this is by the uint16(0) or uint32(0) statements (uint16be(0) or uint32be(0) for big endian format). Using "MZ" as string would cause a string search for "MZ" in the whole file, which could produce a lot of matches before checking the location of those matches with position 0.
condition:
	uint16(0) == 0x5A4D

Also consider using the "magic" module which is not available on the Windows platform. Using the "magic" module makes it much easier to apply complex magic header checks and the resulting rules are easiert to read.
Custom GIF magic header definition:
rule gif_1 {
  condition:
    (uint32be(0) == 0x47494638 and uint16be(4) == 0x3961) or
    (uint32be(0) == 0x47494638 and uint16be(4) == 0x3761)
}

Using the "magic" module:
import "magic"
rule gif_2 {
  condition:
    magic.mime_type() == "image/gif"
}

Too Short Strings

Avoid defining too short strings. Any string with less than 5 or 6 bytes will probably appear in a lot of files.
Example:
Looking for a string that has a .pdb extension.

BAD: $s = ".pdb"
OK: $s = /[^\.\\]{1,40}\.pdb/
BEST: just avoid them

String Advices

Try to describe string definitions as narrow as possible. Avoid the "nocase" attribute if possible, because many atoms will be generated and searched for. Remember, in the absence of modifiers "ascii" is assumed by default. The posible combinations are:
FASTEST - only one atom is generated
$s1 = "cmd.exe"		        (ascii only)
$s2 = "cmd.exe" ascii       (ascii only, same as $s1)
$s3 = "cmd.exe" wide        (UTF-16 only)

FAST - two atoms will be generated
$s4 = "cmd.exe" ascii wide  (both ascii and UTF-16)

SLOW - many atoms will be generated
$s5 = "cmd.exe" nocase      (all different cases, e.g. "Cmd.exe", "cMd.exe", "cmD.exe" ..)

Regular Expressions

Use expressions only when necessary. Regular expression evaluation is inherently slow, don't use them if hex strings with jumps and wild-cards can solve the problem.
Conditions and Short-Circuit Evaluation

Try to write condition statements in which the elements that are most likely to be "False" are placed first. The condition is evaluated from left to right. The sooner the engine identifies that a rule is not satisfied the sooner it can skip the current rule and evaluate the next one. The speed improvement caused by this way to order the condition statements depends on the difference in necessary CPU cycles to process each of the satements. If all statements are more or less equally expensive, reordering the statements causes no noticeable improvement. If one of the statements can be processed very fast it is recommended to place it first in order to skip the expensive statement evaluation in cases in which the first statment is FALSE.
Changing the order in the following statement does not cause a significant improvement:
$string1 and $string2 and uint16(0) == 0x5A4D

However, if the execution time of the statements is very different, reordering in order to trigger the short-circuit will improve the scan speed significantly:
SLOW
EXPENSIVE and CHEAP
math.entropy(0, filesize) > 7.0 and uint16(0) == 0x5A4D

FAST
CHEAP and EXPENSIVE
uint16(0) == 0x5A4D and math.entropy(0, filesize) > 7.0

Short-circuit evaluation was introduced to help optimizing expensive sentences, particularly "for" sentences. Some people were using conditions like the one in the following example:
strings:
	$mz = "MZ"
	...
condition:
	$mz at 0 and for all i in (1..filesize) : ( whatever )

Because filesize can be a very big number, "whatever" can be executed a lot of times, slowing down the execution. Now, with short-circuit evaluation, the "for" sentence will be executed only if the first part of the condition is met, so, this rule will be slow only for MZ files. An additional improvement could be:
$mz at 0 and filesize < 100K and for all i in (1..filesize) : ( whatever )

This way a higher bound to the number of iterations is set.
(this "short-circuit" feature was not included in the 3.3 release but is included in the current release 3.4)