Skip to content

Instantly share code, notes, and snippets.

@theKidOfArcrania

theKidOfArcrania/fuzzing-1.md Secret

Last active Jun 22, 2018
Embed
What would you like to do?

AFL Fuzzing 1

Date: 12.6.2018

Overall implementation

At a high algorithmic level, AFL is implemented as follows. Later on I will specifically touch on each aspect of the code for more details:

repeat until user calls to stop
  select next seed from the queue of seeds
  for each deterministic mutation from seed
    run program with mutated input
    if it crashes, add to crash_set
    using execution trace of program, determine if it is interesting
    if it is interesting add to queue
  next

  compute performance score of seed
  for each undeterministic mutation using performance score
    run program with mutated input
    if it crashes, add to crash_set
    using execution trace of program, determine if it is interesting
    if it is interesting add to queue
  next
loop

The overall algorithm of the fuzzer is very straight forward, it basically runs the program with a set of mutations from a seed input, adding any interesting mutations into the new set of seed inputs.

Deterministic mutations

AFL consists of a few builtin mutations: flipping bits (1 at a time, 2 at a time, 4 at a time), flipping bytes (1 byte at a time, 2 bytes at a time, 4 bytes at a time), arithmetic operations (8-bit, 16-bit, 32-bit), interesting integer values, and splicing together dictionary tokens.

Here are a few notes of optimization:

  • During the 1/1 flipping bit operation, AFL also constructs a dictionary of possible tokens by testing the presence or absense of a change in execution path.
  • AFL creates an effector map for the byte flipping step and steps onward, which flags which bytes actually cause any changes in the execution path, so that later more expensive operations won't get called for those bytes that cause little or no effect.

Tracing execution path

Tracing the execution path of the instrumented binary is important so that we can determine whether if a particular input file results in a different part of the binary being executed. We want the execution path to be both memory/time efficient and reduce as much aliasing/collisions as possible (reduce number of distinct execution paths that are mapped to the same path information).

Also, creating the resulting code-flow graph would be unfeasible because the resulting graph could take up up to infinite size and memory, depending on the complexity of the program.

Coverage vs Execution path

Coverage only denotes the number of hits per junction while execution path denotes the order this execution is done in. Coverage is inferior to an execution path because multiple execution paths hitting the same code locations would result in the same coverage information.

AFL Implementation

AFL creates an execution path trace by mixing the previous code location with the current code location (denotes an edge rather than a point in the code-flow graph) and increments a hit count of this edge. A bitshift right of the previous location is used to make this edge asymmetric and reduces the influence of the previous location over the current location.

Determining interesting cases

An interesting case is one that has a different execution path compared to the ones already seen.

AFL implements this by placing compressed tuples of already seen hit-counts for each execution edge. It is compressed by comparing the bits already set by the previous hit-counts vs the bits of the current hit-count:

bits = bits & ~hit_count

Example: the bits of a particular code location starts as binary 11111111 2 or -1. If execution path A hits this location 3 times, the number 3 is first inverted to 111111002, and then the new bits value becomes 11111111 & 11111100 or 111111002. Then if execution path B then hits this location 14 times, then we first invert 14 into 111100012 and the new bits become 11111100 & 11110001 or 11110000.

Successive hit counts will then turn off certain bits, and if any change occurs, we know for that we have not seen this particular hit count and therefore this execution path is a new path, and this is an interesting case by definition.

This algorithm is done because it minimizes the memory overhead per code location, and mutiple bits values can be computed at the same time, allowing for a fast performance as well. This algorithm also optimizes for sparse data, so the map should not be full or close to full for optimal performance/ efficiency.

Limitations

Because of the nature of (lossy) compression, there are some interesting cases (where we have different hit counts of a code locations) that get overlooked (false negatives), but these limititions are okay if it can help to improve overall performance and efficiency. Furthermore there are certain hit-counts (those with a large number of 1's in its binary form) that could poison the bit state for the particular code location. For example, a hit-count of 127 would effectively disable any futher useful comparisons for that particular edge. Perhaps a way to effecient method to "round-up" a hit-count to the nearest power of 2 could help improve this limitation; however, it would very likely negatively effect the performance of a critical code section.

Performance Score

The biggest difference between AFL and AFLGo is how ALFGo computes the resulting performance score, or "energy" of the seed, which then determines the number of random mutations to perform on this seed. Again, the performance score is actually not utilized anywhere else other than determining the number of random mutations, as per the newest AFL/AFLGo implementation.

AFL uses three main factors that determine a performance score:

  • AFL gives a higher score to those seeds that take less execution time (more of those inputs can be done without taking up too much time)
  • AFL gives a higher score to those seeds that provide a larger coverage (a coverage gives us a larger area of the program to work with)
  • AFL gives a higher score to those seeds that has a larger depth (came from a later generation of seeds) because those seeds are most likely more difficult to be discovered by traditional fuzzers.

AFLGo, in addition to using the above three factors utilizes an additional distance factor, which is defined by the average distances between the location along the execution trace of the seed to the target block(s). This distance is then mixed with a global temerature via an annealing function so that ALFGo will not get stuck in a local minimum.

Possible Improvements

  • Adding some deterministic/undeterministic mutations that could be left out. A larger number of mutations (especially if a large class of mutations get systematically ignored) could result in reaching all the possible code paths faster.
  • Improve aliasing issues in the hit-counts discussed in determining interesting cases.
  • Mixing more previous locations into the execution trace in order to reduce more aliasing in the execution paths. (We should not mix too many previous locations; otherwise, it might result in associating unrelated branches/code locations).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment