theKidOfArcrania/fuzzing-1.md Secret

## fuzzing-1.md

      
    Raw
  

              fuzzing-1.md
            
          
    AFL Fuzzing 1

Date: 12.6.2018
Overall implementation

At a high algorithmic level, AFL is implemented as follows. Later on I will
specifically touch on each aspect of the code for more details:
repeat until user calls to stop
  select next seed from the queue of seeds
  for each deterministic mutation from seed
    run program with mutated input
    if it crashes, add to crash_set
    using execution trace of program, determine if it is interesting
    if it is interesting add to queue
  next

  compute performance score of seed
  for each undeterministic mutation using performance score
    run program with mutated input
    if it crashes, add to crash_set
    using execution trace of program, determine if it is interesting
    if it is interesting add to queue
  next
loop

The overall algorithm of the fuzzer is very straight forward, it basically
runs the program with a set of mutations from a seed input, adding any
interesting mutations into the new set of seed inputs.
Deterministic mutations

AFL consists of a few builtin mutations: flipping bits (1 at a time, 2 at a
time, 4 at a time), flipping bytes (1 byte at a time, 2 bytes at a time, 4
bytes at a time), arithmetic operations (8-bit, 16-bit, 32-bit), interesting
integer values, and splicing together dictionary tokens.
Here are a few notes of optimization:

During the 1/1 flipping bit operation, AFL also constructs a dictionary of
possible tokens by testing the presence or absense of a change in execution
path.
AFL creates an effector map for the byte flipping step and steps onward,
which flags which bytes actually cause any changes in the execution path, so
that later more expensive operations won't get called for those bytes that
cause little or no effect.

Tracing execution path

Tracing the execution path of the instrumented binary is important so that we
can determine whether if a particular input file results in a different part of
the binary being executed. We want the execution path to be both memory/time
efficient and reduce as much aliasing/collisions as possible (reduce number of
distinct execution paths that are mapped to the same path information).
Also, creating the resulting code-flow graph would be unfeasible because the
resulting graph could take up up to infinite size and memory, depending on the
complexity of the program.
Coverage vs Execution path

Coverage only denotes the number of hits per junction while execution path
denotes the order this execution is done in. Coverage is inferior to an
execution path because multiple execution paths hitting the same code locations
would result in the same coverage information.
AFL Implementation

AFL creates an execution path trace by mixing the previous code location with the
current code location (denotes an edge rather than a point in the code-flow
graph) and increments a hit count of this edge. A bitshift right of the previous
location is used to make this edge asymmetric and reduces the influence of the
previous location over the current location.
Determining interesting cases

An interesting case is one that has a different execution path compared to the
ones already seen.
AFL implements this by placing compressed tuples of already seen hit-counts for
each execution edge. It is compressed by comparing the bits already set by the
previous hit-counts vs the bits of the current hit-count:

bits = bits & ~hit_count

Example: the bits of a particular code location starts as binary 11111111
₂ or -1. If execution path A hits this location 3 times, the
number 3 is first inverted to 11111100₂, and then the new bits
value becomes 11111111 & 11111100 or 11111100₂. Then if
execution path B then hits this location 14 times, then we first invert 14
into 11110001₂ and the new bits become 11111100 & 11110001 or
11110000.
Successive hit counts will then turn off certain bits, and if any change occurs,
we know for that we have not seen this particular hit count and therefore this
execution path is a new path, and this is an interesting case by definition.
This algorithm is done because it minimizes the memory overhead per code
location, and mutiple bits values can be computed at the same time, allowing
for a fast performance as well. This algorithm also optimizes for sparse data,
so the map should not be full or close to full for optimal performance/
efficiency.
Limitations

Because of the nature of (lossy) compression, there are some interesting cases
(where we have different hit counts of a code locations) that get overlooked
(false negatives), but these limititions are okay if it can help to improve
overall performance and efficiency. Furthermore there are certain hit-counts
(those with a large number of 1's in its binary form) that could poison
the bit state for the particular code location. For example, a hit-count of 127
would effectively disable any futher useful comparisons for that particular
edge. Perhaps a way to effecient method to "round-up" a hit-count to the
nearest power of 2 could help improve this limitation; however, it would very
likely negatively effect the performance of a critical code section.
Performance Score

The biggest difference between AFL and AFLGo is how ALFGo computes the resulting
performance score, or "energy" of the seed, which then determines the number of
random mutations to perform on this seed. Again, the performance score is
actually not utilized anywhere else other than determining the number of random
mutations, as per the newest AFL/AFLGo implementation.
AFL uses three main factors that determine a performance score:

AFL gives a higher score to those seeds that take less execution time (more
of those inputs can be done without taking up too much time)
AFL gives a higher score to those seeds that provide a larger coverage (a
coverage gives us a larger area of the program to work with)
AFL gives a higher score to those seeds that has a larger depth (came from a
later generation of seeds) because those seeds are most likely more difficult
to be discovered by traditional fuzzers.

AFLGo, in addition to using the above three factors utilizes an additional
distance factor, which is defined by the average distances between the location
along the execution trace of the seed to the target block(s). This distance is
then mixed with a global temerature via an annealing function so that ALFGo
will not get stuck in a local minimum.
Possible Improvements


Adding some deterministic/undeterministic mutations that could be left
out. A larger number of mutations (especially if a large class of mutations
get systematically ignored) could result in reaching all the possible code
paths faster.
Improve aliasing issues in the hit-counts discussed in determining
interesting cases.
Mixing more previous locations into the execution trace in order to reduce
more aliasing in the execution paths. (We should not mix too many previous
locations; otherwise, it might result in associating unrelated branches/code
locations).