Skip to content

Instantly share code, notes, and snippets.

@tin-z
Created March 13, 2022 21:51
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tin-z/23f00e5bafacc7cd3676ac82b1dab8b0 to your computer and use it in GitHub Desktop.
Save tin-z/23f00e5bafacc7cd3676ac82b1dab8b0 to your computer and use it in GitHub Desktop.
Roadmap to learn fuzzing

Roadmap to learn fuzzing


Index

1. Sanitizers
2. Intro-to-fuzzing
3. libFuzzerTutorial repo
4. ClusterFuzz
5. Oss-Fuzz
6. Fuzzing-survey
7. Libfuzzer-workshop
8. AFL++ 
8. antonio-morales workshops and tutorials on Fuzzing binaries in real case scenarios
9. Fuzzing python stuffs
10. Fuzzing native C/C++
11. Fuzzing network protocols
12. Angora fuzzer
13. Fuzzer software dev
14. Extra

  1. Sanitizers
  • paper: "AddressSanitizer: A Fast Address Sanity Checker" link
  • summary link
  • usage examples link

  1. read the documents listed here: intro-to-fuzzing

 - Fuzz testing is a process of testing APIs with generated data. The most common forms are:
    * Mutation based fuzzing which mutates existing data samples (aka the test corpus) to create test data;
    * Generation based fuzzing which produces new test data based on models of the input.

 - Guided fuzzing is an important extension to mutation based fuzzing. Guided fuzzers employ a feedback loop when testing newly mutated inputs. 
   If an input results in a new signal (such as increased code coverage), it is permanently added to the test corpus. 
   The corpus grows over time, therefore increasing the test coverage of the target program.

 - Fuzzing is typically used to find the following kinds of bugs
    * Bugs specific to C/C++ that require the sanitizers to catch:
      > Use-after-free, buffer overflows
      > Uses of uninitialized memory
      > Memory leaks

    * Arithmetic bugs:
      > Div-by-zero, int/float overflows, invalid bitwise shifts

    * Plain crashes:
      > NULL dereferences, Uncaught exceptions

    * Concurrency bugs:
      > Data races, Deadlocks

    * Resource usage bugs:
      > Memory exhaustion, hangs or infinite loops, infinite recursion (stack overflows)

    * Logical bugs:
      > Discrepancies between two implementations of the same protocol (example)
      > Round-trip consistency bugs (e.g. compress the input, decompress back, - compare with the original)
      > Assertion failures
 - Most of these are exactly the kinds of bugs that attackers use to produce exploits, from denial-of-service through to full remote code execution.

### Potential Fuzzing Targets ###
 - Types of projects where fuzzing has been useful:

   * Anything that consumes untrusted or complicated inputs:
    - Parsers of any kind (xml, pdf, truetype, ...)
    - Media codecs (audio, video, raster and vector images, etc)
    - Network protocols, RPC libraries (gRPC)
    - Network scanners (pmon)
    - Crypto (boringssl, openssl)
    - Compression (zip, gzip, bzip2, brotli, …)
    - Compilers and interpreters (PHP, Perl, Python, Go, Clang, …)
    - Services/libraries that consume protobuffers
    - Regular expression matchers (PCRE, RE2, libc)
    - Text/UTF processing (icu)
    - Databases (SQlite)
    - Browsers (all)
    - Text editors/processors (vim, OpenOffice)
   * OS Kernels (Linux), drivers, supervisors and VMs
   * UI (Chrome UI)

### Fuzzing Successes ###
 - Historically, fuzzing has been an extremely effective technique for finding long-standing bugs in code bases that fall into the target categories above. 
   Some trophy list examples (with a total number of tens of thousands bugs found inside and outside of Google):
   * [AFL bugs](http://lcamtuf.coredump.cx/afl/#bugs)
   * [libFuzzer bugs](http://llvm.org/docs/LibFuzzer.html#trophies)
   * [syzkaller bugs](https://github.com/google/syzkaller/blob/master/docs/found_bugs.md)
   * [go-fuzz bugs](https://github.com/dvyukov/go-fuzz#trophies)
   * [Honggfuzz bugs](https://github.com/google/honggfuzz#trophies)
   * [ClusterFuzz bugs in Chrome](https://bugs.chromium.org/p/chromium/issues/list?can=1&q=label%3AClusterFuzz+-status%3AWontFix%2CDuplicate&sort=-id&colspec=ID+Pri+M+Stars+ReleaseBlock+Cr+Status+Owner+Summary+OS+Modified&x=m&y=releaseblock&cells=tiles)
   * [OSS-Fuzz bugs](https://bugs.chromium.org/p/oss-fuzz/issues/list?q=label%3AClusterFuzz+-status%3AWontFix%2CDuplicate&can=1)
   * [Facebook’s Sapienz (UI fuzzing)](https://engineering.fb.com/developer-tools/sapienz-intelligent-automated-software-testing-at-scale/)

---

 - the basic things to remember about a fuzz target:
    * The fuzzing engine will execute it many times with different inputs in the same process.
    * It must tolerate any kind of input (empty, huge, malformed, etc).
    * It must not exit() or abort() on any input (if it does, it's a bug).
    * It may use threads but ideally all threads should be joined at the end of the function.
    * It must be as deterministic as possible. Non-determinism (e.g. random decisions not based on the input bytes) will make fuzzing inefficient.
    * It must be fast. Try avoiding cubic or greater complexity, logging, or excessive memory consumption.
    * Ideally, it should not modify any global state (although that’s not strict).
    * Usually, the narrower the target the better. E.g. if your target can parse several data formats, split it into several targets, one per format.

### Determinism ###
 - A fuzz target needs to be deterministic, i.e. given the same input it should have the same behavior. 
   This means, for example, that a fuzz target should not use rand() or any other source of randomness.  

### Speed ###
 - Fuzzing is a search algorithm that requires many iterations, and so a good fuzz target should be very fast.
    * A typical good fuzz target will have an order of 1000 executions (per second per one CPU core) on average (exec/s) or more. 
    * For lightweight targets, 10000 exec/s or more.
 - If your fuzz target has less than 10 exec/s you are probably doing something wrong. 
    * We recommend to profile fuzz targets and eliminate any obvious hot spots.

### Memory consumption ###
 - For CPU-efficient fuzzing a good fuzz target should consume less RAM than it is available on a given (virtual) machine per one CPU core.
    * There is no one-size-fits-all RAM threshold, but as of 2019 a typical good fuzz target would consume less than 1.5Gb

### Timeouts, OOMs, shallow bugs ###
 - A good fuzz target should not have any
    * timeouts (inputs that take too long to process),
    * OOMs (input that cause the fuzz target to consume too much RAM),
    * shallow (easily discoverable) bugs. Otherwise fuzzing will stall quickly.

### Seed corpus ###
 - In most cases a good fuzz target should be accompanied with a seed corpus, which is a set of representative inputs for the fuzz target. 
 - These inputs combined should cover large portions of the API under test, ideally achieving 100% coverage (different coverage metrics can be applied, 
   e.g. block coverage or edge coverage, depending on a specific case).
    * Avoid large seed inputs when smaller inputs are sufficient for providing the same coverage.

 - A seed corpus is stored as a directory where every individual file represents one input, subdirectories are allowed.
 - When fixing a bug or adding a new functionality to the API, don't forget to extend the seed corpus. 
    * Monitor the code coverage achieved by the corpus and try to keep it close to 100%.

### Coverage discoverability ###
 - It is often insufficient to have a seed corpus with good code coverage to claim good fuzzability, 
    * i.e. the ability of a fuzzing engine to discover many code paths in the API under test.

 - For example, imagine we are fuzzing an API that consumes an encrypted input, and we have a comprehensive seed corpus with such encrypted inputs. 
   This seed corpus will provide good code coverage, but any mutation of the inputs will be rejected early as broken.

 - So, it is important to ensure that the fuzz target can discover a large subset of reachable CONTROL FLOW EDGES without using the seed corpus. 
    * Tools such as Clang's source-based code coverage can assist with this process.

 - If fuzzing a given target without a seed corpus for, say, a billion iterations, does not provide coverage comparable to a good seed corpus, consider
    * Splitting the target (see Large APIs)
    * Using dictionaries
    * Using Structure-Aware Fuzzing

 - If your API consumes inputs of a specific size(s) the best way is to express it in the fuzz targer, like this:
\```
// fuzz_target.cc
extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  if (Size > kMaxSize || Size < kMinSize) return 0;
\```
 - A good fuzz target does not use I/O:
    * Avoid debug output to stderr or stdout as it slows down fuzzing.
    * Avoid reading from disk other than during one-time initialization.
    * Avoid writing to disk

### Structure-Aware Fuzzing with libFuzzer ###
 - Generation-based fuzzers usually target a single input type, generating inputs according to a pre-defined grammar. Good examples of such fuzzers are:
    * csmith (generates valid C programs) 
    * Peach (generates inputs of any type, but requires such a type to be expressed as a grammar definition)

 - Coverage-guided mutation-based fuzzers, such as libFuzzer or AFL, are not restricted to a single input type and do not require grammar definitions. 
   Thus, mutation-based fuzzers are generally easier to set up and use than their generation-based counterparts. 
    * But the lack of an input grammar can also result in inefficient fuzzing for complicated input types, where any traditional mutation 
      (e.g. bit flipping) leads to an invalid input rejected by the target API in the early stage of parsing.

 - With some additional effort, however, libFuzzer can be turned into a grammar-aware (i.e. structure-aware) fuzzing engine for a specific input type.

---

## Glossary ##
 - Naming things is hard, so this page tries to reduce confusion around fuzzing-related terminology.

### Corpus (Or test corpus, or fuzzing corpus.) ###
 - A set of test inputs. In most contexts, it refers to a set of minimal test inputs that generate maximal code coverage.

### Cross-pollination ###
 - The term is taken from botany, where one plant pollinates a plant of another variety. 
 - In fuzzing, cross-pollination means using a corpus for one fuzz target to expand a corpus for another fuzz target. 
    * For example, if there are two libraries that process the same common data format, it is often benefitial to cross-pollinate their respective corpora.

### Dictionary ###
 - A file which specifies interesting tokens for a fuzz target.
 - Most fuzzing engines support dictionaries, and will adjust their mutation strategies to process these tokens together.

### Fuzz Target (Or Target Function, or Fuzzing Target Function, or Fuzzing Entry Point) ###
 - A function to which we apply fuzzing. A specific signature is required for OSS-Fuzz. Examples: openssl, re2, SQLite.

### Fuzzer ###
 - The most overloaded term and used in a variety of contexts, which makes it bad. 
 - Sometimes, "Fuzzer" is referred to a fuzz target, a fuzzing engine, a mutation engine, a test generator or a fuzzer build.

### Fuzzer Build ###
 - A build that contains all the fuzz targets for a given project, which is run with a specific fuzzing engine, in a specific build mode 
   (e.g. with enabled/disabled assertions), and optionally combined with a sanitizer. 
    * In OSS-Fuzz, it is also known as a job type.

### Fuzzing Engine ###
 - A tool that tries to find interesting inputs for a fuzz target by executing it. Examples: libFuzzer, AFL, honggfuzz, etc.
 - See related terms Mutation Engine and Test Generator.

### Mutation Engine ###
 - A tool that takes a set of testcases as input and creates their mutated versions. 
 - It is just a generator and does not feed the mutations to fuzz target. 
    * Example: radamsa (a generic test mutator).

### Reproducer (Or Test Case.) ###
 - A test input that can be used to reproduce a bug when processed by a fuzz target.

### Sanitizer ###
 - A dynamic testing tool that can detect bugs during program execution. Examples: ASan, DFSan, LSan, MSan, TSan, UBSan.

### Seed Corpus ###
 - A small initial corpus prepared with the intent of providing initial coverage for fuzzing. 
 - Rather than being created by the fuzzers themselves, seed corpora are often prepared from existing test inputs or may be hand-crafted 
   to provide interesting coverage. 
    * They are often checked into source alongside fuzz targets.

### Test Generator ###
 - A tool that generates testcases from scratch according to some rules or grammar. 
    * Examples: csmith (a test generator for C language), cross_fuzz (a cross-document DOM binding test generator).

### Test Input ###
 - A sequence of bytes that is used as input to a fuzz target. Typically, a test input is stored in a separate file.


---


  1. libFuzzerTutorial

  1. ClusterFuzz pdf, talk, repo

  1. Oss-Fuzz link

  1. read "The Art, Science, and Engineering of Fuzzing:A Survey" paper link

  1. Now follow the workshop libFuzzer-workshop
  • Maybe ignore the first 3-4 challenges, because are the same ones of libFuzzerTutorial

  1. From here, now move on AFL++

  1. (part 2) antonio-morales workshops and tutorials

  1. (part 3) Fuzzing in a real case scenario

  1. Fuzzing python stuffs

  1. Fuzzing native C

  1. Fuzzing network protocols

  1. Angora fuzzer
  • Read its paper here

  • Source code here

  • Repeat the workshop, but this time using angora

    • Do at least quickstart example and libxml2 challenge
    • Angora is really bad at finding vulnerabilities
  • If you have issues with DFSan llvm use PIN mode instead


  1. Fuzzer software dev

  1. Extra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment