BrianHicks/non-deterministic-testing.md

## non-deterministic-testing.md

      
    Raw
  

              non-deterministic-testing.md
            
          
    Non-Deterministic Testing Overview

Background/motivation: After working through the Unison tour, in particular the section on testing, I feel a little weird about caching property test results. I raised this on the Unison Slack, and we decided that it would be helpful for me to write up an overview of non-deterministic testing as a way to contribute. So, here goes!
Non-deterministic tests generally fall into two categories:

property tests, where random values are fed into a function to assure a property holds (e.g. to assert that a function is idempotent by checking fn a == fn (fn a).) Examples: QuickCheck (Haskell/Erlang), Hedgehog (Haskell/F#/C#/R/Scala), Hypothesis (Python), elm-test (Elm)
fuzz tests, where a random bytes are given to a program to make sure it behaves properly in the face of malformed input. These tests usually start with a corpus of inputs and mutate them using a genetic algorithm guided by coverage metrics. Examples: go-fuzz (Go), AFL (many languages), libFuzzer (LLVM languages)

In general, property tests pursue correctness while fuzz tests pursue stability and security. This informs both their target (functions vs programs) and invocation (discrete/independent vs continuous/stateful)
There are other flavors of non-deterministic testing (chaos engineering, for example) but for the purposes of this document I'm going to stop at the program boundary. It may be useful for Unison to extend non-deterministic testing to the infrastructure level in the future, though!
(aside: I am most familiar with property testing in elm-test, whose property testing module is named Fuzz. These concepts get mixed up easily!)
Property Testing with Unison

When writing property tests, you tend to run property tests along with your unit tests at small batch sizes and with a new random seed every time. Contrast this with Unison's fuzz testing, which currently runs one random seed and batch size and then caches the result, the same as example-based unit tests.
This is the main area where I feel tension between the concepts in play here: I expect to get different results for different runs of a fuzz testing framework, even if the property test is calling pure functions. Caching this feels weird to me, especially after a low number of runs. I can think of a few ways around this:

don't cache property test results. This seems at odds with Unison's core goals, so I don't imagine this is a good way forward!
allow selective cache invalidation for test results. This would allow a CI system (for example) to repeatedly re-run all the property tests.
parametrize property tests. The two parameters here are the random seed and batch size. This would allow us to create project-specific testing regimes.

The choice here partly depends on what we want for Unison's relationship with CI tools. Can we trust individual machines to produce bug-free code such that we can consume their test result caches? What about in low-trust environments like open-source projects?
Removing Non-Determinism From Property Testing

These are not the only options, of course! There may be ways to remove some non-determinism around fuzz testing.
For example, at Lambda Days 2019, John Hughes presented some improvements to Haskell QuickCheck which allow the caller to specify the coverage level of their input domain. One of his examples involves various flavors of insertion into a key/value map. Using QuickCheck, he can verify separate coverage levels for a key to be inserted vs updated at the beginning vs the end of the data structure. In other words, using QuickCheck he can express his intent that the state space has a certain coverage level up to a certain confidence level.
We can find other examples in logic programming and formal verification. Systems like prolog have ways to exhaustively explore a state space, and SAT/SMT solvers can intelligently prune unproductive branches. One point I'm particularly interested in is how Alloy allows you to specify assertions are valid up to a given state size. (n.b. Alloy makes a lot of optimizations based on the fact that they only deal with discrete domains! This is much more difficult in continuous domains.)
One last idea: once failing examples are simplified, they become good regression-style unit tests since they're examples of how the function under test has failed in the past. A nice first step for a failing fuzz test may be to allow ucm to extract the example to a new failing definition to be re-run/cached in the future.
Fuzz Testing with Unison

This brings us to fuzz testing. We may get some of this for cheap if/when Unison can compile to an LLVM backend (we could then use libFuzzer to do this style of testing.)
The basic idea here is to test a whole program to make sure it does not crash for security or reliability purposes. In my experience, this technique is most useful for anything involving untrusted input. In particular, it really shines with parsers.
Where the Tools Are Moving

One big difference between fuzz and property testing is that fuzz testing works best when you run it continuously. It's examining the whole input space and tends to evolve a corpus of interesting inputs, so swapping in new versions of binaries and continuing the test is a totally reasonable thing to do. In fact, keeping the evolved corpus around is a nice way to check that a crasher hasn't reappeared! This differs from property tests in that those are typically run in small discrete batches with independent seeds.
Because you get better results when running fuzz tests continuously, a number of projects have sprung up to make this easier. Google wrote ClusterFuzz for AFL/libFuzzer and hosts instances for open-source projects with OSS-Fuzz. There are also several startups that do this as a service like Fuzzit and Fuzzbuzz.
That Time I Found Fuzz Testing Valuable

Just to share a personal story about this… in 2016, I was building some DevOps tooling on top of HCL, the Hashicorp configuration language. I wanted to extend it with some constructs of my own for looping and wanted to make sure that I had not created issues in the parser.
To check this, I grabbed the upstream test fixtures, instrumented my code with go-fuzz, and left it running for about 20 minutes. When I got back, I had found 4 or 5 crashing bugs in the upstream parser! They were all strange examples like {:{ that a human would never write normally but which could be potentially be used to trigger a crasher/security bug. I reported them to Hashicorp, where they were fixed upstream. These were described as…

There are some legitimate edge cases found here, and others that are jus[t] very unlikely to be ever hit. :)