Zac-HD/hypothesis-crosshair.md

## hypothesis-crosshair.md

      
    Raw
  

              hypothesis-crosshair.md
            
          
    Hybrid concrete/symbolic testing

You’ve probably heard that "testing can be used to show the presence of bugs,
but never to show their absence" - usually as an argument for formal methods
like symbolic execution. On the other hand, testing is often easier and
finds bugs in formally verified systems too.
But why not use both?
Crosshair is an SMT-based symbolic
execution tool for Python. Providing an interface or extension to check Hypothesis
tests, like propverify in Rust or
DeepState in C and C++, would give users the best of both worlds.
Zac Hatfield-Dodds (Hypothesis core dev) and Phillip Schanely (Crosshair core dev)
are planning to collaborate on various experiments or interoperability features;
Zac is also supervising undergraduate projects
in this area, in addition to our usual willingness to support open-source contributions.
Background

"Write tests once, run them many times (with many backends)"
The three main options for backends are

Random "generative" testing - Hypothesis already does this well
Feedback-guided mutation / fuzzing - you can
plug in a fuzzer
for single tests already.  HypoFuzz
can fuzz all your tests at once, with dynamic prioritization.
Solver-based exhaustive path testing - this project!

The Fuzzing Book
is a great introduction to these topics, and has some simple examples (in Python!)
of symbolic and concolic testing.
Crosshair basically works by creating proxy objects which record the conditions
they're used in, and then using the Z3 solver to choose different paths each time.
When stuck, or used in a non-symbolic function call (etc.) they choose a random
possible value and execution continues as normal.
Ongoing development is mostly based on improving the engine/frontend architecture
(currently mixed) and patching more of the builtins and stdlib to support symbolic
operations.
Write for Hypothesis, run with Crosshair

Integration UX


run Crosshair on Hypothesis tests just like any other kind of contract, or
integrated into HypoFuzz for an automatic hybrid fuzzing workflow

Running this as part of Hypothesis itself is tempting, but too prone to unpredictable
performance issues or otherwise misbehaving on reasonable test code to be worth it.
Crosshair -> choice sequence -> whole test

Basic support for this approach was added in crosshair 0.0.17!
Hypothesis is implemented in terms of a "choice sequence", which you can think of as
the sequence of choices we make while traversing a decision tree of what to generate.
David's paper on this explains more details.
If we can drive Hypothesis tests using CrossHair from the choice sequence level,
this would dodge essentially all the problems of converting strategies.  Because
that's our standard intermediate representation, it would also work beautifully
with our database logic, target(), fuzzing, and so on.
The current representation is a bytestring - hence
the fuzzer integration,
but asking crosshair to analyse all of the Hypothesis-internal transformations between bits
and other types leads to terrible performance.  We therefore plan to try adding a "mid-level IR"
layer consisting of a sequence of (bool, int, float, bytes, str) that can be losslessly 'lowered'
back to our standard choice sequence, but is more tractable for symbolic analysis.
MIR design notes

Zac's initial plan is to expand the role of the
ConjectureData object
by adding methods corresponding to the existing conjecture.utils helpers for e.g.
integer_range and biased_coin.  If we arrange this right, it should be possible to
subclass ConjectureData in such a way that we get the actual values from Crosshair
and then "write back" the lower-level bits that would have generated them - so that
shrinking and our database logic "just works".  We'll see how it goes!
HypothesisWorks/hypothesis#3086
Generating a high-coverage seed corpus

One major advantage of fuzzing a Hypothesis test is that you can build up a "seed"
corpus with good coverage just by running many random tests.  Using Crosshair to
discover new paths could plausibly work better - in practice you'd run both and
take the union of the two corpora - so long as Crosshair can report the choice
sequence which would lead Hypothesis to execute the same path.  This is particular
promising as a way to discover inputs which cover very rare branches.
The crosshair cover command
demonstrates that this approach is viable.
"Assimilating" examples:  output -> Crosshair -> Hypothesis

"Assimilating" external examples into Hypothesis
(i.e. reversing the parse operation done by strategies) would allow Hypothesis
to shrink examples from e.g. user bug reports, making debugging a lot easier.
This is impossible in the general case if we wanted to implement it as a
new method on Hypothesis' SearchStrategy objects, though maybe
practical in enough cases to be interesting.
However, using Crosshair to solve for a choice sequence which produces some
example would mean running 'forwards' instead of 'backwards', and that's
much more practical - the limit becomes "what does Crosshair support" rather
than "can we invert arbitrary functions".  It wouldn't support everything,
but works much better on non-trivial code such as e.g. any strategy with
.map(some_function).
Strategies -> Crosshair

Matthew Law tried this approach;
confirming that it works for some tests but that the choice-sequence approach is more general.
The main challenge for this subproject is to extract generated types and constraints from
Hypothesis, since SearchStrategy objects can't currently be inspected to determine what
type of object they generate or any constraints on it.
That could probably be fixed, or we could just enumerate all the low-level strategies; but
this approach will only ever work for a subset of tests regardless.  That's because
strategies can be composed with arbitrary user code, and it matters whether the string
"hi" was generated from just("hi") or just("hi").map(ensure_in_database),
so let's just consider the cases where it can work.

preconditions with assume() are relatively easy
(see this issue)
the .filter() method can likewise just add constraints

A trickier-to-implement but useful application of this logic is that if we can prove that the
test always passes for some subset of inputs (e.g. numeric ranges, small collection sizes, etc),
it would be possible to redefine the strategy to exclude those inputs and thus focus random
testing on only the unproven part of the input space.
Equivalently, we might mark the corresponding branches of the choice sequence tree
as exhausted so the conjecture backend ignores them.
The exception here is for tests which use type annotations and hypothesis.infer.
In that case, we can just work directly on types - albeit at the expense of the
choice-sequence integration points.
Conclusions

It seems unlikely that this will outperform Hypothesis' existing backend on many
realistic workloads in the near future, but likely to be complementary in the
medium term.  We are particularly optimistic about hybrid approaches, where it
is possible to choose the most suitable technique at runtime.
Overall, we're having a lot of fun working on this and tossing ideas around :-)