Hybrid concrete/symbolic testing
You’ve probably heard that "testing can be used to show the presence of bugs, but never to show their absence" - usually as an argument for formal methods like symbolic execution. On the other hand, testing is often easier and finds bugs in formally verified systems too. But why not use both?
Crosshair is an SMT-based symbolic execution tool for Python. Providing an interface or extension to check Hypothesis tests, like propverify in Rust or DeepState in C and C++, would give users the best of both worlds.
Zac Hatfield-Dodds (Hypothesis core dev) and Phillip Schanely (Crosshair core dev) are planning to collaborate on various experiments or interoperability features; Zac is also supervising undergraduate projects in this area, in addition to our usual willingness to support open-source contributions.
"Write tests once, run them many times (with many backends)"
The three main options for backends are
- Random "generative" testing - Hypothesis already does this well
- Feedback-guided mutation / fuzzing - you can plug in a fuzzer for single tests already. HypoFuzz can fuzz all your tests at once, with dynamic prioritization.
- Solver-based exhaustive path testing - this project!
The Fuzzing Book is a great introduction to these topics, and has some simple examples (in Python!) of symbolic and concolic testing.
Crosshair basically works by creating proxy objects which record the conditions they're used in, and then using the Z3 solver to choose different paths each time. When stuck, or used in a non-symbolic function call (etc.) they choose a random possible value and execution continues as normal. Ongoing development is mostly based on improving the engine/frontend architecture (currently mixed) and patching more of the builtins and stdlib to support symbolic operations.
Write for Hypothesis, run with Crosshair
- run Crosshair on Hypothesis tests just like any other kind of contract, or
- integrated into HypoFuzz for an automatic hybrid fuzzing workflow
Running this as part of Hypothesis itself is tempting, but too prone to unpredictable performance issues or otherwise misbehaving on reasonable test code to be worth it.
Crosshair -> choice sequence -> whole test
Basic support for this approach was added in
Hypothesis is implemented in terms of a "choice sequence", which you can think of as the sequence of choices we make while traversing a decision tree of what to generate. David's paper on this explains more details.
If we can drive Hypothesis tests using CrossHair from the choice sequence level,
this would dodge essentially all the problems of converting strategies. Because
that's our standard intermediate representation, it would also work beautifully
with our database logic,
target(), fuzzing, and so on.
The current representation is a bytestring - hence the fuzzer integration, but asking crosshair to analyse all of the Hypothesis-internal transformations between bits and other types leads to terrible performance. We therefore plan to try adding a "mid-level IR" layer consisting of a sequence of (bool, int, float, bytes, str) that can be losslessly 'lowered' back to our standard choice sequence, but is more tractable for symbolic analysis.
MIR design notes
Zac's initial plan is to expand the role of the
by adding methods corresponding to the existing
conjecture.utils helpers for e.g.
biased_coin. If we arrange this right, it should be possible to
ConjectureData in such a way that we get the actual values from Crosshair
and then "write back" the lower-level bits that would have generated them - so that
shrinking and our database logic "just works". We'll see how it goes!
Generating a high-coverage seed corpus
One major advantage of fuzzing a Hypothesis test is that you can build up a "seed" corpus with good coverage just by running many random tests. Using Crosshair to discover new paths could plausibly work better - in practice you'd run both and take the union of the two corpora - so long as Crosshair can report the choice sequence which would lead Hypothesis to execute the same path. This is particular promising as a way to discover inputs which cover very rare branches.
crosshair cover command
demonstrates that this approach is viable.
"Assimilating" examples: output -> Crosshair -> Hypothesis
"Assimilating" external examples into Hypothesis (i.e. reversing the parse operation done by strategies) would allow Hypothesis to shrink examples from e.g. user bug reports, making debugging a lot easier.
This is impossible in the general case if we wanted to implement it as a
new method on Hypothesis'
SearchStrategy objects, though maybe
practical in enough cases to be interesting.
However, using Crosshair to solve for a choice sequence which produces some
example would mean running 'forwards' instead of 'backwards', and that's
much more practical - the limit becomes "what does Crosshair support" rather
than "can we invert arbitrary functions". It wouldn't support everything,
but works much better on non-trivial code such as e.g. any strategy with
Strategies -> Crosshair
Matthew Law tried this approach; confirming that it works for some tests but that the choice-sequence approach is more general.
The main challenge for this subproject is to extract generated types and constraints from
SearchStrategy objects can't currently be inspected to determine what
type of object they generate or any constraints on it.
That could probably be fixed, or we could just enumerate all the low-level strategies; but
this approach will only ever work for a subset of tests regardless. That's because
strategies can be composed with arbitrary user code, and it matters whether the string
"hi" was generated from
so let's just consider the cases where it can work.
- preconditions with
assume()are relatively easy (see this issue)
.filter()method can likewise just add constraints
A trickier-to-implement but useful application of this logic is that if we can prove that the
test always passes for some subset of inputs (e.g. numeric ranges, small collection sizes, etc),
it would be possible to redefine the strategy to exclude those inputs and thus focus random
testing on only the unproven part of the input space.
Equivalently, we might mark the corresponding branches of the choice sequence tree
as exhausted so the
conjecture backend ignores them.
The exception here is for tests which use type annotations and
In that case, we can just work directly on types - albeit at the expense of the
choice-sequence integration points.
It seems unlikely that this will outperform Hypothesis' existing backend on many realistic workloads in the near future, but likely to be complementary in the medium term. We are particularly optimistic about hybrid approaches, where it is possible to choose the most suitable technique at runtime.
Overall, we're having a lot of fun working on this and tossing ideas around :-)