Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Notes for a project to combine property-based tests with SAT-solving, based on the Hypothesis and Crosshair libraries.

Hybrid concrete/symbolic testing

You’ve probably heard that "testing can be used to show the presence of bugs, but never to show their absence" - usually as an argument for formal methods like symbolic execution. On the other hand, testing is often easier and finds bugs in formally verified systems too. But why not use both?

Crosshair is an SMT-based symbolic execution tool for Python. Providing an interface or extension to check Hypothesis tests, like propverify in Rust or DeepState in C and C++, would give users the best of both worlds.

Zac Hatfield-Dodds (Hypothesis core dev) and Phillip Schanely (Crosshair core dev) are planning to collaborate on various experiments or interoperability features; Zac is also supervising undergraduate projects in this area, in addition to our usual willingness to support open-source contributions.

Background

"Write tests once, run them many times (with many backends)"

The three main options for backends are

  1. Random "generative" testing - Hypothesis already does this well
  2. Feedback-guided mutation / fuzzing - you can plug in a fuzzer for single tests already. HypoFuzz can fuzz all your tests at once, with dynamic prioritization.
  3. Solver-based exhaustive path testing - this project!

The Fuzzing Book is a great introduction to these topics, and has some simple examples (in Python!) of symbolic and concolic testing.

Crosshair basically works by creating proxy objects which record the conditions they're used in, and then using the Z3 solver to choose different paths each time. When stuck, or used in a non-symbolic function call (etc.) they choose a random possible value and execution continues as normal. Ongoing development is mostly based on improving the engine/frontent architecture (currently mixed) and patching more of the builtins and stdlib to support symbolic operations.

Write for Hypothesis, run with Crosshair

Integration UX

Zac imagines adding an optional "phase", so we try [explicit, +verify, replay, generate, shrink]. The verify phase would have four possible outcomes: {not supported/installed, verified, no counterexamples, counterexample}.

In principle we could skip later phases if verify confirms that no bugs are possible; in practice we'd want to try generating some examples anyway to check for bugs in the libraries or test harness - and request a bug report if the test somehow fails! It would make sense to cap the number of generated examples though, and have HypoFuzz skip or de-prioritize verified tests.

Strategies -> Crosshair

The main challenge for this subproject is to extract generated types and constraints from Hypothesis, since SearchStrategy objects can't currently be inspected to determine what type of object they generate or any constraints on it.

That could probably be fixed, or we could just enumerate all the low-level strategies; but this approach will only ever work for a subset of tests regardless. That's because strategies can be composed with arbitrary user code, and it matters whether the string "hi" was generated from just("hi") or just("hi").map(ensure_in_database), so let's just consider the cases where it can work.

  • preconditions with assume() are relatively easy (see this issue)
  • the .filter() method can likewise just add constraints

A trickier-to-implement but useful application of this logic is that if we can prove that the test always passes for some subset of inputs (e.g. numeric ranges, small collection sizes, etc), it would be possible to redefine the strategy to exclude those inputs and thus focus random testing on only the unproven part of the input space.

Alternatively, we might mark the corresponding branches of the choice sequence tree as exhausted

Crosshair -> choice sequence -> whole test

Hypothesis is implemented in terms of a "choice sequence", which you can think of as the sequence of choices we make while traversing a decision tree of what to generate. David's paper on this explains more details. The current representation is a bytestring - hence the fuzzer integration, but we've proposed using a sequence of integers instead for less redundancy. That would require substantial work upstream, but adding a "mid-level IR" is probably worth it even a large gain for symbolic execution - and it's trivially serialisable to bytes.

If we can drive Hypothesis tests using CrossHair from the choice sequence level, this would dodge essentially all the problems of converting strategies. Because that's our standard intermediate representation, it would also work beautifully with our database logic, target(), fuzzing, and so on.

An initial attempt to reverse hypothesis strategies worked in some cases, but was unimpressive overall. bytes pretty much didn't work at all, so Phillip also added some basic support (you'll need HEAD if you wanted to try these yourself or other strategies). We're sure that this can be improved, but it's unclear how much improvement is practical without actually making the attempt.

The draw_bits function causes a lot of problems for CrossHair, because converting bytes into other types generally concretises the symbolic values. We think that even with the sequence-of-ints representation it will be useful to special-case some low-level strategies, but that we should do that second and only where needed. See above for how that might work.

Generating a high-coverage seed corpus

One major advantage of fuzzing a Hypothesis test is that you can build up a "seed" corpus with good coverage just by running many random tests. Using Crosshair to discover new paths could plausibly work better - in practice you'd run both and take the union of the two corpora - so long as Crosshair can report the choice sequence which would lead Hypothesis to execute the same path. This is particular promising as a way to discover inputs which cover very rare branches.

"Assimilating" examples: output -> Crosshair -> Hypothesis

"Assimilating" external examples into Hypothesis (i.e. reversing the parse operation done by strategies) would allow Hypothesis to shrink examples from e.g. user bug reports, making debugging a lot easier.

This is impossible in the general case if we wanted to implement it as a new method on Hypothesis' SearchStrategy objects, though maybe practical in enough cases to be interesting.

However, using Crosshair to solve for a choice sequence which produces some example would mean running 'forwards' instead of 'backwards', and that's much more practical - the limit becomes "what does Crosshair support" rather than "can we invert arbitrary functions". It wouldn't support everything, but works much better on non-trivial code such as e.g. any strategy with .map(some_function).

Write for Crosshair, run with Hypothesis

While interesting to explore, we think this is unlikely to be a fruitful project.

Phillip has spent a little time getting Hypothesis to check CrossHair conditions. As it turned out, this was pretty valuable for finding CrossHair defects, and one such defect led him to overhaul the way floats are simulated (an ongoing project).

This might also be easier, because you don't need to guess at type hints that correspond to arbitrary hypothesis strategies, nor introspect them to get those type hints (let alone constraints) in the first place.

The main challenge is to make strategies efficient, e.g. integers(0, 10) instead of integers().filter(lambda x: 0 <= x <= 10). The latter form would be fine in a prototype, but too inefficent on non-toy examples to work in practice.

Here's a first attempt. There's a lot of noise in the commit from refactoring; but the interesting stuff is in core.py's analyze_with_hypothesis() function.

Conclusions

It seems unlikely that this will outperform Hypothesis' existing backend on many realistic workloads in the near future, but likely to be complementary in the medium term. We are particularly optimistic about hybrid approaches, where it is possible to choose the most suitable technique at runtime.

Overall, we're having a lot of fun working on this and tossing ideas around :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment