Zac-HD/student-project-ideas.md

## student-project-ideas.md

      
    Raw
  

              student-project-ideas.md
            
          
    Property-Based Testing Projects for Undergraduate Researchers

Hypothesis is an advanced and
widely-used property-based testing
framework, used by thousands of companies and open source projects.
Until moving to Anthropic in 2021,
Zac Hatfield-Dodds supervised related
research projects for students
interested in advanced testing and verification.  All projects require
strong Python skills; experience with open source workflows (git, CI, etc) would be
helpful but not required.
Fuzzing with Protobuf or ASN.1 schemas

Many programs communicate via binary formats such as ASN.1
or protocol buffers, and so
automatically generating valid messages is an attractive testing technique.
In this project, you would write a function to take a schema, optionally with
customised 'strategies' for sub-structures, and return a Hypothesis strategy
which generates such messages.
To take this from a six-unit to a twelve-unit project, you could search for bugs,
characterise how it combines with e.g. targeted PBT (like IJON or FuzzFactory) or
the benefits of integrated test-case reduction, etc. - it's well-defined to start
with but there's plenty of extensions if you finish the core quickly.
hypothesis-010 and
hypothesis-jsonschema
are good examples of related projects.
Extending the Hypothesis Ghostwriter

The Hypothesis Ghostwriter
can automatically generate property-based tests for six different properties
based on source-code introspection - without needing to execute the test like
previous work (Randoop, Evosuite, QuickSpec, FuzzGen, etc).
Enhancements could add more properties, or use more sophisticated introspection
logic to determine the required argument types for unannotated parameters.
For a larger, probably Honours-scale, project you could implement an alternative
mode which does include execution of the speculative tests - allowing users to
trade away runtime and safety for improved test code.
Generating graph data-structures

There is significant interest in generating graph and graph-like data structures
with Hypothesis (see e.g.
hypothesis-geojson,
hypothesis-geometry, and
hypothesis-networkx).
There is also a substantial literature on random generation of these structures,
but this prior art generally focusses on generating uniform distributions
(vs heuristic 'bug-finding' distributions).  Designing the generation strategy
for locality of mutation and efficient shrinking is also a novel-for-graphs
problem, though there is plenty of prior art in Hypothesis for other objects.
This is likely to be a twelve-unit project to produce and ship useful open
source code; though proof-of-concept and deeper projects are both possible.
Did you implement the model you verified?

TLA+ formally verifies properties of your system design,
but not the implementation.  We can gain confidence that we have implemented the
spec by checking execution traces from
testing or even production traffic, or with a
custom simulation harness.
A demonstration system connecting model-based testing
to TLA+ verification would be a significant
advance in the state of the art, suitable for an Honours student or as a
twelve-unit project if you are already comfortable with TLA+.
Generalising or abstracting failing inputs

See this issue on Hypothesis.
Reporting generalised examples is as much a user-interface design problem as it
is an implementation challenge: can we show more information about the cause of
an error, without showing too much (or worse - wrong) information?
I would recommend this as a twelve-unit project.
An adversarial fair scheduler

Async programming frameworks like Trio
or the standard-library asyncio
offer substantial speedups for IO-bound code, but may introduce new bugs that
only manifest under unusual scheduler orders of the subtasks.
rr chaos mode
is interesting prior art, albeit for threads rather than an async task model.
So what if we could use a custom scheduler for tests which attempts to induce bugs?
I think this would uncover many bugs, but raises some further research questions.
For example: should we generate distinct inputs, or only task schedules?  The
former covers more states; the latter admits the property "same result regardless
of schedule".  Are adversarial schedules useful on real-world programs, or do they
just reveal uninteresting failures?  And what about fault injection (adding delays
or possible IO errors)?  This could reveal many bugs, but requires substantial
work to give an acceptable user experience.
This project could be a good twelve-unit or Honours project, or you could use a
semester project to prototype and scope your Honours research.
Repurposing PyPy's tracing JIT to measure code coverage

Fuzzing in Python
usually sees a 2x-70x slowdown due to the overhead of coverage measurement; so reducing
this would obviously be very valuable.
PyPy is a Python implementation based on a just-in-time
compiler (JIT) - which offers a large speedup on repetitive workloads like fuzzing,
but considerably worse relative performance with coverage.  Having the JIT tracer
output coverage measurements directly could eliminate this overhead entirely,
improving fuzzing performance by two orders of magnitude.
This project would be good for an ambitious Honours student with an interest in
programming language implementation, especially just-in-time compilation, and
would likely lead to a published paper and widespready use if successfully
implemented.