For readers coming from LinkedIn: the LinkedIn post is only a teaser. This note gives the fuller framing and links to the runnable notebook.
This is a working demo, not a finished benchmark. It explores whether underspecified research methods can be detected before an AI reproduction agent writes code.
The first Colab section should run on the free Colab tier with a Google account and no API key. Larger-model sections require A100 runtime.