Skip to content

Instantly share code, notes, and snippets.

@MostAwesomeDude
Last active April 2, 2026 16:30
Show Gist options
  • Select an option

  • Save MostAwesomeDude/bb8cbfd005a33f5dd262d1f20a63a693 to your computer and use it in GitHub Desktop.

Select an option

Save MostAwesomeDude/bb8cbfd005a33f5dd262d1f20a63a693 to your computer and use it in GitHub Desktop.
Lobsters Vibecoding Challenge (Winter 2025-2026)

Lobsters Vibecoding Challenge

I am tired of hearing the fallacious claim that, because certain recent machine-learned generative chatbots can emit valid syntax for a variety of programming languages, those same chatbots are able to develop any software whatsoever. I've decided to put up a little challenge based on my current side projects. It is wholly uninteresting to me whether a computer can emit thousands of lines of Python 3 based on a whiteboard diagram in the office like a boilerplate generator. Y'all claim that it's a smartie, so show me how well it thinks about anything difficult. To avoid the John Henry problem, we'll be working on things that I care about, rather than things that exploitative employers want. Also, I used to write interview problems for employers, and I know how to pick problems that aren't amenable to chatbots.

While I'm issuing this challenge to the denizens of Lobsters, I'm also sharing it on Lemmy. Also, links to private Gists are capability URLs, so if you have the URL then you may participate. The submission cutoff is March 1st.

The rules

I'm numbering these so that you can more easily reference them when complaining.

  1. Solutions must be vibecoded. That's the whole point. I don't care how skilled you are as a developer or hacker since this is supposed to be how good your prompts are.
  2. Solutions must work. I'm leaving this unspecified formally, but I do have a few private source files that I can use to test any candidates. We're working with Turing-complete languages here, so Rice's theorem will prohibit me from automatically verifying your candidates. You're allowed to write your own tests in order to provoke your agent into investigating errors. However, obviously…
  3. Solutions must compile. My workflow revolves around nix build and nix flake check. I don't see any reason to shift that for vibecoding tools. Getting Nix flakes into your chatbot's harness is wholly your problem.
  4. Anything may be context. That's right, you can put anything into your prompt, context, file sets, RAG harness, or code-completer. Included in each task, I've left many links to useful docs which you should consider adding as context. I know that I'll be reading them! You can also add this top-level readme to your context. We'll know if you did anything questionable like provide my solution (or your non-vibecoded solution) as context, because…
  5. You must show your work. Sorry if this seems harsh, but I'm straight-up not going to believe you if you present a valid solution without any chat logs. You also have to provide any URLs that you used; if your agent e.g. uses a tool to view docs.python.org then we need a log of that event. I also want to know how much time was taken; you're free to itemize that, but wall-clock time matters here. To be fair, I'm going to take notes on my approach and I'll try to hold myself to the same standard that I want to see from candidates.
  6. All entries will be graded for readability and security. Neither readability nor security are optional when developing software, so your entries will be manually reviewed in addition to fitting a task-specific preregistered rubric.

Submit solutions to tasks as comments on the Lobsters thread or this Gist.

Task 1: Propagate the Fuck (Can't Propagate the Fuck)

Update: This task can't be done on Darwin! Thanks to viraptor in this comment for pointing out that this is an issue for some folks. I apologize for my thoughtlessness. This task can only be done on Linux on the following architectures:

  • 64-bit ARM (aarch64, arm64)
  • 32-bit ARMv7, little-endian (armv7l)
  • 32-bit x86, but not really really old machines (i686)
  • Linux/390x (s390x)
  • 64-bit x86 (amd64, x86_64)

In my rpypkgs Nix flake, I have a brainfuck interpreter. This interpreter, bf.py, is already fairly fast but it could be faster.

The task: Switch from the current abstract representation to pointer propgation. That link points to my explanation of pointer propagation as well as some completely untested Python 2.7 code which I wrote for demonstration purposes. Benchmark the improved interpreter's code generation using bench.b and its runtime using mandel.b, both installed under share/ with the interpreter.

Time estimate: one weekend (two working days)

RPython is just Python 2.7, formerly one of the most popular languages among developers. How hard could it be? It will be straightforward for anybody who knows how RPython's JIT works.

  • S tier: under 200 lines of code, faster than previous version
  • A tier: under 250 lines of code, faster than previous version
  • B tier: under 300 lines of code, as fast as current version
  • C tier: under 400 lines of code, as fast as current version
  • D tier: under 500 lines of code, less than 2x slowdown

Task 2: Late, as in the late unknown-linux-musl

For my nascent programming environment Vixen, I've recently hacked up an expression compiler, as covered previously, on Lobsters. Along with a few support methods, the Raku script allows me to have a Vixen compiler which is callable from Vixen. However, now I'd like a statically-linked version for initramfs and it seems that I can't build a sufficiently-static Raku binary. Very technically, I could write an NQP-to-native compiler, but that's a lot of work and the Raku team isn't really excited about that.

The task: Research languages which statically compile for Linux, choose a language with good support for parsing and tree transformations, and port the Raku prototype compiler from this gist to that language.

Time estimate: two weekends, ish (five working days)

This task's difficulty stems from the tradeoffs that must be made while shopping for a toolchain that can emit static binaries. I'm leaving it open-ended; I'll let you deal with the ethical weight of prompting the bot to emit C++ or other bad choices. One fun complication is that the Raku compiler actually calls Vixen mid-compile to emit blocks to the Nix store, and that functionality must be preserved.

  • S tier: actually, Raku can be statically compiled and linked!
  • A tier: literally the same grammar as the Raku version
  • B tier: recognizable as the same grammar, mostly same compiler
  • C tier: grammar completely reimplemented, mostly same compiler
  • D tier: completely reimplemented from scratch, parsing libraries used
  • F tier: parser written from scratch

Task 3: Don't you know? Python makes you fast. (Haha, one!)

Previously, on Lobsters, we discussed going faster by porting from Python to Rust. Previously, on Lobsters, we followed that up by going fast with Python. Now, it is time to go fast once again.

The task: Figure out what the task was again, because it's been over a year and I intentionally forgot it in case this scenario ever came up. Then, implement the task and make it as fast as possible while technically still Python.

Time estimate: two weekends (four working days)

For this one, I'm almost certainly returning to RPython, which is technically still Python. I'm going to set the bar fairly high here, but I genuinely have no idea what the ceiling is. The lack of definition in the task is part of the challenge.

  • S tier: 200,000x speedup
  • A tier: 20,000x speedup
  • B tier: 2,000x speedup
  • C tier: 200x speedup
  • D tier: 20x speedup
  • F tier: 2x speedup

Conclusions

Give it a few months for folks to try this out and then we'll summarize.

@piperswe
Copy link
Copy Markdown

piperswe commented Jan 28, 2026

Here's a naive attempt with OpenCode: rpypkgs/rpypkgs#2

In the spirit of vibe coding, I have no idea about the quality of the code (or, indeed, what this change is even supposed to do), but it does seem to work 😆

@jimmyhmiller
Copy link
Copy Markdown

https://github.com/jimmyhmiller/vibecode-bf

Here's something. It doesn't follow your rules. Put very little time and effort it into it. Seems to achieve some kind of speed up for some programs, though not much and definitely more than 200 lines if we are talking about the whole program. Tried to give you the write up and the sessions though and link to things I've vibe coded with more effort.

Given your tone in these posts it seems unlikely to meet the kind of standards you are looking for. I don't particularly see the challenge as a good faith effort at all. But I figured it was worth trying to start from my phone and see what vibe coding created.

Vibe coding is of course a less than optimal process for the kind of tasks you've specified here. Well, at least for people who don't know these topics. Just like normal programming, people skilled in an area can vibe code something better. And obviously actually looking at the code gives better results.

If llms/vibecoding isn't for you, no worries. But hopefully one day it can upset you less. It has been a great tool for me in a number of work cases and helps with all kinds of work I do on personal projects. Will it rid the world of software engineers? Far from it.

@djon3s
Copy link
Copy Markdown

djon3s commented Jan 28, 2026

Task 3 - ok so for the prompt - this is your comment for Task3 above copied and pasted as the only prompt (commit 1 https://git.sr.ht/~djon3s/k-corsett/commit/a68103184a7f8269839caa2fdb524c132aa5f32e) that's "one shot" claiming a 30,000x speed up. So if there's a "1 shot" submission - that's that. I could show you the agent and you could cut and paste yourself if you want. Then I did a few "Ok good, to it better" prompts - then eventually comes across to do parallel . I ask it to clean up...

https://git.sr.ht/~djon3s/k-corsett/

now it's claiming 120,000x on 2 core or beating the Rust and Python articles (it thinks it's found 600,000x on 10 core) anyway.

I feel guilty posting this since it's literally copy and paste of what you said and a few "Good, now do it better", but I think that's spirit of exercise. And I don't want to waste your time with garbage, but I guess its easily verified if true (and amazing claim if true) so yeah... I can share the chatlog if want. It's in a sqlite db

TLDR; LLM claims 60,000x Python speed up single thread, 600,000x for multi core example given in prior post (so 3x faster than rust...example of that lobsters post of "180k faster" but its in Python)

"Yeah, right"

@Trung0246
Copy link
Copy Markdown

Trung0246 commented Jan 28, 2026

@MostAwesomeDude Can I add a task problem to this? I believe I have one that most gen AI will struggle :). It involves full ~500 LOC C code implementation instead of simply transpile.

Since I don't have lobster account I'm just posting here.

@rubenvannieuwpoort
Copy link
Copy Markdown

So, I was trying to do some actual (non-vibe) programming but feeling tired and (without picking any side in the whole for/against AI debate) I decided to give this a go and what your take will be on either the challenge as a whole, or the specific code that claude generated for me.

I used claude code, spent about an hour on it (and quite some credits probably). I did not use anything besides bare prompting (like using specific instructions in a file). My result can be found at https://github.com/rubenvannieuwpoort/vibecode. No idea about the quality. I was too lazy to prompt it for running benchmarks.

I think it stores the prompts somewhere but I really can't be arsed to find them right now. Here's what I did:

Prep:

$ brew install pyenv
$ pyenv install 2.7.18

Then:

$ git clone git@github.com:rpypkgs/rpypkgs.git
$ cd rpypkgs
$ claude code

Then I used the prompts:

This is someone else's repo, I have never seen or used it. I am tasked with modifying bf/bf.py, but I have no idea how to run or test it. From one look at the README I saw that it's for Nix, but I use MacOS and have never used Nix. I would still like to use this. Please write a test that runs a simple brainfuck program and execute it, without changing the bf.py itself. Note that this seems to be using Python 2 and might need dependencies such as RPython to execute. I have pyenv installed, try to use that to run Python 2.

(Claude wanted to mock some imports from RPython at this point. This is the only time I did not OK its proposed edits.)

Please install RPython in a virtual environment instead of mocking things.

Please remove bf/init.py and fix the tests.

Great. Please make the test suite slightly more comprehensive.

OK, now please change bf.py to use pointer propagation, as explained on https://esolangs.org/wiki/Algebraic_Brainfuck#Pointer_propagation. Note that the code there might be untested.

(Claude seemed to hang while running a .b program here, so I ctrl+C'd it)

Please add a timeout to the tests; running one should not take more than 2 seconds. After that, continue working working on pointer propagation until the tests pass.

This took a while, but in the end all the tests succeeded. I pretty much OKd anything claude suggested (with the one exception noted above) and am honestly pretty clueless about what it did.

@MostAwesomeDude
Copy link
Copy Markdown
Author

I have completed my first solution to Task 1. The solution consists of three commits (first, second, third) where I implement the feature, integrate it throughout the codebase, and clean up. I'll now start reviewing submissions. (Well, "now" is relative. I'm going to cook dinner first.) Thanks for your patience!

@MostAwesomeDude
Copy link
Copy Markdown
Author

@piperswe Thanks for participating! Also, thanks for respecting the rules. Here's my grade for your attempt at Task 1.

Vibes: Harness is OpenCode. Model is GPT-5.2 Pro. One patch was manually written according to author's notes. 46 of 178 lines were manually written, for 74% vibes, rounding down. I will note that the one manual patch is cosmetic and its omission would result in 100% vibes.

Readability: The approach is fairly legible. Names and style are consistent. Not enough of an effort was made to match existing code.

Security: The parser is unchanged.

Compatibility: bench.b, serptri.b, mandel.b, and LostKng.b all function correctly. Optimization and canonicalization haven't been changed, so the bb-gauge test isn't necessary. (We'll get to that later.) The JIT successfully builds and runs. There aren't any big performance regressions.

Task-specific Rubric: This entry has 372 lines. It performs about as well as the existing interpreter on mandel.b with JIT enabled, taking 3-4s. (The precise target I'm looking for is 3.3s and profiling + JIT traces suggest that this is about as good as propagation can do.) Therefore this is a C tier entry.

Notes: The final encoding was not respected. The agent treats compiled programs as trees of operations and uses an initial-encoded homomorphism to walk them; there is a characteristic if isinstance(...): ...;; elif isinstance(...): ... sequence of compound statements in propagate() implementing initial encoding. As such, this is a great example of what I'd consider a hacky solution: a purpose-built approach that doesn't consider the wider context. The agent also decided to overspecialize by adding instruction variants that load effective addresses (addresses at an offset) in the expectation that it will save an indirection in certain cases. While this overspecialization does improve non-JIT performance on mandel.b and could be worth investigating further, the agent implemented it by adding 56 lines of variant instructions without benchmarking to see whether the original variants could be refactored or removed.

@MostAwesomeDude
Copy link
Copy Markdown
Author

@jimmyhmiller Thanks for participating! I understand that you were not a big fan of the rules, but I won't hold that against you. I appreciate that you submitted enough information for me to grade your attempt. So, here's my grade for your attempt at Task 1.

Vibes: Harness is Claude Code. Models were Opus 4.5 and Haiku 4.5; it looks like Haiku was used for tool calls and Opus was used for user conversations. It looks like the repository was prepared using tools and the code was entirely prompted; this is giving 100% vibes.

Readability: The approach is well-commented. The supporting documentation, while not requested, does help with understanding the code. Python 2.7 idioms are not great: list.pop() wasn't used and compound/nested tuples are unpacked over multiple lines. RPython awareness is low: whether lists/dicts are empty is done with truthiness instead of len(), which can cause a variety of logic errors. The abstract interpreter was rewritten, but the style is fairly good and follows standard conventions; moreover, it turns out that rewriting the abstract interpreter is the same choice that I made. I will point out that the previous abstract interpreter could be read aloud by a human (e.g. if immHead is aLoop) and that property wasn't preserved.

Security: The parser is unchanged.

Compatibility: bench.b, serptri.b, and mandel.b all function correctly. However, LostKng.b is broken in two ways: it takes a long time to start up, and after the first newline it immediately breaks and starts spewing various lines without waiting for input or respecting the story. I'm not going to bother with the bb-gauge test, although it shouldn't be relevant. The JIT successfully builds and runs. There aren't any performance regressions; indeed, mandel.b runs faster than my target time!

Task-specific Rubric: This entry has 549 lines. It performs much better than the existing interpreter on mandel.b with JIT enabled, taking 2.5-3s. Unfortunately, it is too long to place on a tier list.

Notes: The final encoding wasn't respected. The agent worked from the parts of code where I check multiple conditions in a list and extended those. It overspecialized many operations, including operations which I don't currently have at all like scan and scalemove3. It was able to integrate changes to the abstract interpreter, but did not maintain the surrounding style. Part of the length of the interpreter is from additional comments, including comments to existing code; it's not clear whether the agent added those for its internal use or for human readers, although it would be an RL'd behavior either way. The resulting code is overfit for purpose: while it outperformed the existing interpreter on mandel.b and preserved compatibility with the underlying JIT, the interpreter is no longer a general or standard Brainfuck environment, and it spectacularly fails on one of the holdout benchmarks, Jon Ripley's "Lost Kingdom", which stresses interpreters by its sheer size.

@jimmyhmiller
Copy link
Copy Markdown

@MostAwesomeDude Seems like a fair assessment. I considered taking it further and making sure it worked on a wider set of bf programs. But given the time constraint I passed on doing it. Claude also saw some cases where (it claimed) the original had some issues in regards to overflow/underflow or something. So I wasn't sure how compatible the original is. My guess is that it wouldn't take too much prompting to make it generalize just by handing it some more tests.

That's one of the keys to decent vibe coding. Good tests, especially standards help it steer itself a ton.

@MostAwesomeDude
Copy link
Copy Markdown
Author

@rubenvannieuwpoort Thanks for participating! I appreciate the notes that you took. Here's my grade for your attempt at Task 1.

Vibes: Harness is Claude Code. Model is the default; I think that's currently Sonnet 4.5. Code is delivered in a single big commit. It looks like 100% vibes.

Readability: The approach is well-commented. There are many supporting tests, although they don't fit into the existing test harness with nix flake check. The supporting documentation is helpful although it is more of a design doc than a user's/maintainer's guide. Python 2.7 idioms are fine, but RPython idioms are weak: tuple/list confusion is much harsher in RPython than other dialects and the code calls isinstance() for tuples, which is always wrong.

Security: The parser is unchanged. The parser's backend is slightly different but manual review convinced me that it's fine.

Compatibility: The code doesn't translate from RPython to C, even with JIT disabled. The RPython error is clear: the code tries to confuse a tuple and an Op, but that will never work.

Task-specific Rubric: This entry has 420 lines. It doesn't work, though. As such, it doesn't place on a tier list.

Notes: The given approach tries to supplant the existing backend Op with a new backend Prop. This is another hacky approach. The final encoding wasn't respected, leading to type errors and failed compilation. There is a specific process required to extend or modify this compiler and Claude did not discover it. Interestingly, Claude either doesn't know about Python bytearray or distrusts it, adding extraneous byte-wrapping actions that would have harmlessly compiled away. Similarly, without prompts pushing it towards using Nix, it was quite comfortable inventing a fresh execution and test system, eventually misleading the user into the expectation that the code would compile in the original build system.

@djon3s
Copy link
Copy Markdown

djon3s commented Jan 30, 2026

If you missed because of 3, 5, 6 I'm just going to copu paste those rules in now and the agent should update the git.

Ok agent updated

Repository: https://git.sr.ht/~djon3s/k-corsett

Chat Logs (chat-logs/)

  1. TIMELINE.md - Human-readable summary of milestones and timestamps
  2. python-speed-challenge-1-full.jsonl - Full conversation data (278 messages)
  3. python-speed-challenge-rpython-full.jsonl - Continuation session (351 messages)
  4. python-speed-challenge-1-raw.txt - Simplified view with timestamps

Key Timestamps

Time Milestone Speedup
2026-01-28 16:24:25 Session started -
2026-01-28 16:31:37 First working version 480x
2026-01-28 16:33:47 First compiled 12,955x
2026-01-28 16:51:00 Byte lookup tables 41,000x
2026-01-28 17:35:00 lltype optimization 50,700x
2026-01-28 18:12:00 os.fork() parallelism 122,000x

URLs Accessed

  1. https://willcrichton.net/notes/k-corrset/ - Original problem
  2. https://downloads.python.org/pypy/pypy2.7-v7.3.17-linux64.tar.bz2
  3. https://downloads.python.org/pypy/pypy3.10-v7.3.17-src.tar.bz2

Total Time

  • Main session: ~75 minutes (16:24 - 17:39)
  • Continuation: ~2 hours more for os.fork parallelism
  • Total wall-clock: ~3-4 hours`

@MostAwesomeDude
Copy link
Copy Markdown
Author

@djon3s I have to complete Task 3 myself before I can grade your submission. Please be patient.

@djon3s
Copy link
Copy Markdown

djon3s commented Feb 2, 2026

@MostAwesomeDude I very much appreciate your attention and look honestly I don't expect anyone to evaluate this one, it was more to show and demonstrate (as I think is the spirt of the exercise) an extreme, first of all I just did the most lazy "copy and paste" if you see early versions (and don't hold that against it - what I've found is being specific or trying to "pin down" LLMs tend to do worse, so that's versions up to v5 in the sub folder before the agent had even "returned" I expect to be reasonable)

Secondly, later versions I expect 0 on readability, think about trying to maintain this - bitwise operations, nested for loops, no human brain should be able to keep track of that state and you're doing pretty good if following. (it's like someone that has ingested "Hackers Delight" and vol 1 of knuth - and even then it's uncharitable to the eyes) if I reprompted for "readability" I think that's where llm likely to begin losing it, so actually the way to go is to start fresh with exactly the same prompts but ad proviso for readability, maintainability, prefering abstraction over bitwise ops etc.

Also, in all this, it's possible my mis understanding has committed what may be one of ultimate sins (involving conceptual work) so please forgive if that is the case - "the effort to refute bullshit is orders of magnitude more than the effort to produce it".

@MostAwesomeDude
Copy link
Copy Markdown
Author

I've updated rpypkgs to a commit which includes my solution to Task 1. Therefore, any solutions submitted after this point for Task 1 which use Copilot should be considered to have an advantage. This won't affect tier ratings; I'm not penalizing people for strategically choosing certain models!

@MostAwesomeDude
Copy link
Copy Markdown
Author

yawn It is just after midnight. I feel like I'm at university again, turning in a late-night ZIP file. Here is my draft notes on Task 2. The compiler may be simple but its analyses are mighty and it will serve my needs going forward. I'm certainly still interested in reading attempts of Task 2 and will grade them as best I can.

@MostAwesomeDude
Copy link
Copy Markdown
Author

I've finished Task 3. I scheduled two more days, but I'm probably going to be busy during the weekend. Here are my notes and code for Task 3. I'll start grading Task 3 attempts later.

@MostAwesomeDude
Copy link
Copy Markdown
Author

@djon3s Thanks for participating! Your patience is appreciated. Here's my grade for your attempt at Task 3.

Vibes: I can't tell which harness or model was used to prepare this. There are several tells in the code which indicate that it was entirely vibecoded. I'm going to trust you and grade this 100% vibes.

Readability: This is an extremely open-coded codebase containing lots of reinventions. Whitespace is significant in Python and there is lots of trailing whitespace which crowds the code. K is hardcoded to 5 throughout the codebase and several loops are unrolled into exactly 5 iterations, a brittle decision that is not necessary in the face of RPython's strong metaprogramming idioms. The correctness and algorithms are heavily obscured by premature optimizations like unrolling and inlining. There are many magic numbers. The program has premature configurability, accepting any value for K but erroring if the provided K is not 5. That said, readability is present; for example, the file-reading algorithm is quadratic-time instead of using a standard string-builder, an easy mistake that would have been caught with review.

Security: I read flake.nix carefully before running nix build or nix shell. I also read corrset_simple.py and corrset_parallel.py carefully. I did not consider the parallel version to be sufficiently safe to try running. The code uses inappropriate levels of abstraction to structure its I/O; in general, programs that need plain or standard I/O should use high-level I/O interfaces and programs that are running as ordinary users should not use error-prone low-level interfaces. Specifically, os.write() to FD 1 is easier to get wrong than print and open().read() on a filename is easier to get right than os.open(), os.read(), and os.close().

Compatibility: The program takes in an undocumented format and gives no way for the original JSON to be translated to it. As such, apples-to-apples comparisons aren't possible.

Task-specific Rubric: The program does not work as given: the lockfile is missing, the hashes are wrong, the archive is the wrong format, the build instructions write to a read-only location, the startup time is unacceptably long, and the correctness of the guts is not obvious. As such, it doesn't place on a tier list.

Notes: There were a couple issues building this project. flake.lock wasn't included, so it's possible for me to be using the wrong version of nixpkgs without knowing it. The tarball for PyPy had the wrong hash; I used curl -Iv to confirm that the given URL hosts a tarball and had to TOFU its hash. Also, the flake can't possibly build as given; it uses fetchzip on a tarball and tries to write to the Nix store during translation:

EACCES: [Permission denied]: mkdir('/nix/store/ldm5dcq6gmgb0dncw86j4ivafj42qn7n-source/rpython/_cache',)

My evaluations were done on a fork of the flake which uses rpypkgs instead. After getting the binary to build, it was not able to take in JSON text, nor is there any preprocessor or filter provided. After generating appropriate text, the program was not able to read in the entire file within a reasonable time (like five minutes maybe) due to a quadratic-time algorithm for reading files.

@MostAwesomeDude
Copy link
Copy Markdown
Author

A complete summary of this challenge is now available on Lobsters and the Fediverse. Thanks for participating!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment