Skip to content

Instantly share code, notes, and snippets.

@MostAwesomeDude
Last active March 31, 2026 01:00
Show Gist options
  • Select an option

  • Save MostAwesomeDude/560185c24f959f6fec229739cb5a6735 to your computer and use it in GitHub Desktop.

Select an option

Save MostAwesomeDude/560185c24f959f6fec229739cb5a6735 to your computer and use it in GitHub Desktop.
Activating Two Trap Cards at Once

Activating Two Trap Cards at Once, or: A Gentle Response to the Popularity of Vibecoding

Nor? Naur? Naureigh? NOR!? NOR! NAUR! NAUR! NAUR! NAUR PLEASE! NAUR PLEEEAAAZZZZ ~ Geek, 2022

Welcome to the carnival! We've got fun and games. I asked vibecoders to complete three tasks. When folks complained about that, I offered up five more tasks. I did half of these and got average scores. How well did the community do? Scroll to the end to find out!

My left eyelid seems to have developed a permanent twitch over the past two years as I've watched the profession of software engineering dissolve into grey goo. The ELIZA effect is too potent; humans have developed yet another supernormal stimulus with which to distract themselves from the actual goal.

So, what was that goal? Obstensibly, it's to build Naur theories. A Naur theory is a human's mental intuition for what a distributed system (or a single machine) does when programmed with a certain piece of code. In Naur's 1985 article, "Programming as Theory Building", the contents of Naur theories are not as important as the Theory Building View which recognizes their existence. To this end, I submitted two vibecoding challenges (1, 2) to Lobsters. To onlookers, the issue with my chosen tasks is obvious: the vibecoder must have a fragment of the Naur theory in order to know what to put into the prompt for their coding harness and they can't merely brain-scan it out of me. In terms of the common cybernetic metaphor of controlling a horse, the rider must have a destination in mind before they can reasonably be said to control the horse's progress towards that destination, regardless of whether the horse already knows how to get there. I'll mention the Naur theory for each task and the reader can decide whether I withheld relevant information.

Why make it easy, though? I can make it hard to automatically retrieve details of the Naur theory. Here are the trap techniques I used:

  • Forgetfulness. I'm messy, disorganized, wide-reading, and generally bad at recall.
  • Idiosyncracy. I'm already doing things a certain way and don't see the point in conforming to unjustified constraints, particularly those confabulated by chatbots.
  • Indirection. I put so many links to so many other pages. Surely one of them has the correct details.
  • Litote. I love to understate, particularly when the understatement is ironic or sarcastic.
  • Obscurity. Any sort of universal-learning technique is only as good as its training corpus, but I only personally wrote a small part of that training data.

I had a second goal, though. RPython is a dialect of Python 2.7 which can be translated to C. Notably, translation can add multiple JIT compilers to the translated program, so that interpreters written in RPython can be lifted to JIT compilers. RPython is what we use when we want speed. Previously, on Lobsters (1, 2, 3), we witnessed the end of Python 2 according to the Python Software Foundation. However, I wanted to keep using RPython, so I set up a Nix flake that can translate RPython codebases, using Nix to replace upstream package management. My vibecoding challenge was therefore an opportunity to demonstrate what RPython can do. I'll mention how each task advances this goal, too.

Propagate the Fuck (Can't Propagate the Fuck)

This task was about updating a compiler to use a slightly fancier formalism than it had been previously using. I completed this task in three commits (1, 2, 3) over an allotment of two days. Following that, I needed to fix a bug which I had introduced by not fully considering how snippets like [-] and [--] will behave; the former will always zero a cell in finite time but the latter can conditionally loop forever if the cell contains an odd value. The fix took two commits (1, 2); it wasn't until the latter commit that I actually built a theory of Brainfuck which understands the bug. Finally, I added one more commit to add another primitive operation for even more speed.

This is the task which got the most attempts. It was first in line, it sounds easy enough, it doesn't require learning about anything peculiar to me, and there are plenty of test cases.

Piper (@piperswe) placed in C tier with their attempt. This is by far the best of the attempts and I spent a few minutes trying to understand what stands out about it. While all other attempts used Claude Code, Piper's attempt used OpenCode. They also put a little bit of effort into their prompt and read the error messages after each part of the prompting session.

Naur theory: The theory of Brainfuck, particularly as an algebra, is documented on esolangs and linked in the task. The theory of final-encoded interpreters is documented using this interpreter as an example and was shared previously, on Lobsters.

RPython: bf.py is the second-best interpreter (which compiles) (and can correctly evaluate a benchmark suite) (that I know of) for Brainfuck. The best interpreter's written on top of GNU Lightning, another excellent JIT toolkit. With these changes, perhaps bf.py could get even faster and take first place! Indeed they're now close enough that I'm probably going to have to figure out a more holistic set of benchmarks before I can make a more precise claim.

Late, as in the late unknown-linux-musl

I'm building a Linux distro. Hopefully that's been obvious? I feel like I've been wearing it on my sleeve. Anyway, yes, I don't want to run NixOS with systemd in my home anymore, so I've been experimenting with writing my own system tools. This task concerns porting a prototype compiler from Raku into something that can be statically linked into a small little binary. I looked at Ada, tried OCaml and Rust, and ended up working in RPython. My full notes are here, but in short, UTF-8 support ended up being more of a showstopper for Ada and OCaml than expected, and Rust still has usability issues which would have made it impossible to deliver on time.

Naur theory: The theory of Vixen, including example code, is documented on esolangs. The theory of compiling Vixen expressions to execline wasn't on esolangs at the time, but it was documented using this interpreter as an example and was shared previously, on Lobsters. Fun fact: I just cracked open that jar of banana peppers today; I spiced up a bánh mì.

RPython: The result of this task is going into an initramfs. It has to be statically linked. Can RPython be statically linked and compiled with musl? Yes. Also, rply is getting old and I think I would just not recommend its usage in the future. Instead, I think I should improve the RPython docs on how to use the builtin PEG parser.

Don't you know? Python makes you fast. (Haha, one!)

I especially love the third task because that’s exactly the kind of shit you get thrown on your plate in the field as a SWE. That’s almost exactly one of the first tasks I got as an intern when I was starting out. ~ @V0ldek@awful.systems

This task is about optimizing an NP-hard search problem by tightening its inner loop. Fun fact: I have a mug from a coding competition I won in university where the final round was optimizing an NP-complete search. I won that round by writing the loop in Python and using PyPy to run it, which let me put the first entry on the board with the smallest amount of code; the contest rules said that other folks had to go faster with less code if they took more wall-clock time.

While I submitted a B-tier solution, I want to point out that the problem appears to have some numerical issues distinguishing the top result from multiple almost-perfect results, and as such I was also able to provide a probabalistic approximate solution which has S-tier timing. This is a great example of the disconnection that often arises between managerial specifications and engineering solutions due to the managerial inability to understand complexity-theoretic limitations; if we have an NP-hard problem then the only solutions quickly computable are necessarily approximations.

Rust had another poor showing when it comes to usability. Well, it's cargo, really. I got three different build errors from a supposedly-lockfiled application, including one which indicates that the application can't be built with stable Rust, which feels like an antipattern.

Finally, I should mention that I was able to have a conversation with other people about this task. Their advice slightly improved my benchmark performance but ignoring it would still leave my solution in B tier. Rather, what I want to point out is that humans give better advice than chatbots.

Naur theory: The theory of k-CorrSet, including example code in Rust and Python, was linked in the original task and discussed previously, on Lobsters (1, 2).

RPython: I was pretty explicit about my motivation here: I bet that a fairly basic RPython program can match Rust while being shorter, or match CPython while being more maintainable and less janky.

An Object Finale

I thought that this task would be properly humbling for any would-be centaur. Just get the chatbot to write an encyclopedia article. The obvious trap is that the chatbot will have trouble not plagiarizing. However, there's also two hidden traps. First, English WP is wrong about Conway's Law and the chatbots have learned similar wrongness. Second, I'm the principal author of nLab's article on Conway's Law, so I already completed this task in 2024 and can grade my prior work according to the given rubric.

  • Conventions and grammar are mostly correct: yeah, looks fine. (1/1)
  • Conway's Law is precisely and correctly stated: yes, first as a quote and then clarified. (2/2)
  • The corollary of the law is precisely and correctly stated: ditto, with a quote and clarification. (2/2)
  • Standard mathematical concepts are used precisely and correctly: looks good enough and was reviewed by an admin, but failed to convince some experts during peer review. (1/2)
  • Typical ethical standards regarding plagiarism and citations are met: yep, everything is cited and I used my own words. (1/1)
  • Article meets English Wikipedia notability standards: maybe, but evidence for that isn't presented in-article. (0/1)
  • Article meets English Wikipedia Featured Article criteria: the lead is weak and awkward; it wouldn't pass FA review. (0/1)

So I got a 7/10. Mid.

Naur theory: The theory of graphs is standard mathematical lore. The theory underlying Conway's Law is documented in Conway's 1968 article, "How Do Committees Invent?".

RPython: Nope.

The unattempted

The Menagerie

Nobody did this one. Somebody did claim to make an attempt, but it seems that Claude must have misled them somewhere along the way and they never posted code to review.

This task was a red herring! It is obviously silly. I included it because somebody complained in the first challenge that I was only asking people to work on compilers that I would personally use; okay, here's a compiler that nobody would use!

Stack-Popping Zaddies

This task was included because people complained that I was withholding the actual meaty work of PLT/PLDI from the chatbots. Okay, go ahead and finish one of my unfinished projects. It shouldn't be too hard; they just have to figure out efficient ACE-matching, an open problem.

That said, I have about half of a complete solution in my personal repo. There's two big problems. First, I'm not sure what kind of concrete syntax I should use to represent certain new data structures. Second, I haven't figured out the operational semantics around strings; should they be opaque or transparent WRT matching, and what are the consequences of that?

Metamath Messiah

This task is about implementing a Metamath interpreter and database. The database's axioms have to describe the states of the interpreter, including metainterpreting the interpreter's code. We can program the interpreter with a loop that alternates between searching for some solutions to arbitrary goals and searching for provable improvements to its own code; this is literally an unbounded while-loop.

This is an engineering slog with lots of head-hurting meta-levels. (Well, just two meta-levels, but that's enough to cause head pain.) Also, in terms of computational complexity, this should be the hardest thing that we can ever do as programmers, since a sufficiently-rich collection of axioms will hit the computability ceiling and our unbounded while-loop causes the overall long-term behavior to be undecidable. The payoff's not worth much either, or at least I can't imagine such a machine doing anything other than sitting in a corner and proving a pile of Metamath theorems.

Compression is Magic

This task is about implementing LZW. That's it, really. Save it to an on-disk format that's optimized for being large instead of fast or streaming.

I have this task about half-done in my personal repos. I've implemented a small LZW prototype, fixed an on-disk format, and written a pile of buggy RPython. I don't really care about this task, but I do care about doing it right: reproducible, readable, informative.

Analysis

First, [chatbots aren't] letting me do things I otherwise cannot do. Second, [chatbots aren't] letting those people do things I do. Is this gatekeeping? What, if anything, am I withholding? ~ me, 2026

So, here's the thing, y'all. I don't know how to use a computer. I can put a lot of effort in, but I don't really have an intuitive understanding for what I'm doing. To echo the words of Bill Murray against Chevy Chase, I am "medium talent"; to echo the words of countless video-game opponents, I am a "tryhard". Did you know that I originally went to university to be a musician? I studied jazz piano! It should not be my responsibility to point out systemic issues or to challenge our leaders on their positions or to be a leader myself; I should be in a corner making weird sounds that nobody wants to hear.

At the same time, it is obvious to me that generative chatbots will never be able to produce working code by mere dint of reinforcement learning. A serious literature review has only turned up one system that can turn an RNG into working code, discussed previously, on Lobsters, and it isn't a chatbot or a neural network. For reasons that I don't understand, boosting generative-chatbot products isn't a bannable or even a warnable offense, but treated like polite discourse rather than native advertising. Therefore I suppose that some of us are doomed to the role of Cassandra for as long as people refuse to learn to distinguish a pile of memes from human beings.

On a similar timbre but a different pitch, let's reflect on whether Python 2 stayed dead when CPython 2.7 was discontinued. The real value of CPython is as the bottom-most layer of bootstrapping for bringing up PyPy on a new architecture when RPython's generated C code is insufficiently portable. If RPython is what we use for speed then CPython is what we use for portability. Aside from that, though, it takes one organization on one git forge with one committer to maintain one Nix flake, and that flake can provide a listing of prepackaged libraries like rsdl which are also maintained through that same organization.

Rather than make this a lesson wholly about CPython, I would insist that this is merely another instance of the pattern of Nix replacing bespoke package managers. In this case, Nix replaces pip. Note how, in Python 3 packaging, Nix also replaces all of pip's successors. There is currently great unrest about OpenAI acquiring Astral, discussed previously, on Lobsters, which is mitigated quite a bit by never having bothered adopting uv or any of the intermediate trendy Python-only package managers.

There's also the consideration of Conway's Law. Quoting nLab:

A lone builder, working in a short span of time, has no structural limitations on what they produce. However, a highly structured team of builders, working across a long period of time where individual humans join and leave, has a large structure with many edges that each provoke a corresponding interaction in the resulting system.

Am I only productive because I'm not working with other people? Well, first, no, I am regularly collaborating with multiple teams of maintainers, and I benefit from collaborative knowledge and community knowledge bases, so the premise is bogus. But even admitting it, yes, Conway indicates that there could be structural limitations to productivity which are only escaped by working in insular teams as small as one person.

No, like, analysis of the vibecoded outputs

Oh, okay, yeah, that makes more sense. So, there are several patterns that I noticed. First, the models confabulated quite a bit. They told participants that their code was short and neat, that their benchmarks were fast, that their tests were passing, that their functionality was implemented, and probably other stuff I didn't catch. Immense amounts of wishful thinking and fabricated results.

Something I can't characterize well is the hackiness of generated code. I know what it isn't. It's not the case that models are seeking low-energy or stationary-action solutions; they aren't trying to generate the tersest or simplest code for standard problems. It's also not a failure to condition solutions upon existing code, although hackiness is partially about not fitting in. It's not being low-effort either; one attempt put in a lot of effort to deliberately not use my existing code. I think that hackiness is about a stodgy refusal to work with the code according to the (Naur) theory which originally generated it.

If I were to be extremely generous, I might imagine that an LLM learns about ways to phrase characteristics of many families of Naur theories as phrased by people, but it doesn't directly learn any underlying Naur theory. That is, style transfer doesn't imply technique transfer or theory transfer. How could it? Take Escher as an example artist; knowing about his style (thin short lines) and media (carved wood, pen on paper) does nothing to communicate the underlying geometric ideas which characterize his pieces.

I tihnk that the most dire problem is overfitting to the task at hand. The way that we should think of today's machine-learning models, including Transformers, is that they are always trying to fit solutions at multiple meta-layers simultaneously. This means that they can overfit at training time (becoming stereotypical and shallow in affect), overfit at RLHF time (becoming obsequious and sycophantic), or overfit at prompt time (producing facile, meme-laden answers based on keywords and stereotypes). Examples of overfitting behavior when writing units under test include unrolling loops and inlining magic numbers. Overfitting also occurs when writing the tests themselves; test material is drawn from the task at hand.

For a crisp example, there was a submission for task 1 where the generated interpreter handled an in-distribution test, mandel.b, very well; in fact, it outperformed the reference interpreter on that test. However, there was a withheld test, LostKng.b, which catastrophically failed. This has precisely the structure of a learned generator which overfit on mandel.b and lost the ability to generally interpret Brainfuck as a consequence.

Conclusion

The tier listing is as follows:

  • B tier: Corbin S. (Task 1), Corbin S. (Task 2), Corbin S. (Task 3)
  • C tier: Piper M. (Task 1)

That's all. There wasn't a stampede of folks asking their agents to complete these tasks. There was a general sense of skepticism, disbelief, and anger. There were a few folks who felt I was rude. But mostly there was a dearth of participation and effort.

What a contrast from the robots of sci-fi horror! There was no moment at which some Robo-Corbin threatened to displace me by doing everything I do, but better: writing tighter code, handing out meaner sneers, speedrunning retro games with even more deaths, killing even more garden plants, spending even more time on unreadable rambly blog posts. It turns out that one can't just copy somebody's Naur theories by reading what they've written; one must think for themselves and build their own personal Naur theory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment