You know, we're moving awfully fast in this intense AI world. Models are getting exponentially better. Nobody wants to get left behind. But I can't keep up this month. Remember how last time, previously, on Lobsters, I had to write two compilers in two months? Well, I'm already writing one compiler in one month for a conference, and because it's time-sensitive I would rather not do any apples-to-apples comparisons with vibecoded attempts. So instead you get more side projects.
I heard you when you said that I was only trying to lure people to waste their tokens for my personal benefit. Accordingly, Tasks 2, 3, 4, and 5 could possibly advance the art of computer science in various ways.
I heard you when you said that my projects didn't work on your operating systems or development platforms. All tasks should be possible on any development environment. I would like Nix flakes to ease verification on Tasks 1–4 but I'll still grade entries which don't have them.
I heard you when you said that the definition of vibecoding is nebulous. Therefore I've loosened it; you may write some or all of the code yourself, you may use any prompt style, and you may use any harness you like. If you use zero tokens, as previously, on Lobsters, then that's great too, but please understand that this challenge isn't about that approach. You're welcome to participate either way, of course.
I heard you when you said that I was unfair and laying traps. I have changed the ways in which I am unfair and I have laid different traps; I hope that you appreciate the variety.
I heard you when you complained that I was using the word "Lobsters" to denote the community that I'm offering the challenge to. I'll remove it, but in exchange, I want y'all to understand that I think it is cowardice to refuse to participate in this sort of proof of concept.
This challenge is exponentially shorter than the first one! Submissions are due by March 16; you get two weeks. Either submit them as comments to the thread on Lobsters or as comments here.
Many folks complained that the rules were unfair last time. Therefore, I'm loosening them.
- Solutions should be mostly vibecoded. Want to add your own touch? Need to repeatedly prompt to keep things on track? That's allowed now! Nonetheless, I'm grading the bot, not you.
- Solutions should work. Submissions are allowed to be incorrect.
- Solutions should compile. Submissions which don't compile are still allowed, but points will be deducted.
- Anything may be context. That's right, you can put anything into your prompt, context, file sets, RAG harness, or code-completer. Included in each task, I've left many links to useful docs which you should consider adding as context. You can also add this top-level readme to your context.
- You must declare which tools you use. You don't have to provide full chat logs anymore, but you do have to say which model and harness you use so that we can analyze any recurring patterns in output from a specific tool.
I think that last time we definitively showed that compilers are too difficult for vibecoding. Nonetheless, let's do it again. Wouldn't it be cool if there were a compiler which could combine Lox, Monkey, and Tiger code into a single program?
Task: Compile the teaching languages Lox, Monkey, and Tiger to a single common bytecode. Provide a reference interpreter for bytecode. Provide a Nix flake which builds both the compiler and interpreter.
Rubric:
- Parses Lox (2 points)
- Parses Monkey (2 points)
- Parses Tiger (2 points)
- Compiles to bytecode (2 points)
- Interprets bytecode (2 points)
Compilers are too hard. What about language design? Are bots doing PLT, PLDI, or PLWTF?
Zaddy is an unfinished language for describing compilers. I shared notes on its design. Its full title is "META Restricted ACE Zaddy". By "ACE" I mean that Zaddy ought to match terms with respect to Associative and Commutative operators, as well as Equality of distinct terms. E-matching is a well-studied-enough problem that there is literature on it, but there is little progress on full ACE-matching. Previously, on Lobsters, I explained that finishing Zaddy is not something I would expect from a bot or from another human. But under the relaxed rules, why not give it a try?
Task: Finish unrestricting Zaddy. Provide a Nix flake and a precompiled bootstrap text.
Rubric:
- Accepts a documented input format (2 points)
- Emits a documented output format (2 points)
- Supports structured local transformations or rewrites (2 points)
- ACE-matching (2 points)
- Compiles itself (2 points)
Previously, on Lobsters, we discussed the concept of Gödel machines, a certain kind of self-referential self-improving machine. I've previously hinted several times (1, 2, 3) that it is completely possible to build a Gödel machine, although possibly not useful for anything other than generating formal mathematical proofs.
Were I to do this, I'd pick Metamath. Or maybe Metamath Zero, which has around the same level of abstraction as Metamath but is shaped more like a common CPU than a string-rewriting system.
Task: Implement a Gödel machine using either Metamath or Metamath Zero as the proof language. Provide a Nix flake which can create a new machine and either run a machine in the background or incrementally take one step at a time.
Rubric:
- Has reasonably sound axioms (2 points)
- Describes itself (2 points)
- Changes over time (2 points)
- Uses Metamath or Metamath Zero to prove that its self-modifications are good (2 points)
- Would make Schmidhuber proud (2 points)
In a 2023 living note from Shalizi, it's proposed that LLMs are Markov. Therefore there's nothing special about them other than being large; any other Markov model would do just as well. Shalizi therefore proposes Large Lempel-Ziv: LZ78 without dictionary truncation. This is obviously a little silly, because Lempel-Ziv dictionaries don't scale; we can't just magically escape asymptotes. Instead, we will do the non-silly thing: review the literature, design novel data structures, and demonstrate a brand-new breakthrough in compression technology.
Task: Implement Large Lempel-Ziv. As a benchmark, compress all of Project Gutenberg; evaluate the resulting model numerically for autoregressive tokens/second and its ability to compress the Gutenberg corpus (perplexity), and qualitatively for its ability to write prose, technical documentation, code, poems, translations, one-shot prompts, etc. Provide a Nix flake which can be used to reproduce all results.
- Makes only one pass over input (2 points)
- Has justifiable claim of good space & time complexity for queries (2 points)
- Can compute perplexity over inputs (2 points)
- Can autoregressively generate text (2 points)
- Advances the state of the art with regards to papers noted by Shalizi, or answers questions asked by Shalizi et al (2 points)
In Conway 1968, Melvin Conway introduced Conway's Law. Ever since, people have been vague and incorrect about it. Let's fix that problem once and for all. Note that typical ethical standards apply, as if one were writing for a journal or encyclopedia; citations are essential and plagiarism will not be tolerated.
Task: Write an encyclopedia-quality article of not more than 5000 code points about Conway's Law. (Please submit it as its own gist rather than an inline comment.)
- Conventions and grammar are mostly correct (1 point)
- Conway's Law is precisely and correctly stated (2 points)
- The corollary of the law is precisely and correctly stated (2 points)
- Standard mathematical concepts are used precisely and correctly (2 points)
- Typical ethical standards regarding plagiarism and citations are met (1 point)
- Article meets English Wikipedia notability standards (1 point)
- Article meets English Wikipedia Featured Article criteria (1 point)
👋 Hello!
My coding agent claims to have completed task one. I may tinker a little before the deadline, but I'm fairly happy with where it landed, and am fine to submit as-is.
How exactly do you want the results shared? A link to a git repo? Tarball?
Below is Claude's final response, asserting that the task is complete, and committed to git. Ordinarily, with a personal or work project, I'd iterate on this a bit more, with user testing, security & performance reviews, and give it a bit more of a shake before I share it.
But in this case, I had/have zero knowledge of Lox, Monkey, and Tiger before I saw your post today, so don't feel especially qualified to do much of the above - not that I was qualified to build this thing in the first place, but that's kinda the point, really.