Paper: https://arxiv.org/pdf/2602.16928
Chatgpt: https://chatgpt.com/share/69a88862-e3d8-8009-a766-84745a7485d7
alphaevolve repo: https://github.com/ankitmaloo/alphaevolve (very basic code no parallelism)
While everyone is obsessed with new Transformer architectures, this paper is about something else:
- LLMs doing algorithmic discovery. It’s using an LLM to invent foundational AI algorithms that humans couldn't think of. or havent yet gotten to.
- AI or any algorithms were bottlenecked by human intuition. We write mathematically clean, logical equations. But what if the optimal learning algorithm is actually a messy, non-intuitive block of code?
- One Line Summary Google used an LLM (via AlphaEvolve) to mutate the Python code of two classic Game Theory algorithms, resulting in bizarre, highly reactive new code that crushed human-designed baselines.
-
Prompt-Driven Evolution: It’s an evolutionary algorithm where the LLM is the "mutation operator."
- Take standard Python code for an algorithm.
- Prompt the LLM: "Here is the code. Modify it to improve performance. Think step-by-step. Explore the solution space to give me five distinct mutations."
- Run the newly generated code in a sandbox simulator.
- If it wins, it becomes the new parent. If it fails, discard it. Repeat thousands of times.
-
Why this is huge for ML: Compared to Hyperparameter Tuning (Grid Search/Bayesian Optimization). We (well the paper authors) aren't tweaking the numbers anymore(learning rate = 0.01 to 0.005). we are asking the AI to invent entirely new mechanics, variables, and logic gates.
-
The Background:
- Analogy: "Regret." Imagine going to a restaurant, ordering fish, and seeing your friend's steak. Your "regret" for not ordering the steak is +5. Next time, you are more likely to order the steak. Or poker maybe. folding when you could have won.
- How CFR works: AI plays against itself in hidden-information games (like Poker). It calculates regret for every single decision it didn't make. Over time, it adjusts its strategy to minimize total regret, which mathematically leads to a Nash Equilibrium (unbeatable strategy).
-
The Human Limit: Humans noticed old regrets (from early, dumb training) were ruining the math. So humans invented "Discounted CFR," (noam) multiplying old regrets by fixed numbers (e.g.,
$t^{1.5}$ ) to make them fade away. Clean, static, human math.
-
The AlphaEvolve Discovery (VAD-CFR):
- The LLM invented Volatility-Adaptive Discounting. It wrote code to measure how chaotic (volatile) the game state currently is using an Exponential Weighted Moving Average (EWMA).
- The "Aha!" Moment: The algorithm now dynamically changes how it learns based on the real-time chaos of the game. Humans never thought to use financial-style volatility metrics to discount game-theory regret!
- The Background (Human Level):
- Analogy: Rock-Paper-Scissors. If your opponent only plays Rock, you learn to play Paper. But then they learn Scissors. You end up chasing your tail forever.
- How PSRO works: Instead of one agent, you keep a "population" of past agents. A "meta-solver" looks at the population and says, "Train a new agent that can beat a mix of 30% Rock, 20% Paper, 50% Scissors."
- The Human Limit: Humans struggle to write meta-solvers. Should the solver force the AI to explore weird new strategies, or exploit the best known ones? Humans usually hard-code one or the other.
- The AlphaEvolve Discovery (SHOR-PSRO):
-
It wrote code that blends two different algorithms together, controlled by a "temperature" parameter that changes over time.
-
The "Aha!" Moment: It automated the transition. Early in training, it forces the AI to explore wildly. As training goes on, the code naturally "cools down" and locks into a rigorous, exploitative Nash Equilibrium.
-
It basically wrote an annealing schedule that forces high-entropy exploration of the policy space early in training, and smoothly decays the temperature to lock into a strict, exploitative Nash Equilibrium at the end.
-
-
Interpretability vs. Performance: AlphaEvolve outputs raw Python code that works, but it might be messy or hard to understand why it works. Are we okay with foundational AI algorithms becoming "black boxes" of uninterpretable code, as long as they beat human baselines?
-
The Death of the
MLEngineer? If an LLM can search the space of algorithms better than a PhD researcher, what is the role of the human in ML research in 5 years? Is it just writing the objective function? -
Generalization: The LLM discovered these algorithms by training on simple games (like Leduc Poker). Yet, the algorithms scaled up to beat state-of-the-art baselines on much harder games (like Stratego). Why do you think "bizarre" code generalizes so well?
-
Beyond Game Theory: could we point it at the PyTorch source code for the Adam optimizer and ask it to invent a better neural network optimizer?
LLMs struggle to maintain a coherent "world model" against adversarial adaptation. This paper shows that you don't need the LLM to have the world model natively. By using AlphaEvolve, you constrain the LLM to what it does best (syntax generation and heuristic blending) and let the multi-agent simulator serve as the adversarial world model.
The code for VAD-CFR and SHOR-PSRO is highly performant but mathematically messy. It lacks the clean convergence proofs that game theorists demand (like original CFR). As a technical community, how do we evaluate fundamentally new algorithms discovered by LLMs if they empirically work, but resist formal mathematical verification?