Skip to content

Instantly share code, notes, and snippets.

@simonw
Created September 1, 2025 19:34
Show Gist options
  • Select an option

  • Save simonw/722fc2f242977cb185838353776d14f4 to your computer and use it in GitHub Desktop.

Select an option

Save simonw/722fc2f242977cb185838353776d14f4 to your computer and use it in GitHub Desktop.

LLM digest: July 2025

I wrote 98 posts on my blog in July (that page recently enhanced using OpenAI Codex). Here's your sponsors-only summary of the most important trends and highlights from the past month.

Claude Code

I've been spending a lot of time with Claude Code this month. I published a video showing how I used claude --dangerously-skip-permissions to add an automated table of contents to this README. I also wrote about using Claude Code to write, compile and run Mandelbrot in x86 assembly in a Docker container.

Working with Claude Code lead me to the following idea:

Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.

The challenge then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.

That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.

I've also been experimenting a lot with OpenAI Codex - the tool that runs online (via the ChatGPT app) and files PRs against your code, not their Codex CLI tool that's their version of Claude Code. I wrote about my most substantial experiment with that in Vibe scraping and vibe coding a schedule app for Open Sauce 2025 entirely on my phone.

Model releases in July

There were so many new models released this month!

Grok 4 came out, followed by some embarassing revelations - most notably that it turned out Grok would run a search for tweets from:elonmusk when asked its opinion on controversial topics! This was fixed shortly after by an update to the system prompt.

Google released Gemini 2.5 Flash-Lite, the least expensive model in their Gemini 2.5 family.

Mistral released their first audio-input models, Voxtral Small and Voxtral Mini. They also published detailed figures on their environmental impact and released an updated Codestral code autocomplete model.

It was a huge month for open weight models from Chinese AI labs. I wrote about the following:

These are all excellent models. I've been able to run the GLM-4.5 Air and Qwen-30B models on my 64GB M2 MacBook Pro laptop and I have been astonished at how useful they are. I started using a new benchmark, "Write an HTML and JavaScript page implementing space invaders", and got working games from a single shot using GLM-4.5 Air and Qwen3-Coder-30B running directly on my own machine.

I wrote about those two in more detail, with extensive notes on how I ran them:

There are two interesting trends here. First, it's now possible to run genuinely useful coding models directly on a high end (32GB or 64GB) developer laptop. Secondly, the Chinese AI labs are now undeniably producing the best available open weight models.

OpenAI's open weights model is rumored to show up any day now. It has some substantial competition!

Gold medal performances in the IMO

The IMO is the International Mathematical Olympiad, an annual mathematical competition for high school students that's been held since 1959. It's long been a goal of AI labs to produce a model that can compete in this contest at a high level.

This year two teams achieved a gold medal performance in IMO, from OpenAI and from Gemini DeepMind.

OpenAI announced first and got a lot of press coverage. The Gemini team announced later and there was then some dispute between the two teams about whether their announcement timings were compatible with the guidelines set out by the IMO themselves.

Both models scored the same, solving 5 of the 6 problems (the unsolved 6th was also the hardest for the human contestants). Notably, neither of the gold medal models had access to tools or internet search - they were able to reason through the problems using their model weights alone.

Google just released Gemini Deep Think for their Gemini Ultra $249.99/month subscribers - a close relative of the model that they used for the IMO.

Reverse engineering system prompts

One of the best ways I know of to level up as a prompt engineer is to reverse engineer the system prompts of other products and see how they work.

I wrote up three of those explorations in detail this month:

  • Using GitHub Spark to reverse engineer GitHub Spark - GitHub Spark is GitHub's new prompt-to-app platform, and it has a fascinating system prompt which includes multiple paragraphs of instructions on how to implement good design.
  • OpenAI's new study mode is a mode of ChatGPT designed to help study without doing homework for you - and it's implemented entirely as a system prompt.
  • Reverse engineering some updates to Claude looks at two new Claude features - "create calendar event/create message" and "Upload PDFs, images, code files, and more to AI-powered apps" - and uses the system prompt to help explain what they are and how they work.

Tools I'm using at the moment

  • My daily drivers have remained the same as last month: Claude Sonnet 4 for most things, OpenAI o3 for search and research tasks, both through their respective apps and websites.
  • I'm fully switched to Zed as my editor, because it uses so much less memory than VS Code.
  • I'm running Claude Code a lot. I've also started tinkering with OpenAI's equivalent, codex-cli, to run Claude Code style tasks with their models.
  • I continue to use my own LLM tool for other command-line tasks, defaulting to GPT-4.1 but often using Gemini 2.5 Pro and o3 for harder tasks.
  • The only time I use GPT-4o is for advanced voice mode. I wish they'd upgrade that to use a more powerful model!
  • For local models I've been leaning more on LM Studio, especially now they've changed their policy to allow commercial use of their free desktop app. I also still run Ollama for those, and frequently dabble with mlx-lm as well.

Bonus links

That's it for July!

If this newsletter was useful feel free to forward it to friends who might find it useful too, especially if they might be convinced to sign up to sponsor me for the next one!

Thanks for your support,

Simon Willison https://simonwillison.net/

I'm available for consulting calls over Zoom or similar, you can contact me at contact@simonwillison.net

I also offer private remote workshops for teams, of both my Building software on top of Large Language Models workshop and a new workshop on Writing code with LLMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment