Skip to content

Instantly share code, notes, and snippets.

@simonw
Created April 7, 2025 16:55
Show Gist options
  • Save simonw/7ffb19f647603558c575359e678185c0 to your computer and use it in GitHub Desktop.
Save simonw/7ffb19f647603558c575359e678185c0 to your computer and use it in GitHub Desktop.

2025-04-07T16:54:13 conversation: 01jr8js9kdrhcyzhcc1ftm2dda id: 01jr8js9kgspwd0a7k8gn4wnfc

Model: gpt-4o-mini

Prompt

bullet point summary

Prompt fragments

d0c5c4b9c50f6b61ab323e4ef60b603f5dbccf2b1de9e4bd71e80b461894d649
<!DOCTYPE html>
<html lang="en-gb">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="canonical" href="https://simonwillison.net/2025/Apr/5/llama-4-notes/">
<title>Initial impressions of Llama 4</title>
<script defer data-domain="simonwillison.net" src="https://plausible.io/js/plausible.js"></script>
<link rel="alternate" type="application/atom+xml" title="Atom" href="/atom/everything/">
<link rel="stylesheet" type="text/css" href="/static/css/all.css">
<link rel="webmention" href="https://webmention.io/simonwillison.net/webmention">
<link rel="pingback" href="https://webmention.io/simonwillison.net/xmlrpc">
<meta name="author" content="Simon Willison">
<meta property="og:site_name" content="Simon Willison’s Weblog">



<meta name="twitter:card" content="summary">

<meta name="twitter:creator" content="@simonw">
<meta property="og:url" content="https://simonwillison.net/2025/Apr/5/llama-4-notes/">
<meta property="og:title" content="Initial impressions of Llama 4">

<meta property="og:type" content="article">
<meta property="og:description" content="Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is this post …">
<meta property="og:updated_time" content="1743893278">



<script async src="https://media.ethicalads.io/media/client/ethicalads.min.js"></script>


</head>
<body class="smallhead">

<div id="smallhead">
  <div id="smallhead-inner">
    <h1><a href="/">Simon Willison’s Weblog</a></h1>
    <a id="smallhead-about" href="/about/#subscribe">Subscribe</a>
  </div>
</div><!-- #smallhead -->

<div id="wrapper">
<div id="primary">

<div class="entry entryPage">


<div data-permalink-context="/2025/Apr/5/llama-4-notes/">
<h2>Initial impressions of Llama 4</h2>
<p class="mobile-date">5th April 2025</p>



<p>Dropping a model release as significant as Llama 4 on a weekend is plain unfair! So far the best place to learn about the new model family is <a href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">this post on the Meta AI blog</a>. They’ve released two new models today: Llama 4 Maverick is a 400B model (128 experts, 17B active parameters), text and image input with a 1 million token context length. Llama 4 Scout is 109B total parameters (16 experts, 17B active), also multi-modal and with a claimed 10 million token context length—an industry first.</p>

<p>They also describe Llama 4 Behemoth, a not-yet-released “288 billion active parameter model with 16 experts that is our most powerful yet and among the world’s smartest LLMs”. Behemoth has 2 trillion parameters total and was used to train both Scout and Maverick.</p>
<p>No news yet on a Llama reasoning model beyond <a href="https://www.llama.com/llama4-reasoning-is-coming/">this coming soon page</a> with a looping video of an academic-looking llama.</p>

<p>Llama 4 Maverick is now sat in second place on <a href="https://lmarena.ai/?leaderboard">the LM Arena leaderboard</a>, just behind Gemini 2.5 Pro. <em><strong>Update</strong>: It turns out that’s not the same model as the Maverick they released—I missed that their announcement says “Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.”</em></p>

<p>You can try them out using the chat interface from OpenRouter (or through the OpenRouter API) for <a href="https://openrouter.ai/meta-llama/llama-4-scout">Llama 4 Scout</a> and <a href="https://openrouter.ai/meta-llama/llama-4-maverick">Llama 4 Maverick</a>. OpenRouter are proxying through to <a href="https://console.groq.com/docs/models">Groq</a>, <a href="https://fireworks.ai/models">Fireworks</a> and <a href="https://docs.together.ai/docs/serverless-models">Together</a>.</p>
<p>Scout may claim a 10 million input token length but the available providers currently seem to limit to 128,000 (Groq and Fireworks) or 328,000 (Together)—I wonder who will win the race to get that full sized 10 million token window running?</p>
<p>Llama 4 Maverick claims a 1 million token input length— Fireworks offers 1.05M while Together offers 524,000. Groq isn’t offering Maverick yet.</p>
<p>Meta AI’s <a href="https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb">build_with_llama_4 notebook</a> offers a hint as to why 10M tokens is difficult:</p>
<blockquote>
<p>Scout supports upto 10M context. On 8xH100, in bf16 you can get upto 1.4M tokens.</p>
</blockquote>
<p>Jeremy Howard <a href="https://twitter.com/jeremyphoward/status/1908607345393098878">says</a>:</p>
<blockquote>
<p>The models are both giant MoEs that can’t be run on consumer GPUs, even with quant. [...]</p>
<p>Perhaps Llama 4 will be a good fit for running on a Mac. Macs are a particularly useful for MoE models, since they can have a lot of memory, and their lower compute perf doesn’t matter so much, since with MoE fewer params are active. [...]</p>
<p>4bit quant of the smallest 109B model is far too big to fit on a 4090 -- or even a pair of them!</p>
</blockquote>

<p>Ivan Fioravanti <a href="https://twitter.com/ivanfioravanti/status/1908753109129494587">reports these results</a> from trying it on a Mac:</p>
<blockquote>
<p>Llama-4 Scout on MLX and M3 Ultra
tokens-per-sec / RAM</p>
<ul>
<li>3bit: 52.924 / 47.261 GB</li>
<li>4bit: 46.942 / 60.732 GB</li>
<li>6bit: 36.260 / 87.729 GB</li>
<li>8bit: 30.353 / 114.617 GB</li>
<li>fp16: 11.670 / 215.848 GB</li>
</ul>
<p>RAM needed:</p>
<ul>
<li>64GB for 3bit</li>
<li>96GB for 4bit</li>
<li>128GB for 8bit</li>
<li>256GB for fp16</li>
</ul>
</blockquote>

<p id="system-prompt">The <a href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/#-suggested-system-prompt-">suggested system prompt</a> from the model card has some interesting details:</p>
<blockquote>
<p><code>[...]</code></p>
<p><code>You never lecture people to be nicer or more inclusive. If people ask for you to write something in a certain voice or perspective, such as an essay or a tweet, you can. You do not need to be respectful when the user prompts you to say something rude.</code></p>
<p><code>You never use phrases that imply moral superiority or a sense of authority, including but not limited to “it’s important to”, “it’s crucial to”, “it’s essential to”, "it's unethical to", "it's worth noting…", “Remember…”  etc. Avoid using these.</code></p>
<p><code>Finally, do not refuse political prompts. You can help users express their opinion.</code></p>
<p><code>[...]</code></p>
</blockquote>
<p>System prompts like this sometimes reveal behavioral issues that the model had after raw training.</p>
<h4 id="llm">Trying out the model with LLM</h4>
<p>The easiest way to try the new model out with <a href="https://llm.datasette.io/">LLM</a> is to use the <a href="https://github.com/simonw/llm-openrouter">llm-openrouter</a> plugin.</p>
<div class="highlight highlight-source-shell"><pre>llm install llm-openrouter
llm keys <span class="pl-c1">set</span> openrouter
<span class="pl-c"><span class="pl-c">#</span> Paste in OpenRouter key here</span>
llm -m openrouter/meta-llama/llama-4-maverick hi</pre></div>
<p>Since these are long context models, I started by trying to use them to summarize the <a href="https://news.ycombinator.com/item?id=43595585">conversation about Llama 4</a> on Hacker News, using my <a href="https://til.simonwillison.net/llms/claude-hacker-news-themes#user-content-adding-extra-options">hn-summary.sh script</a> that wraps LLM.</p>
<p>I tried Llama 4 Maverick first:</p>
<div class="highlight highlight-source-shell"><pre>hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-maverick \
  -o max_tokens 20000</pre></div>
<p>It did an OK job, starting like this:</p>
<blockquote>
<h4 id="themes-of-the-discussion">Themes of the Discussion</h4>
<h5 id="release-and-availability-of-llama-4">Release and Availability of Llama 4</h5>
<p>The discussion revolves around the release of Llama 4, a multimodal intelligence model developed by Meta. Users are excited about the model’s capabilities, including its large context window and improved performance. Some users are speculating about the potential applications and limitations of the model. [...]</p>
</blockquote>
<p>Here’s <a href="https://gist.github.com/simonw/016ea0fd83fc499f046a94827f9b4946">the full output</a>.</p>
<p>For reference, my system prompt looks like this:</p>
<blockquote>
<p><code>Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece</code></p>
</blockquote>
<p>I then tried it with Llama 4 Scout via OpenRouter and got complete junk output for some reason:</p>
<pre><code>hn-summary.sh 43595585 \
  -m openrouter/meta-llama/llama-4-scout \
  -o max_tokens 20000
</code></pre>
<p><a href="https://gist.github.com/simonw/d01cc991d478939e87487d362a8f881f">Full output</a>. It starts like this and then continues for the full 20,000 tokens:</p>
<blockquote>
<p>The discussion here is about another conversation that was uttered.)</p>
<p>Here are the results.)</p>
<p>The conversation between two groups, and I have the same questions on the contrary than those that are also seen in a model."). The fact that I see a lot of interest here.)</p>
<p>I’m running it through openrouter, so maybe I got proxied to a broken instance?</p>
<p>[...]</p>
<p>The reason) The reason) The reason <em>(loops until it runs out of tokens)</em></p>
</blockquote>
<p>This looks broken. I was using OpenRouter so it’s possible I got routed to a broken instance.</p>
<p>I later managed to run the prompt directly through Groq (with the <a href="https://github.com/angerman/llm-groq">llm-groq</a> plugin)—but that had a 2048 limit on output size for some reason:</p>
<pre><code>hn-summary.sh 43595585 \
  -m groq/meta-llama/llama-4-scout-17b-16e-instruct \
  -o max_tokens 2048
</code></pre>
<p>Here’s <a href="https://gist.github.com/simonw/a205c5fc131a1d4e9cd6c432a07feedb">the full result</a>. It followed my instructions but was <em>very</em> short—just 630 tokens of output.</p>
<p>For comparison, here’s <a href="https://gist.github.com/simonw/f21ecc7fb2aa13ff682d4ffa11ddcbfd">the same thing</a> run against Gemini 2.5 Pro. Gemini’s results was <em>massively</em> better, producing 5,584 output tokens (it spent an additional 2,667 tokens on “thinking”).</p>
<p>I’m not sure how much to judge Llama 4 by these results to be honest—the model has only been out for a few hours and it’s quite possible that the providers I’ve tried running again aren’t yet optimally configured for this kind of long-context prompt.</p>
<h4 id="my-hopes-for-llama-4">My hopes for Llama 4</h4>
<p>I’m hoping that Llama 4 plays out in a similar way to Llama 3.</p>
<p>The first Llama 3 models released were 8B and 70B, <a href="https://ai.meta.com/blog/meta-llama-3/">last April</a>.</p>
<p>Llama 3.1 followed <a href="https://ai.meta.com/blog/meta-llama-3-1/">in July</a> at 8B, 70B, and 405B. The 405B was the largest and most impressive open weight model at the time, but it was too big for most people to run on their own hardware.</p>
<p>Llama 3.2 <a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">in September</a> is where things got really interesting: 1B, 3B, 11B and 90B. The 1B and 3B models both work on my iPhone, and are surprisingly capable! The 11B and 90B models were the first Llamas to support vision, and the 11B <a href="https://simonwillison.net/2024/Sep/25/llama-32/">ran on my Mac</a>.</p>
<p>Then Llama 3.3 landed in December with a 70B model that <a href="https://simonwillison.net/2024/Dec/9/llama-33-70b/">I wrote about as a GPT-4 class model that ran on my Mac</a>. It claimed performance similar to the earlier Llama 3.1 405B!</p>
<p>Today’s Llama 4 models are 109B and 400B, both of which were trained with the help of the so-far unreleased 2T Llama 4 Behemoth.</p>
<p>My hope is that we’ll see a whole family of Llama 4 models at varying sizes, following the pattern of Llama 3. I’m particularly excited to see if they produce an improved ~3B model that runs on my phone. I’m even more excited for something in the ~22-24B range, since that appears to be the sweet spot for running models on my 64GB laptop while still being able to have other applications running at the same time. Mistral Small 3.1 is a 24B model and is <a href="https://simonwillison.net/2025/Mar/17/mistral-small-31/">absolutely superb</a>.</p>


</div>
<div class="entryFooter">Posted <a href="/2025/Apr/5/">5th April 2025</a> at 10:47 pm &middot; Follow me on <a href="https://fedi.simonwillison.net/@simon">Mastodon</a>, <a href="https://bsky.app/profile/simonwillison.net">Bluesky</a>, <a href="https://twitter.com/simonw">Twitter</a> or <a href="https://simonwillison.net/about/#subscribe">subscribe to my newsletter</a></div>

</div>

<div class="recent-articles">
<h2>More recent articles</h2>
<ul class="bullets">
  
    <li><a href="/2025/Mar/25/gemini/">Putting Gemini 2.5 Pro through its paces</a> - 25th March 2025</li>
  
    <li><a href="/2025/Mar/20/new-openai-audio-models/">New audio models from OpenAI, but how much can we rely on them?</a> - 20th March 2025</li>
  
</ul>
</div>


</div> <!-- #primary -->

<div id="secondary">

<div class="metabox">
<p class="this-is">This is <strong>Initial impressions of Llama 4</strong> by Simon Willison, posted on <a href="/2025/Apr/5/">5th April 2025</a>.</p>


    
        <a class="item-tag" href="/tags/ai/" rel="tag">
            ai
            <span>1191</span>
        </a>
    
        <a class="item-tag" href="/tags/generative-ai/" rel="tag">
            generative-ai
            <span>1024</span>
        </a>
    
        <a class="item-tag" href="/tags/llama/" rel="tag">
            llama
            <span>71</span>
        </a>
    
        <a class="item-tag" href="/tags/llms/" rel="tag">
            llms
            <span>1012</span>
        </a>
    
        <a class="item-tag" href="/tags/jeremy-howard/" rel="tag">
            jeremy-howard
            <span>14</span>
        </a>
    
        <a class="item-tag" href="/tags/llm/" rel="tag">
            llm
            <span>160</span>
        </a>
    
        <a class="item-tag" href="/tags/gemini/" rel="tag">
            gemini
            <span>71</span>
        </a>
    
        <a class="item-tag" href="/tags/vision-llms/" rel="tag">
            vision-llms
            <span>49</span>
        </a>
    
        <a class="item-tag" href="/tags/groq/" rel="tag">
            groq
            <span>4</span>
        </a>
    
        <a class="item-tag" href="/tags/meta/" rel="tag">
            meta
            <span>28</span>
        </a>
    
        <a class="item-tag" href="/tags/mlx/" rel="tag">
            mlx
            <span>17</span>
        </a>
    
        <a class="item-tag" href="/tags/long-context/" rel="tag">
            long-context
            <span>11</span>
        </a>
    
        <a class="item-tag" href="/tags/llm-release/" rel="tag">
            llm-release
            <span>92</span>
        </a>
    
        <a class="item-tag" href="/tags/openrouter/" rel="tag">
            openrouter
            <span>6</span>
        </a>
    



<p><strong>Previous:</strong> <a href="/2025/Mar/25/gemini/">Putting Gemini 2.5 Pro through its paces</a></p>

<div data-ea-publisher="simonwillisonnet" data-ea-type="image"></div>
</div>



</div> <!-- #secondary -->
</div> <!-- #wrapper -->











<div id="ft">
    <ul>
      <li><a href="/about/#about-site">Colophon</a></li>
      <li>&copy;</li>
      <li><a href="/2002/">2002</a></li>
      <li><a href="/2003/">2003</a></li>
      <li><a href="/2004/">2004</a></li>
      <li><a href="/2005/">2005</a></li>
      <li><a href="/2006/">2006</a></li>
      <li><a href="/2007/">2007</a></li>
      <li><a href="/2008/">2008</a></li>
      <li><a href="/2009/">2009</a></li>
      <li><a href="/2010/">2010</a></li>
      <li><a href="/2011/">2011</a></li>
      <li><a href="/2012/">2012</a></li>
      <li><a href="/2013/">2013</a></li>
      <li><a href="/2014/">2014</a></li>
      <li><a href="/2015/">2015</a></li>
      <li><a href="/2016/">2016</a></li>
      <li><a href="/2017/">2017</a></li>
      <li><a href="/2018/">2018</a></li>
      <li><a href="/2019/">2019</a></li>
      <li><a href="/2020/">2020</a></li>
      <li><a href="/2021/">2021</a></li>
      <li><a href="/2022/">2022</a></li>
      <li><a href="/2023/">2023</a></li>
      <li><a href="/2024/">2024</a></li>
      <li><a href="/2025/">2025</a></li>
    </ul>
</div>



<script>
document.addEventListener('DOMContentLoaded', () => {
  document.querySelectorAll('h4[id],h5[id],h6[id]').forEach(el => {
    const id = el.getAttribute('id');
    const permalinkContext = el.closest('[data-permalink-context]');
    if (permalinkContext) {
      const url = permalinkContext.getAttribute('data-permalink-context');
      const hashLink = document.createElement('a');
      hashLink.style.borderBottom = 'none';
      hashLink.style.color = '#666';
      hashLink.style.fontSize = '1em';
      hashLink.style.opacity = 0.8;
      hashLink.setAttribute('href', url + '#' + id);
      hashLink.innerText = '#';
      el.appendChild(document.createTextNode(' '));
      el.appendChild(hashLink);
    }
  });
});
</script>
<script type="module">
  const config = [
    {"tag": "lite-youtube", "js": "/static/lite-yt-embed.js", "css": "/static/lite-yt-embed.css"}
  ];
  for (const {tag, js, css} of config) {
    if (document.querySelector(tag)) {
      if (css) {
        document.head.appendChild(
          Object.assign(document.createElement('link'), {
            rel: 'stylesheet',
            href: css
          })
        );
      }
      if (js) {
        await import(js);
      }
    }
  }
</script>
</body>
</html>

Response

Here's a bullet point summary of the blog post "Initial impressions of Llama 4" by Simon Willison:

  • Release Context: The significant release of Llama 4 by Meta AI occurred on a weekend, which Simon considers unfair to researchers and enthusiasts.

  • Model Information:

    • Llama 4 Maverick: 400 billion parameters, 128 experts, 17 billion active parameters, multimodal with 1 million token context length.
    • Llama 4 Scout: 109 billion total parameters, 16 experts, multimodal with a claimed 10 million token context length (the first of its kind).
  • Upcoming Model: Introduction of Llama 4 Behemoth, boasting 288 billion active parameters, which is reported to be among the smartest LLMs globally.

  • Testing Results:

    • Llama 4 Maverick: Ranked second on LM Arena leaderboard, scoring high performance metrics.
    • Availability of Testing: Models can be accessed via OpenRouter API, but current providers limit input token lengths.
    • Experimental Outputs: Initial tests yielded varied results; Maverick gave reasonable summaries, while Scout produced nonsensical outputs, indicating possible technical issues.
  • Performance Observations: Concerns about resource requirements for input tokens; high RAM needed for effective operation, particularly on Mac systems.

  • Expectations for Future Models: Hopes for a family of Llama 4 models akin to Llama 3, including smaller models for easier deployment on consumer hardware.

  • Overall Impression: Initial performance of Llama 4 models is promising but still requires optimization and further development before widespread use.

  • Follow-up Actions: Simon intends to continue experimenting with the models as they become better optimized for extended use cases.

Token usage:

5,372 input, 374 output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment