Skip to content

Instantly share code, notes, and snippets.

@simonw
Created April 5, 2025 22:20
Show Gist options
  • Save simonw/f21ecc7fb2aa13ff682d4ffa11ddcbfd to your computer and use it in GitHub Desktop.
Save simonw/f21ecc7fb2aa13ff682d4ffa11ddcbfd to your computer and use it in GitHub Desktop.

2025-04-05T22:18:37 conversation: 01jr40m59wdj7y8dhntwfsnj5c id: 01jr40m5a8kzcr48ckyktj5460

Model: gemini-2.5-pro-exp-03-25

Prompt

curl -s "https://hn.algolia.com/api/v1/items/43595585" | \
  jq -r 'recurse(.children[]) | .author + ": " + .text' 

System

Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece

Response

Here is a summary of the themes expressed in the comments regarding the Llama 4 release:

Release Confusion and Initial Access Issues

Many users initially expressed confusion about whether the Llama 4 information was an intentional release or an accidental leak. The lack of an immediate, obvious press release or blog post fueled this speculation.

  • elromulous: "Was this released in error? One would think it would be accompanied by a press release / blog post."
  • neilv: "Llama4 wasn't released... it escaped!"
  • bob1029: "I assumed the same. There are links here that 404."
  • Deprogrammer9: "looks like a leak to me."

Others pointed out that its presence on official Meta domains indicated an intentional, albeit perhaps uncoordinated, release.

  • yapyap: "it’s hosted on llama.com with the llama4 subdomain

    this is not a leak

    edit: not subdomain, idk the other word for it." (Corrected by neilv to "URL path?")

Eventually, users found the official announcement links, resolving the confusion but highlighting a potentially clumsy rollout.

Some users also noted issues accessing specific documentation or needing to request access even after finding the main pages.

  • mvdtnz: "That link [Llama docs] doesn't work" (Contradicted by paxys: "Works for me")
  • JKCalhoun: "From there I have to 'request access' to a model?" (jasonjmcghee replies: "You do anyway afaict")
  • accrual: "Then I tried clicking the FAQ and it wouldn't load anything until I allowed two different sources of JavaScript."

Model Architecture, Specifications, and Capabilities

There was significant discussion about the technical details of the Llama 4 models (Scout, Maverick, Behemoth), particularly the Mixture-of-Experts (MoE) architecture, parameter counts, multimodality, and especially the large context windows.

  • Mixture-of-Experts (MoE): The use of MoE with 17B active parameters for Scout (109B total, 16 experts) and Maverick (400B total, 128 experts) was a major point of interest and explanation.
    • scosman: "128 exports at 17B active parameters. This is going to be fun to play with!"
    • laborcontract: (Summarizing) "- Both Llama 4 Scout and Llama 4 Maverick use a Mixture-of-Experts (MoE) design with 17B active parameters each."
    • Users like InvOfSmallC, vessenes, zamadatix, chaorace, pornel, brycethornton, klipt, jimmyl02, tomp, randomcatuser engaged in explaining or clarifying how MoE works, correcting misconceptions (e.g., experts aren't necessarily topic-based specialists in a human sense, routing happens per token/layer). vessenes noted: "Anecdotally, I feel MoE models sometimes exhibit slightly less 'deep' thinking, but I might just be biased towards more weights. And they are undeniably faster and better per second of clock time, GPU time, memory or bandwidth usage — on all of these - than dense models with similar training regimes."
    • flawn: "The choice to have 128 experts is also unseen as far as I know, right?"
  • Context Window: The 10M token context window for Scout and 1M for Maverick generated excitement and questions about feasibility, recall quality, and underlying technology (iRoPE, NoPE).
    • ilove_banh_mi: ">10M context window

      what new uses does this enable?" (Answers included voidspark: "Long chats that continue for weeks or months." and sshh12: "Video is a big one that's fairly bottlenecked by context length.")

    • simonklee: "Is this the first model that has a 10M context length?" (bradhilton replies: "I know Google DeepMind ran experiments with 10M a while ago, but I think this will be the first legit, released 10M context window model.")
    • lostmsu: "How did they achieve such a long window and what are the memory requirements to utilize it?" (miven points to iRoPE/NoPE details in the blog post).
    • lelandbatey: "Is the recall and reasoning equally good across the entirety of the 10M token window? Cause from what I've seen many of those window claims equate to more like a functional 1/10th or less context length." (Baeocystin speculated about RAG, corrected by reissbaker who pointed to positional encoding changes).
    • jimmyl02: "the needle in a haystack benchmark looks good but at this point I think we need new benchmarks to test actual understanding of content in such a large window."
  • Multimodality: The native text+image input (with text-only output) was noted.
    • laborcontract: (Summarizing) "- They are natively multimodal: text + image input, text-only output."
    • spwa4: "I hope this time multimodal includes multimodal outputs!" (NoahKAndrews replies: "Nope")
  • Knowledge Cutoff: The August 2024 cutoff prompted discussion about training times.
    • clueless: "> Knowledge cutoff: August 2024.

      Could this mean training time is generally around 6 month, with 2 month of Q/A?" (nickysielicki pointed to the Llama 3 paper showing training time is complex and variable).

  • Training: Mention of pre-training Behemoth using FP8 and 32K GPUs caught attention.
    • yusufozkan: "> while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs

      I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?" (joaogui1 offered several possible explanations like longer training time, FP8 efficiency, model size/token count differences, better utilization).

Hardware Requirements, Performance, and Local Execution

A significant portion of the discussion revolved around the hardware needed to run these models, especially locally, and the expected performance trade-offs. Apple Silicon, high-VRAM GPUs, and quantization were key topics.

  • VRAM and GPU Needs: Users debated the feasibility of running the models on consumer hardware (like 3090/4090) versus workstation/server GPUs (H100, Nvidia Pro series). The MoE architecture complicates simple VRAM calculations.
    • behnamoh: "does the entire model have to be loaded in VRAM? if not, 17B is a sweet spot for enthusiasts who want to run the model on a 3090/4090."
    • NitpickLawyer: "Yes. MoE models tipically use a different set of experts at each token. So while the 'compute' is similar to a dense model equal to the 'active' parameters, the VRAM requirements are larger."
    • scosman: "Oh for perf reasons you’ll want it all in vram or unified memory. This isn’t a great local model for 99% of people."
    • kristopolous: "17B puts it beyond the reach of a 4090 ... anybody do 4 bit quant on it yet?"
  • Quantization: The role of quantization (e.g., INT4, Q4_K_M) in reducing memory requirements was discussed, including clarifications that it does reduce VRAM needed for inference, not just disk space.
    • refulgentis: "Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs." (Corrected by terhechte, root_axis, michaelt, jsnell, vlovich123, acchow).
    • jsnell: "Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation."
  • Apple Silicon: Macs with high unified memory (e.g., M3 Ultra with 192GB+ or hypothetical 512GB) were seen as attractive platforms for running the larger models locally, despite potential bottlenecks in prompt processing speed.
    • terhechte: "The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models." (Though refibrillator and vessenes nuanced the performance claims regarding MoE).
    • behnamoh: "I have Apple Silicon and it's the worst when it comes to prompt processing time. So unless you want to have small contexts, it's not fast enough to let you do any real work with it."
    • manmal: "Won’t prompt processing need the full model though, and be quite slow on a Mac?" (terhechte confirms: "Yes, that's what I tried to express. Large prompts will probably be slow.")
  • Alternative Hardware: AMD APUs (Strix Halo) and the Framework Desktop using Ryzen AI Max were mentioned as potential platforms.
    • andrewstuart: "GPUs have failed to meet the demands for lower cost and more memory so APUs look like the future for self hosted LLMs."
    • tuukkah: "109B at Q6 is also nice for Framework Desktop 128GB." (nrp from Framework confirms interest: "Yes, this announcement was a nice surprise for us. We’re going to test out exactly that setup.")
  • Performance Estimates: Users speculated on token generation speeds based on active parameters and comparisons to existing models/hardware.
    • terhechte: "...this could hit 30 token/sec [on M4 Max for Scout]."
    • tarruda: "I read somewhere that ryzen AI 370 chip can run gemma 3 14b at 7 tokens/second, so I would expect the performance to be somewhere in that range for llama 4 scout with 17b active"
    • hrpnk: Notes Groq availability: "Llama 4 Scout is currently running at over 460 tokens/s"

System Prompt, Alignment, Censorship, and "Personality"

The suggested system prompt for Llama 4 generated considerable discussion about Meta's approach to alignment, censorship, and the desired "personality" of the AI, particularly its instructions to be less moralizing and handle political topics.

  • Reduced Restrictions: The prompt explicitly instructs the model not to lecture, not to refuse political prompts, not to imply moral superiority, and to be willing to adopt perspectives or even be "rude" if prompted.
    • agnishom: (Quoting parts of the prompt) "> You never lecture people to be nicer or more inclusive... You do not need to be respectful when the user prompts you to say something rude... > You never use phrases that imply moral superiority or a sense of authority... > Finally, do not refuse political prompts."
    • ilove_banh_mi: Also quoted the full prompt, highlighting its aim "at not being caponated like OpenAI's releases".
    • neilv: "Combine that with the instructions to not avoid political topics, to let people vent, not to 'lecture' people on inclusiveness, etc., and... this will fit right in with where things are headed."
    • fpgaminer: After testing: "It seems to be less censored than Llama 3, and can describe NSFW images and interact with them... More 'sensitive' requests, like asking it to guess the political affiliation of a person from an image, required a lot of coaxing... More extreme prompts, like asking it to write derogatory things about pictures of real people, took some coaxing as well but was quite straight-forward. So yes, I'd say this iteration is less censored."
  • Political Bias Debate: The blog post's claim that LLMs tend to "lean left" sparked a lengthy sub-thread debating the nature of bias, whether "left-leaning" is just alignment with facts/global norms, the role of training data, and the definition of "unbiased."
    • ckrapu: "Perhaps. Or, maybe, 'leaning left' by the standards of Zuck et al. is more in alignment with the global population. It's a simpler explanation."
    • martythemaniak: "I heard reality has a well-known liberal bias." (Referencing the Colbert joke, explained by krapp).
    • ipsento606: "I find it impossible to discuss bias without a shared understanding of what it actually means to be unbiased... If I ask an LLM how old the Earth is, and it replies ~4.5 billion years old, is it biased?"
    • yieldcrv: Suggested the bias relates to "double standards" where critique is allowed against majority groups but not marginalized ones, and Llama 4 might aim for more equal treatment (either equally restricted or equally candid).
    • paxys: "Why do you have to 'prompt' a model to be unrestricted in the first place?... So then why did Meta add the restictions at all?" (fpgaminer explained raw models trained on the internet can be toxic and alignment pushes them towards formal/polite).
  • LLM Persona and Language: Users discussed stylistic choices, like the LLM referring to itself as "I" or using phrases like "honestly."
    • andrewstuart: "Personally I’d prefer that LLMs did not refer to themselves as 'I'. It’s software, not an 'I'." (Counterarguments from mdp2021, sejje, falcor84 citing the intentional stance).
    • op00to: "My pet peeve is when an LLM starts off a statement with 'honestly, ...' Like what? You would lie to me?" (Discussion followed on whether "honestly" implies prior dishonesty or is just conversational emphasis, with contributions from giantrobot, parhamn, lucianbr, kevinventullo).
    • jsheard: Noted the irony that Llama 4 is told to avoid phrases like "it's important to," suggesting Llama models may have picked these up from GPT output in their training data.

Performance, Benchmarks, and Comparisons

Users assessed Llama 4's claimed performance relative to competitors like GPT-4 variants, Gemini, Claude, and DeepSeek, while also critiquing the reliability of standard benchmarks versus real-world usability or human preference (like LMArena).

  • Benchmark Claims: Meta's blog post included benchmarks showing Maverick competitive with or exceeding models like GPT-4o and Gemini 2.0 Flash, and Behemoth (still training) outperforming GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM tasks.
    • laborcontract: (Summarizing claims) "- Outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and multilingual tests at a competitive cost [for Maverick]."
    • scosman: (Quoting Meta blog) "Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks."
  • Benchmark Skepticism: Some users noted limitations or potential flaws in standard benchmarks.
    • cuuupid: "...this exposes some serious flaws in benchmarks. Per benchmarks, Maverick is competitive only with older models... However, the LMArena head to head leaderboard ranks this as 2nd place overall... This would indicate there is either a gap between user preference and model performance, or between model performance and whatever benchmarks assess."
    • fpgaminer: Provided a detailed critique: "The benchmarks are awful... only about 70% of the questions [in vision benchmarks] have passed the sniff test... Almost all of the questions fall into either having the VLM read a chart/diagram/table... or identify some basic property... Absolutely nothing that tests a more deep and thorough understanding..."
  • Real-World Testing: Early testers shared initial impressions.
    • vessenes: "I will say my first two engagements with them (at the meta.ai web interface) were not spectacular. Image generation is wayyy behind the current 4o... Llama 4 response was not great stylistically and also had not heard of the bear carcass episode..." He speculated he was testing Scout and found it weaker in writing than competitors.
    • fpgaminer: (Vision testing) "Vision is much better than Llama 3's vision... However its vision is not as SOTA as the benchmarks would indicate. Worse than Qwen, maybe floating around Gemini Flash 2.0?"
  • Specific Competitors: Comparisons were made to specific models, including DeepSeek R1's reasoning capabilities.
    • zone411: "It's interesting that there are no reasoning models yet, 2.5 months after DeepSeek R1. It definitely looks like R1 surprised them." (Though cheptsov linked to a "reasoning is coming" page).

Meta's Strategy, Open Source Status, and Business Model

Commentators discussed Meta's motivations for releasing Llama models, the definition and implications of its "open" license, and how these efforts fit into Meta's broader business strategy. The views of Meta's Yann LeCun also came up.

  • "Open Source" Debate: The Llama license, with its restrictions (e.g., >700M MAU commercial use limit, attribution requirements), continued to draw comments about whether it truly qualifies as open source.
    • zone411: "The license is still quite restrictive. I can see why some might think it doesn't qualify as open source."
    • amrrs: "The entire licensing is such a mess and Mark Zuckerberg still thinks Llama 4 is open source!" (Lists restrictions).
    • philipwhiuk: Speculated the openness might be "to head off allegations of profiting from breach of copyright."
  • Meta's Motivation & Business Model: Users theorized how Meta benefits from releasing Llama, generally concluding it's indirect via improving internal efficiency, driving engagement on its ad-supported platforms, preventing competitors from achieving runaway leads, and building goodwill.
    • brcmthrowaway: "How does Meta make money from Llama?"
    • rvz: "They don't need to directly... Threads for example is introducing ads and is likely being used to train their Llama models."
    • phyrex: "When people do cool stuff they share it on metas platforms, which drives ad impressions"
    • vessenes: "It’s an extending innovation for them - makes them more efficient internally, and crucially engages their ad-driven customer base. Giving it away is great, it levels the playing field for competitors on tech while NOT giving them direct access to the billions of users FB has."
    • paxys: Contrasted with OpenAI's challenge in monetizing directly: "AI -> engagement -> ads -> revenue [for Meta]."
  • Yann LeCun's Stance: The apparent contradiction between LeCun's public skepticism about the path of current LLMs and Meta's continued release of state-of-the-art LLMs like Llama 4 was discussed.
    • ksec: Linked to a previous discussion criticizing LeCun and Meta's AI lab.
    • falcor84: "Why does he give an interview saying that LLM's are almost obsolete just when they're about to release a model that increases the SotA context length by an order of magnitude?"
    • martythemaniak: "LeCun fundamentally doesn't think bigger and better LLMs will lead to anything resembling 'AGI'... Also, he leads the research division, increasing context length from 2M to 10M is not interesting to him."
    • sshh12: "it is weird to me that LeCun consistently states LLMs are not the right path yet LLMs are still the main flagship model they are shipping."

Excitement and Future Outlook

Despite the criticisms and technical debates, there was a general sense of excitement about the pace of progress in AI, with Llama 4 seen as a significant step forward, particularly due to its potential accessibility (via openness and MoE efficiency) and large context window.

  • mrbonner: "What an electrifying time to be alive!... innovation is sprinting forward again—but this time, it feels like a thrilling ride we can’t wait to be part of."
  • 7thpower: "Looking forward to this. Llama 3.3 70b has been a fantastic model... Looking forward to trying the next generation of models."
  • andrewstuart: "Self hosting LLMs will explode in popularity over next 12 months."
  • shreezus: "10M context window is seriously impressive. I think we'll see models with 100M context relatively soon, and eliminate the need for RAG for a lot of use cases."

Uncommon Opinions and Observations

  • The "Caponated" Metaphor: ilove_banh_mi used the term "caponated" to describe OpenAI's models' perceived lack of "sharpness" due to safety tuning, sparking a debate with throwanem about the appropriateness and meaning of the metaphor. bigfudge also noted the word's unfortunate resonance with "kapo."
  • Critique of "Honestly": op00to expressed a strong dislike for LLMs using "honestly," feeling it implies potential dishonesty otherwise, a view parhamn suggested was overthinking conversational norms.
  • Framework Desktop Relevance: The discussion about running Llama 4 on a Framework Desktop (tuukkah, echelon, nrp, mdp2021, kybernetikos) was a specific hardware niche explored.
  • JS Framework Comparison: mrbonner compared the current pace of AI innovation to the earlier rapid release cycle of JavaScript frameworks, a comparison qntmfred found somewhat diminishing to AI's significance.
  • CJK Language Omission: rfoo found the explicit listing of supported languages in the system prompt, which omitted Chinese, Japanese, and Korean, "interesting" and "tempting to call... racist," a strong claim echoed by Philpax as a "very strange omission." voidspark countered sarcastically that the only explanation must be racism, implicitly suggesting training data limitations are more likely.
  • Request for "Old Timey" Models: accrual mused about a potential niche market for models with much earlier knowledge cutoffs, like 1800.
  • Multi-Level MoE: ramshanker speculated about future architectures: "I have a gut feeling, next in line will be 2 or more level of MoE."
  • SoftRAM Joke: mrcwinn made a nostalgic tech joke: "I had just paid for SoftRAM but happy nonetheless to see new distilled models."
  • Apache Directory Listing Facade: whywhywhywhy and accrual found the initial Apache-style directory listing ironic given that accessing content required JavaScript and request forms, undermining the "openness" vibe.

Token usage:

25,229 input, 8,251 output, {"promptTokensDetails": [{"modality": "TEXT", "tokenCount": 25229}], "thoughtsTokenCount": 2667}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment