Skip to content

Instantly share code, notes, and snippets.

@simonw
Last active May 6, 2025 22:56
Show Gist options
  • Save simonw/3efa62d917370c5038b7acc24b7c786e to your computer and use it in GitHub Desktop.
Save simonw/3efa62d917370c5038b7acc24b7c786e to your computer and use it in GitHub Desktop.

Okay, here is a summary of the themes expressed in the provided comments, following the specified format.

Marketing Hype and Announcement Fatigue

Many users expressed cynicism and fatigue regarding the repetitive nature and superlative claims in AI model announcements. There's a sense that every new model is predictably heralded as the "most" or "best" in some dimension, diminishing the impact of such claims. This pattern is compared to other tech industries, like smartphones.

  • "Isn't every new AI model the 'most '?" (jasonpeacock)
  • "Nobody is going to say 'Announcing Foobar 7.1 - not our best!'" (jasonpeacock)
  • "Same with new phones. The new phone is always the fastest cpu, gpu, and best camera ever!" (SirMaster)
  • "These announcements have started to look like a template. - Our state-of-the-art model. - Benchmarks comparing to X,Y,Z. - 'Better' reasoning. It might be an excellent model, but reading the exact text repeatedly is taking the excitement away." (Oras)
  • "I’m sure the AI helps write the announcements." (Mistletoe), followed by speculation that the models write their own announcements based on previous ones (ototot, cpeterso, xlbuttplug2, belter).
  • "Reminds me of how nobody is too excited about flagship mobile launches anymore. Most flagships for sometime now are just incremental updates over previous gen and only marginally better... It's interesting how the recent AI announcements are following the same trend over a smaller timeframe." (devsda)
  • "hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town" (vivzkestrel)

However, some argue that the specific adjective used in the announcement still holds important information about the model's focus (e.g., cost, speed, intelligence).

  • "You're not wrong, but that just means the is where the bulk of information resides. The trade-off matters. Maybe it's a model with good enough quality but really cheap to serve." (smilekzs)
  • "Sure but that adjective matters. Could be cheapest, 'intelligent', fastest, etc... it's rarely all three of them." (Maxatar)

There was also a specific discussion comparing the marketing approach to OpenAI's GPT-4.5 release, which was perceived negatively for different reasons.

  • "GPT-4.5's announcement was the equivalent of that. 'It beats all the benchmarks...but you really really don't want to use it.'" (minimaxir)
  • "They even priced it so people would avoid using it. GPT-4.5's entire function was to be the anchor of keeping OpenAI in the news, to keep up the perception of releasing quickly." (forbiddenvoid)
  • "Well hey, OpenAI did the exact opposite, and nobody liked that either... To clarify, by 'doing the opposite' I mean OpenAI releasing GPT-4.5, a non-reasoning model that does worse on benchmarks (but supposed to be qualitatively better). People shit on OpenAI hard for doing that." (andai)

Benchmarking Practices and Skepticism

Significant discussion revolved around the use and validity of benchmarks in evaluating AI models. There's considerable skepticism about how benchmarks are chosen, reported, and whether they reflect real-world performance or simply optimized metrics.

  • "The benchmark numbers don't really mean anything -- Google says that Gemini 2.5 Pro has an AIME score of 86.7 which beats o3-mini's score of 86.5, but OpenAI's announcement post [1] said that o3-mini-high has a score of 87.3 which Gemini 2.5 would lose to. The chart says 'All numbers are sourced from providers' self-reported numbers'..." (kmod)
  • "You just have to use the models yourself and see." (kmod)
  • Concerns were raised about Google comparing Gemini 2.5 Pro to OpenAI's o3-mini rather than the potentially more capable o1: "Interesting that they use o3-mini as the comparison rather than o1... Are there benchmarks showing o3-mini performing better than o1?" (cj)
  • "The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out... but I think for anyone paying attention that decision will backfire." (FloorEgg)
  • Some defended the comparison choice based on expected similarity in size, parameters, or pricing: "Probably because It is more similar to o3 in terms of size/parameters as well as price" (jnd0); "It's a reasonable comparison given it'll likely be priced similarly to o3-mini." (logicchains)
  • Others noted standard practice: "Why would you compare against all the models from a competitor. You take their latest one that you can test. Openai or anthropoc don’t compare against the whole gemini family." (PunchTornado)
  • Questions arose about the nature of specific benchmarks: "Keen to hear more about this benchmark [MRCR]. Is it representative of chat-to-document style usecases with big docs?" (alexdzm), with sebzim4500 providing links and context, noting it's "less artificial than most long context benchmarks" but perhaps less representative than Fiction.LiveBench.
  • There's a desire for benchmarks to measure something meaningful: "Man, I hope those benchmarks actually measure something." (flir)
  • An acknowledgment of benchmark limitations: "I would say they are a fairly good measure of how well the model has integrated information from pretraining. They are not so good at measuring reasoning, out-of-domain performance, or creativity." (Legend2440)
  • The inclusion of the Aider Polyglot benchmark was seen as significant: "Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!" (afro88)
  • A need for benchmarks beyond raw capability scores: "More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful." (throwaway13337)

Model Naming Conventions and Versioning

The use of ".5" versioning (like Gemini 2.5) sparked debate about its meaning and appropriateness compared to semantic versioning (semver) or other schemes.

  • "I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?" (vineyardmike)
  • "I’ve always found the use of the *.5 naming kinda silly when it became a thing... It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme." (vineyardmike)
  • Preference for other systems: "I’m a much bigger fan of semver (not skipping to .5 though), date based ('Gemini Pro 2025'), or number + meaningful letter (eg 4o - 'Omni') for model names." (vineyardmike)
  • Justification for ".5": "I would consider this a case of 'expectation management'-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0." (forbiddenvoid); "I think it's reasonable. The development process is just not really comparable to other software engineering... So you do the training, and then you assign the increment to align the two." (jstummbillig); "At least for OpenAI, a .5 increment indicates a 10x increase in training compute. This so far seems to track for 3.5, 4, 4.5." (Workaccount2); "I think it's because of the big jump in coding benchmarks. 74% on aider is just much, much better than before and worthy of a .5 upgrade." (aoeusnth1)
  • Challenges with applying semver: "Regarding semantic versioning: what would constitute a breaking change?" (laurentlb); "As I see it, if it uses a similar training approach and is expected to be better in every regard, then it's a minor release. Whereas when they have a new approach and where there might be some tradeoffs (e.g. longer runtime), it should be a major change." (falcor84)
  • Humorous alternative suggestions: "Or drop the pretext of version numbers entirely since they're meaningless here and go back to classics like Gemini Experience, Gemini: Millennium Edition or Gemini New Technology" (morkalork)

Model Comparisons and Competition

Users frequently compared Gemini 2.5 Pro to models from OpenAI (GPT-4.5, o1, o3-mini, GPT-4o) and Anthropic (Claude 3.7 Sonnet), as well as other Gemini versions (2.0 Flash, 1.5 Pro). There's a strong sense of rapid, iterative competition.

  • Direct comparisons on benchmarks mentioned throughout (AIME, Aider, Long Context, NYT Connections).
  • Anecdotal comparisons favouring o1 over o3-mini: "In my experience o3-mini is much worse than o1." (kmod); "I find o1 to be strictly better than o3-mini, but still use o3-mini for the majority of my agentic workflow because o1 is so much more expensive." (logicchains); "I have used both o1 and o3 mini extensively... o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability." (FloorEgg)
  • Specific competitive points: "The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5 at 128k context length." (M4v3R); "A model that is better on Aider than Sonnet 3.7?" (d3nj4l)
  • General sense of competition: "I take this as a good thing, because they're beating each other every few weeks and using benchmarks as evidence." (diego_sandoval)
  • Some see Google regaining ground: "Looks like they're gradually removing guardrails, it returns Nixon for me." (summerlight); "This looks like the first model where Google seriously comes back into the frontier competition?" (summerlight); "Google is back in the game big time." (freediver)
  • Others remain loyal to competitors: "Claude is still the king right now for me. Grok is 2nd in line, but sometimes it's better." (honeybadger1); "Most of us care only about coding performance, and Sonnet 3.5 has been such a giant winner that we don't get too excited about the latest model from Google." (bklyn11201)

Pricing, Availability, and Experimental Status

Confusion and discussion surrounded the model's availability, pricing structure, rate limits, and the implications of its "experimental" status.

  • Initial availability: "Available as experimental and for free right now in Google AI Studio + API, with pricing coming very soon!" (altbdoor quoting a Twitter post)
  • Free but limited: "Currently free, but only 50 requests/day." (xnx); "With a rate limit of 50 requests per day" (istjohn)
  • Anticipation and concern about future pricing: "I wish they’d mention pricing - it’s hard to seriously benchmark models when you have no idea what putting it in production would actually cost." (serjester); "I expect this might be pricier. Hoping not unusable level expensive." (KoolKat23)
  • Frustration with existing rate limits and release cadence: "The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out." (ekojs); "Weird, they released Gemini 2.5 but I still can't use 2.0 pro with a reasonable rate limit (5 RPM currently)." (daquisu)
  • Confusion about "experimental" status: "It's 'experimental', which means that it is not fully released. In particular, the 'experimental' tag means that it is subject to a different privacy policy and that they reserve the right to train on your prompts." (kmod); "2.0 Pro is also still 'experimental' so I agree with GP that it's pretty odd that they are 'releasing' the next version despite never having gotten to fully releasing the previous version." (kmod)
  • Difference between AI Studio/API and consumer product rollout: "I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp." (dcchambers) countered by "I see it in both, probably just some gradual rollout delays." (ehsankia)

Performance and Capabilities

This was a major focus, with users reporting experiences and speculating on capabilities across various domains like long context, coding, reasoning, math, multimodal tasks, and general intelligence.

  • Long Context: Highly praised by some. "The Long Context benchmark numbers seem super impressive." (M4v3R); "For me, the most exciting part is the improved long-context performance... Gemini was already one of my favorite models for this usecase." (tibbar); A user testing ~230k tokens of poetry reported: "Having just tried this model... I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems." (jorl17); Another user testing a Dart codebase bug: "It's about 360,000 tokens... This is the first model to identify it correctly." (mindwok). Skepticism remains: "But they score it on their own benchmark, on which coincidentally Gemini models always were the only good ones. In Nolima or Babilong we see that Gemini models still cant do long context." (zwaps)
  • Coding: Strong performance noted, especially on the Aider benchmark. "Gemini 2.5 Pro set the SOTA on the aider polyglot leaderboard [0] with a score of 73%... A huge jump from prior Gemini models." (anotherpaulg); "A model that is better on Aider than Sonnet 3.7? For free, right now?" (d3nj4l); "Generated 1000 lines of turn based combat with shop, skills, stats, elements, enemy types, etc. with this one" (mclau156). However, issues with tool use were reported: "I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it." (TheMagicHorsey)
  • Reasoning & Math: Seen as a potential strength. "Been playing around with it and it feels intelligent and up to date... A reasoning model by default when it needs to." (jnd0); A user tested a complex logic puzzle: "Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it... LLMs are now better than 95+% of the population at mathematical reasoning." (malisper). This claim sparked debate about training data contamination and the true nature of reasoning: "Unfortunately, I was easily able to find the exact question with a solution... online, thus it will have been in the training set." (sebzim4500); "It's a non-sequitur, you first have to show that the LLMs are reasoning in the same way humans do." (dkjaudyeqooe). Others found it failed similar problems (stabbles) or that other models also solved it (hmottestad, doener). Users discussed using reasoning models for "RCA infrastructure incidents" (bravura) and improving faithfulness in text tasks (liuliu), and SQL generation from free text (mgens).
  • Multimodal: Positive results reported. "Wow, was able to nail the pelican riding on a bicycle test" (nickandbro) with links provided; "I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those." (simonw)
  • General Intelligence/Quality: Mixed reports. Positive: "This model is a fucking beast." (Dowwie); "It nailed my two hard reasoning+linguistic+math questions in one shot" (asah); "Tops our benchmark in an unprecedented way... High quality, to the point." (freediver); Negative: "I tested out Gemini 2.5 and it failed miserably... got into an infinite loop" (TheMagicHorsey); Hallucinations reported: "hallucinated the ability to use a search tool" (slama); Confused self-identity: "I asked it a question and it said '... areas where I, as Bard, have...'" (noisy_boy); Knowledge cutoff confusion: "My info, the stuff I was trained on, cuts off around early 2023." (eenchev) vs. knowing recent news (staticman2, nikcub).
  • Specific Use Cases: Engineering questions (og_kalu), PDF understanding for capstone project (arjun_krishna1), long-form writing pacing (og_kalu), Magic: The Gathering rules puzzles (strstr - initially positive, then negative).

User Experience and Anecdotal Evidence

Beyond specific capabilities, users shared overall impressions and experiences, ranging from highly positive to deeply negative.

  • Positive: "Been playing around with it and it feels intelligent and up to date." (jnd0); "I'm impressed by this one." (simonw); "This model is a fucking beast. I am so excited about the opportunities this presents." (Dowwie); "This is a breakthrough moment. Truly a leap." (jorl17); "Very impressed" (r0fl, regarding solving a puzzle that stumped other models).
  • Negative/Critical: "Gemini models are like a McDonalds Croissant. You always give them an extra chance, but they always fall apart on your hands..." (belter); "I tested out Gemini 2.5 and it failed miserably... got into an infinite loop... I really don't know how others are getting these amazing results." (TheMagicHorsey); Reports of basic failures in conversational AI on phones (throitallaway, dcchambers); "I know I've been disappointed at the quality of Google's AI products. They are backup at best." (resource_waste); Overly cautious guardrails mentioned (andrewinardeer, slongfield).
  • Ambivalent/Specific Preference: "I've moved entire workflows to Gemini because it's just way better than what openai has to offer, especially for the money." (schainks); "They're not good models. They over fit to LMArena leaderboard, but perform worse in real life scenarios compared to their competitors." (ipsum2)

Privacy Concerns and Data Usage

The terms of service regarding data usage, particularly for experimental models, raised significant privacy concerns among users.

  • Quoting the terms: "'Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...'" (greatgib)
  • Highlighting human review: "To help with quality and improve our products... human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations." (greatgib, quoting terms)
  • Data retention policy: "Conversations that have been reviewed or annotated by human reviewers... are not deleted when you delete your Gemini Apps activity because they are kept separately... retained for up to three years." (greatgib, quoting terms and adding emphasis)
  • Comparison to others: "How does it compare to OpenAI and anthropic’s user data retention policy?" (mastodon_acc); "I mean this is pretty standard for online llms. What is Gemini doing here that openai or Anthropic aren’t already doing?" (mastodon_acc) vs "If i'm not wrong, Chatgpt states clearly that they don't use user data anymore by default... it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will." (greatgib)
  • Distinction between free/experimental and paid tiers: "I'm assuming this is true of all experimental models? That's not true with their models if you're on a paid tier though, correct?" (sauwan); "You can use a paid tier to avoid such issues." (summerlight)
  • Call for better guidelines: "More of a reason for new privacy guidelines specially for big tech and AI" (suyash)

Broader AI Landscape and Market Dynamics

Comments touched upon the overall state of AI development, commodification, hype cycles, and the search for competitive advantages ("moats").

  • Commoditization: "This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks. They are all interchangeable. This is great for users as it adds to price pressure." (bhouston)
  • Search for Moats: "Sooner or later someone is going to find 'secret sauce' that provides a step-up in capability, and it will be closely guarded by whoever finds it. As big players look to start monetizing, they are going to desperately be searching for moats." (Workaccount2)
  • Hype Cycle & Practicality: "Glaringly missing from the announcements: concrete use cases and products. The Achilles heel of LLMs is the distinct lack of practical real-world applications." (cratermoon), countered by "ChatGPT has like 500M weekly active users, what are you on about?" (sebzim4500). The "first step fallacy" was invoked regarding hype (cratermoon).
  • Productivity Impact: Questions about whether benchmark improvements translate to real-world productivity gains. "as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity... But then where is the productivity increases?" (WasimBhai)
  • Pace of Development: Seen as incredibly rapid by some: "The rate of announcements is a sign that models are increasing in ability at an amazing rate... leading to intense competition that benefits us all." (cadamsdotcom)
  • Resource Importance: Debate on whether hardware, talent, or data is most critical. "Google clearly is in the lead in terms of data access in my view." (lvl155); Hardware advantage also suggested due to TPUs (Workaccount2, danpalmer, semiinfinitely).

Google's Specific Challenges and Strengths

Users discussed factors unique to Google's position in the AI race, including its perceived weaknesses in marketing and product strategy, potential hardware advantages, and issues with guardrails and product perception.

  • Marketing/Hype Deficit: "Why do I have the feel that nobody is too much excited to google's models compared to other companies?" (batata_frita); "Google is worse at marketing and hyping people up." (Mond_)
  • Product Strategy/Bureaucracy: "The problem Goog has is its insane bureaucracy and lack of vision from Sundar" (CuriouslyC); Fear of products being discontinued: "Because it’s more likely to be sunsetted. https://killedbygoogle.com/" (SamuelAdams); "Normal Google rollout process: Bard is deprecated, Gemini is not ready yet." (guyzero); "Google has this habit of 'releasing' without releasing AI models." (throwaway13337)
  • Guardrails: Seen as overly cautious compared to competitors. "Google is overly cautious with their guardrails." (andrewinardeer); "For better or worse, Google gets more bad press when their models get things wrong compared to smaller AI labs." (slongfield)
  • Hardware Advantage (TPUs): Mentioned as a potential edge, especially for long context. "Google has the upperhand here because they are not dependent on nvidia for hardware. They make and uses their own AI accelerators." (Workaccount2); "TPUs have a network topology better suited for long context than gpus" (semiinfinitely).
  • Perception Issues: "I think google is digging a hole for themselves by making their lightweight models be the most used model. Regardless of what their heavy weight models can do, people will naturally associate them with their search model or assistant model." (Workaccount2)
  • Integration: Some see Google winning here: "Also, I think google's winning the race on actually integrating the AI to do useful things... A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO." (schainks)

Uncommon or Contrasting Opinions

While many themes had general consensus or clear points of debate, some individual comments stood out against the prevailing sentiment.

  • Enthusiasm for the Pace: Amidst general fatigue with announcements, one user found the rate itself exciting: "The rate of announcements is a sign that models are increasing in ability at an amazing rate... intense competition that benefits us all." (cadamsdotcom)
  • Strong Positive Endorsement: Contrasting with cautious optimism or negativity, some were extremely enthusiastic: "This model is a fucking beast. I am so excited about the opportunities this presents." (Dowwie)
  • Specific Positive Niche Use Case: While many debated general capabilities, one user highlighted a very specific, impressive success in long-form creative writing pacing: "This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details." (og_kalu)
  • Defense of Benchmarks' Utility: While skepticism was common, one view offered a specific positive take on their value: "I would say they are a fairly good measure of how well the model has integrated information from pretraining." (Legend2440)
  • Wishing for Better Google Marketing (Due to Preference): Instead of just criticizing Google's marketing, one user expressed a desire for improvement because they actually prefer the product: "I wish I wish I wish Google put better marketing into these releases. I've moved entire workflows to Gemini because it's just way better..." (schainks)
  • Affirming Novelty Potential: Countering skepticism about LLM creativity: (koakuma-chan: "Doesn't novel literally mean something new? Can we really expect an LLM to produce a novel?") "Yes" (FloorEgg)
  • Asserting Superiority Despite Perception: Against the trend of Google skepticism: "I feel like Google intentionally don't want people to be as excited. This is a very good model. Definitely the best available model today." (Davidzheng)

Token usage:

24,968 input, 6,088 output, {"promptTokensDetails": [{"modality": "TEXT", "tokenCount": 24968}]}

@simonw
Copy link
Author

simonw commented Mar 26, 2025

Created with this command:

curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
    jq -r 'recurse(.children[]) | .author + ": " + .text' | \
    llm -m "gemini-2.5-pro-exp-03-25" -s \
    'Summarize the themes of the opinions expressed here.
    For each theme, output a markdown header.
    Include direct "quotations" (with author attribution) where appropriate.
    You MUST quote directly from users when crediting them, with double quotes.
    Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment