The announcement of GPT-4.5 sparked immediate discussion about OpenAI's naming conventions, with many users finding the product lineup increasingly confusing.
"At this point I think the ultimate benchmark for any new LLM is whether or not it can come up with a coherent naming scheme for itself. Call it 'self awareness,'" joked throwup238.
Lenerdenator added that "The people naming them really took the 'just give the variable any old name, it doesn't matter' advice from Programming 101 to heart."
Several users pointed out the confusing sequence of model versions:
- "3,3.5,4,4o,4.5... I had my money on 4oz" quipped nopelynopington
- "Still more coherent than the OpenAI lineup," noted smallmancontrov
The most significant criticism centered on GPT-4.5's extremely high pricing relative to its incremental improvements over existing models.
The pricing structure shocked many:
- Input: $75.00 / 1M tokens
- Cached input: $37.50 / 1M tokens
- Output: $150.00 / 1M tokens
This represents a 30x increase for input and 15x increase for output compared to GPT-4o.
"GPT 4.5 pricing is insane," wrote zaptrem, noting that "It sounds like it's so expensive and the difference in usefulness is so lacking(?) they're not even gonna keep serving it in the API for long."
MattSayar calculated: "Input price difference: 4.5 is 30x more. Output price difference: 4.5 is 15x more. In their model evaluation scores in the appendix, 4.5 is, on average, 26% better. I don't understand the value here."
Bhouston highlighted benchmark results showing GPT-4.5 at 38.0% on AWS Bench verified coding tests, compared to 61.0% for o3-mini and 62-70% for Claude 3.7: "This means that I'll stick with Claude 3.7 for the time being."
Many users noticed that OpenAI seems to be emphasizing the model's emotional intelligence and conversational abilities over reasoning capabilities.
Sebastiennight observed: "It is interesting that they are focusing a large part of this release on the model having a higher 'EQ' (Emotional Quotient). We're far from the days of 'this is not a person, we do not want to make it addictive' and getting a firm foot on the territory of 'here's your new AI friend'."
JohnMakin shared the contrasting responses between GPT-4.5 and GPT-4o to "I'm going through a tough time after failing a test," noting: "Is it just me or is the 4o response insanely better? I'm not the type of person to reach for a LLM for help about this kind of thing, but if I were, the 4o respond seems vastly better."
Some users like mvdtnz criticized this approach: "OpenAI doubling down on the American-style therapy-speak instead of focusing on usefulness. No thanks."
Many users couldn't access the new model despite being Pro subscribers.
"Not available in my Pro plan," reported I_am_tiberius.
Sam Altman explained on Twitter (as shared by minimaxir): "bad news: it is a giant, expensive model. we really wanted to launch it to plus and pro at the same time, but we've been growing a lot and are out of GPUs. we will add tens of thousands of GPUs next week and roll it out to the plus tier then. (hundreds of thousands coming soon, and i'm pretty sure y'all will use every one we can rack up.)"
An interesting discussion emerged about OpenAI's approach compared to competitors like Anthropic, with some seeing this as a sign that pre-training scaling is hitting diminishing returns.
Eightysixfour provided thoughtful analysis: "Seeing OpenAI and Anthropic go different routes here is interesting... Anthropic appears to be making a bet that a single paradigm (reasoning) can create a model which is excellent for all use cases. OpenAI seems to be betting that you'll need an ensemble of models with different capabilities, working as a single system, to jump beyond what the reasoning models today can do."
Serjester was more critical: "I suppose this was their final hurrah after two failed attempts at training GPT-5 with the traditional pre-training paradigm. Just confirms reasoning models are the only way forward."
Some users did report positive experiences with specific use cases of GPT-4.5:
Jampa shared: "The style it writes is way better: it keeps the tone you ask and makes better improvements on the flow. One of my biggest complaints with 4o is that you want for your content to be more casual and accessible but GPT / DeepSeek wants to write like Shakespeare did."
Ripped_britches noted: "for user facing applications like mine, this is an awesome step in the right direction for EQ / tone / voice. Obviously it will get distilled into cheaper open models very soon, so I'm not too worried about the price or even tokens per second."
Many users compared GPT-4.5 unfavorably to competing models like Claude 3.7 Sonnet.
"It seems clearly worse than Claude Sonnet 3.7, yet costs 30x as much?" questioned jasonjmcghee.
Anotherpaulg shared benchmark results: "GPT-4.5 Preview scored 45% on aider's polyglot coding benchmark. OpenAI describes it as 'good at creative tasks', so perhaps it is not primarily intended for coding."
The benchmark results showed:
- 65% Sonnet 3.7, 32k think tokens (SOTA)
- 60% Sonnet 3.7, no thinking
- 48% DeepSeek V3
- 45% GPT 4.5 Preview
- 27% ChatGPT-4o
- 23% GPT-4o
Some users offered more contrarian views that differed from the majority of negative reactions:
Antirez provided a more charitable take: "In many ways I'm not an OpenAI fan (but I need to recognize their many merits). At the same time, I believe people are missing what they tried to do with GPT 4.5: it was needed and important to explore the pre-training scaling law in that direction. A gift to science, however selfist it could be."
Crazygringo defended the exploratory approach: "With new disruptive technologies, companies aren't supposed to be able to look into a crystal ball and see the future. They're supposed to try new things and see what the market finds useful."
Wewewedxfgdf expressed nostalgia for earlier models: "GPT-2 was laugh out loud funny, rolling on the ground funny. I miss that - newer LLMs seem to have lost their sense of humor. On the other hand GPT-2's funny stories often veered into murdering everyone in the story and committing heinous crimes but that was part of the weird experience."
Zone411 reported a significant improvement in a specific benchmark: "It significantly improves upon GPT-4o on my Extended NYT Connections Benchmark. 22.4 -> 33.7."
Huh, Claude 3.7 seems to be in favor of any conversations regarding "Claude 3.7", "Anthropic". Aside from that, in comparison with GPT 4.5, I can say it's more concise and less duplicate themes.