Model: o3-mini
raw text
johnneville:
brcmthrowaway: Gamechanger?
aspect0545: It’s always a game changer, isn’t it?
colonelspace: Yep.<p>And groundbreaking.
maeil: <i>It changes the landscape with its multifaceted approach</i>.
sss111: This time, it is a save-face release, especially because Azure, AWS, and OpenRouter have started offering DeepSeek
42lux: Is it AGI yet?
sss111: So far, it seems like this is the hierarchy<p>o1 > GPT-4o > o3-mini > o1-mini > GPT-4o-mini<p>o3 mini system card: <a href="https://cdn.openai.com/o3-mini-system-card.pdf" rel="nofollow">https://cdn.openai.com/o3-mini-system-card.pdf</a>
forrestthewoods: OpenAI needs a new branding scheme.
airstrike: ChatGPT Series X O one
nejsjsjsbsb: The Llama folk know how. Good old 90s version scheme.
losvedir: What about "o1 Pro mode". Is that just o1 but with more reasoning time, like this new o3-mini's different amount of reasoning options?
MichaelBurge: o1-pro is a different model than o1.
sho_hn: I think OpenAI really needs to rethink its product naming, especially now that they have a portfolio where there's no such clear hierarchy, but they have a place along different axis (speed, cost, reasoning, capabilities, etc).<p>Your summary attempt e.g. also misses o3-mini vs o3-mini-high. Lots of trade-ofs.
echelon: It's like AWS SKU naming (`c5d.metal`, `p5.48xlarge`, etc.), except non-technical consumers are expected to understand it.
maeil: Have you seen Azure VM SKU naming? It's.. impressive.
buildbot: And it doesn’t even line up with the actual instances you’ll be offered. At one point I was using some random Nvidia A10 node that was supposed to be similar to Standard_NV36adms_A10_v5, but was an NC series for some reason with slightly different letters…
nejsjsjsbsb: Those are not names but hashes used to look up the specs.
sss111: Yeah I tried my best :(<p>I think they could've borrowed a page out of Apple's book, even mountain names would be better. Plus Sonoma, Ventura, and Yosemite are cool names.
rf15: They're strongly tied to Microsoft, so confusing branding is to be expected.
chris_va: One of my favorite parodies: <a href="https://www.youtube.com/watch?v=EUXnJraKM3k" rel="nofollow">https://www.youtube.com/watch?v=EUXnJraKM3k</a>
ANewFormation: I can't wait for Project Unify which just devolves into a brand new p3-mini type naming convention. It's pretty much identical to the o3-mini, except the API is changed just enough to be completely incompatible and it crashes on any query using a word with more than two syllables. Fix coming soon, for 4 years so far.<p>On the bright side the app now has curved edges!
ngokevin: It needs to be clowned on here:<p>- Xbox, Xbox 360, Xbox One, Xbox One S/X, Xbox Series S/X<p>- Windows 3.1...98, 2000, ME, XP, Vista, 7, 8, 10<p>I guess it's better than headphones names (QC35, WH-1000XM3, M50x, HD560s).
nejsjsjsbsb: Flashbacks of the .NET zoo. At least they reigned that in.
kaaskop: Yeah their naming scheme is super confusing, I honestly confuse them all the time.
Euphorbium: They can still do models o3o, oo3 and 3oo. Mini-o3o-high, not to be confused with mini-O3o-high (the first o is capital).
brookst: You’re thinking too small. What about o10, O1o, o3-m1n1?
margalabargala: They should just start encoding the model ID in trinary using o, O, and 0.<p>Model 00oOo is better than Model 0OoO0!
nullpoint420: Can't wait for the eventual rename to GPT Core, GPT Plus, GPT Pro, and GPT Pro Max models!<p>I can see it now:<p>> Unlock our industry leading reasoning features by upgrading to the GPT 4 Pro Max plan.
LoveMortuus: How would the DeepSeek fit into this?<p>Or can it not compare? I don't know much about this stuff, but I've heard recently many people talk about DeepSeek and how unexpected it was.
sss111: Deepseek V3 is equivalent to 4o. Deepseek R1 is equivalent to o1 (if not better)<p>I think someone should just build an AI model comparing website at this point. Include all benchmarks and pricing
jsk2600: This one is good: <a href="https://artificialanalysis.ai/" rel="nofollow">https://artificialanalysis.ai/</a>
withinboredom: Looks like this only compares commercial models, and not the ones I can download and actually run locally.
gundmc: If this is the hierarchy, why does 4o score so much higher than o1 on LLM Arena?<p>Worrisome for OpenAI that Gemini's mini/flash reasoning model outscores both o1 and 4o handily.
crazysim: Is it possible people are voting for speed of responsiveness too?
ActVen: I really wish they would open up the reasoning effort toggle on o1 API. o1 Pro Mode is still the best overall model I have used for many complex tasks.
ALittleLight: That seems very bad. What's the point of a new model that's worse than 4o? I guess it's cheaper in the API and a bit better at coding - but, this doesn't seem compelling.<p>With DeepSeek I heard OpenAI saying the plan was to move releases on models that were meaningfully better than the competition. Seems like what we're getting is the scheduled releases that are worse than the current versions.
usaar333: For non-stem perhaps.<p>For math/coding problems, o3 mini is tied if not better than o1.
koakuma-chan: You cannot compare GPT-4o and o*(-mini) because GPT-4o is not a reasoning model.
thot_experiment: at least if i ran the company you'd know that<p>ChatGPTasdhjf-final-final-use_this_one.pt > ChatGPTasdhjf-final.pt > ChatGPTasdhjf.pt > ChatGPTasd.pt> ChatGPT.pt
withinboredom: yeah, you can def tell they are partnered with Microsoft.
vincentpants: Wow, it got to the top of the front page so fast! Weird!
dang: I took a quick look at the data and FWIW the votes look legit to me, if that's what you were wondering.
throwaway314155: I'm fairly certain it was sarcasm.
Qwuke: It did get 29 points in 3 minutes, which seems like a lot even for a fan favorite, but is also consistent with previous OpenAI announcements here.
johnneville: I posted a verge article first but then checked and saw the openai blog and posted that. I'd guess it's the officialness / domain that makes ppl click on this so easily.
zurfer: to be fair, I was waiting for this release the whole day
pookieinc: Can't wait to try this. What's amazing to me is that when this was revealed just one short month ago, the AI landscape looked very different than it does today with more AI companies jumping into the fray with very compelling models. I wonder how the AI shift has affected this release internally, future releases and their mindset moving forward... How does the efficiency change, the scope of their models, etc.
echelon: There's no moat, and they have to work even harder.<p>Competition is good.
wahnfrieden: Collaboration is even better, per open source results.<p>It is the closed competition model that’s being left in the dust.
lesuorac: I really don't think this is true. OpenAI has no moat because they have nothing unique; they're using mostly other people's (like Transformers) architectures and other companies hardware.<p>Their value-prop (moat) is that they've burnt more money than everybody else. That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.<p>OpenAI isn't the only company. The Tech companies being beaten massively by Microsoft in #of H100s purchases are the ones with a moat. Google / Amazon with their custom AI chips are going to have a better performance per cost than others and that will be a moat. If you want to get the same performance per cost then you need to spend the time making your own chips which is years of effort (=moat).
sangnoir: > That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.<p>DeepSeek has proven that the latter is possible, which drops a couple of River crossing rocks into the moat.
withinboredom: The fact that I can basically run o1-mini with deepseek:8b, locally, is amazing. Even on battery power, it works acceptably.
brookst: Brand is a moat
cruffle_duffle: Ask Jeeves and Altavista surely have something to say about that!
sumedh: > That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.<p>Google with all its money and smart engineers was not able to build a simple chat application.
lumost: Capex was the theoretical moat, same as TSMC and similar businesses. DeepSeek poked a hole in this theory. OpenAI will need to deliver massive improvements to justify a 1 billion dollar training cost relative to 5 million dollars.
patrickhogan1: I thought it was o3 that was released one month ago and received high scores on ARC Prize - <a href="https://arcprize.org/blog/oai-o3-pub-breakthrough" rel="nofollow">https://arcprize.org/blog/oai-o3-pub-breakthrough</a><p>If they were the same, I would have expected explicit references to o3 in the system card and how o3-mini is distilled or built from o3 - <a href="https://cdn.openai.com/o3-mini-system-card.pdf" rel="nofollow">https://cdn.openai.com/o3-mini-system-card.pdf</a> - but there are no references.<p>Excited at the pace all the same. Excited to dig in. The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred.
iamjackg: I'm very interested in their Jailbreak evaluations: they're new to me. I might have missed previous mentions.
Ninjinka: 50 messages a day -> 150 messages a day for Plus and Team users
cjbarber: The interesting question to me is how far these reasoning models can be scaled. With another 12 months of compute scaling (for synthetic data generation and RL) how good will these models be at coding? I talked with Finbarr Timbers (ex-DeepMind) yesterday about this and his take is that we'll hit diminishing returns – not because we can't make models more powerful, but because we're approaching diminishing returns in areas that matter to users and that AI models may be nearing a plateau where capability gains matter less than UX.
xyzzy9563: What is the comparison of this versus DeepSeek in terms of good results and cost?
reissbaker: Probably a good idea to wait for external benchmarks like Aider, but my guess is it'll be somewhere between DeepSeek V3 and R1 in terms of benchmarks — R1 trades blows with o1-high, and V3 is somewhat lower — but I'd expect o3-mini to be considerably faster. Despite the blog post saying paid users can access o3-mini today, I don't see it as an option yet in their UI... But IIRC when they announced o3-mini in December they claimed it would be similar to 4o in terms of overall latency, and 4o is much faster than V3/R1 currently.
Synaesthesia: Deepseek is the state of the art right now in terms of performance and output. It's really fast. The way it "explains" how it's thinking is remarkable.
fpgaminer: DeepSeek is great because: 1) you can run the model locally, 2) the research was openly shared, and 3) the reasoning tokens are open. It is not, in my experience, state of the art. In all of my side by side comparisons thus far in real world applications between DeepSeek V3 and R1 vs 4o and o1, the latter has always performed better. OpenAI's models are also more consistent, glitching out maybe one in 10,000, whereas DeepSeek's models will glitch out 1 in 20. OpenAI models also handle edge cases better and have a better overall grasp of user intentions. I've had DeepSeek's models consistently misinterpret prompts, or confuse data in the prompts with instructions. Those are both very important things that make DeepSeek useless for real world applications. At least without finetuning them, which then requires using those huge 600B parameter models locally.<p>So it is by no means state of the art. Gemini Flash 2.0 also performs better than DeepSeek V3 in all my comparisons thus far. But Gemini Flash 2.0 isn't robust and reliable either.<p>But as a piece of research, and a cool toy to play with, I think DeepSeek is great.
evertedsphere: >developer messages<p>looks like finally their threat model has been updated to take into account that the user might be too "unaligned" to be trusted with the ability to provide a system message of their own
logicchains: If their models ever fail to keep ahead of the competition in terms of smarts, users are going to ditch them in mass for a competitor that doesn't treat their users like their enemy.
reissbaker: ...I'm pretty sure they just renamed the key...
buyucu: why should anyone use this when deepseek is free/cheaper?<p>openai is no longer relevant.
ilaksh: I don't think OpenAI is training on your data. At least they say they don't, and I believe that. I wouldn't be surprised if the NSA or something has access to data if they request it or something though.<p>But DeepSeek clearly states in their terms of service that they can train on your API data or use it for other purposes. Which one might assume their government can access as well.<p>We need direct eval comparisons between o3-mini and DeepSeek.. Or, well they are numbers so we can look them up on leaderboards.
csomar: You can pay for the compute and be certain that no one in recording your data with deepseek.
seinecle: Yes but DeepSeek models can be accessed through the APIs of Cloudflare or GitHub, in which case no training on your data takes place.
ilaksh: True.
lappa: OpenAI clearly states that they train on your data <a href="https://help.openai.com/en/articles/5722486-how-your-data-is-used-to-improve-model-performance" rel="nofollow">https://help.openai.com/en/articles/5722486-how-your-data-is...</a>
lemming: <i>By default, we do not train on any inputs or outputs from our products for business users, including ChatGPT Team, ChatGPT Enterprise, and the API. We offer API customers a way to opt-in to share data with us, such as by providing feedback in the Playground, which we then use to improve our models. Unless they explicitly opt-in, organizations are opted out of data-sharing by default.</i><p>The business bit is confusing, I guess they see the API as a business product, but they do not train on API data.
therein: So for posterity, in this subthread we found that OpenAI indeed trains on user data and it isn't something that only DeepSeek does.
pzo: > Services for businesses, such as ChatGPT Team, ChatGPT Enterprise, and our API Platform
> By default, we do not train on any inputs or outputs from our products for business users, including ChatGPT Team, ChatGPT Enterprise, and the API.<p>So on API they don't train by default, for other paid subscription they mention you can opt-out
sekai: > I don't think OpenAI is training on your data. At least they say they don't, and I believe that.<p>Like they said they were committed to being “open”?
JadoJodo: I'm going to assume the best in your question and disregard your statement.<p>Reasons to use o3 when deepseek is free/cheaper:<p>- Some companies/users may already have integrated heavily with OpenAI<p>- The expanded feature-set (e.g., function-calling, search) could be very powerful<p>- DeepSeek has deep ties to the Chinese Communist Party and, while the US has its own blackspots, the "steering" of information is far more prevalent in their models<p>- Local/national regulations might not allow for using DeepSeek due to data privacy concerns<p>- "free" isn't always better<p>I'm sure others have better reasons
GoatInGrey: > openai is no longer relevant.<p>I think you've spent a little too long hitting on the Deepseek pipe. Enterprise customers with familiarity with China will avoid the hosted model for data security and IP protection reasons, among others.<p>Those working in any area considered economically competitive with China will also be hesitant to use the vanilla model in self-hosted form as there perpetually remains the standing question on what all they've tuned inside the model to benefit the CCP. Perhaps even in subtle ways reminiscent of the Trisolaran sophons from the Three Body Problem.<p>For instance, you can imagine that if Germany had released an OS model in 1943, that the Americans wouldn't have trusted it to help them develop better military systems even if initial testing passed muster.<p>Unfortunately, state control of private enterprise in the Chinese economy makes it unproductive to separate the two from one another. Particularly in Deepseek's case as a wide array of Chinese state-linked social media accounts were promoting V3/R1 on the day of its public release.<p><a href="https://www.reuters.com/technology/artificial-intelligence/chinese-state-linked-accounts-hyped-deepseek-ai-launch-ahead-us-stock-rout-2025-01-31/" rel="nofollow">https://www.reuters.com/technology/artificial-intelligence/c...</a>
anon373839: Perhaps you didn’t realize: Deepseek is an open weights model and you can use it via the inference provider of your choice, or even deploy it on your own hardware - unlike OpenAI’s models. API calls to China are not necessary.
ks2048: I think OpenAI should just have a single public facing "model" - all these names and versions are confusing.<p>Imagine if Google, during it's accent, had a huge array of search engines with code names and notes about what it's doing behind the scenes. No, you open the page and type in box. If they can make it work better next month, great.<p>(I understand this could not apply to developers or enterprise-type API usage).
johanvts: Thats the role of ChatGPT?
sroussey: Nope. That lets you choose a from seven models right now.
ehfeng: Early Google search only provided web links. Google Images, News, Video, Shopping, Maps, Finance used to be their own search boxes. Only later did Google start unifying their search experiences.<p>Yelp suffered greatly in the early 2010s when Google started putting Google Maps listings (and their accompanying reviews) in their search results.<p>OpenAI will eventually unify their products as well.
Deegy: If google had to face the reality that distilling their search engine into multiple case-specific engines would have resulted in vastly superior search results, they surely would done (or considered) it.<p>Fortunately for them a monolith search engine was perfectly fine (and likely optimal due to accrued network effects).<p>OpenAI is basically signaling that they need to distill their monolith in order to serve specific segments of the marketplace. They've explicitly said that they're targeting STEM with this one. I think that's a smart choice, the most passionate early adopters of this tech are clearly STEM users.<p>If the tech was such that one monolith model was actually the optimal solution for all use cases, they would just do that. Actually, this is their stated mission: AGI. One monolith that's best at everything is basically what AGI is.
dgfitz: Oh look, another model. Yay.
devindotcom: Sure as a clock, tick follows tock. Can't imagine trying to build out cost structures, business plans, product launches etc on such rapidly shifting sands. Good that you get more for your money, I suppose. But I get the feeling no model or provider is worth committing to in any serious way.
turnsout: Hmm, not seeing it in my dashboard yet (Tier 4)
throwaway314155: This has happened to me with (I think) every single major model release (llm or image gen) from OpenAI. They just lie in their release announcements which leaves people scrambling on the day of.
sunaookami: It appeared just now for me on Tier 3.
thunkingdeep: I’ll take the China Deluxe instead, actually.<p>I’ve been incredibly pleased with DeepSeek this past week. Wonderful product, I love seeing its brain when it’s thinking.
mechagodzilla: Being able to see the thinking trace in R1 is so useful, as you can go back and see if it's getting stuck, making a wrong assumption, missing data, etc. To me that makes it materially more useful than the OpenAI reasoning models, which seem impressive, but are much harder to inspect/debug.
thot_experiment: Running it locally lets you <i>INTERJECT IN IT'S THINKING IN REALTIME</i> and I cannot stress how useful that is.
amarcheschi: Oh this is so cool
Gooblebrai: You mean it reacts to you writing something while it's thinking of that you can stop it while it's thinking?
hmottestad: You can stop it at any time, then modify what it's written so far...then press continue and let it continue thinking and answering.
thot_experiment: Fundamentally the UI is up to you, I have a "typing-pauses-inference-and-starts-gaslighting" feature in my homebrew frontend, but in OpenWebUI/Sillytavern you can just pause it and edit the chain of thought and then have it continue from the edit.
Gracana: That's a great idea. In your frontend, do you write in the same text entry field as the bot? I use oobabooga/text-generation-webui and I findit's a little awkward to edit the bot responses.
thenameless7741: Interesting.. In the official API [1], there's no way to prefill the reasoning_content:<p>> Please note that if the reasoning_content field is included in the sequence of input messages, the API will return a 400 error. Therefore, you should remove the reasoning_content field from the API response before making the API request<p>So the best I can do is pass the reasoning as part of the context (which means starting over from the beginning).<p>[1] <a href="https://api-docs.deepseek.com/guides/reasoning_model" rel="nofollow">https://api-docs.deepseek.com/guides/reasoning_model</a>
c-fe: I would actually love if it would just ask me simple questions (just yes/no) when its thinking about something i wasnt clear about and i could help it this way, its a bit sad seeing it write out the assumption and then take the wrong conclusion
thot_experiment: You can run it locally, pause it when it thinks wrong and correct it's chain of thought.
c-fe: Oh wow I did not know and dont have the hardware to run it locally unfortunately
thot_experiment: You probably have the hardware to run the smallest distill, it runs even on my ancient laptop. It's not very smart but it still does the CoT and you can have fun editing it.
viraptor: You can add that to the prompt. If you're running into those situation with vague assumption, ask it to provide either the answer or questions to provide any useful missing information.
czk: the fact that openai hides the reasoning tokens from us to begin with shows that what they are doing behind the scenes isnt all that impressive, and likely easily cloned (r1)<p>would be nice if they made them visible now
istjohn: I recently tried Gemini-1.5-Pro for the first time. It was clearly better than DeepSeek or any of the OpenAI models available to Plus subscribers.
leovander: I am running the 7B distilled version locally. I asked it to create a skeleton MEAN project. Everything was great but then it started to generate the front-end and I noticed the file extension (.tsx) and then saw react getting imported.<p>I gave the same prompt to sonnet 3.5 and not a single hiccup.<p>Maybe not an indication that Deepseek is worse/bad (I am using a distilled version), but moreso speaks to much react/nextjs is out in the world influencing the front-end code that is referenced.
rafaquintanilha: You know you are running an extremely nerfed version of the model, right?
leovander: I did update my comment, but said that I am using the distilled version, so yes?
cbg0: Even the full model scores below Claude on livebench so a distilled version will likely be even worse.
rsanek: Based on the leaderboard R1 is significantly better than Claude? <a href="https://livebench.ai/#/" rel="nofollow">https://livebench.ai/#/</a>
satvikpendem: You are not actually running DeepSeek, those distilled models have nothing to do with DeepSeek itself and are just finetuned on DeepSeek responses.
xeckr: Have you tried seeing what happens when you speak to it about topics which are considered politically sensitive in the PRC?
leovander: You can get around it based on how you ask the question. If you follow whatever X/reddit posts you might have seen for the most part, yes, you get the thinking stream to immediately stop and get the safety message.
thot_experiment: R1 (70B-distill) itself is very uncensored, will give you full account of tiannanmen square from vague prompts. Asking R1 "what significant things happened in china in 1989" had it volunteering that "the death toll was in the hundreds or thousands and the exact number remains disputed to this day". The only thing that's censored is the web interface.
GoatInGrey: When asking it about the concept of human rights and the various forms in which it manifests (i.e. demographic equality under the law). I get a mixture of mundane nuance and bizarre answers that Xi Jingping himself could have written. With references to unity and the importance of social harmony over the "freedoms of the few".<p>This tracks when considering that the model was trained on western model outputs and then tuned post-training to (poorly) align it with Chinese values.
thot_experiment: I definitely am not getting that, perhaps the 671b model is notably worse than the 70b llama distill in this respect. 70b seemed pretty happy to talk about the ethnic cleansing of the Uyghurs in Xinjiang by the CCP and Palestinians in Gaza by Israel, it did some both-sides ing but it generally seemed to provide a balanced-ish viewpoint. At least I think it provided a viewpoint that comports with my best guess of what the average person globally would consider balanced.
amarcheschi: Seeing the cot can provide some insights on what's happening in his "mind" and that alone it's quite worth it imho
jazzyjackson: Using R1 with Perplexity has impressed me in a way that none of the previous models have, and I can't even figure out if it's actually R1, seems likely that its a 70B-llama distillation since that's what AWS offers on Bedrock but from what I can find Perplexity does have their own H100 cluster through Amazon so it's feasible they could be hosting the real thing? But I feel like they would brag about that achievement instead of being coy and simply labeling "Deepseek R1 - Hosted in US"
Szpadel: I played with their model, and I want able to make him follow any instructions, it looked like it just reads first message and ignore rest of the conversation. not sure if they is bug with oupenrouter or model, but I was highly disappointed.<p>from way how it thinks/responds looks like it's one of destinations , likely llama one
I also suspect that many of free/cheap providers also serve llama instead of real R1
coliveira: Yes, it is a great product, especially for coding tasks.
thefourthchime: I've seen it get into long 5 minute chains of thought where it gets totally confused.
bushbaba: I did a blind test and still prefer Gemini, Claude, and OpenAI to deepseek.
wg0: Sometimes its thinking is more useful than the actual output.
anon373839: Agreed. These locked-down, proprietary models do not interest me. And I certainly am not building product with them - being shackled to a specific provider is a needless business risk.
ofou: I find quite interesting they're releasing three compute levels (low, medium, high), I guess now there's some way to cap the thinking tokens when using their API.<p>Pricing for o3-mini [1] is $1.10 / $4.40 per 1M tokens.<p>[1]: <a href="https://platform.openai.com/docs/pricing#:~:text=o3%2Dmini" rel="nofollow">https://platform.openai.com/docs/pricing#:~:text=o3%2Dmini</a>
kevinsundar: BTW if you want to stay up to date with these kinds of updates from OpenAI you can follow them here: <a href="https://www.getchangelog.com/?service=openai.com" rel="nofollow">https://www.getchangelog.com/?service=openai.com</a><p>It uses GPT-4o mini to extract updates from the website using scrapegraphai so this is kinda meta :). Maybe I'll switch to o3 mini depending on cost. It's reasoning abilities, with a lower cost than o1, could be quite powerful for web scraping.
random3: I might be missing some context here - to what specific context does your comment refer to?
I'm asking because I don't see you in the conversation and you comments seems an out of context self-promoting plug.
kevinsundar: Hey! I'm sorry you feel that way. There's several people who have subscribed to updates to OpenAI from my comment so there is clearly value to other commenters. I understand not everyone is interested though. It's just a free side project I built and I make no money.<p>Additionally, I believe my contribution to the conversation is that gpt-4o-mini, the previous model advertised as low-cost, works pretty well for my use case (which in this case can help others here). I'm excited to try out gpt-03-mini depending on what the cost looks like for web scraping purposes. Happy to report back here once I try it out.
ryanhecht: > While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed.<p>I feel like this naming scheme is growing a little tired. o1 is for general knowledge reasoning, o3-mini replaces o1-mini but might be more specialized than o1 for certain technical domains...the "o" in "4o" is for "omni" (referring to its multimodality) but the reasoning models start with "o" ...but they can't use "o2" for trademark reasons so they skip straight to "o3" ...the word salad is getting really hard to follow!
kingnothing: They really need someone in marketing.<p>If the model is for technical stuff, then call it the technical model. How is anyone supposed to know what these model names mean?<p>The only page of theirs attempting to explain this is a total disaster. <a href="https://platform.openai.com/docs/models" rel="nofollow">https://platform.openai.com/docs/models</a>
rowanG077: If marketing terms from intel, AMD, Dell and other tech companies have taught me anything, it's that they need LESS of people in marketing.
TeMPOraL: But think of all the other marketers whose job is to produce blogspam explaining confusing product names!
ninetyninenine: I bet you can get one of their models to fix that disaster.
ryanhecht: But what would we call that model?
aleph_minus_one: > But what would we call that model?<p>Ask one of their models for advice. :-)
ryanhecht: Reminds me of a joke in the musical "How to Succeed in Business Without Really Trying" (written in 1961):<p>PETERSON
Oh say, Tackaberry, did you get my memo?<p>TACKABERRY
What memo?<p>PETERSON
My memo about memos. We're sending out too many memos and it's got to stop!<p>TACKABERRY
All right. I'll send out a memo.
ninetyninenine: Let’s call it “O5 Pro Max Elite”—because if nonsense naming works for smartphones, why not AI models?
ryandrake: O5 Pro Max Elite Enterprise Edition with Ultra
TeMPOraL: Maybe they could start selling "season passes" next to make their offering even more clear!
n2d4: > They really need someone in marketing.<p>Who said this is not intentional? It seems to work well given that people are hyped every time there's a release, no matter how big the actual improvements are — I'm pretty sure "o3-mini" works better for that purpose than "GPT 4.1.3"
fkyoureadthedoc: > I'm pretty sure "o3-mini" works better for that purpose than "GPT 4.1.3"<p>Why would the marketing team of all people call it GPT 4.1.3?
ryanhecht: Ugh, and some of the rows of that table are "sets of models" while some are singular models...there's the "Flagship models" section at the top only for "GPT models" to be heralded as "Our fast, versatile, high intelligence flagship models" in the NEXT section...<p>...I like "DALL·E" and "Whisper" as names a lot, though, FWIW :p
golly_ned: Yes, this $300Bn company generating +$3.4Bn in revenue needs to hire marketing expert. They can begin by sourcing ideas from us here to save their struggling business from total marketing disaster.
winrid: At the least they should care more about UX. I have no idea how to restore the sidebar on chatgpt on desktop lol
Legend2440: Click the 'open sidebar' icon in the top left corner of the screen.
optimalsolver: >this $300Bn company<p>Watch this space.
avs733: Hype based marketing can be effective but it is high risk and unstable.<p>A marketing team isn’t a generality that makes a company known, it often focuses on communicating what products different types of customers need from your lineup.<p>If I sell three medications:<p>Steve<p>56285<p>Priximetrin<p>And only tell you they are all pain killers but for different types and levels of pain I’m going to leave revenue on the floor. That is no matter how valuable my business is or how well it’s known.
TeMPOraL: > <i>How is anyone supposed to know what these model names mean?</i><p>Normies don't have to know - ChatGPT app focuses UX around capabilities and automatically picks the appropriate model for capabilities requested; you can see which model you're using and change it, but <i>don't need to</i>.<p>As for the techies and self-proclaimed "AI experts" - OpenAI is the leader in the field, and one of the most well-known and talked about tech companies in history. Whether to use, praise or criticize, this group of users is motivated to figure it out on their own.<p>It's the privilege of fashionable companies. They could name the next model ↂ-↊↋, and it'll take all of five minutes for everyone in tech (and everyone on LinkedIn) to learn how to type in the right Unicode characters.<p>EDIT: Originally I wrote \Omega-↊↋, but apparently HN's Unicode filter extends to Greek alphabet now? 'dang?
relaxing: What if you use ASCII 234? Ω
(edit: works!)
TeMPOraL: Thanks! I copied mine from Wikipedia (like I typically do with Unicode characters I rarely use), where it is also Ω - the same character. For a moment I was worried I somehow got it mixed up with the Ohm symbol but I didn't. Not sure what happened here.
koakuma-chan: Name is just a label. It's not supposed to mean anything.
ninetyninenine: Think how awesome the world would be if labels ALSO had meanings.
koakuma-chan: As someone else said in another thread, if you could derive the definition from a word, the word would be as long as the definition, which would defeat the purpose.
TeMPOraL: There is no such thing. "Meaning" isn't a property of a label, it arises from how that label is used with other labels in communication.<p>It's actually the reason LLMs work in the first place.
optimalsolver: You're gonna need to ground those labels in something physical at some point.<p>No one's going to let an LLM near anything important until then.
TeMPOraL: You only need it for bootstrapping. Fortunately, we've already done that when we invented first languages. LLMs are just bootstrapping off us.
layer8: Inscrutable naming is a proven strategy for muddying the waters.
jtwaleson: Salesforce would like a word...
SAI_Peregrinus: The USB-IF as well. Retroactively changing the name of a previous standard was particularly ridiculous. It's always been USB 3.1 Gen 1 like we've always been at war with Eastasia.
unsupp0rted: This is definitely intentional.<p>You can like Sama or dislike him, but he knows how to market a product. Maybe this is a bad call on his part, but it is a call.
thorum: Not really. They’re successful because they created one of the most interesting products in human history, not because they have any idea how to brand it.
marko-k: If that were the case, they’d be neck and neck with Anthropic and Claude. But ChatGPT has far more market share and name recognition, especially among normies. Branding clearly plays a huge role.
KeplerBoy: That's first mover advantage.
bobxmax: I think that has more to do with the multiple year head start and multiple tens of billions of dollars in funding advantage.
joshstrange: And you think that is due to their model naming?
cj: ChatGPT is still benefitting from first mover advantage. Which they’ve leveraged to get to the position they’re at today.<p>Over time, competitors catch up and first mover advantage melts away.<p>I wouldn’t attribute OpenAI’s success to any extremely smart marketing moves. I think a big part of their market share grab was simply going (and staying) viral for a long time. Manufacturing virality is notoriously difficult (and based on the usability and poor UI of ChatGPT early versions, it feels like they got lucky in a lot of ways)
jcheng: I prefer Anthropic's models but ChatGPT (the web interface) is far superior to Claude IMHO. Web search, long-term memory, and chat history sharing are hard to give up.
mrbungie: That's like making a second reading and appealing to authority.<p>The naming is bad. Other people already said it you can "google" stuff, you can "deepseek" something, but to "chatgpt" sounds weird.<p>The model naming is even weirder, like, did they really avoid o2 because of oxigen?
sumedh: > but to "chatgpt" sounds weird.<p>People just say it differently, they say "ask chatgpt"
mrbungie: Obviously they do. That's the whole point.
zamadatix: The -mini postfix makes perfect sense, probably even clearer than the old "turbo" wording. Naturally, the latest small model may be better than larger older models which were larger... but not always in everything. What you'd expect from a -mini model is exactly what is delivered.<p>The non reasoning line was also pretty straightforward. Newer base models get a larger prefix number and some postfixes like 'o' were added to signal specific features in each model variant.<p>Where things went of the rails was specifically when they decided to also name the reasoning models 'o' as the prefix and a separate linear sequence. I wonder if we'll end up with both a 4o and o4...
lolinder: > I wonder if we'll end up with both a 4o and o4...<p>The perplexing thing is that <i>someone</i> has to have said that, right? It has to have been brought up in some meeting when they were brainstorming names that if you have 4o and o1 with the intention of incrementing o1 you'll eventually end up with an o4.<p>Where they really went off the rails was not just bailing when they realized they couldn't use o2. In that moment they had the chance to just make o1 a one-off weird name and go down a different path for its final branding.<p>OpenAI just struggles with names in general, though. ChatGPT was a terrible name picked by engineers for a product that wasn't supposed to become wildly successful, and they haven't really improved at it since.
viraptor: The obvious solution could be to just keep skipping the even numbers and go to o5.
arrowleaf: Or further the hype and name it o9.
observationist: They should be calling it ChatGPT and ChatGPT-mini, with other models hidden behind some sort of advanced mode power user menu. They can roll out major and minor updates by number. The whole point of differentiating between models is to get users to self limit the compute they consume - rate limits make people avoid using the more powerful models, and if they have a bad experience using the less capable models, or if they're frustrated by hopping between versions without some sort of nuanced technical understanding, it's just a bad experience overall.<p>OpenAI is so scattered they haven't even bothered using their own state of the art AI to come up with a coherent naming convention? C'mon, get your shit together.
TeMPOraL: "ChatGPT" (chatgpt-4o) is now its own model, distinct from gpt-4o.<p>As for self-limiting usage by non-power users, they're already doing that: ChatGPT app automatically picks a model depending on what capabilities you invoke. While they provide a limited ability to see and switch the model in use, they're clearly expecting regular users not to care, and design their app around that.
fourseventy: It's almost as bad as the Xbox naming scheme.
Someone1234: I don't know if anything is as bad as a games console named "Series."
siliconc0w: The real heated contest here amongst the top AI labs is to see who can come up with the most confusing product names.
not_a_bot_4sho: Someone dropped the ball with Phi models. There is clearly an opportunity for XP and Ultimate and X/S editions.
lja: I really think a "OpenAI Me" is what's needed.
baq: Personally waiting for the ME model. Should be great at jokes and humor.
tdb7893: It's nice to see Google finally having competition in a space it used to really dominate (though they definitely still are holding their own with all the Gemini naming). I feel like it takes real effort to have product names be this confusing and capricious
gundmc: Gemini naming seems pretty straightforward at this point. 2.0 is the full model, flash is a smaller/faster/cheaper model, and flash thinking is a smaller/faster/cheaper reasoning model with Cost.
TheOtherHobbes: Surprised Apple hasn't gone with iI Pro Max.
dilap: Haven't used openai in a bit -- whyyy did they change "system" role (now basically an industry-wide standard) to "developer"? That seems pointlessly disruptive.
logicchains: They mention in the model card, it's so that they can have a separate "system" role that the user can't change, and they trained the model to prioritise it over the "developer" role, to combat "jailbreaks". Thank God for DeepSeek.
sroussey: They should have just created something above system and left as it was.
kaaskop: How's this compare to Mistral Small 3?
highfrequency: Anyone else confused by inconsistency in performance numbers between this announcement and the concurrent system card? <a href="https://cdn.openai.com/o3-mini-system-card.pdf" rel="nofollow">https://cdn.openai.com/o3-mini-system-card.pdf</a><p>For example-<p>GPQA diamond system card: o1-preview 0.68<p>GPQA diamond PR release: o1-preview 0.78<p>Also, how should we interpret the 3 different shading colors in the barplots (white, dotted, heavy dotted on top of white)...
kkzz99: Actually sounds like benchslop to me.
airstrike: Hopefully this is a big improvement from o1.<p>o1 has been very disappointing after spending sufficient time with Claude Sonnet 3.5. It's like it actively tries to gaslight me and thinks it knows more than I do. It's too stubborn and confidently goes off in tangents, suggesting big changes to parts of the code that aren't the issue. Claude tends to be way better at putting the pieces together in its not-quite-mental-model, so to speak.<p>I told o1 that a suggestion it gave me didn't work and it said "if it's still 'doesn't work' in your setup..." with "doesn't work" in quotes like it was doubting me... I've canceled my ChatGPT subscription and, when I really need to use it, just go with GPT-4o instead.
Deegy: I've also noticed that with cGPT.<p>That said I often run into a sort of opposite issue with Claude. It's very good at making me feel like a genius. Sometimes I'll suggest trying a specific strategy or trying to define a concept on my own, and Claude enthusiastically agrees and takes us down a 2-3 hour rabbit hole that ends up being quite a waste of time for me to back track out of.<p>I'll then run a post-mortem through chatGPT and very often it points out the issue in my thinking very quickly.<p>That said I keep coming back to sonnet-3.5 for reasons I can't perfectly articulate. Perhaps because I like how it fluffs my ego lol. ChatGPT on the other hand feels a bit more brash. I do wonder if I should be using o1 as my daily driver.<p>I also don't have enough experience with o1 to determine if it would also take me down dead ends as well.
ilaksh: It looks like a pretty significant increase on SWE-Bench. Although that makes me wonder if there was some formatting or gotcha that was holding the results back before.<p>If this will work for your use case then it could be a huge discount versus o1. Worth trying again if o1-mini couldn't handle the task before. $4/million output tokens versus $60.<p><a href="https://platform.openai.com/docs/pricing" rel="nofollow">https://platform.openai.com/docs/pricing</a><p>I am Tier 5 but I don't believe I have access to it in the API (at least it's not on the limits page and I haven't received an email). It says "rolling out to select Tier 3-5 customers" which means I will have to wait around and just be lucky I guess.
TechDebtDevin: Genuinely curious, What made you choose OpenAI as your preferred api provider? Its always been the least attractive to me.
TeMPOraL: Until recently they were the only game in town, so maybe they accrued significant spend back then?
ilaksh: I have mainly been using Claude 3.5/3.6 Sonnet via API in the last several months (or since 3.5 Sonnet came out). However, I was using o1 for a challenging task at one point, but last I tested it had issues with some extra backslashes for that application.<p>I also have tested with DeepSeek R1 and will test some more with that although in a way Claude 3.6 with CoT is pretty good. Last time I tried to test R1 their API was out.
ipaddr: Who else might be a good choice? Deepseek is down. Who has the cheapest gpt3.5 level or above api
TechDebtDevin: Ive personaly been using Deepseek (which has been better than for 3.5 for a really long time), and Perplexity, which is nice for their built in search. Ive actually been using Deepseek since it was free. Its been generally good for me. Ive mostly chosen both because of pricing as I generally dont use APIs for extermely complex prompts.
eknkc: We extensively used the batch APIs to decrease cost and handle large amount of data. I also need JSON responses for a lot of things and OpenAI seem to have the best json schema output option out there.
TeMPOraL: Tier 3 here and already see it on Limits page, so maybe the wait won't be long.
georgewsinger: Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?<p>Am I missing something?
logicchains: Maybe they found a need to quantize it further for release, or lobotomise it with more "alignment".
kkzz99: Or the number was never real to begin with.
ben_w: > lobotomise<p>Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.<p>Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".<p>Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.<p>And given these models do actually get stuff wrong, is it really <i>incorrect</i> for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? <a href="https://www.corememory.com/p/a-young-man-used-ai-to-build-a-nuclear" rel="nofollow">https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...</a>
Rastonbury: They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs
jakereps: The caption on the graph explains.<p>> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .<p>I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.
DrewHintz: I'm guessing an "internal tools scaffold" is something like Goose: <a href="https://github.com/block/goose">https://github.com/block/goose</a><p>Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.
anothermathbozo: I think this is with and without "tools." They explain it in the system card:<p>> We evaluate SWE-bench in two settings:
> ** Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.<p>> ** o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.
georgewsinger: Makes sense. Thanks for the correction.
Bjorkbat: So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.<p>While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (<a href="https://cdn.openai.com/o3-mini-system-card.pdf" rel="nofollow">https://cdn.openai.com/o3-mini-system-card.pdf</a>).<p>It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.<p>Of course the real improvement is cost, but still, it kind of rubs me the wrong way.
pockmarked19: YC usually says “a startup is the point in your life where tricks stop working”.<p>Sam Altman is somehow finding this out now, the hard way.<p>Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).
OutOfHere: Wake me up when the full o3 is out.
therein: My guess is it will happen right after Sam Altman's next public freakout about how dangerous this new model they have in store is and how it tried to escape from its confinement and kidnap the alignment operator.
ls_stats: That's pretty much what Altman said about GPT-3 (or 2, I don't remember), he said it was too dangerous to release to the public.
msp26: I wish they'd just reveal the CoT (like gemini and deepseek do), it's very helpful to see when the model gets misled by something in your prompt. Paying for tokens you aren't even allowed to see is peak OpenAI.
kumarm: I ran some quick programming tasks I have used O1 previously:<p>1. 1/4th time for reasoning for most tasks.<p>2. Far better results.
CamperBob2: Compared to o1 or o1-pro?
Bjorkbat: I have to admit I'm kind of surprised by the SWE-bench results. At the highest level of performance o3-mini's CodeForces score is, well, high. I've honestly never really sat down to understand how elo works, all I know is that it scored better than o1, which allegedly as better than ~90% of all competitors on CodeForces. So, you know, o3-mini is pretty good at CodeForces.<p>But it's SWE-bench scores aren't meaningfully better than Claude, 49.3 vs Claude's 49.0 on the public leaderboard (might be higher now due to recent updates?)<p>My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering tasks. Besides that, for all the work put into OpenAI's most recent model it still has a hard time living up to an LLM initially released by Anthropic some time ago, at least according to this benchmark.<p>Mind you, the Github issues that the problems in SWE-bench were based-off have been around long enough that it's pretty much a given that they've all found their way into the training data of most modern LLMs, so I'm really surprised that o3 isn't meaningfully better than Sonnet.
_boffin_: why is o1-pro not mentioned in there?
Oras: 200k context window<p>$1.1/m for input<p>$4.4/m for output<p>I assume thinking medium and hard would consume more tokens.<p>I feel the timing is bad for this release especially when deepseek R1 is still peaking. People will compare and might get disappointed with this model.
GaggiX: The model looks quite a bit better in the benchmarks so unless they overfit the model on them it would probably perform better than deepseek.
WiSaGaN: My vibe question checking suggests otherwise. Even o3-mini-high is not as good as r1, even though it's faster than r1. Considering o3-mini is more expensive per token. It's not clear o3-mini-high is cheaper than r1 either even r1 probably consumes more token per answer.
IMTDb: I really don't get the point of those oX-mini models for chat apps. (API is different, we can benchmark multiple models for a given recurring taks and choose the best one taking costs into consideration). As part of my job, I am trying to promote usage of AI in my company (~150 FTE); we have an OpenAI chatGPT plus subscription for all employees.<p>Roughly speaking the message is: "use GPT-4o all the time, use o1 (soon o3) if you have more complex tasks". What am I supposed to answer when people ask "when am I supposed to use o3-mini ? . And what the heck is o3-mini-high, how do I know when to use it ?".
People aren't gonna ask the same question to 5 different models and burn all their rate limits; yet it feels that what's openAI is hoping people will do.<p>Put those weirs models in a sub-menu for advanced users if you really want to, but is you can use o1 there is probably no reason for you to hake o3-mini <i>and</i> o3-mini-high as additional options.
oezi: Why not promote o1? 4o is rather sloppy in comparison
IMTDb: 99% of what people use ChatGPT is for very mundane stuff. Think “translate this email to English”, “fix spelling mistakes”, “write this better for me”. Data extraction (list of emails) is big as well. You don’t need o1 for that; and people make lot of those requests per day.<p>Additionally, o1 does not have access to search and multimodality and taking a screenshot of something and asking questions about it is also a big use case.<p>It’s easy to overlook how widely ChatGPT is used for <i>very</i> small stuff. But compounded it’s still a game changer for many people.
xinayder: "oh no DeepSeek copied our product it's not fair"<p>> proceeds to release a product based on DeepSeek<p>ah, alas the hypocrisy...
feznyng: o3 was announced in December. R1 arguably builds off the rumored approach of o1 (LLM + RL) although with major efficiency gains. I'm not a big fan of OpenAI but it's the other way around.
Rooster61: The thing they previewed back in December before the whole Deepseek kerfuffle this week?<p>Don't get me wrong, I'm laughing at OpenAI just like everyone else, but if they were really copying Deepseek, they'd be releasing a smaller model distilled from Deepseek API responses, and have it be open source to boot. This is neither
yapyap: They sure scrambled something together after DeepSeek sweeped the market.
GoatInGrey: Indeed. Everyone knows that one can cobble together a frontier model and deploy it within three weeks.
TechDebtDevin: Not to mention the model has been available to researchers for a month.
mhb: Maybe they can get some advice from the AWS instance naming group.
og_kalu: R1 seems to be the only of these reasoning models that seem to have had gains in the creative writing side.
throwaway314155: Typical OpenAI release announcement where it turns out they're _actually_ doing some sort of delayed rollout and despite what the announcement says, no - you can't use o3-mini today.
feverzsj: It's already a dead end for a while now, as they can't improve o1 meaningfully anymore. The market is also losing patience quickly.
czk: im just glad it looks like o3-mini finally has internet access<p>the o1 models were already so niche that i never used them, but not being able to search the web made them even more useless
oytis: Let me guess - everyone is mindblown.
estsauver: I couldn't find in the documentation anything that describes the relative number of tokens that you get for low/medium/high. I'm curious if anyone can find that, I'd be curious to see how it plays out relative to DeepSeeks thinking sections.
isusmelj: Does anyone know why GPT4 has knowledge cutoff December 2023 and all the other models (newer ones like 4o, O1, O3) seem to have knowledge cutoff October 2023?
<a href="https://platform.openai.com/docs/models#o3-mini" rel="nofollow">https://platform.openai.com/docs/models#o3-mini</a><p>I understand that keeping the same data and curating it might be beneficial. But it sounds odd to roll back in time with the knowledge cutoff. AFAIK, the only event that happened around that time was the start of the Gaza conflict.
kikki: I think trained knowledge is less and less important - as these multi-modal models have the ability to search the web and have much larger context windows.
andrewstuart: I find Claude to be vastly better than any OpenAI model as a programming assistant.<p>In particular the "reasoning" models just seem to be less good and more slow.
chad1n: I think that OpenAI should reduce the prices even further to be competitive with Qwen or Deepseek. There are a lot of vendors offering Deepseek R1 for $2-2.5 per 1 million tokens output.
othello: Would you have specific recommendations of such vendors?
chad1n: For example, `<a href="https://deepinfra.com/" rel="nofollow">https://deepinfra.com/</a>` which asks for $2.5 per million on output or <a href="https://nebius.com" rel="nofollow">https://nebius.com</a> which asks for $2.4 per million output tokens.
druskacik: Well, it's $2.19 per million output tokens even directly on deepseek platform.<p><a href="https://api-docs.deepseek.com/quick_start/pricing/" rel="nofollow">https://api-docs.deepseek.com/quick_start/pricing/</a>
rsanek: If you want reliable service you're going to pay more around $7~8 per million tokens. Sister commenters mention providers that are considered unstable <a href="https://openrouter.ai/deepseek/deepseek-r1" rel="nofollow">https://openrouter.ai/deepseek/deepseek-r1</a>
secondcoming: Anyone else stuck in a Cloudflare 'verify you're a human' doom loop?
tempeler: They made a discount; it's very impressive; they probably found a very efficient way, so it's discounted. I guess there's no need to build a very large nuclear power plant or a $9 trillion chip factory to run a single large language model. Efficiency has skyrocketed, or thanks to competition, OpenAI's all problems were solved.
jen729w: > Testers preferred o3-mini's responses to o1-mini 56% of the time<p>I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.<p>They're both 2,000 words, and I asked a question because I have something to do. <i>I'm not reading them both</i>; I'm usually just selecting the one that answered first.<p>That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.
danielmarkbruce: RLUHF, U = useless.
brookst: Those prompts are so irritating and so frequent that I’ve taken to just quickly picking whichever one looks worse at a cursory glance. I’m paying them, they shouldn’t expect high quality work from me.
AznHisoka: Same, I always intentionally choose the one that’s worst. They wanna make the best AI possible, go right ahead. Just don’t expect me to aid you
apparent: Have you considered the possibility that your feedback is used to choose what type of response to give to you specifically in the future?<p>I would not consider purposely giving inaccurate feedback for this reason alone.
jackbrookes: Yes I'd bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate
cgriswald: I read the one on the left but choose the shorter one.<p>The interface wastes so much screen real estate already and the answers are usually overly verbose unless I've given explicit instructions on how to answer.
johnneville: they also pay contractors to do these evaluations with much more detailed metrics, no idea which their number is based on though
dkjaudyeqooe: It's kind of strange that they gave that stat. Maybe they thought people would somehow think about "56% better" or something.<p>Because when you think about it, it really is quite damning. Minus statistical noise it's no better.
mikeInAlaska: Maybe we should take both answers, paste them into a new chat and ask for a summary amalgamation of them
EcommerceFlow: First thing I noticed on API and Chat for it is THIS THING IS FAST. That alone makes it a huge upgrade to o1-pro (not really comparable I know, just saying). Can't imagine how much I'll get done with this type of speed.
GaggiX: The API pricing is almost exactly double the deepseek ones.
mise_en_place: Too little too late IMO. This is not impressive at all, what am I missing here?
ben_w: There's only two kinds of software, prototype and obsolete.<p>I was taught that last millennium.
jstummbillig: Idk, everything: The price point + performance?
sumedh: > This is not impressive at all, what am I missing here?<p>Compared to?
RobinL: Wow - this is seriously fast (o3-mini), and my initial impressions are very favourable. I was asking it to layout quite a complex html form from a schema and it did a very good job.<p>Looking at the comments on here and the benchmark results I was expecting it to be a bit meh, but initial impressions are quite the opposite<p>I was expecting it to perhaps be a marginal improvement for complex things that need a lot of 'reasoning', but it seems it's a bit improvement for simple things that you need doing fast
AISnakeOil: The naming convention is so messed up. o1, o3-mini (no o2, no o3???)
igravious: <a href="https://www.perplexity.ai/search/new?q=list%20of%20all%20OpenAI%27s%20models" rel="nofollow">https://www.perplexity.ai/search/new?q=list%20of%20all%20Ope...</a> :)<p>OpenAI has developed a variety of models that cater to different applications, from natural language processing to image generation and audio processing. Here’s a comprehensive list of the current models available:<p><pre><code> ## Language Models
- \*GPT-4o\*: The flagship model capable of processing text, images, and audio.
- \*GPT-4o mini\*: A smaller, more cost-effective version of GPT-4o.
- \*GPT-4\*: An advanced model that improves upon GPT-3.5.
- \*GPT-3.5\*: A set of models that enhance the capabilities of GPT-3.
- \*GPT-3.5 Turbo\*: A faster variant designed for efficiency in chat applications.
## Reasoning Models
- \*o1\*: Focused on reasoning tasks with improved accuracy.
- \*o1-mini\*: A lightweight version of the o1 model.
- \*o3\*: The successor to o1, currently in testing phases.
- \*o3-mini\*: A lighter version of the o3 model.
## Audio Models
- \*GPT-4o audio\*: Supports real-time audio interactions and audio generation.
- \*Whisper\*: For transcribing and translating speech to text.
## Image Models
- \*DALL-E\*: Generates images from textual descriptions.
## Embedding Models
- \*Embeddings\*: Converts text into numerical vectors for similarity tasks.
- \*Ada\*: An embedding model with various sizes (e.g., ada-002).
## Additional Models
- \*Text to Speech (Preview)\*: Synthesizes spoken audio from text.
</code></pre>
These models are designed for various tasks, including coding assistance, image generation, and conversational AI, making OpenAI's offerings versatile for developers and businesses alike[1][2][4][5].<p>Citations:<p><pre><code> [1] https://learn.microsoft.com/vi-vn/azure/ai-services/openai/concepts/models
[2] https://platform.openai.com/docs/models
[3] https://llm.datasette.io/en/stable/openai-models.html
[4] https://en.wikipedia.org/wiki/OpenAI_API
[5] https://industrywired.com/open-ai-models-list-top-models-to-consider/
[6] https://holypython.com/python-api-tutorial/listing-all-available-openai-models-openai-api/
[7] https://en.wikipedia.org/wiki/GPT-3
[8] https://stackoverflow.com/questions/78122648/openai-api-how-do-i-get-a-list-of-all-available-openai-models/78122662</code></pre>
ben_w: There's an o1-mini, there's an o3 it just hasn't gone live yet: <a href="https://openai.com/12-days/#day-12" rel="nofollow">https://openai.com/12-days/#day-12</a><p>they can't call it o2 because: <a href="https://en.wikipedia.org/wiki/The_O2_Arena" rel="nofollow">https://en.wikipedia.org/wiki/The_O2_Arena</a><p>and the venue's sponsor: <a href="https://en.wikipedia.org/wiki/O2_(UK)" rel="nofollow">https://en.wikipedia.org/wiki/O2_(UK)</a>
sumedh: o3 will come later.<p>o2 was not selected because there is already another brand with that name in UK
thimabi: Does anyone know the current usage limits for o3-mini and o3-mini-high when used through the ChatGPT interface? I tried to find them on the OpenAI Knowledgebase, but couldn’t find anything about that.
scarface_74: This took 1:53 in o3-mini<p><a href="https://chatgpt.com/share/679d310d-6064-8010-ba78-6bd5ed3360d3" rel="nofollow">https://chatgpt.com/share/679d310d-6064-8010-ba78-6bd5ed3360...</a><p>The 4o model without using the Python tool<p><a href="https://chatgpt.com/share/679d32bd-9ba8-8010-8f75-2f26a792e0d3" rel="nofollow">https://chatgpt.com/share/679d32bd-9ba8-8010-8f75-2f26a792e0...</a><p>Trying to get accurate results with the paid version of 4o with the Python interpreter.<p><a href="https://chatgpt.com/share/679d31f3-21d4-8010-9932-7ecadd0b870f" rel="nofollow">https://chatgpt.com/share/679d31f3-21d4-8010-9932-7ecadd0b87...</a><p>The share link doesn’t show the output for some reason. But it did work correctly. I don’t know whether the ages are correct. I was testing whether it could handle ordering<p>I have no idea what conclusion I should draw from this besides depending on the use case, 4o may be better with “tools” if you know your domain where you are using it.<p>Tools are relatively easy to implement with LangChain or the native OpenAI SDK.
margalabargala: The 4o model is blatantly incorrect. I'm not going to look up if it's the order or the ages that are incorrect, but:<p>36. Abraham Lincoln – 52 years, 20 days (1861)<p>37. James Garfield – 49 years, 105 days (1881)<p>38. Lyndon B. Johnson – 55 years, 87 days (1963)<p>Basically everything after #15 in the list is scrambled.
simonw: I just pushed a new release of my LLM CLI tool with support for the new model and the reasoning_effort option: <a href="https://llm.datasette.io/en/stable/changelog.html#v0-21" rel="nofollow">https://llm.datasette.io/en/stable/changelog.html#v0-21</a><p>Example usage:<p><pre><code> llm -m o3-mini 'write a poem about a pirate and a walrus' \
-o reasoning_effort high
</code></pre>
Output (comparing that with the default reasoning effort): <a href="https://github.com/simonw/llm/issues/728#issuecomment-2628321507">https://github.com/simonw/llm/issues/728#issuecomment-262832...</a><p>(If anyone has a better demo prompt I'd love to hear about it)
Summarize the themes of the opinions expressed here. For each theme, output a markdown header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece
Below are several major themes running through the discussion. Each theme is introduced as a header and enriched with direct quotes (with author attribution) drawn from participants where appropriate. At the end is a section highlighting a few opinions that are more uncommon compared to the overall sentiment.
Many commenters expressed frustration—and some amusement—over OpenAI’s naming conventions. There is widespread confusion over the multiple “o1,” “o3-mini,” “4o,” and similar labels, with several users suggesting that a simpler scheme would reduce the mental overhead. For example:
-
"I think OpenAI really needs to rethink its product naming, especially now that they have a portfolio where there's no such clear hierarchy…" – sho_hn
-
"I honestly confuse them all the time." – kaaskop
-
"It’s already a game changer for many people. But to have so many names like o1, o3-mini, GPT-4o, & GPT-4o-mini suggests there may be too much focus on internal tech details rather than clear communication." (paraphrase based on multiple similar sentiments)
Some compared the naming challenge to that faced by other tech brands:
- "It’s like AWS SKU naming (
c5d.metal
,p5.48xlarge
, etc.), except non-technical consumers are expected to understand it." – echelon
Others poked fun at the idea of excessively “fanciful” names:
- "Maybe they could start selling 'season passes' next to make their offering even more clear!" – TeMPOraL
Overall, while most agree that innovation drives excitement, the muddled product labels create barriers for both developers and everyday users.
A significant strand of the comments revolves around comparing the performance of the OpenAI models with each other and with competitors (such as DeepSeek, Gemini, and Claude). Many noted that while newer models are touted as “game changers,” the actual performance improvements are sometimes underwhelming or inconsistent across benchmarks:
-
"For non-stem perhaps. For math/coding problems, o3 mini is tied if not better than o1." – ALittleLight
-
"The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred." – patrickhogan1
Some users also debated the significance of benchmark metrics and the potential manipulation of data:
- "It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental." – Bjorkbat
Others questioned how improvements in speed versus reasoning tokens, cost per token, and other factors weigh into overall model utility:
- "I really wish they would open up the reasoning effort toggle on o1 API. o1 Pro Mode is still the best overall model I have used for many complex tasks." – ActVen
Benchmark comparisons, especially in coding and reasoning tasks, remain a contentious and nuanced topic across the discussion.
Another prominent thread is the discussion contrasting DeepSeek with OpenAI’s models. Several users highlighted performance, usability, and cost differences between these platforms. While DeepSeek is seen by some as an impressive, fast, and even “cool” research project, others note its limitations in reliability and real-world application:
-
"I’ve had DeepSeek’s models consistently misinterpret prompts, or confuse data in the prompts with instructions. Those are both very important things that make DeepSeek useless for real world applications." – fpgaminer
-
"Deepseek is great because: 1) you can run the model locally, 2) the research was openly shared, and 3) the reasoning tokens are open." – fpgaminer
-
"DeepSeek is the state of the art right now in terms of performance and output. It's really fast. The way it 'explains' how it's thinking is remarkable." – Synaesthesia
Some compare pricing as well as how each platform handles data and debug information, noting that while DeepSeek’s policies might be a challenge for organizations concerned with data privacy, OpenAI’s offerings in consistency and edge-case handling are still valued.
The conversation touches on how OpenAI’s models are designed and how they make use of user data. There is some debate about whether OpenAI trains on user-submitted data compared to models like DeepSeek. For instance:
- "I don’t think OpenAI is training on your data. At least they say they don’t, and I believe that." – ilaksh
But contrasting viewpoints are represented:
- "So for posterity, in this subthread we found that OpenAI indeed trains on user data and it isn’t something that only DeepSeek does." – therein
Users also discussed differences in how the products expose (or hide) chain-of-thought explanations, with some lamenting that OpenAI hides the reasoning tokens, whereas competitors may offer more transparency:
- "the fact that openai hides the reasoning tokens from us to begin with shows that what they are doing behind the scenes isn’t all that impressive, and likely easily cloned (r1)" – czk
The theme hints at broader concerns about user trust, privacy, and the trade-offs between transparency and security in commercially deployed models.
Cost is a recurring concern, with several commenters comparing pricing models and cost efficiency between OpenAI and competitors such as DeepSeek. The discussion is replete with commentary on token pricing and the challenge of balancing cost with performance improvements.
-
"I really don’t get the point of those oX-mini models for chat apps. ... What am I supposed to answer when people ask 'when am I supposed to use o3-mini?'" – IMTDb
-
"The API pricing is almost exactly double the DeepSeek ones." – GaggiX
Some doubt that incremental upgrades justify higher costs:
- "With DeepSeek I heard OpenAI saying the plan was to move releases on models that were meaningfully better than the competition. Seems like what we’re getting is the scheduled releases that are worse than the current versions." – ALittleLight
The overarching sentiment is that while advancements are celebrated, they are only compelling if improvements justify the added expense.
Many commenters discuss the practicalities of using these models, noting both the advantages and drawbacks of their current interfaces. Points of interest include the ability to pause a model’s chain-of-thought for debugging and the overall UX challenges posed by rapidly shifting features:
-
"Being able to see the thinking trace in R1 is so useful, as you can go back and see if it's getting stuck, making a wrong assumption, missing data, etc." – mechagodzilla
-
"Running it locally lets you INTERJECT IN ITS THINKING IN REALTIME and I cannot stress how useful that is." – thot_experiment
Others note design issues with the product interfaces:
- "At the least they should care more about UX. I have no idea how to restore the sidebar on chatgpt on desktop lol" – winrid
These comments underscore that beyond raw performance, practical usability—in terms of customization, feedback, and debugging—is crucial for advanced users.
A number of participants weigh in on the competitive landscape. Many express that while OpenAI has historically enjoyed a first-mover advantage, competitors are now catching up or even exceeding its offerings in specific areas:
-
"Anyone can write very fast software if you don’t mind it sometimes crashing or having weird bugs." – ben_w (in reference to benchmark manipulation and performance claims)
-
"OpenAI isn’t the only company. The Tech companies being beaten massively by Microsoft in #of H100s purchases are the ones with a moat." – lesuorac
Overall, while OpenAI’s branding and viral success are acknowledged, there is a sentiment that a rapidly evolving competitive market means no provider can rest on past laurels.
Several comments also target the marketing strategies and overall corporate messaging from OpenAI. Some feel that the elaborate model labels are as much a product of marketing as they are technical, while others suggest that clearer, consumer-friendly names (a la Apple’s use of names like Sonoma, Ventura, and Yosemite) could be more effective:
-
"I’m pretty sure 'o3-mini' works better for that purpose than 'GPT 4.1.3'." – TeMPOraL
-
"OpenAI needs a new branding scheme." – forrestthewoods
The debate here revolves around whether the current naming is an intended and effective way to generate hype despite its inherent confusion.
While most voices discuss naming, performance, cost, and usability in somewhat measured terms, a few comments stand out as less common or even contrarian:
-
"Wow, it got to the top of the front page so fast! Weird!" – vincentpants
– A brief, almost flippant observation compared to the detailed technical debates elsewhere. -
"openai is no longer relevant." – GoatInGrey
– An aggressive statement that contrasts with the majority who still see value in the innovations, even if minor. -
"I’m going to assume the best in your question and disregard your statement." – JadoJodo
– A dismissive tone that is rare among the more detail-focused critiques. -
"I’d bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate." – jackbrookes
– This quip injects humor into an otherwise technical discussion about evaluation metrics. -
"Perhaps you didn’t realize: Deepseek is an open weights model and you can use it via the inference provider of your choice..." – anon373839
– Highlighting freedom from proprietary shackles, this opinion emphasizes a philosophy not widely echoed by others in the thread.
These less common voices—ranging from dismissive to humorously critical—provide a counterpoint to the mainstream narratives and serve as a reminder that within even a highly technical debate, a diversity of perspectives flourishes.
Overall, the conversation is a vibrant mix of technical evaluation, cost-benefit analysis, and branding misgivings—all set against the larger backdrop of an increasingly competitive AI market. Each theme, supported by direct quotations, reveals the multifaceted challenges and passionate viewpoints that come with rapidly evolving AI technologies.
18,936 input, 2,905 output, {"completion_tokens_details": {"reasoning_tokens": 704}}
Cost: 3.3612 cents according to https://tools.simonwillison.net/llm-prices