Skip to content

Instantly share code, notes, and snippets.

@FanaHOVA
Last active September 10, 2024 22:44
Show Gist options
  • Save FanaHOVA/69f540c5228d7ace69bac06189cf0396 to your computer and use it in GitHub Desktop.
Save FanaHOVA/69f540c5228d7ace69bac06189cf0396 to your computer and use it in GitHub Desktop.
### Show Notes
- [Michelle Pokrass](https://x.com/michpokrass)
- [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs)
- [JSON mode](https://platform.openai.com/docs/guides/structured-outputs/json-mode)
- [Dev Day Recap](https://www.latent.space/p/devday)
- [Grammar-Constrained Decoding](https://arxiv.org/abs/2305.13971)
- [pgBouncer](https://www.pgbouncer.org/)
- [Zod](https://zod.dev/)
- [Instructor](https://github.com/jxnl/instructor)
- [Backus–Naur form](https://www.sciencedirect.com/topics/computer-science/backus-naur-form)
- [Lenny Bogdonoff](https://x.com/rememberlenny)
- [Gorilla BFCL](https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)
- [AI's XKCD](https://xkcd.com/1425/)
- [The Making of the Prince of Persia](https://www.jordanmechner.com/en/books/journals/)
- [Richard Thaler's Misbehaving](https://www.amazon.com/Misbehaving-Behavioral-Economics-Richard-Thaler/dp/039335279X)
- [Hack the North](https://hackthenorth.com/)
### Timestamps
- [00:00:00] Introductions
- [00:06:37] Joining OpenAI pre-ChatGPT
- [00:08:21] ChatGPT release and scaling challenges
- [00:09:58] Structured Outputs and JSON mode
- [00:11:52] Structured Outputs vs JSON mode vs Prefills
- [00:17:08] OpenAI API / research teams structure
- [00:18:12] Refusal field and why the HTTP spec is limiting
- [00:21:23] ChatML & Function Calling
- [00:27:42] Building agents with structured outputs
- [00:30:52] Use cases for structured outputs
- [00:38:36] Roadmap for structured outputs
- [00:42:06] Fine-tuning and model selection strategies
- [00:48:13] OpenAI's mission and the role of the API
- [00:49:32] War stories from the trenches
- [00:51:29] Assistants API updates
- [00:55:48] Relationship with the developer ecosystem
- [00:58:08] Batch API and its use cases
- [01:00:12] Vision API
- [01:02:07] Whisper API
- [01:04:30] Advanced voice mode and how that changes DX
- [01:05:27] Enterprise features and offerings
- [01:06:09] Personal insights on Waterloo and reading recommendations
- [01:10:53] Hiring and qualities that succeed at OpenAI
These topics cover the main discussions from the podcast episode.
### Transcript
**Alessio** [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO-in-Residence at [Decibel Partners](https://decibel.vc), and I'm joined by my co-host Swyx, founder of [Smol AI](https://smol.ai).
**Swyx** [00:00:13]: Hey, and today we're excited to be in the in-person studio with Michelle. Welcome.
**Michelle** [00:00:18]: Thanks for having me. Very excited to be here.
**Swyx** [00:00:20]: This has been a long time coming. I've been following your work on the API platform for a little bit, and I'm finally glad that we could make this happen after you shipped Structured Outputs. How does that feel?
**Michelle** [00:00:31]: Yeah, it feels great. We've been working on it for quite a while, so very excited to have it out there and have people using it.
**Swyx** [00:00:37]: We'll tell the story soon, but I want to give people a little intro to your backgrounds. So you've interned and worked at Google, Stripe, Coinbase, Clubhouse, and obviously OpenAI. What was that journey like? The one that has the most appeal to me is Clubhouse because that was a very, very hot company for a while. How do you seem to join companies when they're about to scale up really a lot? And obviously OpenAI has been the latest, but yeah, just what are your learnings and your history going into all these notable companies?
**Michelle** [00:01:06]: Yeah, totally. For a bit of my background, I'm Canadian. I went to the University of Waterloo, and there you do like six internships as part of your degree. So I started... Actually, my first job was really rough. I worked at a bank, and I learned Visual Basic, and I animated bond yield curves, and it was, you know... Me too.
**Swyx** [00:01:24]: Oh, really? Yeah. I was a derivative trader. Interest rate swaps, that kind of stuff.
**Michelle** [00:01:28]: Awesome. Yeah. Yeah. So I liked having a job, but I didn't love that job. And then my next internship was Google, and I learned so much there. It was tremendous. But I had a bunch of friends that were into startups more, and Waterloo is a big startup culture. And one of my friends interned at Stripe, and he said it was super cool. So that was kind of my... I also was a little bit into crypto at the time, and then I got into it on Hacker News, and so Coinbase was on my radar. And so that was my first real startup opportunity was Coinbase. I think I've never learned more in my life than in the four-month period when I was interning at Coinbase. They actually put me on call. I worked on the ACH rails there, and it was absolutely crazy. You know, crypto was a very formative experience. Yeah.
**Swyx** [00:02:08]: This is 2018 to 2020, kind of like the first big wave.
**Michelle** [00:02:11]: That was my full-time. Yeah. But I was there as an intern in 2016. Yeah. And so that was the period where I really learned to become an engineer, learned how to use Git, got on call right away, managed production databases and stuff, so that was super cool. After that, I went to Stripe and kind of got a different flavor of payments on the other side. Learned a lot. I was really inspired by the Colsons. And then my next internship after that, I actually started a company at Waterloo. So there's this thing you can do, it's an entrepreneurship co-op, and I did it with my roommate. The company's called Readwise, which still exists, but... Yeah, yeah.
**Alessio** [00:02:43]: Everyone uses Readwise.
**Swyx** [00:02:44]: What? You co-founded Readwise?
**Michelle** [00:02:46]: Yeah.
**Alessio** [00:02:47]: Awesome, I'm a premium user.
**Swyx** [00:02:48]: It's not even on your LinkedIn?
**Michelle** [00:02:51]: Yeah. I mean, I only worked on it for about a year. And so Tristan and Dan are the real founders, and I just had an interlude there. But yeah, really loved working on something very startup-focused, user-focused, and hacking with friends. It was super fun. Eventually, I decided to go back to Coinbase and really get a lot better as an engineer. I didn't feel equipped to be a CTO of anything at that point, and so just learned so much at Coinbase. And that was a really fun curve. But yeah, after that, I went to Clubhouse, which was a really interesting time. So I wouldn't say that I went there before it blew up. I would say I went there as it blew up. So not quite the startling track record that it might seem, but it was a super exciting place. I joined as the second or third backend engineer, and we were down every day, basically. One time Oprah came on and absolutely everything melted down, and so we would have a stand-up every morning and be like, how do we make everything stay up? Which is super exciting. Also, one of the first things I worked on there was making our notifications go out more quickly. Because when you join a Clubhouse room, you need everyone to come in right away so that it's exciting, and the person speaking thinks a lot of my audience is here. But when I first joined, I think it would take like 10 minutes for all the notifications to go out, which is insane. By the time you want to start talking to the time your audience is there, you can totally kill the room. So that's one of the first things I worked on, is making that a lot faster and keeping everything up.
**Swyx** [00:04:11]: I mean, so already we have an audience of engineers. Those two things are useful. Keeping things up and notifications out. Notifications like, is it a Kafka topic?
**Michelle** [00:04:19]: It was a Postgres shop, and you had all of the followers in Postgres, and you needed to iterate over the followers and figure out, is this a good notification to send? And so all of this logic, it wasn't well-batched and parallelized, and our job queuing infrastructure wasn't right. And so there was a lot of fixing all of these things. Eventually, there were a lot of database migrations because Postgres just wasn't scaling well for us.
**Alessio** [00:04:40]: Interesting.
**Swyx** [00:04:41]: So keeping things up, that was more of a, I don't know, reliability issue, SRE type?
**Michelle** [00:04:47]: A lot of it, yeah, it goes down to database stuff.
**Swyx** [00:04:51]: Everywhere I've worked- It's all databases. Yeah. Indexing.
**Michelle** [00:04:55]: Actually, at Coinbase, at Clubhouse, and at OpenAI, Postgres has been a perennial challenge. It's like the stuff you learn at one job carries over to all the others because you're always debugging a long-running Postgres query at 3 a.m. for some reason. So those skills have really carried me forward, for sure.
**Alessio** [00:05:12]: Why do you think that not as much of this is prioritized? Obviously, Postgres is an open-source project that's not aimed at gigascale, but you would think somebody would come around and say, hey, we're like the- Yeah.
**Michelle** [00:05:22]: I think that's what Planetscale is doing. It's not on Postgres, I think. It's on MySQL. But I think that's the vision. It's like they have zero downtime migrations, and that's a big pain point. I don't know why no one is doing this on Postgres, but I think it would be pretty cool.
**Swyx** [00:05:37]: Their connection puller, like pgBouncer, is good enough?
**Michelle** [00:05:40]: I don't know. I mean, I've run pgBouncer everywhere, and there's still a lot of problems.
**Swyx** [00:05:45]: Your scale is something that not many people see. Yeah.
**Michelle** [00:05:49]: I mean, at some point, every successful company gets to the scale where Postgres is not cutting it, and then you migrate to some sort of NoSQL database. And that process I've seen happen a bunch of times now.
**Swyx** [00:05:59]: MongoDB, Redis, something like that.
**Michelle** [00:06:01]: Yeah. I mean, we're on Azure now, and we use Cosmos DB.
**Swyx** [00:06:06]: Cosmos DB, hey!
**Michelle** [00:06:07]: At Clubhouse, I really love DynamoDB. It's probably my favorite database, which is like a very nerdy sentence, but that's the one I'm using if I need to scale something as far as it goes.
**Swyx** [00:06:16]: Yeah. DynamoDB, when I worked at AWS briefly, and it's kind of like the memory register for the web. Yes. If you treat it just as physical memory, you will use it well. If you treat it as a real database, you might run into problems. Right.
**Michelle** [00:06:31]: You have to totally change your mindset when you're going from Postgres to Dynamo. But I think it's a good mindset shift and kind of makes you design things in a more scalable way.
**Swyx** [00:06:37]: Yeah. I'll recommend the DynamoDB book for people who need to use DynamoDB. But we're not here to talk about AWS. We're here to talk about OpenAI. You joined OpenAI pre-ChatGPT. I also had the option to join, and I didn't. What was your insight?
**Michelle** [00:06:50]: Yeah. I think a lot of people who joined OpenAI joined because of a product that really gets them excited, and for most people, it's ChadGBT. But for me, I was a daily user of Copilot, GitHub Copilot, and I was so blown away at the quality of this thing. I actually remember the first time seeing it on Hacker News and being like, wow, this is absolutely crazy. This is going to change everything. And I started using it every day. It just really, even now when I don't have service and I'm coding without Copilot, it's just like 10x difference. So I was really excited about that product. I thought now is maybe the time for AI. And I'd done some AI in college and thought some of those skills would transfer. And I got introduced to the team. I liked everyone I talked to, so I thought that would be cool. Why didn't you join?
**Swyx** [00:07:30]: It was like, I was like, is DALI it? We were there.
**Alessio** [00:07:35]: We were at the DALI launch thing, and I think you were talking with Lenny, and Lenny was at OpenAI at the time.
**Swyx** [00:07:41]: We don't have to go into too much detail, but this was one of my biggest regrets of my life.
**Alessio** [00:07:46]: No, no, no.
**Swyx** [00:07:47]: But I was like, okay, I mean, I can create images. I don't know if this is the thing to dedicate, but obviously you had a bigger vision than I did.
**Michelle** [00:07:55]: DALI was really cool, too. I remember first showing my family, I was like, I'm going to this company, and here's one of the things they do. And it really helped bridge the gap. I still haven't figured out how to explain to my parents what crypto is. My mom, for a while, thought I worked at Bitcoin. So it's pretty different to be able to tell your family what you actually do, and they can see it.
**Swyx** [00:08:15]: And they can use it, too, personally. So you were there. Were you immediately on API Platform? You were there for the ChatGPT moment.
**Michelle** [00:08:21]: Yeah. I mean, API Platform is a very grandiose term for what it was. There was just a handful of us working on the API.
**Swyx** [00:08:27]: Yeah, it was like a closed beta, right? Not even everyone had access to the GPT-3 model.
**Michelle** [00:08:31]: A very different access model then, a lot more like tiered rollouts. But yeah, I would say the Applied team was maybe like 30 or 40 people, and yeah, probably closer to 30. And there was maybe like five-ish total working on the API at most. So yeah, we've grown a lot since then.
**Swyx** [00:08:47]: It's like 60, 70 now, right?
**Michelle** [00:08:49]: No, Applied is much bigger than that. Applied now is bigger than the company when I joined.
**Swyx** [00:08:53]: OK.
**Michelle** [00:08:54]: Yeah, we've grown a lot. I mean, there's so much to build. So we need all the help we can.
**Swyx** [00:08:57]: I'm a little out of date, yeah.
**Alessio** [00:08:58]: So when did the ChatGPT release kind of like all ends on deck story? I had lunch with Evan Morikawa a few months ago. It sounded like it was a fun time to build the APIs and have all these people trying to use the web thing. How are you prioritizing internally? What was the helping scaling when you're scaling non-GPU workloads versus like Postgres bouncers and things like that?
**Michelle** [00:09:19]: Totally. Yeah, actually, surprisingly, there were a lot of Postgres issues when ChatGPT came out because the accounts for like ChatGPT were tied to the accounts in the API. And so you're basically creating a developer account to log into ChatGPT at the time because it's just what we had. It was low-key research preview. And so I remember there was just so much work scaling like our authorization system and that would be down a lot. Also, GPU, you know, I never had worked in a place where you couldn't just scale the thing up. It's like everywhere I've worked, compute is like free and you just like auto scale a thing and you like never think about it again. But here we're having like tough decisions every day. We're like discussing like, you know, should they go here or here and we have to be principled about it. So that's a real mindset shift.
**Swyx** [00:09:58]: So you just released structured outputs. Congrats. You also wrote the blog post for it, which was really well-written and I loved all the examples that you put out. Like it really gives the full story. Yeah. Tell us about the whole story from beginning to end.
**Michelle** [00:10:09]: Yeah. I guess the story we should rewind quite a bit to Dev Day last year. Dev Day last year, exactly. We shipped JSON mode, which is our first foray into this area of product. So for folks who don't know, JSON mode is this functionality you can enable in our chat completions and other APIs, where if you opt in, we'll kind of constrain the output of the model to match the JSON language. And so you basically will always get something in a curly brace. And this is good. This is nice for a lot of people. You can like describe your schema, what you want in prompt and then, you know, we'll constrain it to JSON, but it's not getting you exactly where you want because you don't want the model to kind of make up the keys or like match different values than what you want. Like if you want an enum or a number and you get a string instead, it's like pretty frustrating. So we've been ideating on this for a while and like people have been asking for basically this every time I talk to customers for maybe the last year. So it was really clear that there's a developer need and we started working on kind of making it happen. And this is a real collab between engineering and research, I would say. And so it's not enough to just kind of constrain the model. I think of that as the engineering side, whereas basically you mask the available tokens that are produced every time to only fit the schema. And so you can do this engineering thing and you can force the model to do what you want, but you might not get good outputs. And sometimes with JSON mode, developers have seen that our models output like white space for a really long time where they don't-
**Swyx** [00:11:27]: Because it's a legal character.
**Michelle** [00:11:29]: Right. It's legal for JSON, but it's not really what they want. And so that's what happens when you do kind of a very engineering biased approach. But the modeling approach is to also train the model to do more of what you want. And so we did these together. We trained a model which is significantly better than our past models at following formats. And we did the ntwork to serve like this constrained decoding concept at scale. And so I think marrying these two is why this feature is pretty cool.
**Alessio** [00:11:52]: You just mentioned starts and ends with a curly brace and maybe people's minds go to prefills in the Claude API. How should people think about JSON mode structured output prefills? Because some of them are like roughly starts with a curly brace and asks you for JSON, you should do it. And then Instructor is like, hey, here's a rough data schema you should use. And how do you think about them?
**Michelle** [00:12:13]: So I think we kind of designed structured outputs to be the easiest to use. The way you use it in our SDK, I think is my favorite thing. So you just create like a pedantic object or a Zod object and you pass it in and you get back an object. And so you don't have to deal with any of the serialization. With the parse helper. Yeah, you don't have to deal with any of the serialization on the way in or out. So I kind of think of this as the feature for the developer who is like, I need this to plug into my system. I need the function call to be exact. I don't want to deal with any parsing. So that's where structured outputs is tailored. Whereas if you want the model to be more creative and use it to come up with a JSON schema that you don't even know you want, then that's kind of where JSON mode fits in. But I expect most developers are probably going to want to upgrade to structured outputs.
**Swyx** [00:12:55]: The thing you just said, you just use interchangeable terms for the same thing, which is function calling and structured outputs. We've had disagreements or discussion before on the podcast about, are they the same thing? Semantically, they're slightly different.
**Michelle** [00:13:09]: They are.
**Swyx** [00:13:10]: Yes. So the API came out first, then JSON mode. And we used to abuse function calling for JSON mode. Do you think we should treat them as synonymous?
**Michelle** [00:13:20]: No.
**Swyx** [00:13:21]: OK. Yeah. Please clarify. And by the way, there's also tool calling. Yeah.
**Michelle** [00:13:26]: The history here is we started with function calling. And function calling came from the idea of like, let's give the model access to tools and let's see what it does. And we basically had these internal prototypes of what a code interpreter is now. And we were like, this is super cool. Let's make it an API. But we're not ready to host code interpreter for everybody. So we're just going to expose the raw capability and see what people do with it. But even now, I think there's a really big difference between function calling and structured outputs. So you should use function calling when you actually have functions that you want the model to call. Right. And so if you have a database that you want the model to be able to query from, or if you want the model to send an email, or generate arguments for an actual action. And that's the way the model has been fine-tuned on, is to treat function calling for actually calling these tools and getting their outputs. The new response format is a way of just getting the model to respond to the user, but in a structured way. And so this is very different. Like responding to a user versus like, you know, I'm going to go send an email. A lot of people were hacking function calling to get the response format they needed. And so this is why we shipped kind of this new response format. So you can get exactly what you want, and you get kind of more of the model's verbosity. It's like kind of responding in the way it would speak to a user. And so less kind of just programmatic tool calling, if that makes sense.
**Alessio** [00:14:42]: Are you building something into the SDK to actually close the loop with the function calling? Because right now it returns the function, then you got to run it, then you got to like fake another message to then continue the conversation.
**Swyx** [00:14:53]: They have that in beta, the runs.
**Michelle** [00:14:55]: Yes. We have this in beta in the Node SDK. So you can basically define... Oh, not Python? Oh. It's coming to Python as well.
**Alessio** [00:15:02]: That's why I didn't know. Yeah, I'm a Node guy, so...
**Swyx** [00:15:04]: The JavaScript mind is too advanced. It's already existed. It's coming everywhere.
**Michelle** [00:15:07]: But basically what you do is you write a function, and then you add a decorator to it. And then you can... Basically, there's this run tools method, and it does the whole loop for you, which is pretty cool.
**Swyx** [00:15:19]: When I saw that in the Node SDK, I wasn't sure if that's... Because it basically runs it in the same machine, and maybe you don't want that to happen. Yeah.
**Michelle** [00:15:28]: I think of it as like, if you're prototyping and building something really quickly and just playing around, it's so cool to just create a function and give it this decorator. But you have the flexibility to do it however you like.
**Swyx** [00:15:38]: You don't want it in a critical path of a web request? I mean, some people definitely will.
**Michelle** [00:15:42]: Really? It's just kind of the easiest way to get started. But let's say you want to execute this function on a job queue async, then it wouldn't make sense to use that.
**Swyx** [00:15:52]: Prior art, Instructure, outlines, JSON former. What did you study? What did you credit or learn from these things?
**Michelle** [00:15:59]: Yeah. There's a lot of different approaches to this. There's more fill-in-the-blank style sampling, where you basically preform the keys and then get the model to sample just the value. There's a lot of approaches here. We didn't use any of them wholesale, but we really loved what we saw from the community and the developer experiences we saw. So that's where we took a lot of inspiration.
**Swyx** [00:16:21]: There was a question also just about constrained grammar. This is something that I first saw in Llama CPP, which seems to be the most, let's just say, academically permissive form of constrained grammar.
**Michelle** [00:16:32]: It's on the lowest level.
**Swyx** [00:16:33]: Yeah. For those who don't know, maybe I don't know if you want to explain it, but they use Bacchus-Norr form, which you only learn in college when you're working on programming languages and compilers. I don't know if you use that under the hood or you explore that.
**Michelle** [00:16:44]: Yeah. We didn't use any kind of other stuff. We kind of built our solution from scratch to meet our specific needs. But I think there's a lot of cool stuff out there where you can supply your own grammar. Right now we only allow JSON schema and a dialect of that. But I think in the future it could be a really cool extension to let you supply a grammar more broadly. And maybe it's more token efficient than JSON. So a lot of opportunity there.
**Alessio** [00:17:08]: You mentioned before also training the model to be better at function calling. What's that discussion like internally for resources? It's like, hey, we need to get better JSON mode. And it's like, well, can't you figure it out on the API platform without touching the model? Is there a really tight collaboration between the two teams?
**Michelle** [00:17:25]: Yeah. So I actually work on the API models team. I guess we didn't quite get into what I do at API.
**Swyx** [00:17:31]: What do you say it is you do here? Yeah.
**Michelle** [00:17:34]: So yeah, I'm the tech lead for the API, but also I work on the API models team. And this team is really working on making the best models for the API. And a lot of common deployment patterns are research makes a model and then you kind of ship it in the API. But I think there's a lot you miss when you do that. You miss a lot of developer feedback and things that are not kind of immediately obvious. What we do is we get a lot of feedback from developers and we go and make the models better in certain ways. So our team does model training as well. We work very closely with our post-training team. And so for structured outputs, it was a collab between a bunch of teams, including safety systems to make a really great model that does structured outputs.
**Swyx** [00:18:12]: Mentioning safety systems, you have a refusal field.
**Michelle** [00:18:15]: Yes.
**Swyx** [00:18:16]: You want to talk about that? Yeah.
**Michelle** [00:18:18]: It's pretty interesting. So you can imagine, basically, if you constrain the model to follow a schema, you can imagine there being a schema supplied that it would add some risk or be harmful for the model to kind of follow that schema. And we wanted to preserve our model's abilities to refuse when something doesn't match our policies or is harmful in some way. And so we needed to give the model an ability to refuse even when there is this schema. But also, if you are a developer and you have this schema and you get back something that doesn't match it, you're like, oh, the feature's broken. So we wanted a really clear way for developers to program against this. So if you get something back in the content, you know it's valid, it's JSON parsable. But if you get something back in the refusal field, it makes for a much better UI for you to kind of display this to your user in a different way, and it makes it easier to program against. So really, there was a few goals. But it was mainly to allow the model to continue to refuse, but also with a really good developer experience.
**Swyx** [00:19:11]: Yeah. Why not offer it as an error code? Because we have to display error codes anyway. Yeah.
**Michelle** [00:19:17]: We've falafeled for a long time about API design, as we are wont to do. And there are a few reasons against an error code. You could imagine this being a 4xx error code or something. But the developer's paying for the tokens. And that's kind of atypical for a 4xx error code.
**Swyx** [00:19:33]: We pay with errors anyway, right?
**Michelle** [00:19:36]: So 4xxs don't.
**Swyx** [00:19:39]: That's a you error.
**Michelle** [00:19:40]: Right. It's a malformed request. And it doesn't make sense as a 5xx either, because it's not our fault. It's the way the model is designed. I think the HTTP spec is a little bit limiting for AI in a lot of ways. There are things that are in between your fault and my fault. There's kind of the model's fault. And there's no error code for that. So we really have to kind of invent a lot of the paradigm here. Make it 6xx.
**Swyx** [00:20:02]: Yeah.
**Michelle** [00:20:03]: That's one option. There's actually some esoteric error codes we've considered adopting. 328.
**Swyx** [00:20:08]: My favorite.
**Michelle** [00:20:09]: Yeah. Yeah. There's the TPOT one.
**Swyx** [00:20:12]: Hey!
**Michelle** [00:20:13]: We're still figuring that out. But I think there are some things, like, for example, sometimes our model will produce tokens that are invalid based on kind of our language. And when that happens, it's an error. But 500 is fine, which is what we return. But it's not as expressive as it could be. So yeah. Just areas where Web 2.0 doesn't quite fit with AI yet.
**Alessio** [00:20:37]: If you had to put in a spec to just change, what would be your number one proposal to rehaul?
**Swyx** [00:20:43]: The HTTP committee to reinvent the world. Yeah.
**Michelle** [00:20:47]: I mean, I think we just need an error of, like, a range of model error. And we can have many different kinds of model errors. Like a refusal is a model error.
**Alessio** [00:20:55]: 601. Model refusal. Yeah.
**Michelle** [00:20:58]: Again, like, so we've mentioned before that chat completions uses this chat ML format. So when the model doesn't follow chat ML, that's an error. And we're working on reducing those errors. But that's, like, I don't know, 602, I guess.
**Swyx** [00:21:10]: A lot of people actually no longer know what chat ML is. Yeah. Fair enough. Briefly introduced by OpenAI and then, like, kind of deprecated. Everyone who implements this under the hood knows it. But maybe the API users don't know it.
**Michelle** [00:21:23]: Basically, the API started with just one endpoint, the completions endpoint. And the completions endpoint, you just put text in and you get text out. And you can prompt in certain ways. Then we released chat GPT. And we decided to put that in the API as well. And that became the chat completions API. And that API doesn't just take, like, a string input and produce an output. It actually takes in messages and produces messages. And so you can get a distinction between, like, an assistant message and a user message. And that allows all kinds of behavior. And so the format under the hood for that is called chat ML. Sometimes, you know, because the model is so out of distribution based on what you're doing, maybe the temperature is super high, then it can't follow chat ML. Yeah.
**Swyx** [00:22:02]: I didn't know that there could be errors generated there. Maybe I'm not asking challenging enough questions. It's pretty rare.
**Michelle** [00:22:07]: And we're working on driving it down. But actually, this is a side effect of structured outputs now, which is that we have removed a class of errors. We didn't really mention this in the blog, just because we ran out of space. But-
**Swyx** [00:22:20]: That's what we're here to do.
**Michelle** [00:22:21]: Yeah. The model used to occasionally pick a recipient that was invalid. And this would cause an error. But now we are able to constrain to chat ML in a more valid way. And this reduces a class of errors as well.
**Swyx** [00:22:34]: Recipient meaning? So there's this, like, a few number of defined roles, like user, assistant, system.
**Michelle** [00:22:39]: So like recipient as in, like, picking the right tool. So the model before was able to hallucinate a tool, but now it can't when you're using structured outputs.
**Alessio** [00:22:49]: Do you collaborate with other model developers to try and figure out this type of errors? Like, how do you display them? Because a lot of people try to work with different models. Yeah. Is there any?
**Michelle** [00:23:00]: Yeah. Not a ton. We're kind of just focused on making the best API for developers.
**Swyx** [00:23:04]: A lot of research and engineering, I guess, comes together with evals. You published some evals there. I think Gorilla is one of them. What is your assessment of the state of evals for function calling and structured output right now?
**Michelle** [00:23:17]: Yeah. We've actually collaborated with BFCL a little bit, which is, I think, the same thing as
**Swyx** [00:23:23]: Gorilla. Function calling leaderboard.
**Michelle** [00:23:25]: Kudos to the team. Those evals are great. And we use them internally. Yeah. We've also sent some feedback on some things that are misgraded. And so we're collaborating to make those better. In general, I feel evals are kind of the hardest part of AI. When we talk to developers, it's so hard to get started. It's really hard to make a robust pipeline. And you don't want evals that are 80% successful, because things are going to improve dramatically. And it's really hard to craft the right eval. You kind of want to hit everything on the difficulty curve. I find that a lot of these evals are mostly saturated, like for BFCL. All the models are near the top already. And the errors are more, I would say, just differences in default behaviors. I think most of the models on the leaderboard can kind of get 100% with different prompting. But it's more kind of you're just pulling apart different defaults at this point. So yeah, I would say in general, we're missing evals. We work on this a lot internally. But it's hard.
**Swyx** [00:24:18]: Did you, other than BFCL, would you call out any others just for people exploring the space?
**Michelle** [00:24:23]: So eBench is actually a very interesting eval, if people don't know. You basically give the model a GitHub issue and a repo, and just see how well it does at the issue. Which I think is super cool. It's kind of like an integration test. I would say, for models.
**Swyx** [00:24:36]: It's a little unfair, right?
**Michelle** [00:24:38]: What do you mean?
**Swyx** [00:24:39]: A little unfair, because usually, as a human, you have more opportunity to ask questions about what it's supposed to do, and you're giving the model way too little information
**Michelle** [00:24:47]: to do the job.
**Swyx** [00:24:48]: It's a hard job.
**Michelle** [00:24:49]: But yeah, so eBench targets how well can you follow the diff format, and how well can you search across files, and how well can you write code. So I'm really excited about evals like that, because the pass rate is low, so there's a lot of room to improve. And it's just targeting a really cool capability.
**Swyx** [00:25:03]: I've seen other evals for function calling, where I think might be BFCL as well, where they evaluate different kinds of function calling. And I think the top one that people care about, for some reason, I don't know personally that this is so important to me, but it's parallel function calling. I think you confirmed that you don't support that yet. Why is that hard? Just more context about it.
**Michelle** [00:25:23]: So yeah, we put out parallel function calling Dev Day last year as well. And it's kind of the evolution of function calling. So function calling v1, you just get one function back. Function calling v2, you can get multiple back at the same time and save latency. We have this in our API, all of our newer models support it. But we don't support it with structured outputs right now. And there's actually a very interesting trade-off here. So when you basically call our API for structured outputs with a new schema, we have to build this artifact for fast sampling later on. But when you do parallel function calling, the kind of schema we follow is not just directly one of the function schemas. It's like this combined schema based on a lot of them. If we were going to do the same thing and build an index every time you pass in a list of functions, if you ever change the list, you would kind of incur more latency. And we thought it would be really unintuitive for developers and hard to reason about. So we decided to kind of wait until we can support a no-added-latency solution and not just kind of make it really confusing for developers.
**Swyx** [00:26:19]: Mentioning latency, that is something that people discovered, is that there is an increased cost in latency for the first token.
**Michelle** [00:26:25]: For the first request, yeah.
**Swyx** [00:26:26]: First request. Is that an issue? Is that going to go down over time? Is there some overhead to parsing JSON that is just insurmountable?
**Michelle** [00:26:33]: It's definitely not insurmountable. And I think it will definitely go down over time. We just kind of take the approach of ship early and often. And if there's nothing in there you don't want to fix, then you probably ship too late. So I think we will get that latency down over time. But yeah, I think for most developers, it's not a big concern. Because you're testing out your integration. You're sending some requests while you're developing it. And then it's fast and broad. So it kind of works for most people. The alternative design space that we explored was pre-registering your schema, so a totally different endpoint, and then passing in a schema ID. But we thought that was a lot of overhead, and another endpoint to maintain, and just kind of more complexity for the developer. And we think this latency is going to come down over time. So it made sense to keep it in chat completions.
**Swyx** [00:27:20]: I mean, hypothetically, if one were to ship caching at a future point, it would basically be the superset of that. Maybe.
**Michelle** [00:27:28]: I think the caching space is a little underexplored. We've seen kind of two versions of it. But I think, yeah, there's ways that maybe put less onus on the developer. But we haven't committed to anything yet, but we're definitely exploring opportunities for making things cheaper over time.
**Alessio** [00:27:42]: Is AGI and agents just going to be a bunch of structure output and function calling one next to each other? Like, how do you see, you know, there's like the model does everything. Where do you draw the line? Because you don't call these things like an agent API. But like, if I were a startup trying to raise a C round, I would just do function calling and say, this is an agent API. So how do you think about the difference and like how people build on top of it for like agentic systems? Yeah.
**Michelle** [00:28:04]: Love that question. One of the reasons we wanted to build structured outputs is to make agentic applications actually work. So right now it's really hard. Like, if something is 95% reliable, but you're chaining together a bunch of calls, if you magnify that error rate, it makes your application not work. So that's a really exciting thing here from going from like 95% to 100%. I'm very biased working on the API and working on function calling and structured outputs, but I think those are the building blocks that we'll be using kind of to distribute this technology very far. It's the way you connect like natural language and converting user intent into working with your application. And so I think like kind of there's no way to build without it, honestly, like you need your function calls to work like, yeah, we wanted to make that a lot easier.
**Alessio** [00:28:45]: And do you think the assistance kind of like API thing will be a bigger part as people build agents? I think maybe most people just use messages and completion.
**Michelle** [00:28:54]: So I would say the assistance API was kind of a bet in a few areas. One bet is hosted tools. So we have the file search tool and code interpreter. Another bet was kind of statefulness. It's our first stateful API. It'll store, you know, threads and you can fetch them later. I would say the hosted tools aspect has been really successful. Like people love our file search tool and it's like, it saves a lot of time to not build your own rag pipeline. I think we're still iterating on the shape for the stateful thing to make it as useful as possible. Right now, there's kind of a few endpoints you need to call before you can get a run going. And we want to work to make that, you know, much more intuitive and easier over time.
**Swyx** [00:29:31]: One thing I'm just kind of curious about, did you notice any trade-offs when you add more structured output, it gets worse at some other thing that was like kind of you didn't think was related at all?
**Michelle** [00:29:40]: Yeah, it's a good question. Yeah. I mean, models are very spiky and RL is hard to predict. And so every model kind of improves on some things and maybe is flat or neutral on other things.
**Swyx** [00:29:53]: Yeah. Like it's like very rare to just add a capability and have no trade-offs and everything else.
**Michelle** [00:29:58]: So yeah, I don't have something off the top of my head, but I would say, yeah, every model is a special kind of its own thing. This is why we put them in API dated so developers can choose for themselves which one works best for them. In general, we strive to continue improving on all evals, but it's stochastic.
**Swyx** [00:30:14]: Able to apply the structured output system on backdated models, like 4.0 May, as well as Mini, as well as August.
**Michelle** [00:30:23]: Actually, the new response format is only available on two models. It's 4.0 Mini and the new 4.0. So the old 4.0 doesn't have the new response format. However, for function calling, we were able to enable it for all models that support function calling. And that's because those models were already trained to follow these schemas. We basically just didn't want to add the new response format to models that would do poorly at it because they would just kind of do infinite white space, which is the most likely token if you have no idea what's going on.
**Swyx** [00:30:52]: I just wanted to call out a little bit more in the stuff you've done in the blog post. So in blog posts, just use cases, right? I just want people to be like, yeah, we're spelling it out for you. Use these for extracting structured data from unstructured data. By the way, it does vision too, right? So that's cool. Dynamic UI generation. Actually, let's talk about dynamic UI. Gen UI, I think, is something that people are very interested in. It's your first example. What did you find about it?
**Michelle** [00:31:16]: Yeah, I just thought it was a super cool capability we have now. So the schemas, we support recursive schemas. And this allows you to do really cool stuff. Every UI is a nested tree that has children. So I thought that was super cool. You can use one schema and generate tons of UIs. As a backend engineer who's always struggled with JavaScript and frontend, for me, that's super cool. We've now built a system where I can get any frontend that I want. So yeah, that's super cool. The extracting structured data, the reality of a lot of AI applications is you're plugging them into your enterprise business. And you have something that works, but you want to make it a little bit better. And so the reliability gains you get here is you'll never get a classification using the wrong enum. It's just exactly your types. So really excited about that.
**Swyx** [00:32:03]: Maybe hallucinate the actual values, right? So let's clearly state what the guarantees are. So the guarantees is that this fits the schema, but the schema itself may be too broad, because the JSON schema type system doesn't say, like, I only want to range from 1 to 11. You might give me 0. You might give me 12.
**Michelle** [00:32:21]: So yeah, JSON schema, so this is actually a good thing to talk about. So JSON schema is extremely vast, and we weren't able to support every corner of it. So we kind of support our own dialect, and it's described in the docs. And there are a few trade-offs we had to make there. So by default, if you don't pass in additional properties in a schema, by default that's true. And so that means you can get other keys, which you didn't spell out, which is kind of the opposite of what developers want. You basically want to supply the keys and values, and you want to get those keys and values. And so then we had to decision to make. It's like, do we redefine what additional properties means as the default? That felt really bad. It's like, there's a schema that's predated us. It wouldn't be good. It would be better to play nice with the community. And so we require that you pass it in as false. One of our design principles is to be very explicit, and so developers know what to expect. And so this is one where we decided it's a little harder to discover, but we think you should pass this thing in so that we can have a very clear definition of what you mean and what we mean. There's a similar one here with required. By default, every key in JSON schema is optional. But that's not what developers want, right? You'd be very surprised if you pass in a bunch of keys and you didn't get some of them back. And so that's the trade-off we made, is to make everything required and have the developers spell that out.
**Alessio** [00:33:34]: Does there a require false? Can people turn it off, or they're just getting all the keys?
**Michelle** [00:33:38]: So developers can... Basically, what we recommend for that is to make your actual key a union type. And so... Nullable. Yeah. Make it a union of int and null, and that gets you the same behavior.
**Swyx** [00:33:48]: Any other examples you want to dive into, math, chain of thought? Yeah.
**Michelle** [00:33:52]: You can now specify a chain of thought field before a final answer. This is just a more structured way of extracting the final answer. One example we have, I think we put up a demo app of this math tutoring example, or it's coming out soon. Did I miss it?
**Swyx** [00:34:06]: Oh, okay.
**Michelle** [00:34:07]: Well... Basically, it's this math tutoring thing, and you put in an equation, and you can go step-by-step and answer it. This is something you can do now with Structured Office. In the past, a developer would have to specify their format, and then write a parser and parse out the model's output, which would be pretty hard. But now you just specify steps, and it's an array of steps. And every step you can render, and then the user can try it, and you can see if it matches and go on that way. So I think it just opens up a lot of opportunities for any kind of UI where you want to treat different parts of the model's responses differently, Structured Office is great for that.
**Swyx** [00:34:38]: I remembered my question from earlier. I'm basically just using this to ask you all the questions as a user, as a daily user of the stuff that you put out. So one is a tip that people don't know, and I confirmed it to you on Twitter, which is you respect descriptions of JSON schemas, and you can basically use that as a prompt for the field. Totally. I assume that's blessed, and people should do that.
**Michelle** [00:34:57]: Intentional, yeah.
**Swyx** [00:34:58]: One thing that I started to do, which could be a hallucination of me, is I changed the property name to prompt the model to what I wanted to do. So for example, instead of saying topics as a property name, I would say, like, brainstorm a list of topics, up to five, or something like that, as a property name. I could stick that in the description as well. But is that too much?
**Michelle** [00:35:22]: Yeah, I would say, I mean, we're so early in AI that people are figuring out the best way to do things, and I love when I learn from a developer a way they found to make something work. In general, I think there's three or four places to put instructions. You can put instructions in the system message, and I would say that's helpful for when to call a function. So it's like, let's say you're building a customer support thing, and you want the model to verify the user's phone number or something. You can tell the model in the system message, like, here's when you should call this function. Then when you're within a function, I would say the descriptions there should be more about how to call a function. So what's really common is someone will have, like, date as a string, but you don't tell the model, like, do you want year, year, month, month, day, day, or do you want that backwards? And that's what a really good spot is for those kind of descriptions. It's like, how do you call this thing? And then sometimes there's, like, really stuff like what you're doing. It's like, name the key by what you want. So sometimes people put, like, do not use, and, you know, if they don't want, you know, this parameter to be used, except only in some circumstances, and really, I think that's the fun nature of this. Like, you're figuring out the best way to get something out of the model.
**Swyx** [00:36:27]: Okay, so you don't have an official recommendation is what I'm hearing.
**Michelle** [00:36:30]: Well, the official recommendation is, you know, how to call a model, system instructions.
**Swyx** [00:36:34]: Exactly, exactly.
**Michelle** [00:36:35]: Or when to call a function.
**Alessio** [00:36:36]: Yeah. Do you benchmark these type of things? So, like, say, with date, it's like description, it's like, return it in, like, ISO 8, or if you call the key date in ISO A6001, I feel like the benchmarks don't go that deep, but then all the AI engineering kind of community, like, all the work that people do, it's like, oh, actually, this performs better, but then there's no way to verify, you know, like, even the, I'm going to tip you $100,000, or whatever, like, some people say it works, some people say it doesn't. Do you pay attention to this stuff as you build this, or are you just like, the model is just going to get better, so why waste my time running evals on these small, small things?
**Michelle** [00:37:14]: Yeah, I would say, to that, I would say we basically pick our battles. I mean, there's so much surface area of LLMs that we could dig into, and we're just mostly focused on kind of raising the capabilities for everyone. I think for customers, and we work with a lot of customers, really developing their own evals is super high leverage, because then you can upgrade really quickly when we have a new model, you can experiment with these things with confidence. So, yeah, we're hoping to make making evals easier. I think that's really generally very helpful for developers.
**Swyx** [00:37:42]: For people, I would just kind of wrap up the discussion for structured outputs, I immediately implemented, we use structured outputs for AI news, I use Instructor, and I ripped it out, and I think I saved 20 lines of code, but more importantly, it was like, we cut it by 55% of API costs based on what I measured, because we saved on the retries.
**Michelle** [00:38:02]: Nice. Yeah, love to hear that.
**Swyx** [00:38:04]: Yeah, which people I think don't understand, when you can't just simply add Instructor or add outlines, you can do that, but it's actually going to cost you a lot of retries to get the model that you want, but you're kind of just kind of building that internally into the model.
**Michelle** [00:38:17]: Yeah, I think this is the kind of feature that works really well when it's integrated with the LLM provider. Yeah, actually, I had folks, even my husband's company, he works at a small startup, they thought we were just retrying, and so I had to make the blog post, we are not retrying, we're doing it in one shot, and this is how you save on latency and cost.
**Swyx** [00:38:36]: Awesome. Any other behind-the-scenes stuff, just generally on structured outputs, we're going to move on to the other models. Yeah, I think that's it. Well, that's an excellent product, and I think everyone will be using it, and we have the full story now that people can try out. So roadmap would be parallel function calling, anything else that you've called out as coming soon?
**Michelle** [00:38:54]: Not quite soon, but we're thinking about does it make sense to expose custom grammars beyond JSON schema?
**Swyx** [00:39:00]: What would you want to hear from developers to give you information, whether it's custom grammars or anything else about structured output, what would you want to know more of?
**Michelle** [00:39:07]: Just always interested in feature requests, what's not working, but I'd be really curious like what specific grammars folks want. I know some folks want to match programming languages like Python. There's some challenges like with the expressivity of our implementation, and so yeah, just kind of the class of grammars folks want.
**Swyx** [00:39:25]: I have a very simple one, which is a lot of people try to use GPT as judge, right? Which means they end up doing a rating system, and then there's like 10 different kinds of rating systems, there's a Likert scale, there's whatever. If there was an officially blessed way to do a rating system with structured outputs, everyone would use it.
**Michelle** [00:39:42]: Yeah. Yeah, that makes sense. I would definitely recommend using log probs with classification tasks. So rather than sampling, let's say you have four options, like red, yellow, blue, green, rather than sampling two tokens for yellow, you can just do like ABCD and get the log probs of those. The inherent randomness of each sampling isn't taken into account, and you can just actually look at what is the most likely token.
**Swyx** [00:40:07]: I think this is more of like a calibration question. If I ask you to rate things from one to 10, a non-calibrated model might always pick seven, just like a human would. So actually have a nice gradation from one to 10 would be the rough idea. And then even for structured outputs, I can't just say, have a field of rating from one to 10, because I have to then validate it, and it might give me 11. Yeah, absolutely.
**Alessio** [00:40:31]: So what about model selection? Now you have a lot of models. When you first started, you had one model endpoint? I guess you had the DaVinci, but most people were using one model endpoint. Today, you have a lot of competitive models, and I think we're nearing the end of the 3.5 run RIP. How do you advise people to experiment, select, both in terms of tasks and costs? What's your playbook?
**Michelle** [00:40:56]: In general, I think folks should start with 4.0 Mini. That's our cheapest model, and it's a great workhorse, works for a lot of great use cases. If you're not finding the performance you need, maybe it's not smart enough, then I would suggest going to 4.0. And if 4.0 works well for you, that's great. Finally, there's some really advanced frontier use cases, and maybe 4.0 is not quite cutting it. And there, I would recommend our fine-tuning API. Even just 100 examples is enough to get started there, and you can really get the performance you're looking for.
**Swyx** [00:41:26]: We're recording this ahead of it, but you're announcing some fine-tuning stuff that people should pay attention to.
**Michelle** [00:41:32]: Yeah. So for 4.0, we're dropping our GA for GPT-4.0 fine-tuning. So 4.0 Mini has been available for a few weeks now, and 4.0 is now going to be generally available. And we also have a free training offering for a bit. I think until September 23rd, you get one million free training tokens a day. This is already announced, right?
**Swyx** [00:41:50]: Or am I talking about a different thing?
**Michelle** [00:41:52]: So that was for 4.0 Mini, and now it's also for 4.0. So we're really excited to see what people do with it. And it's actually a lot easier to get started than a lot of people expect. I think they might need tens of thousands of examples, but even 100 really high-quality ones or 1,000 is enough to get going.
**Swyx** [00:42:06]: Well, we might get a separate podcast just specifically on that, but we haven't confirmed that yet. It basically seems like every time, I think people's concerns about fine-tuning is that they're kind of locked into a model, and I think you're paving the path for migration of models. As long as they keep their original data set, they can at least migrate nicely.
**Michelle** [00:42:25]: Yeah, I'm not sure what we've said publicly there yet, but we definitely want to make it easier for folks to migrate.
**Swyx** [00:42:31]: It's the number one concern. I'm just... Yeah, it's obvious.
**Michelle** [00:42:34]: Absolutely.
**Swyx** [00:42:35]: I also want to point people to... You have official model selection docs, where it's in the guide. We'll put it in the show notes, where it says to optimize for accuracy first. So prompt engineering, RAG, evals, fine-tuning. This was done at Dev Day last year, so I'm just repeating things. And then optimize for cost and latency second. And there's a few sets of steps for optimizing latency. So people can read up on that stuff.
**Michelle** [00:42:57]: Yeah, totally.
**Alessio** [00:42:58]: We had one episode with Nigolas Carlini from DeepMind, and we actually talked about how some people don't actually get to the boundaries of the model performance. They just kind of try one model, and it's like, oh, LLMs cannot do this, and they stop. How should people get over the hurdle? It's like, how do you know if you hit the model performance, or you hit skill issues? It's like, your prompt is not good, or try another model and whatnot. Is there an easy way to do that?
**Michelle** [00:43:22]: That's tough. Some people are really good at prompting, and they just kind of get it right away. And for others, it's more of a challenge. I think there's a lot we can do to make it easier to prompt our models. But for now, I think it requires a lot of creativity and not giving up right away. And a lot of people have experience now with ChatGPT. Before ChatGPT, the easiest way to play with our models was in the playground. But now everyone's played with it, with a model of some sort, and they have some sort of intuition. It's like, if I tell you my grandma is sick, then maybe I'll get the right output. And we're hoping to kind of remove the need for that. But playing around with ChatGPT is a really good way to get a feel for how to use the API as well.
**Alessio** [00:43:59]: Will prompt engineering be here forever? Or is it a dying art as the models get better?
**Michelle** [00:44:04]: I mean, it's like the perennial question of software engineering as well. It's like, as the models get better at coding, you know, if we hit 100 on Swebench, what does that mean? I think there will always be alpha in people who are able to clearly explain what they're trying to build. Most of engineering is like figuring out the requirements and stating what you're trying to do. And I believe this will be the case with AI as well. You're going to have to very clearly explain what you need, and some people are better than others at it. And people will always be building, it's just the tools are going to get far better.
**Swyx** [00:44:32]: Last two weeks, you released two models. There's GPC 4.0 2024-08-06, and then there's also ChatGPT 4.0 latest. I think people were a little bit confused by that, and then you issued a clarification that it was one's chat-tuned and the other is more function calling-tuned. Can you elaborate?
**Michelle** [00:44:47]: Yeah, totally. So part of the impetus here was to kind of be very transparent with what's on ChatGPT and in the API. So basically, we're often trading models, and they're different use cases. So you don't really need function calling for user-defined functions in ChatGPT. And so this gives us kind of the freedom to build the best model for each use case. So in ChatGPT latest, we're releasing kind of this rolling model. The weights aren't pinned as we release new models.
**Swyx** [00:45:14]: This is literally what we use. Yeah.
**Michelle** [00:45:16]: So it's in what's in ChatGPT, so it's very good for like chat-style use cases. But for the API, broadly, you know, we really tune our models to be good at things that developers want, like function calling and structured outputs. And when a developer builds their application, they want to know that kind of the weights are stable under them. And so we have this offering where it's like if you're tuning to a specific model and you know your function works, you know it will never change the weights out from under you. And so those are the models we commit to supporting for a long time, and we think those are the best for developers. But we want to give it up, you know, we want to leave the choice to developers. Do you want the ChatGPT model, or do you want the API model, and you have the freedom to choose what's best for you?
**Swyx** [00:45:54]: I think it's for people, they do want to pin model versions, so I don't know when they would use ChatGPT, like the rolling one, unless they're really just kind of cloning ChatGPT, which is like, why would they?
**Michelle** [00:46:08]: I mean, I think there's a lot of interesting stuff that developers can do when unbounded, and so we don't want to limit them artificially. So it's kind of survival of the fittest. Whichever model is better, you know, that's the one that people should use.
**Swyx** [00:46:21]: Yeah, I talked about it to my friends, it's like, this isn't that new thing, and basically OpenAI has never actually shared with you the actual ChatGPT model, and now they do.
**Michelle** [00:46:31]: Well, it's not necessarily true. Actually, a lot of the models we have shipped have been the same, but you know, sometimes they diverge, and it's not a limitation we want to stick around.
**Swyx** [00:46:41]: Anything else we should know about the new model? I don't think there was no evals announced or anything, but people say it's better. I mean, obviously, LMSYS is like way better above on everything, right? It's like number one in the world on...
**Michelle** [00:46:52]: Yeah, we published some release notes. They're not as in-depth as we want to be yet, because it's still kind of a science and we're learning what actually changes with each model, and how can we better understand the capabilities. But we are trying to do more release notes in the future and keep folks updated. But yeah, it's kind of an art and a science right now.
**Swyx** [00:47:13]: You need the best evals team in the world to help you figure this out. Yeah, evals are hard.
**Michelle** [00:47:17]: We're hiring if you want to come work on evals.
**Swyx** [00:47:19]: Hold that thought on hiring. We'll come back to the end on what you want, what you're looking for, because obviously people want to join you and they want to know what qualities you're looking for.
**Alessio** [00:47:27]: So we just talked about API versus ChargeGBT. What's I guess the vision for the interface? You know, the mission of OpenAI is to build AGI that is accessible. Where is it going to come from?
**Michelle** [00:47:40]: Totally, yeah. So I believe that the API is kind of our broadest vehicle for distributing AGI. You know, we're building some first-party products, but they'll never reach every niche in the world and kind of every corner in community. And so I really love working with developers and seeing the incredible things they come up with. I often find that developers kind of see the future before anyone else, and we love working with them to make it happen. And so really the API is a bet on going really broad. We'll go very deep as well in our first-party products, but I think just that our impact is absolutely magnified by every developer that we uplift.
**Swyx** [00:48:13]: They can do the last mile where you cannot. Like ChargeGBT is one type of product, but there's many other kinds. In fact, I observed, I think, in February, basically, ChargeGBT's user growth stopped when the API was launched, because everyone's going to kind of be able to take that and build other things. That has not become true anymore, because ChargeGBT growth has continued to grow. But then you're not confirming any of this. This is me quoting similar web numbers, which have very high variance.
**Michelle** [00:48:40]: Well, the API predates ChargeGBT. The API was actually OpenAI's first product, and the first idea for commercialization. That predates me as well.
**Swyx** [00:48:48]: Wide release. Like, GA, everyone can sign up and use it immediately, and that's what I'm talking about. But yeah, I mean, I do believe that, and that means you also have to expose all of OpenAI models, right? Like, all the multimodal models. We'll ask you questions on that, but I think that API mission is important. It's interesting that Hada's new programming language is supposed to be English, but it's actually just software engineering, right? It's just, you know, we're talking about HTTP error codes.
**Michelle** [00:49:17]: Right. Yeah, I think, you know, engineering is still the way you access these models. And I think there are companies working on tools to make engineering more accessible for everyone. But there's still so much alpha in just writing code and deploying. Yeah.
**Swyx** [00:49:32]: One might even call it AI engineering. Exactly. Yeah. So there's lots of war stories from building this platform. We started at the start of your career, and then we jumped straight to structured outputs. There's a whole thing, like two years that we skipped in between. What have become your principles? What are your favorite stories that you like to tell?
**Michelle** [00:49:50]: We had so much fun working on the Assistants API and leading up to Dev Day. You know, things are always pretty chaotic when you have an externally, like a date that is hard, and there's like a stage, and there's like a thousand people coming.
**Swyx** [00:50:02]: You can always launch a wait list.
**Michelle** [00:50:06]: We're trying hard not to, because, you know, we love it when people can access the thing on day one. And so, yeah, the Assistants API, we had like this really small team and just working as hard as we could to make this come to life. But even actually the morning of, I don't know if you'll remember this, but Sam did this keynote. Yep. And Roman came up, and they gave free credits to everybody. So that was live, fully live, as were all of the demos that day. But actually, maybe like two hours before that, we had a little outage and everyone was like scrambling to make this thing work again. So yeah, things are early and scrappy here, and, you know, we were really glad. We were a bit on the edge of our seat watching it live.
**Alessio** [00:50:46]: What's the plan B in that situation? If you can share.
**Swyx** [00:50:49]: Play a video. It's just classic DevRel, right? I don't know.
**Michelle** [00:50:52]: I mean, I actually don't know what the plan B was.
**Swyx** [00:50:55]: No plan B, no failure.
**Michelle** [00:50:56]: But we just, you know, we fixed it. We got everything running again, and the demo went well.
**Swyx** [00:51:02]: Just hire cracked Waterloo tracks. Exactly. Skill issues, as usual.
**Michelle** [00:51:05]: Sometimes you just got to make it happen.
**Swyx** [00:51:07]: I imagine it's actually very motivating, but I did hear that after Dev Day, like, the whole company got like a few weeks off just to relax a little bit.
**Michelle** [00:51:15]: Yeah, we sometimes get, like, we just had the week of July 4th off, and yeah. It's hard to take vacation because people are working on such exciting things, and it's like, you get a lot of FOMO on vacation. So it helps when the whole company's on vacation.
**Swyx** [00:51:29]: Speaking of Assistants API, you actually announced a roadmap there, and things have developed. I think people may not be up to date. What's the offering today versus, you know, one year ago?
**Michelle** [00:51:39]: Yeah. So we've made a bunch of key improvements. I would say the biggest one is in the file search product. Before we only supported, I think, like 20 files per assistant, and the way we used those files was, like, less effective. Basically, the model would decide based on the file name whether to search a file, and there's not a ton of information in there. So our new offering, which we shipped a few months ago, I think, now allows 10K files per assistant, which is, like, dramatically more. And also, it's a kind of different operation, so you can search semantically over all files at once rather than just kind of the model choosing one up front. So a lot of customers have seen really good performance. We also have exposed more, like, chunking and re-ranking options. I think the re-ranking one is coming, I think, next week or very soon. So this kind of gives developers more control and more flexibility there. So we're trying to make it the easiest way to kind of do RAG at scale.
**Swyx** [00:52:29]: Yeah. I think that visibility into the RAG system was the number one thing missing from DevDay, and then people got their first impressions, and then they never looked at it again. So that's important. The re-ranker is a core feature of, let's say, some other foundation model labs. Is OpenAI going to, like, offer a re-ranking service, a re-ranker model?
**Michelle** [00:52:49]: So we do re-ranking as part of it. I think we're soon going to ship more controls for that.
**Swyx** [00:52:54]: OK, got it. So, like, if I'm an existing LANG chain, LLAMA Index, whatever, how do you compare? Do you make different choices? Like, where does that exist in the spectrum of choices?
**Michelle** [00:53:04]: I think we are just coming at it trying to be the easiest option. And so, ideally, like, you don't have to know what a re-ranker is, and you don't have to have a chunking strategy, and the thing just kind of works out of the box. So I would say that's where we're going. And then, you know, giving controls to the power users to make the changes they need.
**Swyx** [00:53:22]: Awesome. I wanted to ask about a couple other things, just updates on stuff also announced at Dev Day. And we talked about this before. Determinism, something that people really want. Dev Day would announce the seed parameter as well as system fingerprint. And like, objectively, I've heard issues. Yeah.
**Michelle** [00:53:36]: I don't know what's going on. Yeah, the seed parameter is not fully deterministic, and it's kind of a best effort thing. Yeah. So you'll notice there's more determinism in the first few tokens. That's kind of the current implementation. We've heard a lot of feedback. We're thinking about ways to make it better. But it's challenging. It's kind of trading off against, you know, reliability and uptime.
**Alessio** [00:53:55]: Other maybe underrated API-only thing, logic bias. That's another thing that kind of seems very useful and that maybe most people are like, it's a lot of work. I don't want to use it. Do you have any examples of like use cases or like products that are made a lot better through using it?
**Michelle** [00:54:11]: So yeah, classification is the big one. So logic bias, you're valid classification outputs, and you know, you're more likely to get something that matches. We've seen that people logic bias like punctuation tokens, maybe trying to get more succinct writing. Yeah. It's generally a very much a power user feature. And so not a ton of folks use it.
**Swyx** [00:54:30]: I actually wanted to use it to reduce the incidence of the word delve.
**Michelle** [00:54:34]: Yeah.
**Swyx** [00:54:35]: Have people done that?
**Michelle** [00:54:36]: Probably. I don't know. Is delve one token? You're probably, you got to do a lot of permutations. It's got to be. It's used so much.
**Swyx** [00:54:42]: It's got to be.
**Michelle** [00:54:43]: Maybe it is.
**Alessio** [00:54:44]: Depends on the tokenizer. Are there non-public tokenizers? I guess you cannot answer or you would admit it. Are the 100K and 200K vocabs, like the ones that you use across all models or?
**Michelle** [00:54:52]: Yeah, I think we have docs that publish more information. I don't have it off the top, but I think we publish which tokenizers for which model.
**Alessio** [00:55:00]: Okay. So those are the only two.
**Swyx** [00:55:02]: Rate, the tiering rate limiting system. I don't think there was an official blog post kind of announcing this, but it was kind of mentioned that like you started tying like fine tuning to tiering and like feature rollouts just on the, from your point of view, like how do you manage that? And like, what should people know about the tiering system and rate limiting? Yeah.
**Michelle** [00:55:20]: I think basically the main changes here were to be more transparent and easier to use. So before developers didn't know what tier they're in. And now you can see that in the dashboard. I think it's also, I think we publish like how you move from tier to tier. And so this just helps us do kind of gated rollouts for the fine tuning launch. I think everyone tier two and up has full access. That makes sense.
**Swyx** [00:55:40]: You know, I would just advise people to just get to tier five as quickly as possible. Like a gold star customer, you know, like, I don't know, it seems to make sense.
**Alessio** [00:55:48]: Do we want to maybe wrap with future things and kind of like how you think about designing and everything? So you just mentioned you want to be the easiest way to basically do everything. What's the relationship with other people building in the developer ecosystem? Like I think maybe in the early days it's like, okay, we only have these APIs and then everybody helps us. But now you're kind of building a whole platform. How do you make decisions? Yeah.
**Michelle** [00:56:11]: I think kind of the 80-20 principle applies here. We'll build things that kind of capture, you know, 80% of the value and maybe leave the long tail to other developers. So we really prioritize by like, how much feedback are we getting? How much easier will this make something, like an integration for a developer? So yeah, we want to do more in this space and not just be an LLM as a service, but kind of AI development platform as a service.
**Swyx** [00:56:34]: Ooh, okay. That ties into a thing that I put in the notes that we prepped. There are other companies trying to be AI development platform. So you will compete with them or they just want to know what you won't build so that they can build it? Yeah.
**Michelle** [00:56:50]: It's a tough question. I think we haven't, you know, determined what exactly we will and won't build, but you can think of something, if it makes it a lot easier for developers to integrate, you know, it's probably on our radar and we'll, you know, stack rank by impact.
**Swyx** [00:57:03]: Yeah. So there's like cost tracking and model fallbacks. Model fallbacks is an interesting one because people do it. I don't think it adds a ton of value, but like, if you don't build it, I have to build it because if one API is down or something, I need to fall back to another one.
**Michelle** [00:57:18]: Yeah. I mean, the way we're targeting that user need is just by investing a lot in reliability. And so we- Oh yeah.
**Swyx** [00:57:24]: Just don't fail.
**Michelle** [00:57:25]: I mean, we have improved our uptime, like pretty dramatically over the last year and it's been, you know, the result of a lot of hard work from folks. So you'll see that on our status page and in our continued commitment going forward.
**Alessio** [00:57:37]: What's the important thing about owning the platform that gives you the flexibility to put all the kind of messy stuff behind the scenes or yeah, how do you draw the line between what you want to include? Yeah.
**Michelle** [00:57:48]: I just think of it as like, how can we onboard the next generation of AI engineers as you put it, right? Like, what's the easiest way to get them building really cool apps? And I think it's by building stuff to kind of hide this complexity or just make it really easy to integrate. So I think of it a lot as like, what is the value add we can provide beyond just the models that makes the models really useful?
**Swyx** [00:58:08]: Okay. We'll touch on four more features of the API platform that we prepped. Batch, Vision, Whisper, and then Team Enterprise stuff. So you wanted to talk about Batch. So the rough idea is, the contract between you and me is that I give you the Batch job, you have 24 hours to run it, and it's kind of like spot-inst for the API. What should people know about it?
**Michelle** [00:58:31]: So it's half off, which is a great savings. It also works with like 4.0 mini. So the savings on top of 4.0 mini is pretty crazy. Like the stuff you can do-
**Swyx** [00:58:40]: Like 7.5 cents or something per million.
**Michelle** [00:58:42]: Yeah. I should really have that number top of mind, but it's like staggeringly cheap. So I think this opens up a lot more use cases. Like let's say you have a user activation flow and you want to send them an email like maybe every day or like at certain points in their user journey. So now you can do this with the Batch API and something that was maybe a lot more expensive and not feasible is now very easy to do. So right now we have this 24 hour turnaround time for half off and curious, would love to hear from your community, like what kind of turnaround time do they want?
**Swyx** [00:59:10]: I would be an ideal user of Batch and I cannot use Batch because it's 24 hours. I need two to four.
**Michelle** [00:59:15]: Two to four hours. Okay. Yeah. That's good to know. Yeah. Just a lot of folks haven't heard about it. It's also really great for like evals, running them offline. You generally don't need them to come back within two hours.
**Swyx** [00:59:25]: I think you could do a range, right? Two to four for me, like I need to produce a daily thing and then 24 for like the average use case. And then maybe like a week, a month, who cares? For people who just have a lot to do. Yeah, absolutely.
**Michelle** [00:59:37]: So yeah, that's Batch API. I think folks should use it more. It's pretty cool.
**Alessio** [00:59:41]: Is there a future in which like six months is like free, you know? Like is there like small, is there like super small like shards of like GPU runtime that like over a long enough timeline, you can just run all these things for free?
**Michelle** [00:59:55]: Yeah, it's certainly possible. I think we're getting to the point where a lot of these are like almost free. That's true.
**Swyx** [01:00:00]: Why would they work on something that's like completely free? I don't know. Okay, so Vision. Vision got G8. Like last year, people were so wild by the GPT-4 demo, and that was primarily Vision. What was it like building the Vision API?
**Michelle** [01:00:12]: Yeah, the Vision API is super cool. We have a great team working there. I think the cool thing about Vision is that it works across our APIs. So there's, you can use it in the Assistance API, you can use it in the Batch API, in Chat Completions, it works with Structured Outputs. I think it just helps a lot of folks with kind of data extraction, where, you know, the spatial relationships between the data is too complicated, and you can't get that over text. But yeah, there's a lot of really cool use cases.
**Swyx** [01:00:37]: I think the tricky thing for me is understanding how frequent to turn Vision from like single images into like effectively just always watching. And right now, I think people just like send a frame every second. Will that model ever change? Will there just be like, I stream you a video? And then...
**Michelle** [01:00:55]: Yeah, I think it's very possible that we'll have an API where you stream video in. And maybe, you know, to start, we'll do the frame sampling for you.
**Swyx** [01:01:03]: Because the frame sampling is the default, right? Right. But I feel like it's hacky.
**Michelle** [01:01:07]: Yeah, I think it's hard for developers to do. And so, you know, we should definitely work on making that easier.
**Alessio** [01:01:11]: Is there in the Batch API, do you have like time guarantees, like order guarantees? Like if I send you a Batch request of like a video analysis, I need every frame to be done in order?
**Michelle** [01:01:22]: For Batch, you send like a list of requests and each of them stand alone. So you'll get all of them finished, but they don't kind of chain off each other.
**Alessio** [01:01:29]: Well, if you're doing a video, you know, if you're doing like analyzing a video...
**Swyx** [01:01:33]: I wasn't linking video to Batch, but that's interesting.
**Alessio** [01:01:36]: Yeah, well, a video is like, you know, if you have a very long video, you can just do a Batch of all the images and let it process. That's a good idea.
**Swyx** [01:01:43]: You could offer like sequential through...
**Alessio** [01:01:46]: Yeah, yeah, yeah, exactly. But the whole point of Batch is you're just using kind of spare time to run it. Let's talk about my favorite model, Whisper. All of our... Yeah, I built this thing called SmallPodcaster, which is an open source tool for podcasters. And why does Whisper API not have diarization when everybody is transcribing people talking? That's my main question.
**Michelle** [01:02:07]: Yeah, it's a good question. And you've come to the right person. I actually worked on the Whisper API and shipped that. That was one of my first APIs I shipped. Long story short is that like Whisper v3, which we open sourced, has, I think, the diarization feature, but there's some performance trade-offs. So Whisper v2 is better at some things than Whisper v3. And so it didn't seem that worthwhile to ship Whisper v3 compared to the other things in our priorities. I think we still will at some point, but yeah, it's just, you know, there's always so many things we could work on. It's tough to do everything.
**Alessio** [01:02:38]: We have a Python notebook that does the diarization for the pod, but I would just like... You can translate like 50 languages, but you cannot tell me who's speaking. That was like the funniest thing.
**Michelle** [01:02:50]: There's like an XKCD thing about this, about hard problems in AI. Yeah, yeah, yeah.
**Swyx** [01:02:54]: Exactly.
**Michelle** [01:02:55]: It's like, tell me if this was taken in a park, and like, that's easy. And it's like, tell me if there's a bird in this picture, and it's like, give me 10 people on a research team. Yeah. It's like, you never know which things are challenging, and diarization is, I think, you know, more challenging than expected.
**Swyx** [01:03:08]: Yeah.
**Alessio** [01:03:09]: It still breaks a lot with like overlaps, obviously. Sometimes similar voices it struggles with. Like I need to like double read the thing. Totally. But yeah, great model. It would take us so long to do transcriptions. And I don't know why, like Smallpodcast has better transcription than like mostly every commercial tool. It beats Descript. And I'm like, I'm just using the model. I'm literally not doing anything. Yeah. You know, it's just a notebook. So yeah, it just speaks to like, sometimes just using the simple OpenAI model is better than like figuring out your own pipeline thing.
**Swyx** [01:03:40]: Totally. I think the top feature request there just would be, I mean, again, you know, using you as a feature request dump is like being able to bias the vocab. I think there is like in Raw Whisper, you can do that. You can pass a prompt in the API as well. But you pass in the prompts? Okay. Yeah.
**Michelle** [01:03:57]: There's no more deterministic way to do it. So this is really helpful when you have like acronyms that aren't very familiar to the model. And so you can put them in the prompt and you'll basically get the transcription using those correctly.
**Alessio** [01:04:06]: We have the AI engineer solution, which is just a dictionary.
**Michelle** [01:04:09]: Nice.
**Alessio** [01:04:10]: We're like all the way misspelled it in the past and then G-sub and like replace the thing.
**Michelle** [01:04:14]: If it works, it works. Like that's engineering. Yeah.
**Alessio** [01:04:17]: Okay. It's like, you know, llama with like one L or like all these different things or like length chain and like transcribes length chain and like a bunch of like three or four different ways.
**Michelle** [01:04:28]: Yeah. You guys should try the prompt feature.
**Swyx** [01:04:30]: I love these like kind of pro tip. Okay. Fun question. I know we don't know yet, but I've been enjoying the advanced voice mode. It really streams back and forth and it handles interruptions. How would your audio endpoint change when that comes out?
**Michelle** [01:04:44]: We're exploring, you know, new shape of the API to see how it would work in this kind of speech to speech paradigm. I don't think we're ready to share quite yet, but we're definitely working on it. I think just the regular request response probably isn't going to be the right solution.
**Swyx** [01:04:57]: For those who are listening along, I think it's pretty public that OpenAI uses LiveKit for the chat GPT app, which seems to be the socket based approach that people should be at least up to speed on. Like I think a lot of developers only do request response and like that doesn't work for streaming. Yeah.
**Michelle** [01:05:13]: When we do put out this API, I think we'll make it really easy for developers to figure out how to use it. Yeah. It's hard to do.
**Swyx** [01:05:19]: It'll be a paradigm change. Okay. And then I think the last one on our list was team enterprise stuff. Audit logs, service accounts, API keys. What should people know? Yeah.
**Michelle** [01:05:27]: What's in the enterprise offering? Yeah. We recently shipped our admin and audit log APIs. And so a lot of enterprise users have been asking for this for a while. The ability to kind of manage API keys programmatically, manage your projects, get the audit log. And so we've shipped this and for folks that need it, it's out there and happy for your feedback.
**Swyx** [01:05:43]: Yeah. Awesome. I don't use them. So I don't know. I imagine it's just like build your own internal gateway for your internal developers to manage your deployment of OpenAI.
**Michelle** [01:05:53]: Yeah. I mean, if you work at like a company that needs to keep track of all the API keys, it was pretty hard in the past to do this in the dashboard. We've also improved our SSO offering. So that's much easier to use now.
**Alessio** [01:06:04]: The most important feature of any enterprise company.
**Michelle** [01:06:07]: Yeah. So.
**Alessio** [01:06:09]: All right. Let's go outside of OpenAI. What about just you personally? So you mentioned Waterloo. Maybe let's just do why is everybody at Waterloo cracked and why are people so good and why have people not replicated it or any other commentary on your experience?
**Michelle** [01:06:24]: The first is the co-op program. It's obviously really good. You know, I did six internships, learned so much in those. I think another reason is that Waterloo is like, you know, it's very cold in the winter. It's pretty miserable. There's like not that much to do apart from study and like hack on projects. And there's this big like hacker mentality. You know, there's a Hack the North is a very popular hackathon and there's a lot of like startup incubators. It's kind of just has this like startup and hacker ethos. Then that combined with the six internships means that you get people who like graduate with two years of experience and they're very entrepreneurial and you know, they're down to grind.
**Swyx** [01:07:00]: I do notice a correlation between climate and the correctness of engineers. So, you know, it's no coincidence that Seattle is the birthplace of Microsoft and Amazon. I think I had this compilation of Denmark where people like, so it's the birthplace of C++, PHP, Turbo Pascal, Standard ML, BNF, the thing that we just talked about, MD5Crypt, Ruby on Rails, Google Maps and V8 for Chrome. And it's because according to Bjorn Sjostrup, the creator of C++, there's nothing else to do.
**Alessio** [01:07:29]: Yeah. Well, yeah. Linus Torvalds in Finland.
**Michelle** [01:07:33]: I mean, you hear a lot about this, like in relation to SF, people say, you know, New York is way more fun. There's nothing to do in SF. And maybe it's a little by design that all tech is here.
**Alessio** [01:07:41]: The climate is too good. Yeah. If we also have fun things to do.
**Swyx** [01:07:44]: Nature is so nice. You can touch grass. Why are we not touching grass?
**Michelle** [01:07:47]: You know, restaurants close at like 8pm. Like that's what people are referring to. There's not a lot of like late night dining culture. Yeah. So you have time to wake up early and get to work.
**Swyx** [01:07:58]: You are a book recommender or book enjoyer. What underrated books do you recommend most to others?
**Michelle** [01:08:03]: Yeah, I think a book I read somewhat recently that was very formative was The Making of the Prince of Persia. It's a striped press book. That book just made me want to work hard like nothing I've ever read. It's just like this journal of what it takes to like build, you know, incredible things. So I'd recommend that.
**Alessio** [01:08:20]: Yeah. It's funny how video games are, for a lot of people, at least for me, kind of like some of the moments in technology. Like when I played The Sense of Time on PS2, it was like my first PlayStation 2 game. And I was like, man, this thing is so crazy compared to any PlayStation 1 game. And it's like, wow, my expectations for like the technology. I think like open AI is a lot of similar things, like the advanced voice. It's like you see that thing and then you're like, OK, what I can expect from everybody else is kind of raised now, you know?
**Michelle** [01:08:47]: Totally. Another book I like to plug is called Misbehaving by Richard Thaler. He's a behavioral economist and talks a lot about how people act irrationally in terms of decision making. And I actually think about that book like once a week, probably at least when I'm making a decision and I realize that, you know, I'm falling into a fallacy or, you know, it could be a better decision.
**Swyx** [01:09:06]: Yeah.
**Michelle** [01:09:07]: You did a minor in psych? I did. Yeah. I don't know if I learned that much there, but it was interesting.
**Swyx** [01:09:11]: Is there like an example of like a cognitive bias or misbehavior that you just love telling people about?
**Michelle** [01:09:18]: Yeah. People. So let's say you won tickets to like a Taylor Swift concert and I don't know how much they're going for, but it's probably like $10,000. Oh, OK.
**Swyx** [01:09:27]: Or whatever.
**Michelle** [01:09:28]: Sure. And like a lot of people are like, oh, I have to keep these. Like I won them. It's $10,000. But really, it's the same decision you're making. If you have $10,000, like would you buy these tickets? And so people don't really think about it rationally. I'm like, would they rather have $10,000 or the tickets for people who want it? A lot of the time it's going to be the $10,000, but their bias is because they want it. The world organized itself this way and you should keep it for some reason. Yeah.
**Swyx** [01:09:49]: Oh, OK. I'm pretty familiar with this stuff. There's also a loss version. Yes. There's also a loss version where it's like, if I take it away from you, you respond more strongly than if I give it to you.
**Michelle** [01:09:58]: Yes. If people are like really upset, if they like don't get a promotion, but if they do get a promotion, they're like, OK, phew. It's like not even, you know, excitement. It's more like we react a lot worse to losing something.
**Swyx** [01:10:11]: Which is why, like when you join like a new platform, they often give you points and then they'll take it away if you like don't do some action in the first few days. Yeah, totally.
**Michelle** [01:10:20]: Yeah. So I often like references people who like operate very rationally as econs, as like a separate group to humans. And I often think like, you know, what would an econ do here in this moment and try to act that way.
**Swyx** [01:10:34]: OK, let's do this. Are LLMs econs?
**Michelle** [01:10:36]: I mean, they are maximizing probability distributions.
**Swyx** [01:10:42]: Minimizing loss.
**Michelle** [01:10:43]: Yeah. So I think way more than all of us, they are econs.
**Swyx** [01:10:46]: Whoa. OK. So they're more rational than us?
**Michelle** [01:10:49]: Hmm. Yeah. Their optimization functions are more clear than ours. Yeah.
**Alessio** [01:10:53]: Just to wrap, you mentioned you need help on a lot of things. Yeah. Any specific roles, call outs, and also people's backgrounds. Like is there anything that they need to have done before, like what people fit well at OpenAI?
**Michelle** [01:11:04]: Yeah. We've hired people, all kinds of backgrounds, people who have PhD and an ML or folks who just done engineering like me. And we're really hiring for a lot of teams. We're hiring across the applied org, which is where I sit for engineering and for a lot of researchers. And there's a really cool model behavior role that we just dropped. So yeah, across the board, we'd recommend checking out our careers page, and you don't need a ton of experience in AI specifically to join.
**Swyx** [01:11:30]: I think one thing that I'm trying to get at is what kind of person does well at OpenAI? I think objectively, you have done well. And I've seen other people not do as well and basically be managed out. I know it's an intense environment.
**Michelle** [01:11:43]: I mean, the people I enjoy working with the most are kind of low ego, do what it takes, ready to roll up their sleeves, do what needs to be done and unpretentious about it. Yeah. I also think folks that are very user-focused do well on kind of API and chat GVT. Like the YC ethos of build something people want is very true at OpenAI as well. So I would say low ego, user-focused, driven.
**Alessio** [01:12:08]: Cool. Yeah, this was great. Thank you so much for coming on.
**Michelle** [01:12:11]: Thanks for having me.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment