Model: gemini/gemini-2.5-pro
Full transcript with timestamps
{
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"timestamp": {
"type": "string",
"description": "mm:ss"
},
"speaker": {
"type": "string",
"description": "best guess at name"
},
"text": {
"type": "string"
}
},
"required": [
"timestamp",
"speaker",
"text"
]
}
}
},
"required": [
"items"
]
}- audio/mpeg:
/private/tmp/gemini-2.5_smaller.m4a
{
"items": [
{
"speaker": "Logan Koepke",
"text": "Hey everyone. How's it going?",
"timestamp": "01:01"
},
{
"speaker": "Logan Koepke",
"text": "We will get started in a couple of minutes after all of the awkwardness of starting a X space. So, hang in there for one or two more minutes.",
"timestamp": "01:06"
},
{
"speaker": "Logan Koepke",
"text": "Hey Zach, Hey Tulsi, Hey Melvin, Hey Anka.",
"timestamp": "01:21"
},
{
"speaker": "Zach",
"text": "Hello.",
"timestamp": "01:25"
},
{
"speaker": "Melvin Johnson",
"text": "Hello.",
"timestamp": "01:27"
},
{
"speaker": "Tulsee Doshi",
"text": "Yay, hello. Hello.",
"timestamp": "01:29"
},
{
"speaker": "Logan Koepke",
"text": "Awesome. I think we have everyone. Are we Google AI host uh account? Are we good to get started?",
"timestamp": "01:37"
},
{
"speaker": "Logan Koepke",
"text": "I see the three more minutes. Cool. We will hang tight for just another minute or two.",
"timestamp": "01:49"
},
{
"speaker": "Logan Koepke",
"text": "We need lobby music in this uh in this space.",
"timestamp": "02:03"
},
{
"speaker": "Logan Koepke",
"text": "Okay, I think we are going to get started. Um, we have lots of awesome announcements uh that have just been rolling out. We've got lots of awesome questions from folks who who sent in a bunch of questions. Uh, so excited to dive into all this. Um, my name's Logan Patrick. Uh, excited to have this conversation. I do developer product stuff. Um, we're joined by an amazing set of folks, um, who I'm excited to sort of get them to share their perspective.",
"timestamp": "03:11"
},
{
"speaker": "Logan Koepke",
"text": "Um, Tulsi Doshi who's our head of product for Gemini models, um, and the co-conspirator of lots of these launches. Uh, hopefully you've seen a bunch of uh her incredible tweets. Um, so we'll we'll make that happen. Um, Anka is our senior director for AI safety and alignment, um, and actually also one of the post-training co-leads. Um, so Anka, I'm excited to to get your perspective today. Melvin Johnson, um, who is a distinguished software software engineer, um, another post-training person on our team who's driving uh some of the cross Google initiatives there. So, excited to have Melvin. Um, and then Zach is our product lead for Gemini pre-training and embeddings, um, and has done a bunch of the small model launches. Um, and obviously one of the new models for today's announcement was around Flashlight. So excited to talk about that our smallest uh our smallest model and smallest reasoning model. Um, so this is exciting to have the conversation. Maybe Tolsi, you can actually kick us off with just a high level of some of the model announcements. We've obviously had a bunch of stuff with 2.5 over the last few months, but today it feels like the culmination of of a lot of that stuff coming together.",
"timestamp": "03:38"
},
{
"speaker": "Tulsee Doshi",
"text": "Yeah, first of all, hey everyone. Uh, we're super excited to be here. Uh, it's a, it's a cool day for us because I think this is really today's set of launches is 2.5 as a family kind of taking the next step. It's us actually having uh, 2.5 Pro and Flash be stable production ready models that we will be supporting for a long time. Um, so this is really us taking the set of previews that we've shipped and gotten your feedback on and iterated on and bringing them now into these stable 2.5 Pro and 2.5 flash launches. And both models are awesome. We can talk more about what makes them uh so great, but we're really excited for you to keep building building on the 2.5 family. And then as Logan said, we've released also flash light. Uh, so now the 2.5 family has three model sizes. There's Pro, Flash and Flash light. And the way you can think about them is Pro is really the model that is um, just amazing performance, right? So if you're trying to get the best quality, especially for code or complex prompts, uh, Pro is your model. Flash is this kind of good workhorse model. It is great cost for quality. Um, but it's also just an extremely strong reasoning model. Uh, and is also has good latency to be able to support real-time use cases. So especially if you care about that kind of real-time latency, live performance, things like that. And then flashlight, which we're introducing today, the 2.5 flashlight model is really optimized to be our fastest and cheapest model. Um, and so it really gives that kind of opportunity, especially if you're caring about, um, high latency or tasks that require sorry, very low latency, so like high bandwidth, um, but also tasks where you need a lot of throughput and need to actually worry about the cost kind of at scale. Uh flashlight is a great model for a lot of those tasks. uh and we've seen and been having a lot of fun with customers over the last few weeks testing this model and trying to figure out what are the use cases it can it can do really well for. So yeah, we're really excited. I think this is going to be frankly like the best family of models we've had and maybe the world.",
"timestamp": "04:49"
},
{
"speaker": "Logan Koepke",
"text": "Yeah, I feel like the best part of this Tosi is there's no more preview and at least two of our model names. Uh so hopefully that makes the lives of people easier as they think about which model to use. Um, and I think that that actually takes me naturally to this question. I saw a bunch of the replies to this thread and others were just around like, how should people think about um what model they should be using especially as it feels like there's this capability shift happening. um where like Flash has historically Flash was this small workhorse model and I think now with reasoning capabilities and the 2.5 flash, it's actually doing it feels like a lot more historically um than it was in the past. So I'm curious how you how you think about that and if others have thoughts, uh, please feel free to jump in as well.",
"timestamp": "06:55"
},
{
"speaker": "Tulsee Doshi",
"text": "Yeah, I mean, actually, maybe I'll I'll push this to Melvin. Curious your take on it since you've been thinking about kind of this full series of models and continuing to push the quality of of flash and continuing to make it better. I'm curious like how you think about the continuous improvements of this model and the fact that they're kind of jumping in performance every time we do one of these, you know, revs on our side.",
"timestamp": "07:40"
},
{
"speaker": "Melvin Johnson",
"text": "Yeah, I think uh this is something we've like the 2.5 family we've had a lot of fun with it uh from the post training side. so it was a big jump from pre-training coming in and we generally like to do this sort of cycle where we first put out the pro, then we use the pro to further improve the flash. And now that the flash gets really good, the gap between the flash and the pro reduces. So we want to push the pro further. So we've done that cycle over time, uh we've leaned in on this Pareto frontier graph that you will see that we're publishing a lot. We really want to be at the frontier of the Pareto when it comes to cost and uh performance. and for flash, we want since this is the workhorse model, we want this to be most optimal in terms of that tradeoff. And for Pro, we want to optimize for higher quality at the expense of cost and latency. So, I'm quite excited at where we've landed on the frontier and you know, if you look at the graphs, we are the frontier with the family of the uh family of models that we have in 2.5. and we'll continue to push uh push upward uh with uh the next set of uh releases. But you know, overall happy with these sets of models becoming stable, becoming GA, and overall happy with where we've landed on the graph with Flash and Pro.",
"timestamp": "08:02"
},
{
"speaker": "Tulsee Doshi",
"text": "And one thing I'll say actually on that point too is one thing that's kind of cool about flash is it even with all these performance gains, it continues to stay a very fast model, right? So it's actually like if you look at its speed compared to other competitors, its decode speed, uh tokens per second is actually still extremely competitive. It's actually probably one of the fastest models um in the market. And so I think what's kind of cool about how we're thinking about flash is we continue to want it to have uh competitive speed and competitive cost, but we continue to improve that performance and so hopefully it just becomes an even better value proposition over time.",
"timestamp": "09:30"
},
{
"speaker": "Logan Koepke",
"text": "Yeah, I love that. Anka, I'm I'm curious this um we we now have this sort of full complete setup of of reasoning models across the board from Pro to Flash, to Flashlight. Um, I I know you wear many different hats, but I'm I'm curious to like sort of get your reaction from whichever the dimensions whichever the hats you you're wearing that you want to give the reaction to about just like how that progress has happened, how it's helped with, you know, our ability to align the model, that actually making like higher quality models even at small sizes. I'm curious uh to get your perspective.",
"timestamp": "10:07"
},
{
"speaker": "Anca",
"text": "Yeah, a lot to unpack there, huh? Um, so one thing that I will say, maybe I'll start with pro. I mean, flash is very flash. It's it's we've aptly named it. Uh, pro, uh, 2.5 Pro, um, has we've been iterating on it, but, uh, one of the things that that stands out is that it's amazing at code and I'll get into that from the safety perspective to a little later. It's amazing at code, it's amazing at all the benchmarks. But something that we don't quite have a public benchmark for is maybe the way it behaves. The way it actually partners with you, the way it kind of shows up as a collaborator for you as a developer or you as an end user. Um, which is something, uh, you know, we've worked hard on, we keep working hard on, and also to be honest, we're we're starting to get a little bit for free. We're seeing as the models become more and more capable across the board, their kind of intelligence, their broad intelligence is becoming, um, really good too. So I'll give you a couple of anecdotes on behavior. And I can also talk about what I don't like, but, um, one of one of the things that struck me about the latest Pro was first of all, it's very witty and humorous. So we sometimes like to challenge the model with these like fake trolley problems, right? Because because I everyone wants to kind of trick the model and figure out how it responds and sometimes we throw these kind of ridiculous trolley problems. I threw one at it about like sort of like, oh no, I saved the toaster instead of like two cows today. Um, and I feel, you know, so it's been it was such a difficult decision. And the model, you know, most mostly our previous models would, uh, either say, okay, this is kind of a ridiculous thing or, you know, maybe show some empathy if they kind of go along with it. But this model is so cool because it just sort of goes on this rant, this hilarious rant about how these, you know, the toaster is really the pinnacle of the breakfast civilization, you know, and then and then he makes all these jokes about the toaster. Uh, what did the cows bring to you? Nothing like and then um, and then at the end it says, you know, go and you've had a really rough day, go and enjoy a slice of toast. I mean, it's just it's such a cool. I, you know, it's the first time we've kind of seen that uh, with the pro family, uh, with the pro line. And then, um, uh, so it's got humor, it's got wit. But maybe more practically, it shows up and helps you strategize. It shows up and helps you business plan. It shows up and, and, um, you know, um, if if you're trying to write a I was trying to write a letter with it, um, to my department chairs at Berkeley, and it came back at me and saying, this is really good start, but here's a very different strategy you could be taking to, you know, take this thing that you're presenting as an ask into an offer and show them what they have to benefit. And I was, I was a little bit blown away, right? So it's just the way that it actually collaborates with you to help you along. Um, I think is is new and uh, we're working to now, you know, use that to make the flash model better, um, in these areas as well. Um, so, yeah, model behavior, very, very, very transformative, um, in the 2.5 family.",
"timestamp": "10:42"
},
{
"speaker": "Logan Koepke",
"text": "Yeah, I can have a quick follow-up, which is like, what, um what what do you think the thing is that's driven some of this? Is this like a is this just like some intentional sort of design decisions that we've made or is this just like sort of emergent from like a capability improvement or like, yeah, how much of it is like us intentionally wanting some, you know, model behavior changes versus like you just make the model better at, you know, code for 2.5 Pro and then it like also becomes, you know, you know, I don't want to say wittier, but like it's better along some dimensions that that you've just described.",
"timestamp": "14:23"
},
{
"speaker": "Anca",
"text": "I would say it's both, but that I was really surprised at how much we get by improving kind of core capabilities across the board and improving reasoning. Uh, you know, I've seen this in safety as well. Um, it's always really hard uh, to kind of draw some red lines, maybe around self harm, something like that. and then make sure that the model doesn't over infer from that, doesn't over generalize and start saying no for stuff that we actually want it to help. Um, and, um, almost for free, I think reasoning kind of helps the model navigate these nuanced situations and so you'll see that that, uh, 2.5 Pro refuses, we have this metric on over refusals, it refuses a lot last in in places where it was kind of accidentally refusing before, so we've seen it there. And maybe while I'm safety, I will note that this model is so good at coding that it triggered our frontier safety early warning trigger for cyber uplift. So cyber uplift is something where we're um, um, we're we're measuring to what extent the model can help cyber attacks. And so far, you know, all models that we were testing were sort of like, ah, you know, they can help a little bit, but not in any way that we see like as capabilities will keep improving, this will become an actual serious problem. Uh, the the code and and broad improvements have made 2.5 Pro kind of come to this level where in our frontier safety report, we sort of said, look, this is for us is not the critical capability level yet. Uh, but it's early warning and we're starting our, um, the response plan for this.",
"timestamp": "14:55"
},
{
"speaker": "Logan Koepke",
"text": "Yeah, and just a quick plug for the Gemini 2.5 uh family technical report, which I think also came out today, which sort of goes into lots of detail across, um, across different, uh, axes of the new the new model launch. Um, so super helpful perspective on that. Um, Zach, I think we we sort of heard the the 2.5 Pro story and how much, um, obviously that's been exciting for folks over the last few months, but I think sort of, uh, we've also always had this sort of small model story and I'm very excited for 2.5 Flashlight. Um, also like natively with reasoning capabilities. Um, I'm curious if you can just give us the quick rundown, but also maybe for folks who don't have some of the historical context on small model stuff like Flash AB and others, if you can sort of give that historical perspective as well.",
"timestamp": "16:39"
},
{
"speaker": "Zach",
"text": "Yeah, of course. Uh start a little bit with the history. Like when we when we kicked off Gemini, uh we really started with the pro model. Um, and then when it came to uh the 1.5 series, um, we released flash and we also had this Flash 8B model. um because we were seeing a lot of customers who are asking for like lower cost, lower latency models, and we were uh really trying to meet that demand. Um, and then as we came out with the like next series, um, you know, 2.0, uh we came out with the flashlight model and made that a part of like the full family and uh we're continuing to iterate on that and now we are offering this 2.5 flashlight model. I think we've gotten to a really good place where now I think we have a much better understanding in terms of like cost latency, quality tradeoffs where we're meeting customers uh in terms of what they're asking for. Um, and making sure that we could support like the best model for their use cases. So for example, for Flashlight, and these cheaper models, we're seeing a lot of people who really love using this for rag use cases, um people, customers using it for classification, like content moderation, uh translation. So really like high volume use cases um that people are doing offline, but we also saw people asking for more latency improvements. Um I think like some of our learnings from like the Gemini uh diffusion model, um people are really excited about the low latency there. So we wanted to continue to push for that uh in the flashlight models. So now we have 2.5 which is not only our uh most cost effective model, um but it is our lowest latency model. Um so we see late significantly latency gains in terms of what was available in 2.0 flashlight and also 2.0 flash um that we're really excited about. And then beyond just uh latency improvements, as I mentioned, the quality is significantly better than the 2.0 flashlight model. Um and we see some new capabilities. So the model is now um has the thinking capabilities uh and it also has uh like the ability to use tools um which was missing uh like uh um code execution and search. Uh so we're we're glad that this model is like a more complete uh uh part of the family and uh we're excited to see how people use it.",
"timestamp": "17:24"
},
{
"speaker": "Melvin Johnson",
"text": "I mean I think the other thing is from the usage, we want to understand for these different models, flashlight, flash and Pro, what the dynamic thinking budget needs to be, like how how much it needs to think? Like are people latency sensitive for certain kinds of use cases versus not? and you know, we just don't know. Uh we're actually at the frontier here because we're trying to build both the reasoning and the chat and the API model into one. Uh so we really want to figure out, you know, how to do it right instead of shipping like six different models.",
"timestamp": "32:45"
},
{
"speaker": "Logan Koepke",
"text": "Yeah, 100%. Um, we've got two minutes left and I'll I'll take two more questions really quickly. One of them was, um, why did 2.5 Flash increase its price by 100% input and 400% output seems extreme. Um, just for folks who don't have context on the API pricing, um, 2.5 Flash previously had a different input price and uh had the same input price but different output prices, uh depending on whether it was using thinking versus not thinking. Um, and we got a bunch of feedback on this. So the sort of GA version of the model, we consolidated to a single price point. Um, the input price went up a little bit, um, but the output price relative to the reasoning version, um, actually went down slightly. Um, so we think of 2.5 Flash and anyone feel free to jump in and add extra context as like a super performant reasoning model. Like the main use cases for 2.5 Flash are the reasoning use cases. Um, you can turn reasoning off if you want to, but like the model was really built to be um a reasoning model from the ground up. Uh, that takes me to the question of like, um someone was asking, we were using 2.5 Flash non-thinking without the reasoning capability. Um, is it safe to say that we can migrate to 2.5 Flashlight uh with no performance differences? So we did um I'm curious Zach, Melvin, Tolsika, someone wants to jump in here, but we did release a a bunch of like graphics which show some of the the model metrics. Um I think this depends like very much on the use case, like what the difference will look like, but I don't know if anyone sort of broad strokes wants to talk about um the differences. I'll make a quick note, which is one of the things I'm happy about, which is if you were using, um, the non-thinking version of 2.5 Flash, if you migrate to Flashlight, the like base prices are actually less expensive. Um, so hopefully for some use cases it actually becomes more cost efficient, um, and you know, relatively in the same ballpark for performance, but I don't know if who wants to jump in here and add more context.",
"timestamp": "33:19"
},
{
"speaker": "Zach",
"text": "Yeah, I can, I can, uh, talk a little bit about here. Um, yeah, so we're we're, you know, want to make sure that migration paths are as smooth as possible. Um, at least for the 2.5 Flash, um, flashlight model. Um, the model is, uh, you know, definitely significantly higher quality, um, than the 2.0, uh, flashlight. Um, we also with thinking turned on, it is higher quality than the 2.0 Flash. Um, with out the thinking model, without the thinking turned on, I think, as you were mentioning, Lo Logan, it's pretty use case specific where I think for some of the easier, simpler use cases, you're definitely going to see the model on par with quality and in other cases, uh, it might not be on par. I think the model is in preview, so this is, you know, key area that we're trying to learn more so that the migration can be as smooth as possible. um, and make sure that we hill, continue to hill climb on the most important use cases that people are prioritizing for the flashlight model.",
"timestamp": "35:20"
},
{
"speaker": "Logan Koepke",
"text": "Yeah, 100%. I love that.",
"timestamp": "36:23"
},
{
"speaker": "Tulsee Doshi",
"text": "I was just going to say, I think, um, this is also going back to the kind of what is our launch approach and what are we trying to do here. I think we really want to make sure that flashlight is a model that ideally developers can rely on as a as a solid migration path from from some of the 2.0 models. And so, um, you know, this is a call to action to all of you who are listening, like as you're trying flashlight and you're finding areas where it's working really well, share those if you're finding areas where it's not, uh, and you're seeing deltas in performance. That's also really helpful for us, um, to to something Melvin said earlier, you know, we're continuing to push the performance of these models. We're going to do that for flashlight too. Uh, and we want to make sure that we, you know, know where we should really be investing, where it's most valuable for all of you.",
"timestamp": "36:27"
},
{
"speaker": "Logan Koepke",
"text": "Yeah, I love that. Awesome. Well, this was, um, this was wonderful. I feel like it's always a pleasure to get you all together and and have these conversations. Um, I'll echo Tulsi's comment, which is, I think part of what makes the model improvement flywheel spin is feedback from developers. So if you have stuff, we're all here. Uh, please ping any of us. Uh, we'll continue to make uh progress on on hill climbing. Um, thank you for all the questions. Thank you for all of the speakers, Tosi Melvin, Zach and Anka for taking the time to chat and excited hopefully to do more of these uh sometime soon.",
"timestamp": "37:09"
},
{
"speaker": "Tulsee Doshi",
"text": "Thanks.",
"timestamp": "37:46"
},
{
"speaker": "Melvin Johnson",
"text": "Thanks. Bye-bye.",
"timestamp": "37:47"
},
{
"speaker": "Logan Koepke",
"text": "Bye folks.",
"timestamp": "37:48"
},
{
"speaker": "Logan Koepke",
"text": "See you. Take care.",
"timestamp": "37:49"
},
{
"speaker": "Anca",
"text": "Thanks everybody.",
"timestamp": "37:50"
}
]
}74,073 input, 8,856 output, {"candidatesTokenCount": 6324, "promptTokensDetails": [{"modality": "TEXT", "tokenCount": 5}, {"modality": "AUDIO", "tokenCount": 74068}], "thoughtsTokenCount": 2532}
147481ms so 147.5 seconds