Skip to content

Instantly share code, notes, and snippets.

@L0rdCha0s
Last active February 20, 2023 08:03
Show Gist options
  • Save L0rdCha0s/de22ae0c7e7a7a70b37ac9c1262e27e1 to your computer and use it in GitHub Desktop.
Save L0rdCha0s/de22ae0c7e7a7a70b37ac9c1262e27e1 to your computer and use it in GitHub Desktop.
Tesla AI Day 2022 Transcription with Whisper
[00:27.280 --> 00:31.880] All right, welcome everybody give everyone a moment to
[00:31.880 --> 00:36.280] Get back in the audience and
[00:38.440 --> 00:44.640] All right great welcome to Tesla AI day 2022
[00:51.680 --> 00:57.140] We've got some really exciting things to show you I think you'll be pretty impressed
[00:57.140 --> 01:02.380] I do want to set some expectations with respect to our
[01:03.380 --> 01:05.380] Optimus robot as
[01:05.660 --> 01:09.300] As you know last year was just a person in a robot suit
[01:11.100 --> 01:17.620] But we've now we've come a long way and that's I think we you know compared to that it's gonna be very impressive
[01:18.420 --> 01:20.420] and
[01:20.980 --> 01:22.740] We're gonna talk about
[01:22.740 --> 01:30.340] The advancements in AI for full self-driving as well as how they apply to more generally to real-world AI problems
[01:30.340 --> 01:33.300] Like a humanoid robot and and even going beyond that
[01:33.940 --> 01:38.300] I think there's some potential that what we're doing here at Tesla could
[01:39.140 --> 01:42.020] make a meaningful contribution to AGI and
[01:43.940 --> 01:46.540] And I think actually Tesla is a good
[01:46.540 --> 01:54.460] Antity to do it from a governance standpoint because we're a publicly traded company with one class of stock and
[01:54.980 --> 01:59.420] That means that the public controls Tesla and I think that's actually a good thing
[02:00.580 --> 02:03.420] So if I go crazy you can fire me. This is important
[02:04.860 --> 02:06.860] Maybe I'm not crazy. I don't know
[02:07.340 --> 02:09.340] so
[02:09.340 --> 02:17.620] Yeah, so we're gonna talk a lot about our progress in AI autopilot as well as progress in with dojo and
[02:18.060 --> 02:23.540] Then we're gonna bring the team out and to do a long Q&A so you can ask tough questions
[02:25.300 --> 02:30.100] But whatever you'd like existential questions technical questions, but we want to have
[02:30.860 --> 02:36.300] As much time for Q&A as possible. So let's see you with that
[02:36.300 --> 02:38.300] That's because
[02:39.820 --> 02:44.180] Hey guys, I'm Milana work on autopilot and it is about and I'm Lizzie
[02:45.260 --> 02:48.580] Mechanical engineer on the project as well. Okay
[02:49.660 --> 02:54.100] So should we should we bring up the bot before we do that? We have one
[02:55.660 --> 03:02.220] One little bonus tip for the day. This is actually the first time we try this robot without any backup support
[03:02.220 --> 03:06.980] Cranes mechanical mechanisms. No cables. Nothing. Yeah
[03:06.980 --> 03:30.340] I want to do it with you guys tonight. That is the first time. Let's see. You ready? Let's go
[03:36.980 --> 03:38.980] I
[03:38.980 --> 03:40.980] I
[03:40.980 --> 04:06.900] I think the bug got some boobs
[04:10.980 --> 04:24.340] This is essentially the simple self-driving computer that runs in your Tesla cars by the way
[04:24.340 --> 04:39.300] This is the this is literally the first time the robot has operated without a tether was on stage tonight
[04:54.340 --> 05:13.780] So the robot can actually do a lot more than we just showed you we just didn't want it to fall on its face
[05:14.900 --> 05:20.500] So we'll we'll show you some videos now of the robot doing a bunch of other things
[05:20.500 --> 05:29.380] Yeah, which are less risky. Yeah, we should close the screen guys. Yeah
[05:35.300 --> 05:41.300] Yeah, we wanted to show a little bit more what we've done over the past few months with the bot and just walking around and dancing on stage
[05:41.300 --> 05:53.460] Just humble beginnings, but you can see the autopilot neural networks running as it's just retrained for the bot directly on that on that new platform
[05:54.340 --> 06:00.180] That's my watering can yeah when you when you see a rendered view. That's that's the robot. What's the that's the world the robot sees
[06:00.580 --> 06:06.500] So it's it's very clearly identifying objects like this is the object. It should pick up picking it up
[06:06.500 --> 06:10.500] Yeah
[06:12.980 --> 06:19.220] We use the same process as we did for autopilot to connect data and train neural networks that we didn't deploy on the robot
[06:19.860 --> 06:23.220] That's an example that illustrates the upper body a little bit more
[06:26.260 --> 06:30.500] Something that will like try to nail down in a few months over the next few months, I would say
[06:30.500 --> 06:36.980] To perfection, but this is really an actual station in the Fremont factory as well that it's working at
[06:48.900 --> 06:52.340] And that's not the only thing we have to show today, right? Yeah, absolutely. So
[06:52.340 --> 07:03.460] That what you saw was what we call bumble sea. That's our sort of rough development robot using semi off-the-shelf actuators
[07:04.660 --> 07:09.540] But we actually have gone a step further than that already the team's done an incredible job
[07:10.260 --> 07:15.700] And we actually have an optimist bot with fully tesla designed and built actuators
[07:15.700 --> 07:22.740] Um battery pack control system everything. Um, it it wasn't quite ready to walk
[07:23.380 --> 07:25.380] But I think it will walk in a few weeks
[07:25.940 --> 07:27.940] But we wanted to show you the robot
[07:28.500 --> 07:32.420] The something that's actually fairly close to what we'll go into production
[07:32.420 --> 07:48.340] And and show you all all the things that can do so let's bring it up
[08:02.820 --> 08:04.820] All right
[08:16.260 --> 08:18.260] Yeah
[08:18.260 --> 08:30.660] So here you're seeing optimists with
[08:31.940 --> 08:33.460] these the
[08:33.460 --> 08:38.500] With the degrees of freedom that we expect to have in optimists production unit one
[08:39.060 --> 08:42.580] Which is the ability to move all the fingers independently move the
[08:42.580 --> 08:47.620] To have the thumb have two degrees of freedom. So it has opposable thumbs
[08:48.340 --> 08:54.100] And both left and right hand so it's able to operate tools and do useful things. Our goal is to make
[08:55.300 --> 08:58.420] a useful humanoid robot as quickly as possible
[08:59.060 --> 08:59.940] and
[08:59.940 --> 09:07.300] We've also designed it using the same discipline that we use in designing the car, which is to say to design it for
[09:07.300 --> 09:15.300] All manufacturing such that it's possible to make the robot at in high volume at low cost with high reliability
[09:16.020 --> 09:22.020] So that that's incredibly important. I mean, you've all seen very impressive humanoid robot demonstrations
[09:22.740 --> 09:24.820] And that that's great. But what are they missing?
[09:25.620 --> 09:32.020] They're missing a brain that they don't have the the intelligence to navigate the world by themselves
[09:32.420 --> 09:34.420] And they're they're also very expensive
[09:34.420 --> 09:41.700] Um and made in low volume. Um, whereas, uh, this this is the optimist's design to be an extremely capable robot
[09:42.100 --> 09:44.580] But made in in very high volume probably
[09:45.140 --> 09:47.140] ultimately millions of units
[09:47.300 --> 09:50.660] And it is expected to cost much less than a car
[09:51.860 --> 09:56.740] So, uh, I would say probably less than 20,000 dollars would be my guess
[09:56.740 --> 09:58.740] Okay
[10:03.700 --> 10:09.220] The the potential for optimist is I think appreciated by very few people
[10:13.780 --> 10:15.860] As usual Tesla demos are coming in hot
[10:17.540 --> 10:19.540] So
[10:19.540 --> 10:26.820] Um, yeah, uh, I'm the team's put on put in and the team has put in an incredible amount of work
[10:27.300 --> 10:30.100] Uh working days, you know seven days a week
[10:30.980 --> 10:32.980] Running the 3am oil
[10:33.220 --> 10:38.500] To to get to the demonstration today. Um, super proud of what they've done is they've really done a great job
[10:38.500 --> 10:50.500] I'd just like to give a hand to the whole optimist team
[10:51.620 --> 10:57.620] So, you know that now there's still a lot of work to be done to, uh refine optimists and
[10:58.340 --> 11:01.300] Improve it. Obviously, this is just optimist version one
[11:02.500 --> 11:05.060] And that's really why we're holding this event
[11:05.060 --> 11:09.300] Which is to convince some of the most talented people in the world like you guys
[11:09.940 --> 11:10.980] um
[11:10.980 --> 11:12.020] to
[11:12.020 --> 11:17.540] Join tesla and help make it a reality and bring it to fruition at scale
[11:18.260 --> 11:20.740] Such that it can help millions of people
[11:22.020 --> 11:24.980] And the and the potential likes it is is really
[11:25.620 --> 11:30.580] boggles the mind because you have to say like what what is an economy an economy is
[11:31.220 --> 11:32.420] uh
[11:32.420 --> 11:37.460] sort of productive entities times the productivity, uh capita times
[11:39.060 --> 11:43.140] Productivity per capita at the point at which there is not a limitation on capita
[11:43.860 --> 11:48.740] The it's not clear what an economy even means at that point. It an economy becomes quasi infinite
[11:49.620 --> 11:50.740] um
[11:50.740 --> 11:52.740] so
[11:53.060 --> 11:57.460] What what you know taken to fruition in the hopefully benign scenario
[11:59.140 --> 12:00.580] the
[12:00.580 --> 12:04.980] This means a future of abundance a future where
[12:05.860 --> 12:06.980] um
[12:06.980 --> 12:11.220] There is no poverty where people you can have whatever you want
[12:12.100 --> 12:14.100] In terms of products and services
[12:16.740 --> 12:22.100] It really is a a fundamental transformation of civilization as we know it
[12:24.740 --> 12:28.500] Obviously we want to make sure that transformation is a positive one and um
[12:28.500 --> 12:30.500] safe
[12:31.300 --> 12:33.300] And but but that's also why I think
[12:34.020 --> 12:40.100] Tesla as an entity doing this being a single class of stock publicly traded owned by the public
[12:40.900 --> 12:42.900] um is very important
[12:43.300 --> 12:49.940] Um and should not be overlooked. I think this is essential because then if the public doesn't like what tesla's doing
[12:50.420 --> 12:53.700] The public can buy shares in tesla and vote differently
[12:54.900 --> 12:56.900] This is a big deal
[12:56.900 --> 12:57.780] Um
[12:57.780 --> 13:00.420] Like it's very important that that I can't just do what I want
[13:01.140 --> 13:05.620] You know, sometimes people think that that but it's not true. Um, so
[13:08.420 --> 13:14.020] You know that it's it's very important that the the corporate entity that has that that makes this happen
[13:14.580 --> 13:17.940] Is something that the public can properly influence
[13:19.940 --> 13:22.980] And so I think the tesla structure is is is ideal for that
[13:22.980 --> 13:26.100] Um
[13:27.460 --> 13:31.300] And like said that you know self-driving cars will certainly have a
[13:32.260 --> 13:36.500] Tremendous impact on the world. Um, I think they will improve
[13:37.300 --> 13:39.620] the productivity of transport by at least
[13:41.060 --> 13:44.580] A half order of magnitude perhaps an order of magnitude perhaps more
[13:45.700 --> 13:47.300] um
[13:47.300 --> 13:49.860] Optimus I think has
[13:49.860 --> 13:53.140] Maybe a two order of magnitude
[13:53.940 --> 13:55.940] Uh potential improvement
[13:56.660 --> 13:58.660] in uh economic output
[13:59.860 --> 14:03.940] Like like it's it's not clear. It's not clear what the limit actually even is
[14:05.140 --> 14:06.500] um
[14:06.500 --> 14:08.500] so
[14:08.740 --> 14:12.020] But we we need to do this in the right way we need to do it carefully and safely
[14:12.660 --> 14:16.500] And ensure that the outcome is one that is beneficial to
[14:16.500 --> 14:20.740] Uh civilization and and one that humanity wants
[14:21.940 --> 14:24.740] Uh, I can't this is also extremely important obviously
[14:25.540 --> 14:27.540] so, um
[14:29.140 --> 14:36.500] And and I hope you will consider uh joining tesla to achieve those goals
[14:37.780 --> 14:38.740] Um
[14:38.740 --> 14:44.580] It tesla we're we really care about doing the right thing here or aspire to do the right thing and and really not
[14:44.580 --> 14:47.540] Pay the road to hell with with good intentions
[14:47.860 --> 14:51.220] And I think the road is road to hell is mostly paved with bad intentions, but every now and again
[14:51.220 --> 14:58.100] There's a good intention in there. So we want to do the right thing. Um, so, you know consider joining us and helping make it happen
[14:58.660 --> 14:59.460] um
[14:59.460 --> 15:02.180] With that let's let's uh, we want to the next phase
[15:02.180 --> 15:14.420] All right, so you've seen a couple robots today. Let's do a quick timeline recap
[15:14.900 --> 15:19.380] So last year we unveiled the tesla bot concept, but a concept doesn't get us very far
[15:19.860 --> 15:25.380] We knew we needed a real development and integration platform to get real life learnings as quickly as possible
[15:25.940 --> 15:28.820] So that robot that came out and did the little routine for you guys
[15:28.820 --> 15:35.940] We had that within six months built working on software integration hardware upgrades over the months since then
[15:36.500 --> 15:40.500] But in parallel we've also been designing the next generation this one over here
[15:42.020 --> 15:47.540] So this guy is rooted in the the foundation of sort of the vehicle design process, you know
[15:47.540 --> 15:50.340] We're leveraging all of those learnings that we already have
[15:51.540 --> 15:55.860] Obviously there's a lot that's changed since last year, but there's a few things that are still the same you'll notice
[15:55.860 --> 15:59.460] We still have this really detailed focus on the true human form
[15:59.860 --> 16:05.700] We think that matters for a few reasons, but it's fun. We spend a lot of time thinking about how amazing the human body is
[16:06.420 --> 16:08.420] We have this incredible range of motion
[16:08.980 --> 16:10.980] typically really amazing strength
[16:11.380 --> 16:17.380] Um a fun exercise is if you put your fingertip on the chair in front of you you'll notice that there's a huge
[16:18.180 --> 16:24.740] Range of motion that you have in your shoulder and your elbow for example without moving your fingertip you can move those joints all over the place
[16:24.740 --> 16:30.260] Um, but the robot, you know, its main function is to do real useful work
[16:30.500 --> 16:33.860] And it maybe doesn't necessarily need all of those degrees of freedom right away
[16:34.500 --> 16:40.980] So we've stripped it down to a minimum sort of 28 fundamental degrees of freedom and then of course our hands in addition to that
[16:42.100 --> 16:46.500] Humans are also pretty efficient at some things and not so efficient in other times
[16:46.980 --> 16:51.860] So for example, we can eat a small amount of food to sustain ourselves for several hours. That's great
[16:51.860 --> 16:58.660] Uh, but when we're just kind of sitting around no offense, but we're kind of inefficient. We're just sort of burning energy
[16:59.460 --> 17:05.140] So on the robot platform, what we're going to do is we're going to minimize that idle power consumption drop it as low as possible
[17:05.460 --> 17:10.580] And that way we can just flip a switch and immediately the robot turns into something that does useful work
[17:12.660 --> 17:15.620] So let's talk about this latest generation in some detail, shall we?
[17:15.620 --> 17:22.740] So on the screen here, you'll see in orange are actuators, which we'll get to in a little bit and in blue are electrical system
[17:23.780 --> 17:25.780] So now that we have our sort of
[17:26.100 --> 17:29.540] human based research and we have our first development platform
[17:29.620 --> 17:32.900] We have both research and execution to draw from for this design
[17:33.780 --> 17:40.100] Again, we're using that vehicle design foundation. So we're taking it from concept through design and analysis
[17:40.500 --> 17:42.500] and then build and validation
[17:42.500 --> 17:49.780] Along the way, we're going to optimize for things like cost and efficiency because those are critical metrics to take this product to scale
[17:49.780 --> 17:51.300] eventually
[17:51.300 --> 17:57.300] How are we going to do that? Well, we're going to reduce our part count and our power consumption of every element possible
[17:57.860 --> 18:01.620] We're going to do things like reduce the sensing in the wiring at our extremities
[18:01.780 --> 18:07.700] You can imagine a lot of mass in your hands and feet is going to be quite difficult and power consumptive to move around
[18:07.700 --> 18:14.260] And we're going to centralize both our power distribution and our compute to the physical center of the platform
[18:15.700 --> 18:20.020] So in the middle of our torso, actually it is the torso. We have our battery pack
[18:20.500 --> 18:25.060] This is sized at 2.3 kilowatt hours, which is perfect for about a full day's worth of work
[18:26.260 --> 18:32.980] What's really unique about this battery pack is it has all of the battery electronics integrated into a single pcb within the pack
[18:33.540 --> 18:35.540] So that means everything from sensing
[18:35.540 --> 18:37.540] to fusing
[18:37.620 --> 18:41.940] Charge management and power distribution is all on one all in one place
[18:43.220 --> 18:45.220] We're also leveraging both
[18:45.220 --> 18:48.420] our vehicle products and our energy products
[18:48.980 --> 18:53.860] To roll all of those key features into this battery. So that's streamlined manufacturing
[18:54.340 --> 18:59.140] Really efficient and simple cooling methods battery management and also safety
[18:59.140 --> 19:04.900] And of course we can leverage tesla's existing infrastructure and supply chain to make it
[19:06.340 --> 19:10.100] So going on to sort of our brain it's not in the head, but it's pretty close
[19:11.060 --> 19:13.700] Also in our torso, we have our central computer
[19:14.180 --> 19:19.300] So as you know tesla already ships full self-driving computers in every vehicle we produce
[19:19.940 --> 19:24.980] We want to leverage both the autopilot hardware and the software for the humanoid platform
[19:24.980 --> 19:30.020] But because it's different in requirements and informed factor, we're going to change a few things first
[19:30.900 --> 19:34.500] So we still are gonna it's going to do everything that a human brain does
[19:35.220 --> 19:42.100] Processing vision data making split sescan decisions based on multiple sensory inputs and also communications
[19:42.580 --> 19:47.780] So to support communications, it's equipped with wireless connectivity as well as audio support
[19:47.780 --> 19:54.740] And then it also has hardware level security features which are important to protect both the robot and the people around the robot
[19:56.820 --> 20:01.140] So now that we have our sort of core we're going to need some limbs on the sky
[20:01.700 --> 20:06.340] Um, and we'd love to show you a little bit about our actuators and our fully functional hands as well
[20:06.340 --> 20:20.340] But the first before we do that, I'd like to introduce Malcolm who's going to speak a little bit about our structural foundation for the robot
[20:24.340 --> 20:27.540] Tesla have the capabilities to analyze highly complex systems
[20:28.100 --> 20:30.100] Don't get much more complex than a crash
[20:30.580 --> 20:33.220] You can see here a simulated crash from bottle three
[20:33.220 --> 20:35.780] Superimposed on top of the actual physical crash
[20:36.500 --> 20:38.900] It's actually incredible how um, how accurate it is
[20:39.460 --> 20:41.780] Just to give you an idea of the complexity of this model
[20:42.500 --> 20:47.940] It includes every not bolt and washer every spot weld and it has 35 million degrees of freedom
[20:48.420 --> 20:50.020] quite amazing
[20:50.020 --> 20:54.660] And it's true to say that if we didn't have models like this, we wouldn't be able to make the safest cars in the world
[20:54.660 --> 21:02.020] So can we utilize our capabilities and our methods from the automotive side to influence a robot?
[21:04.500 --> 21:08.900] Well, we can make a model and since we had crash software we're using the same software here
[21:09.220 --> 21:10.740] We can make it fall down
[21:10.740 --> 21:16.820] The purpose of this is to make sure that if it falls down ideally it doesn't but it's superficial damage
[21:17.620 --> 21:23.620] We don't want it to for example break its gearbox and its arms. That's equivalent of a dislocated shoulder of a robot
[21:23.620 --> 21:25.620] Difficult and expensive to fix
[21:26.500 --> 21:29.620] So we wanted to dust itself off get on with the job. It's being given
[21:32.660 --> 21:36.020] We could also take the same model and we can drive the actuators
[21:36.500 --> 21:39.140] using the inputs from a previously solved model
[21:40.180 --> 21:42.020] Bringing it to life
[21:42.020 --> 21:49.620] So this is producing the motions for the tasks we want the robot to do these tasks are picking up boxes turning squatting walking upstairs
[21:49.620 --> 21:55.140] Whatever the set of tasks are we can play to the model. This is showing just simple walking
[21:55.460 --> 22:00.260] We can create the stresses in all the components that helps us optimize the components
[22:03.380 --> 22:08.340] These are not dancing robots these are actually the modal behavior the first five modes of the robot
[22:08.900 --> 22:16.020] And typically when people make robots they make sure the first mode is up around the top single figures up towards 10 hertz
[22:16.020 --> 22:23.700] Who is to do this is to make the controls of walking easier. It's very difficult to walk if you can't guarantee where your foot is wobbling around
[22:24.900 --> 22:28.420] That's okay if you make one robot. We want to make thousands maybe millions
[22:29.060 --> 22:35.060] We haven't got the luxury of making from carbon fiber titanium. We want to make them plastic things are not quite as stiff
[22:36.420 --> 22:39.860] So we can't have these high targets. I call them dumb targets
[22:41.380 --> 22:43.380] We've got to make them work at lower targets
[22:43.380 --> 22:49.540] So is that it's that good to work? Well, if you think about it, sorry about this, but we're just bags of soggy
[22:49.940 --> 22:51.940] jelly and bones thrown in
[22:52.260 --> 22:56.100] We're not high frequency. If I start on my leg, I don't vibrate at 10 hertz
[22:57.620 --> 23:02.740] We people operate at a lot of frequency. So we know the robot actually can it just makes controls harder
[23:03.060 --> 23:05.060] So we take the information from this the modal
[23:06.020 --> 23:10.100] Data and the stiffness and feed it into the control system that allows it to walk
[23:10.100 --> 23:12.100] And
[23:12.740 --> 23:15.300] Just changing tax lightly looking at the knee
[23:16.100 --> 23:21.620] We could take some inspiration from biology and we can look to see what the mechanical advantage of the knee is
[23:22.180 --> 23:26.900] It turns out it actually represent quite similar to four-bar link and that's quite non-linear
[23:27.860 --> 23:30.820] That's not surprising really because if you think when you bend your leg down
[23:31.300 --> 23:34.980] The torque on your knee is much more when it's bent than it is when it's straight
[23:34.980 --> 23:42.420] So you'd expect a non-linear function and in fact the biology is non-linear. This matches it quite accurately
[23:44.740 --> 23:48.020] So that's a representation the four-bar link is obviously not physically four-bar link
[23:48.260 --> 23:50.660] as I said the characteristics are similar, but
[23:51.460 --> 23:54.740] Me bending down that's not very scientific. Let's be a bit more scientific
[23:55.380 --> 23:59.060] We've played all the tasks through the through this graph
[23:59.060 --> 24:03.700] And this is showing picketing is walking squatting the tasks I said we did on the stress
[24:03.700 --> 24:05.700] And that's the
[24:05.940 --> 24:07.540] the torque
[24:07.540 --> 24:11.060] Seen at the knee against the knee bend on the horizontal axis
[24:11.540 --> 24:14.340] This is showing the requirement for the need to do all these tasks
[24:15.060 --> 24:21.540] And then put a curve through it surfing over the top of the piece and that's saying this is what's required to make the robot
[24:21.540 --> 24:23.540] Do these tasks?
[24:26.740 --> 24:30.260] So if we look at the four-bar link that's actually the green curve
[24:30.260 --> 24:34.660] And it's saying that the non-linearity of the four-bar link is actually linearized
[24:35.060 --> 24:38.820] The characteristic of the force what that really says is that's lower the force
[24:38.820 --> 24:44.180] That's what makes the actuator have the lowest possible force, which is the most efficient. We want to burn energy up slowly
[24:44.820 --> 24:48.740] What's the blue curve with the blue curve is actually if we didn't have a four-bar link
[24:48.900 --> 24:54.980] We just had an arm sticking out of my leg here with a with an actuator on it a simple two-bar link
[24:54.980 --> 25:00.740] That's the best we could do with a simple two-bar link and it shows that that would create a much more force in the actuator
[25:01.140 --> 25:03.140] Which would not be efficient
[25:04.500 --> 25:06.500] So what does it look like in practice?
[25:06.580 --> 25:08.260] well
[25:08.260 --> 25:12.820] As you'll see but it's very tightly packaged in the knee you'll see it go transparent on the second
[25:12.820 --> 25:19.780] You'll see the four-bar link there is operating on the actuator. This is determined the force and the displacements on the actuator
[25:19.780 --> 25:26.980] And now pass you over to Constantina to tell you a lot more detail about how these actuators are made and designed optimized. Thank you
[25:35.940 --> 25:38.580] So I am I would like to talk to you about
[25:39.940 --> 25:42.660] The design process and the actuator portfolio
[25:44.020 --> 25:45.540] In our robot
[25:45.540 --> 25:49.620] So there are many similarities between a car and the robot when it comes to powertrain design
[25:50.340 --> 25:55.220] The most important thing that matters here is energy mass and cost
[25:56.020 --> 26:00.180] We are carrying over most of our designing experience from the car to the robot
[26:03.540 --> 26:07.700] So in the particular case you see a car with two drive units
[26:08.420 --> 26:12.660] And the drive units are used in order to accelerate the car zero to 60
[26:12.660 --> 26:15.220] miles per hour time or drive a city
[26:16.180 --> 26:17.620] drive site
[26:17.620 --> 26:18.660] while
[26:18.660 --> 26:21.140] The robot that has 28 actuators
[26:21.540 --> 26:22.900] and
[26:22.900 --> 26:26.340] It's not obvious. What are the tasks at actuator level?
[26:26.980 --> 26:34.020] So we have tasks that are higher level like walking or climbing stairs or carrying a heavy object
[26:34.260 --> 26:36.500] Which need to be translated into joint
[26:37.780 --> 26:40.660] Into joint specs therefore we use our model
[26:40.660 --> 26:42.660] That generates
[26:43.300 --> 26:45.300] The torque speed
[26:45.380 --> 26:50.180] Trajectories for our joints which subsequently is going to be fed in our optimization model
[26:50.660 --> 26:53.460] And to run through the optimization process
[26:56.260 --> 27:01.060] This is one of the scenarios that the robot is capable of doing which is turning and walking
[27:02.660 --> 27:08.340] So when we have this torque speed trajectory we lay it over an efficiency map of an actuator
[27:08.340 --> 27:12.820] And we are able along the trajectory to generate
[27:13.460 --> 27:19.860] The power consumption and the energy cumulative energy for the task versus time
[27:20.900 --> 27:27.140] So this allows us to define the system cost for the particular actuator and put a simple point into the cloud
[27:27.700 --> 27:32.660] Then we do this for hundreds of thousands of actuators by solving in our cluster
[27:32.660 --> 27:38.660] And the red line denotes the Pareto front, which is the preferred area where we will look for optimal
[27:39.380 --> 27:44.740] So the x denotes the preferred actuator design we have picked for this particular joint
[27:45.300 --> 27:51.780] So now we need to do this for every joint. We have 28 joints to optimize and we parse our cloud
[27:52.820 --> 28:01.460] We parse our cloud again for every joint spec and the red axis this time denotes the bespoke actuator designs for every joint
[28:01.460 --> 28:03.460] The problem here
[28:03.620 --> 28:08.980] Is that we have too many unique actuator designs and even if we take advantage of the symmetry
[28:09.220 --> 28:11.940] Still there are too many in order to make something
[28:12.260 --> 28:17.780] Mass manufacturable we need to be able to reduce the amount of unique actuator designs
[28:17.860 --> 28:23.460] Therefore we run something called commonality study, which we parse our cloud again
[28:23.460 --> 28:31.380] Looking this time for actuators that simultaneously meet the joint performance requirements for more than one joint at the same time
[28:31.780 --> 28:36.740] So the resulting portfolio is six actuators and they show in a color map at the middle figure
[28:37.620 --> 28:38.900] um
[28:38.900 --> 28:41.860] And the actuators can be also viewed in this
[28:42.660 --> 28:48.740] Slide we have three rotary and three linear actuators all of which have a great
[28:48.740 --> 28:52.740] Output force or torque per mass
[28:54.100 --> 28:57.700] The rotary actuator in particular has a mechanical clutch integrated
[28:58.420 --> 29:02.260] On the high speed side angular contact ball bearing and on the high speed side
[29:03.460 --> 29:10.020] And on the low speed side a cross roller bearing and the year train is a strain wave year
[29:10.020 --> 29:17.380] Um, there are three integrated sensors here and the bespoke permanent magnet machine
[29:19.780 --> 29:21.780] The linear actuator
[29:26.340 --> 29:28.340] I'm sorry
[29:28.420 --> 29:32.820] The linear actuator has planetary rollers and an inverted planetary
[29:33.300 --> 29:38.260] Screw as a gear train which allows efficiency and compaction and durability
[29:38.260 --> 29:45.860] So in order to demonstrate the force capability of our linear actuators, we have set up an
[29:46.660 --> 29:50.100] experiment in order to test it under its limits
[29:53.380 --> 29:56.500] And I will let you enjoy the video
[29:56.500 --> 30:05.140] So our actuator is able to lift
[30:14.020 --> 30:18.340] A half ton nine foot concert grand piano
[30:20.660 --> 30:22.660] And
[30:22.660 --> 30:29.220] This is a requirement it's not something nice to have
[30:29.780 --> 30:36.820] Because our muscles can do the same when they are direct driven when they are directly driven our quadriceps muscles
[30:37.300 --> 30:41.300] Can do the same thing it's just that the knee is an upgearing
[30:41.540 --> 30:47.940] Linked system that converts the force into velocity at the end effector of our heels for purposes of giving
[30:48.500 --> 30:50.500] To the human body agility
[30:50.500 --> 30:54.500] So this is one of the main things that are amazing about the human body
[30:54.980 --> 31:00.260] And I'm concluding my part at this point and I would like to welcome my colleague Mike who's going to talk to you about
[31:00.740 --> 31:02.740] Hand design. Thank you very much
[31:04.820 --> 31:06.820] Thanks for seeing us
[31:08.020 --> 31:11.620] So we just saw how powerful a human and a humanoid actuator can be
[31:12.900 --> 31:16.020] However, humans are also incredibly dexterous
[31:16.020 --> 31:21.620] The human hand has the ability to move at 300 degrees per second
[31:22.340 --> 31:24.900] There's tens of thousands of tactile sensors
[31:25.780 --> 31:29.940] And it has the ability to grasp and manipulate almost every object in our daily lives
[31:31.540 --> 31:34.500] For our robotic hand design, we were inspired by biology
[31:35.060 --> 31:37.060] We have five fingers an opposable thumb
[31:38.020 --> 31:41.940] Our fingers are driven by metallic tendons that are both flexible and strong
[31:41.940 --> 31:50.740] We have the ability to complete wide aperture power grasps while also being optimized for precision gripping of small thin and delicate objects
[31:52.180 --> 31:54.180] So why a human like robotic hand?
[31:54.900 --> 31:59.140] Well, the main reason is that our factories in the world around us is designed to be ergonomic
[32:00.180 --> 32:03.460] So what that means is that it ensures that objects in our factory are graspable
[32:04.260 --> 32:09.300] But it also ensures that new objects that we may have never seen before can be grasped by the human hand
[32:09.300 --> 32:11.300] And by our robotic hand as well
[32:12.580 --> 32:18.020] The converse there is is pretty interesting because it's saying that these objects are designed to our hand instead of having to make changes
[32:18.020 --> 32:20.020] To our hand to accompany a new object
[32:22.180 --> 32:25.620] Some basic stats about our hand is that it has six actuators and 11 degrees of freedom
[32:26.260 --> 32:31.220] It has an in-hand controller which drives the fingers and receives sensor feedback
[32:32.020 --> 32:35.620] Sensor feedback is really important to learn a little bit more about the objects that we're grasping
[32:35.620 --> 32:41.060] And also for proprioception and that's the ability for us to recognize where our hand is in space
[32:42.900 --> 32:45.460] One of the important aspects of our hand is that it's adaptive
[32:46.020 --> 32:51.540] This adaptability is involved essentially as complex mechanisms that allow the hand to adapt the objects that's being grasped
[32:53.060 --> 32:56.020] Another important part is that we have a non-back drivable finger drive
[32:56.420 --> 33:01.140] This clutching mechanism allows us to hold and transport objects without having to turn on the hand motors
[33:01.140 --> 33:05.380] You just heard how we went about going we went about designing the tesla bot hardware
[33:05.780 --> 33:08.980] Now I'll hand it off to Milan and our autonomy team to bring this robot to life
[33:11.380 --> 33:13.380] Thanks Michael
[33:17.380 --> 33:19.380] All right
[33:19.380 --> 33:22.420] So all those cool things we've shown earlier in the video
[33:22.980 --> 33:29.460] Were possible just in a matter of a few months. Thanks to the amazing work that we've done autopilot over the past few years
[33:29.460 --> 33:33.380] Most of those components poured it quite easily over to the bot's environment
[33:33.700 --> 33:38.260] If you think about it, we're just moving from a robot on wheels to a robot on legs
[33:38.580 --> 33:42.980] So some of the components are pretty similar and some other require more heavy lifting
[33:44.260 --> 33:47.220] So for example our computer vision neural networks
[33:47.700 --> 33:51.700] Were ported directly from autopilot to the bot's situation
[33:52.340 --> 33:58.180] It's exactly the same occupancy network that we'll talk into a little bit more details later with the autopilot
[33:58.180 --> 34:00.740] team that is now running on the bot here in this video
[34:01.300 --> 34:04.420] The only thing that changed really is the training data that we had to recollect
[34:07.460 --> 34:10.820] We're also trying to find ways to improve those occupancy networks
[34:11.780 --> 34:16.660] Using work made on your radiance fields to get really great volumetric
[34:17.140 --> 34:23.220] Rendering of the bot's environments for example here some machinery that the bot might have to interact with
[34:23.220 --> 34:30.580] Another interesting problem to think about is in indoor environments, mostly with that sense of gps signal
[34:31.140 --> 34:37.140] How do you get the bot to navigate to its destination? Say for instance to find its nearest charging station
[34:37.620 --> 34:39.620] So we've been training
[34:39.700 --> 34:42.820] More neural networks to identify high-frequency features
[34:43.300 --> 34:45.300] key points within the bot's camera streams
[34:45.780 --> 34:50.420] And track them across frames over time as the bot navigates with its environment
[34:50.420 --> 34:57.780] And we're using those points to get a better estimate of the bot's pose and trajectory within its environment as it's walking
[35:00.900 --> 35:05.540] We also did quite some work on the simulation side and this is literally the autopilot simulator
[35:06.180 --> 35:10.900] To which we've integrated the robots locomotion code and this is a video of the
[35:11.540 --> 35:14.100] Motion control code running in the autopilot simulator
[35:14.100 --> 35:22.260] Showing the evolution of the robot's work over time. So as you can see we started quite slowly in April and start accelerating as we unlock more joints
[35:22.580 --> 35:26.260] And deeper more advanced techniques like arms balancing over the past few months
[35:28.020 --> 35:34.100] And so locomotion is specifically one component that's very different as we're moving from the car to the bot's environment
[35:34.580 --> 35:39.220] And so I think it warrants a little bit more depth and I'd like my colleagues to start talking about this now
[35:39.220 --> 35:45.140] Thank you Milan. Hi, everyone. I'm Felix. I'm a robotics engineer on the project and I'm going to talk about walking
[35:46.100 --> 35:52.020] Walking seems easy, right? People do it every day. You don't even have to think about it
[35:52.580 --> 35:57.380] But there are some aspects of walking which are challenging from engineering to technology
[35:57.940 --> 36:04.580] And I think that's one of the things that makes it so much easier for me to think about it
[36:04.580 --> 36:10.740] But there are some aspects of walking which are challenging from engineering perspective. For example
[36:11.780 --> 36:16.420] Physical self-awareness that means having a good representation of yourself
[36:16.980 --> 36:22.180] What is the length of your limbs? What is the mass of your limbs? What is the size of your feet? All that matters
[36:23.380 --> 36:30.100] Also having an energy efficient gate. You can imagine there's different styles of walking and all of them are equally efficient
[36:30.100 --> 36:34.740] Most important keep balance. Don't fall
[36:35.860 --> 36:39.140] And of course also coordinate the motion of all of your limbs together
[36:40.260 --> 36:42.580] So now humans do all of this naturally
[36:43.140 --> 36:46.580] But as engineers or roboticists we have to think about these problems
[36:47.220 --> 36:51.220] And the following I'm going to show you how we address them in our locomotion planning and control stack
[36:51.940 --> 36:53.940] So we start with locomotion planning
[36:53.940 --> 37:00.980] And our representation of the bot that means a model of the robots kinematics dynamics and the contact properties
[37:01.940 --> 37:09.860] And using that model and the desired path for the bots our locomotion planner generates reference trajectories for the entire system
[37:11.140 --> 37:16.020] This means feasible trajectories with respect to the assumptions of our model
[37:16.020 --> 37:23.380] The planner currently works in three stages. It starts planning footsteps and ends with the entire motion photo system
[37:24.180 --> 37:26.660] And let's dive a little bit deeper in how this works
[37:27.460 --> 37:31.540] So in this video we see footsteps being planned over a planning horizon
[37:32.020 --> 37:34.020] following the desired path
[37:34.100 --> 37:36.100] And we start from this and add then
[37:36.820 --> 37:42.180] Foot trajectories that connect these footsteps using toe-off and heel strike just as the humans
[37:42.180 --> 37:48.980] Just as humans do and this gives us the largest right and less knee bend for high efficiency of the system
[37:50.180 --> 37:53.140] The last stage is then finding a center of mass trajectory
[37:53.540 --> 37:57.940] Which gives us a dynamically feasible motion of the entire system to keep balance
[37:57.940 --> 38:12.900] As we all know plans are good, but we also have to realize them in reality. Let's say how see how we can do this
[38:14.900 --> 38:20.100] Thank you Felix. Hello everyone. My name is Anand and I'm going to talk to you about controls
[38:20.100 --> 38:27.700] So let's take the motion plan that Felix just talked about and put it in the real world on a real robot
[38:28.260 --> 38:30.260] Let's see what happens
[38:32.020 --> 38:34.740] It takes a couple steps and falls down
[38:35.700 --> 38:37.700] Well, that's a little disappointing
[38:37.780 --> 38:41.060] But we are missing a few key pieces here, which will make it walk
[38:41.060 --> 38:51.140] Now as Felix mentioned the motion planner is using an idealized version of itself and a version of reality around it
[38:52.340 --> 38:54.340] This is not exactly correct
[38:54.420 --> 38:57.300] It also expresses its intention
[38:57.860 --> 39:05.140] Through trajectories and wrenches wrenches of forces and torques that it wants to exert on the world to locomotive
[39:05.140 --> 39:13.060] Reality is way more complex than any similar model. Also the robot is not simplified
[39:13.140 --> 39:18.820] It's got vibrations and modes, compliance, sensor noise and on and on and on
[39:20.580 --> 39:23.940] So what does that do to the real world when you put the bot in the real world?
[39:24.900 --> 39:32.500] Well, the unexpected forces cause unmodeled dynamics, which essentially the planet doesn't know about and that causes destabilization
[39:32.500 --> 39:37.780] Especially for a system that is dynamically stable like bipedal locomotion
[39:39.060 --> 39:42.500] So what can we do about it? Well, we measure reality
[39:43.220 --> 39:47.220] We use sensors and our understanding of the world to do state estimation
[39:47.860 --> 39:54.020] And here you can see the attitude and pelvis pose, which is essentially the vestibular system in a human
[39:55.220 --> 40:00.020] Along with the center of mass trajectory being tracked when the robot is walking in the office environment
[40:00.020 --> 40:05.380] Now we have all the pieces we need in order to close the loop
[40:06.020 --> 40:08.020] So we use our better bot model
[40:08.820 --> 40:12.580] We use the understanding of reality that we've gained through state estimation
[40:13.140 --> 40:19.860] And we compare what we want versus what we expect the reality expect that reality is doing to us in order to
[40:21.380 --> 40:24.180] Add corrections to the behavior of the robot
[40:24.180 --> 40:31.140] Here the robot certainly doesn't appreciate being poked, but it has an admirable job of staying upright
[40:33.140 --> 40:37.460] The final point here is a robot that walks is not enough
[40:37.460 --> 40:54.100] We need it to use its hands and arms to be useful. Let's talk about manipulation
[40:55.220 --> 40:58.660] Hi everyone, my name is Eric robotics engineer on tesla bot
[40:59.300 --> 41:03.940] And I want to talk about how we've made the robot manipulate things in the real world
[41:03.940 --> 41:11.140] We wanted to manipulate objects while looking as natural as possible and also get there quickly
[41:11.780 --> 41:15.140] So what we've done is we've broken this process down into two steps
[41:15.540 --> 41:18.260] First is generating a library of natural motion references
[41:19.060 --> 41:25.460] Or we could call them demonstrations and then we've adapted these motion references online to the current real world situation
[41:27.620 --> 41:30.420] So let's say we have a human demonstration of picking up an object
[41:30.420 --> 41:35.460] We can get a motion capture of that demonstration, which is visualized right here as
[41:35.940 --> 41:39.940] A bunch of keyframes representing the location of the hands the elbows the torso
[41:40.820 --> 41:43.700] We can map that to the robot using inverse kinematics
[41:44.260 --> 41:48.020] And if we collect a lot of these now we have a library that we can work with
[41:50.340 --> 41:55.460] But a single demonstration is not generalizable to the variation in the real world
[41:55.700 --> 41:58.740] For instance, this would only work for a box in a very particular
[41:58.740 --> 42:00.740] Location
[42:00.740 --> 42:02.740] So what we've also done is run these
[42:03.620 --> 42:10.900] Reference trajectories through a trajectory optimization program which solves for where the hand should be how the robot should balance
[42:11.860 --> 42:13.540] during
[42:13.540 --> 42:17.700] When it needs to adapt the motion to the real world. So for instance, if the box is
[42:18.660 --> 42:23.380] In this location, then our optimizer will create this trajectory instead
[42:23.380 --> 42:31.780] Next Milan's going to talk about uh, what's next for the optimist uh, tesla lie. Thanks
[42:39.540 --> 42:43.940] Right, so hopefully by now you guys got a good idea of what we've been up to over the past few months
[42:44.580 --> 42:51.140] Um, we started having something that's usable, but it's far from being useful. There's still a long and exciting road ahead of us
[42:51.140 --> 42:55.060] um, I think the first thing within the next few weeks is to
[42:55.620 --> 43:01.300] Get optimists at least apart with bumble see the other bug prototype you saw earlier and probably beyond
[43:02.020 --> 43:08.740] We are also going to start focusing on the real use case at one of our factories and really going to try to try to
[43:09.620 --> 43:14.340] Nail this down and I run out all the elements needed to deploy this product in the real world
[43:14.900 --> 43:18.020] I was mentioning earlier, you know indoor navigation
[43:18.020 --> 43:24.820] Um graceful for management or even servicing all components needed to scale this product up
[43:25.700 --> 43:29.380] But um, I don't know about you, but after seeing what we've shown tonight
[43:29.700 --> 43:32.260] I'm pretty sure we can get this done within the next few months or years
[43:32.660 --> 43:36.580] Um, and and make this product a reality and change the entire economy
[43:37.140 --> 43:42.180] Um, so I would like to thank the entire optimist team for their hard work over the past few months
[43:42.180 --> 43:48.180] I think it's pretty amazing. All of this was done in barely six or eight months. Thank you very much
[44:02.340 --> 44:04.340] Hey everyone
[44:05.380 --> 44:08.180] Hi, I'm Ashok. I lead the autopilot team alongside Milan
[44:08.180 --> 44:12.020] Oh god, it's going to be so hard to top that optinist section
[44:13.620 --> 44:15.620] He'll try nonetheless
[44:16.180 --> 44:18.180] anyway
[44:18.260 --> 44:20.820] Every tesla that has been built over the last several years
[44:21.300 --> 44:24.420] We think has the hardware to make the car drive itself
[44:25.620 --> 44:29.780] We have been working on the software to add higher and higher levels of autonomy
[44:31.380 --> 44:36.100] This time around last year. We are roughly 2000 cars driving our fsd beta software
[44:36.100 --> 44:41.700] Since then we have significantly improved the software's robustness and capability
[44:42.420 --> 44:46.020] That we have now shipped it to 160,000 customers as of today
[44:54.740 --> 44:59.460] This did not come for free it came from the sweat and blood of the engineering team over the last one year
[44:59.460 --> 45:05.700] Um, for example, we trained 75,000 neural network models just last one year
[45:06.340 --> 45:08.660] That's roughly a model every eight minutes
[45:09.700 --> 45:16.420] That's you know coming out of the team and then we evaluate them on our large clusters and then we ship 281 of those models
[45:16.660 --> 45:18.660] That actually improved the performance of the car
[45:19.380 --> 45:22.420] And this space of innovation is happening throughout the stack
[45:22.420 --> 45:29.380] The the planning software the infrastructure the tools even hiring everything is progressing to the next level
[45:32.180 --> 45:35.060] The fsd beta software is quite capable of driving the car
[45:36.660 --> 45:40.660] It should be able to navigate from parking lot to parking lot handling city street driving
[45:41.300 --> 45:43.300] stopping for traffic lights and stop signs
[45:43.860 --> 45:47.300] Negotiating with objects at intersections making turns and so on
[45:47.300 --> 45:49.300] All of this comes from the
[45:49.940 --> 45:53.940] Uh camera streams that go through our neural networks that run on the car itself
[45:53.940 --> 45:55.620] It's not coming back to the server or anything
[45:55.620 --> 46:02.660] It runs on the car and produces all the outputs uh to form the world model or on the car and the planning software drives the car based on that
[46:04.900 --> 46:08.260] Today we'll go into a lot of the components that make up the system
[46:09.300 --> 46:13.940] The occupancy network acts as the base geometry layer of the system
[46:13.940 --> 46:17.060] This is a multi-camera video neural network
[46:18.340 --> 46:24.180] That from the images predicts the full physical occupancy of the world around the robot
[46:25.220 --> 46:28.500] So anything that's physically present trees walls buildings
[46:29.220 --> 46:34.500] Cars balls, whatever you it predicts if it's physically present it predicts them along with their future motion
[46:37.940 --> 46:39.940] On top of this base level of geometry
[46:39.940 --> 46:44.980] We have more semantic layers in order to navigate the roadways. We need the lanes, of course
[46:46.660 --> 46:50.260] But then the roadways have lots of different lanes and they connect in all kinds of ways
[46:50.820 --> 46:55.940] So it's actually a really difficult problem for typical computer vision techniques to predict the set of lanes and their
[46:55.940 --> 46:57.140] Connectivities
[46:57.140 --> 47:01.620] So we reached all the way into language technologies and then pull the state of the art from other
[47:02.100 --> 47:05.300] Domains are not just computer vision to make this task possible
[47:05.300 --> 47:08.420] For vehicles, we need their full kinematics state to control for them
[47:10.180 --> 47:15.780] All of this directly comes from neural networks video streams raw video streams come into the networks
[47:16.660 --> 47:23.460] Goes through a lot of processing and then outputs the full kinematics state that positions velocities acceleration jerk all of that
[47:23.860 --> 47:30.420] Directly comes out of networks with minimal post processing. That's really fascinating to me because how how much does it take?
[47:30.420 --> 47:37.300] Even possible what world do we live in that this magic is possible that these networks predicts fourth derivatives of these positions and people thought
[47:37.620 --> 47:39.620] We couldn't even detect these objects
[47:41.700 --> 47:43.700] My opinion is that it did not come for free
[47:44.340 --> 47:51.140] It it required tons of data. So we had to be sophisticated auto labeling systems that shone through raw sensor data
[47:51.940 --> 47:57.940] Run a ton of offline compute on the servers. It took a lot of time. It took a lot of time. It took a lot of time
[47:57.940 --> 48:03.380] Run a ton of offline compute on the servers. It can take a few hours run expensive neural networks
[48:03.940 --> 48:08.260] Distill the information into labels that train our in-car neural networks
[48:09.940 --> 48:15.940] On top of this we also use our simulation system to synthetically create images and since it's a simulation
[48:16.180 --> 48:18.180] We trivially have all the labels
[48:18.180 --> 48:27.620] All of this goes through a well oiled data engine pipeline where we first train a baseline model with some data
[48:28.260 --> 48:32.420] Ship it to the car see what the failures are and once we know the failures
[48:33.140 --> 48:35.540] We mind the fleet for the cases where it fails
[48:36.100 --> 48:39.060] Provide the correct labels and add the data to the training set
[48:40.020 --> 48:45.140] This process systematically fixes the issues and we do this for every task that runs in the car
[48:45.140 --> 48:47.780] Yeah, and to train these new massive neural networks
[48:48.260 --> 48:52.020] This year we expanded our training infrastructure by roughly 40 to 50 percent
[48:52.420 --> 48:59.220] So that sits us at about 14,000 GPUs today across multiple training clusters in the United States
[49:00.260 --> 49:05.860] We also worked on our AI compiler which now supports new operations needed by those neural networks
[49:06.180 --> 49:10.180] And map them to the the best of our underlying hardware resources
[49:10.180 --> 49:18.100] And our inference engine today is capable of distributing the execution of a single neural network across two independent system on chips
[49:18.420 --> 49:23.460] Essentially two independent computers interconnected within the same full self-driving computer
[49:24.100 --> 49:29.140] And to make this possible we have to keep a tight control on the end-to-end latency of this new system
[49:29.460 --> 49:33.060] So we deployed more advanced scheduling code across the full FSD platform
[49:35.700 --> 49:37.700] All of these neural networks running in the car
[49:37.700 --> 49:42.180] Together produce the vector space, which is again the model of the world around the robot or the car
[49:42.820 --> 49:47.860] And then the planning system operates on top of this coming up with trajectories that avoid collisions or smooth
[49:48.180 --> 49:52.420] Make progress towards the destination using a combination of model-based optimization
[49:52.820 --> 49:56.020] Plus neural network that helps optimize it to be really fast
[49:58.820 --> 50:02.820] Today we are really excited to present progress on all of these areas
[50:02.820 --> 50:08.980] We have the engineering leads standing by to come in and explain these various blocks and these power not just the car
[50:09.220 --> 50:13.060] But the same components also run on the Optimus robot that Milan showed earlier
[50:13.940 --> 50:16.900] With that I welcome Paril to start talking about the planning section
[50:24.580 --> 50:26.580] Hi all, I'm Paril Jain
[50:28.420 --> 50:30.420] Let's use this intersection scenario today
[50:30.420 --> 50:36.020] Let's use this intersection scenario to dive straight into how we do the planning and decision making in autopilot
[50:37.940 --> 50:42.500] So we are approaching this intersection from a side street and we have to yield to all the crossing vehicles
[50:43.860 --> 50:45.940] Right with as they are about to enter the intersection
[50:47.140 --> 50:51.380] The pedestrian on the other side of the intersection decides to cross the road without a crosswalk
[50:52.500 --> 50:54.500] Now we need to yield to this pedestrian
[50:54.500 --> 51:02.100] Yield to the vehicles from the right and also understand the relation between the pedestrian and the vehicle on the other side of the intersection
[51:03.620 --> 51:06.180] So a lot of these intra object dependencies
[51:06.980 --> 51:08.980] That we need to resolve in a quick glance
[51:10.180 --> 51:12.180] And humans are really good at this
[51:12.420 --> 51:15.220] We look at a scene understand all the possible interactions
[51:15.860 --> 51:17.860] evaluate the most promising ones
[51:18.180 --> 51:20.180] And generally end up choosing a reasonable one
[51:20.180 --> 51:25.300] So let's look at a few of these interactions that autopilot system evaluated
[51:26.420 --> 51:30.660] We could have gone in front of this pedestrian with a very aggressive longitudinal lateral profile
[51:31.380 --> 51:35.380] Now obviously we are being a jerk to the pedestrian and we would spook the pedestrian and his cute pet
[51:36.900 --> 51:38.900] We could have moved forward slowly
[51:39.380 --> 51:42.580] Short for a gap between the pedestrian or end the vehicle from the right
[51:43.460 --> 51:45.620] Again, we are being a jerk to the vehicle coming from the right
[51:45.620 --> 51:51.540] But you should not outright reject this interaction in case this is only safe interaction available
[51:53.540 --> 51:55.780] Lastly the interaction we ended up choosing
[51:56.740 --> 52:01.300] Stay slow initially find the reasonable gap and then finish the maneuver after all the agents pass
[52:03.940 --> 52:06.820] Now evaluation of all of these interactions is not trivial
[52:07.860 --> 52:12.020] Especially when you care about modeling the higher order derivatives for other agents
[52:12.020 --> 52:18.020] For example, what is the longitudinal jerk required by the vehicle coming from the right when you assert in front of it?
[52:19.220 --> 52:26.020] Relying purely on collision checks with marginal predictions will only get you so far because you will miss out on a lot of valid interactions
[52:27.220 --> 52:34.020] This basically boils down to solving a multi-agent joint trajectory planning problem over the trajectories of ego and all the other agents
[52:36.020 --> 52:40.820] Now how much ever you optimize there's going to be a limit to how fast you can run this optimization problem
[52:40.820 --> 52:46.420] It will be close to close to order of 10 milliseconds even after a lot of incremental approximations
[52:48.580 --> 52:50.980] Now for a typical crowded unprotected lift
[52:51.780 --> 52:53.780] Say you have more than 20 objects
[52:54.180 --> 52:59.860] Each object having multiple different future modes the number of relevant interaction combinations will blow up
[53:02.660 --> 53:07.700] The planner needs to make a decision every 50 milliseconds. So how do we solve this in real time?
[53:07.700 --> 53:13.700] We rely on a framework what we call as interaction search, which is basically a paralyzed research over a bunch of maneuver trajectories
[53:15.700 --> 53:23.700] The state space here corresponds to the kinematic state of ego, the kinematic state of other agents, their nominal future multiple multi-modal predictions
[53:24.500 --> 53:26.500] and all the static entities in the scene
[53:28.500 --> 53:31.700] The action space is where things get interesting
[53:31.700 --> 53:39.700] We use a set of maneuver trajectory candidates to branch over a bunch of interaction decisions and also incremental goals for a longer horizon maneuver
[53:41.700 --> 53:45.700] Let's walk through this research very quickly to get a sense of how it works
[53:46.900 --> 53:51.700] We start with a set of vision measurements namely lanes occupancy moving objects
[53:52.500 --> 53:55.700] These get represented as past attractions as well as latent features
[53:55.700 --> 53:57.700] We use this to create a set of goal candidates
[53:59.700 --> 54:05.700] Lanes again from the lanes network or unstructured regions which correspond to a probability mask derived from human demonstrations
[54:07.700 --> 54:13.700] Once we have a bunch of these goal candidates, we create three trajectories using a combination of classical optimization approaches
[54:13.700 --> 54:17.700] As well as our network planner again trained on data from the customer fleet
[54:19.700 --> 54:21.700] Now once we get a bunch of these three trajectories
[54:21.700 --> 54:23.700] We use them to start branching on the interactions
[54:25.700 --> 54:27.700] We find the most critical interaction
[54:29.700 --> 54:31.700] In our case, this would be the interaction with respect to the pedestrian
[54:33.700 --> 54:35.700] Whether we assert in front of it or yield to it
[54:37.700 --> 54:41.700] Obviously the option on the left is a high penalty option, it likely won't get prioritized
[54:43.700 --> 54:47.700] So we branch further onto the option on the right and that's where we bring in more and more complex interactions
[54:47.700 --> 54:51.700] Building this optimization problem incrementally with more and more constraints
[54:53.700 --> 54:57.700] And the tree search keeps flowing, branching on more interactions, branching on more goals
[54:59.700 --> 55:03.700] Now a lot of pricks here lie in evaluation of each of this node of the tree search
[55:05.700 --> 55:09.700] Inside each node, initially we started with creating trajectories using classical optimization approaches
[55:11.700 --> 55:13.700] Where the constraints like I described would be added incrementally
[55:13.700 --> 55:17.700] And this would take close to 1 to 5 milliseconds per action
[55:19.700 --> 55:23.700] Now even though this is fairly good number, when you want to evaluate more than 100% interactions, this does not scale
[55:25.700 --> 55:29.700] So we ended up building lightweight queryable networks that you can run in the loop of the planner
[55:31.700 --> 55:35.700] These networks are trained on human demonstrations from the fleet as well as offline solvers with relaxed time limits
[55:35.700 --> 55:39.700] With this, we were able to bring the run time down to close to 100 microseconds per action
[55:51.700 --> 55:55.700] Now doing this alone is not enough because you still have this massive tree search that you need to go through
[55:57.700 --> 55:59.700] And you need to efficiently prune the search space
[55:59.700 --> 56:03.700] So you need to do a new scoring on each of these trajectories
[56:05.700 --> 56:09.700] Few of these are fairly standard, you do a bunch of collision checks, you do a bunch of comfort analysis
[56:11.700 --> 56:13.700] What is the jerk and access required for a given manure
[56:15.700 --> 56:17.700] The customer fleet data plays an important role here again
[56:19.700 --> 56:21.700] We run two sets of again lightweight queryable networks, both really augmenting each other
[56:23.700 --> 56:25.700] One of them trained from interventions from the FSD beta fleet
[56:25.700 --> 56:29.700] Which gives a score on how likely is a given manure to result in interventions over the next few seconds
[56:31.700 --> 56:35.700] And second, which is purely on human demonstrations, human driven data, giving a score on how close is your given selected action to a human driven trajectory
[56:41.700 --> 56:45.700] The scoring helps us prune the search space, keep branching further on the interactions and focus the compute on the most promising outcomes
[56:45.700 --> 56:51.700] The cool part about this architecture is that it allows us to create a cool blend between data driven approaches where you don't have to rely on a lot of hand engineered costs
[56:53.700 --> 56:55.700] But also ground it in reality with physics based checks
[56:57.700 --> 56:59.700] Now a lot of what I described was with respect to the agents, we could observe in the scene
[57:01.700 --> 57:05.700] But the same framework extends to all of the other systems that we have
[57:05.700 --> 57:11.700] We use the video feed from 8 cameras to generate the 3D occupancy of the world
[57:13.700 --> 57:15.700] The blue mask here corresponds to the visibility region, we call it
[57:17.700 --> 57:19.700] It basically gets blocked at the first occlusion you see in the scene
[57:21.700 --> 57:23.700] We consume this visibility mask to generate the visibility of the scene
[57:25.700 --> 57:27.700] We use the video feed from 8 cameras to generate the 3D occupancy of the world
[57:29.700 --> 57:31.700] The blue mask here corresponds to the visibility region, we call it
[57:31.700 --> 57:37.700] In the first occlusion you see in the scene, we consume this visibility mask to generate what we call as ghost objects which you can see on the top left
[57:39.700 --> 57:43.700] Now if you model the spawn regions and the state transitions of this ghost objects correctly
[57:45.700 --> 57:51.700] If you tune your control response as a function of their existence likelihood, you can extract some really nice human-like behaviors
[57:51.700 --> 58:01.700] Now I'll pass it on to Phil to describe more on how we generate these occupancy networks
[58:05.700 --> 58:11.700] Hey guys, my name is Phil, I will share the details of the occupancy network we built over the past year
[58:13.700 --> 58:17.700] This network is our solution to model the physical work in 3D around our cars
[58:17.700 --> 58:21.700] And it is currently not shown in our customer-facing visualization
[58:21.700 --> 58:26.700] What you will see here is the raw network output from our internal lab tool
[58:29.700 --> 58:33.700] The occupancy network takes video streams of all our 8 cameras as input
[58:34.700 --> 58:39.700] Produces a single unified volumetric occupancy in vector space directly
[58:39.700 --> 58:46.700] For every 3D location around our car, it predicts the probability of that location being occupied or not
[58:47.700 --> 58:54.700] Since it has video contacts, it is capable of predicting obstacles that are occluded instantaneously
[58:54.700 --> 59:08.700] For each location, it also produces a set of semantics such as curb, car, pedestrian, and road debris as color-coded here
[59:10.700 --> 59:13.700] Occupancy flow is also predicted for motion
[59:13.700 --> 59:20.700] Since the model is a generalized network, it does not tell static and dynamic objects explicitly
[59:20.700 --> 59:27.700] It is able to produce and model the random motion such as a swarming trainer here
[59:28.700 --> 59:32.700] This network is currently running in all Teslas with FSD computers
[59:32.700 --> 59:37.700] And it is incredibly efficient, runs about every 10 milliseconds with our neural-line accelerator
[59:39.700 --> 59:42.700] So how does this work? Let's take a look at architecture
[59:43.700 --> 59:47.700] First, we rectify each camera image with a camera calibration
[59:47.700 --> 59:50.700] And the images we're showing here are given to the network
[59:50.700 --> 59:53.700] It's actually not the typical 8-bit RGB image
[59:53.700 --> 1:00:00.700] As you can see from the first image on top, we're giving the 12-bit raw photo-account image to the network
[1:00:00.700 --> 1:00:08.700] Since it has 4 bits more information, it has 16 times better dynamic range as well as reduced latency
[1:00:08.700 --> 1:00:10.700] Since we don't have to run ISP in the loop anymore
[1:00:10.700 --> 1:00:17.700] We use a set of reglets and bif-fps as a backbone to extract image space features
[1:00:18.700 --> 1:00:27.700] Next, we construct a set of 3D position queries along with the image space features as keys and values fit into an attention module
[1:00:27.700 --> 1:00:32.700] The output of the attention module is high-dimensional spatial features
[1:00:32.700 --> 1:00:40.700] These spatial features are aligned temporally using vehicle odometry to derive motion
[1:00:41.700 --> 1:00:49.700] Next, these spatial temporal features go through a set of deconvolutions to produce the final occupancy and occupancy flow output
[1:00:49.700 --> 1:00:56.700] They're formed as fixed-size voxel grids, which might not be precise enough for planning on control
[1:00:56.700 --> 1:01:03.700] In order to get a higher resolution, we also produce per voxel feature maps which we feed into MLP
[1:01:03.700 --> 1:01:10.700] with 3D spatial point queries to get position and semantics at any arbitrary location
[1:01:11.700 --> 1:01:15.700] After knowing the model better, let's take a look at another example
[1:01:15.700 --> 1:01:21.700] Here we have an articulated bus parked on the right side of the road, highlighted as an L-shaped voxel here
[1:01:21.700 --> 1:01:29.700] As we approach, the bus starts to move. The front of the car turns blue first, indicating the model predicts
[1:01:29.700 --> 1:01:33.700] The front of the bus has a long zero occupancy flow
[1:01:34.700 --> 1:01:44.700] As the bus keeps moving, the entire bus turns blue, and you can also see that the network predicts the precise curvature of the bus
[1:01:44.700 --> 1:01:55.700] This is a very complicated problem for a traditional object detection network, as you'll have to see whether I'm going to use one cuboid or perhaps two to feed the curvature
[1:01:55.700 --> 1:02:05.700] But for an occupancy network, since all we care about is the occupancy in the visible space, we'll be able to model the curvature precisely
[1:02:05.700 --> 1:02:11.700] Besides the voxel grid, the occupancy network also produces a drivel surface
[1:02:11.700 --> 1:02:21.700] The drivel surface has both 3D geometry and semantics. They are very useful for control, especially on hilly and curvy roads
[1:02:21.700 --> 1:02:31.700] The surface and the voxel grid are not predicted independently. Instead, the voxel grid actually aligns with the surface implicitly
[1:02:31.700 --> 1:02:41.700] Here, we are at a hill quest where you can see the 3D geometry of the surface being predicted nicely
[1:02:41.700 --> 1:02:46.700] Planner can use this information to decide perhaps we need to slow down more for the hill quest
[1:02:46.700 --> 1:02:52.700] And as you can also see, the voxel grid aligns with the surface consistently
[1:02:52.700 --> 1:03:01.700] Besides the voxels and the surface, we're also very excited about the recent breakthrough in Neural Radiance Field or NERF
[1:03:01.700 --> 1:03:13.700] We're looking into both incorporating some of the last NERF features into occupancy network training as well as using our network output as the input state for NERF
[1:03:13.700 --> 1:03:21.700] As a matter of fact, Ashok is very excited about this. This has been his personal weekend project for a while
[1:03:21.700 --> 1:03:32.700] About these NERFs, because I think the academia is building out of these foundation models for language using tons of large data sets for language
[1:03:32.700 --> 1:03:40.700] But I think for vision, NERFs are going to provide the foundation models for computer vision because they are grounded in geometry
[1:03:40.700 --> 1:03:46.700] And geometry gives us a nice way to supervise these networks and freezes off the requirement to define an ontology
[1:03:46.700 --> 1:03:51.700] And the supervision is essentially free because you just have to differentially render these images
[1:03:51.700 --> 1:04:01.700] So I think in the future, this occupancy network idea where images come in and then the network produces a consistent volumetric representation of the scene
[1:04:01.700 --> 1:04:05.700] That can then be differentially rendered into any image that was observed
[1:04:05.700 --> 1:04:12.700] I personally think it's a future of computer vision and we do some initial work on it right now
[1:04:12.700 --> 1:04:23.700] But I think in the future, both at Tesla and in academia, we will see that this combination of one-shot prediction of volumetric occupancy will be the future
[1:04:23.700 --> 1:04:26.700] That's my personal bet
[1:04:26.700 --> 1:04:28.700] Thanks Ashok
[1:04:28.700 --> 1:04:33.700] So here's an example early result of a 3D reconstruction from our free data
[1:04:33.700 --> 1:04:43.700] Instead of focusing on getting perfect RGB reproduction in image space, our primary goal here is to accurately represent the world in 3D space for driving
[1:04:43.700 --> 1:04:49.700] And we want to do this for all our free data over the world in all weather and lighting conditions
[1:04:49.700 --> 1:04:54.700] And obviously this is a very challenging problem and we're looking for you guys to help
[1:04:54.700 --> 1:05:01.700] Finally, the occupancy network is trained with large auto-labeled data sets without any human in the loop
[1:05:01.700 --> 1:05:06.700] And with that, I'll pass to Tim to talk about what it takes to train this network
[1:05:06.700 --> 1:05:08.700] Thanks Phil
[1:05:12.700 --> 1:05:14.700] Alright, hey everyone
[1:05:14.700 --> 1:05:17.700] Let's talk about some training infrastructure
[1:05:17.700 --> 1:05:21.700] So we've seen a couple of videos, no four or five
[1:05:21.700 --> 1:05:26.700] I think and care more and worry more about a lot more clips on that
[1:05:26.700 --> 1:05:30.700] So we've been looking at the occupancy networks just from Phil
[1:05:30.700 --> 1:05:36.700] Just Phil's videos, it takes 1.4 billion frames to train that network
[1:05:36.700 --> 1:05:41.700] What you just saw and if you have 100,000 GPUs, it would take one hour
[1:05:41.700 --> 1:05:46.700] But if you have one GPU, it would take 100,000 hours
[1:05:46.700 --> 1:05:50.700] So that is not a humane time period that you can wait for your training job to run, right?
[1:05:50.700 --> 1:05:52.700] We want to ship faster than that
[1:05:52.700 --> 1:05:55.700] So that means you're going to need to go parallel
[1:05:55.700 --> 1:05:57.700] So you need a more compute for that
[1:05:57.700 --> 1:06:00.700] That means you're going to need a supercomputer
[1:06:00.700 --> 1:06:06.700] So this is why we've built in-house three supercomputers comprising of 14,000 GPUs
[1:06:06.700 --> 1:06:12.700] Where we use 10,000 GPUs for training and around 4,000 GPUs for auto-labeling
[1:06:12.700 --> 1:06:18.700] All these videos are stored in 30 petabytes of a distributed managed video cache
[1:06:18.700 --> 1:06:22.700] You shouldn't think of our data sets as fixed
[1:06:22.700 --> 1:06:26.700] Let's say as you think of your image net or something, you know, with like a million frames
[1:06:26.700 --> 1:06:28.700] You should think of it as a very fluid thing
[1:06:28.700 --> 1:06:33.700] So we've got half a million of these videos flowing in and out of this cluster
[1:06:33.700 --> 1:06:36.700] These clusters every single day
[1:06:36.700 --> 1:06:43.700] And we track 400,000 of these kind of Python video instantiations every second
[1:06:43.700 --> 1:06:45.700] So that's a lot of calls
[1:06:45.700 --> 1:06:51.700] We're going to need to capture that in order to govern the retention policies of this distributed video cache
[1:06:51.700 --> 1:06:58.700] So underlying all of this is a huge amount of infra, all of which we build and manage in-house
[1:06:58.700 --> 1:07:04.700] So you cannot just buy, you know, 14,000 GPUs and then 30 petabytes of Flash NVMe
[1:07:04.700 --> 1:07:07.700] And you just put it together and let's go train
[1:07:07.700 --> 1:07:11.700] It actually takes a lot of work and I'm going to go into a little bit of that
[1:07:11.700 --> 1:07:15.700] What you actually typically want to do is you want to take your accelerator
[1:07:15.700 --> 1:07:20.700] So that could be the GPU or dojo, which we'll talk about later
[1:07:20.700 --> 1:07:25.700] And because that's the most expensive component, that's where you want to put your bottleneck
[1:07:25.700 --> 1:07:32.700] And so that means that every single part of your system is going to need to outperform this accelerator
[1:07:32.700 --> 1:07:34.700] And so that is really complicated
[1:07:34.700 --> 1:07:40.700] That means that your storage is going to need to have the size and the bandwidth to deliver all the data down into the nodes
[1:07:40.700 --> 1:07:47.700] These nodes need to have the right amount of CPU and memory capabilities to feed into your machine learning framework
[1:07:47.700 --> 1:07:52.700] This machine learning framework then needs to hand it off to your GPU and then you can start training
[1:07:52.700 --> 1:07:58.700] But then you need to do so across hundreds or thousands of GPU in a reliable way in lockstep
[1:07:58.700 --> 1:08:02.700] And in a way that's also fast, so you're also going to need an interconnect
[1:08:02.700 --> 1:08:04.700] Extremely complicated
[1:08:04.700 --> 1:08:07.700] We'll talk more about dojo in a second
[1:08:07.700 --> 1:08:12.700] So first I want to take you through some optimizations that we've done on our cluster
[1:08:12.700 --> 1:08:19.700] So we're getting in a lot of videos and video is very much unlike, let's say, training on images or text
[1:08:19.700 --> 1:08:21.700] Which I think is very well established
[1:08:21.700 --> 1:08:25.700] Video is quite literally a dimension more complicated
[1:08:25.700 --> 1:08:31.700] And so that's why we needed to go end to end from the storage layer down to the accelerator
[1:08:31.700 --> 1:08:33.700] Optimize every single piece of that
[1:08:33.700 --> 1:08:38.700] Because we train on the photon count videos that come directly from our fleet
[1:08:38.700 --> 1:08:42.700] We train on those directly, we do not post-process those at all
[1:08:42.700 --> 1:08:47.700] The way it's just done is we seek exactly to the frames we select for our batch
[1:08:47.700 --> 1:08:52.700] We load those in including the frames that they depend on, so these are your eye frames or your key frames
[1:08:52.700 --> 1:08:57.700] We package those up, move them into shared memory, move them into a double bar from the GPU
[1:08:57.700 --> 1:09:03.700] And then use the hardware decoder that's only accelerated to actually decode the video
[1:09:03.700 --> 1:09:09.700] So we do that on the GPU natively, and this is all in a very nice PyTorch extension
[1:09:09.700 --> 1:09:14.700] Doing so unlocked more than 30% training speed increase for the occupancy networks
[1:09:14.700 --> 1:09:20.700] And freed up basically a whole CPU to do any other thing
[1:09:20.700 --> 1:09:25.700] You cannot just do training with just videos, of course you need some kind of a ground truth
[1:09:25.700 --> 1:09:28.700] And that is actually an interesting problem as well
[1:09:28.700 --> 1:09:33.700] The objective for storing your ground truth is that you want to make sure you get to your ground truth
[1:09:33.700 --> 1:09:37.700] That you need in the minimal amount of file system operations
[1:09:37.700 --> 1:09:43.700] And load in the minimal size of what you need in order to optimize for aggregate cross cluster throughput
[1:09:43.700 --> 1:09:50.700] Because you should see a compute cluster as one big device which has internally fixed constraints and thresholds
[1:09:50.700 --> 1:09:56.700] So for this we rolled out a format that is native to us that's called small
[1:09:56.700 --> 1:10:00.700] We use this for our ground truth, our feature cache and any inference outputs
[1:10:00.700 --> 1:10:02.700] So a lot of tensors that are in there
[1:10:02.700 --> 1:10:07.700] And so just a cartoon here, let's say this is your table that you want to store
[1:10:07.700 --> 1:10:10.700] Then that's how that would look out if you rolled out on disk
[1:10:10.700 --> 1:10:15.700] So what you do is you take anything you'd want to index on, so for example video timestamps
[1:10:15.700 --> 1:10:20.700] You put those all in the header so that in your initial header read you know exactly where to go on disk
[1:10:20.700 --> 1:10:28.700] Then if you have any tensors you're going to try to transpose the dimensions to put a different dimension last as the contiguous dimension
[1:10:28.700 --> 1:10:31.700] And then also try different types of compression
[1:10:31.700 --> 1:10:35.700] Then you check out which one was most optimal and then store that one
[1:10:35.700 --> 1:10:38.700] This is actually a huge tip if you do feature caching
[1:10:38.700 --> 1:10:41.700] Unintelligible output from the machine learning network
[1:10:41.700 --> 1:10:47.700] Rotate around the dimensions a little bit, you can get up to 20% increase in efficiency of storage
[1:10:47.700 --> 1:10:53.700] Then when you store that we also order the columns by size
[1:10:53.700 --> 1:10:56.700] So that all your small columns and small values are together
[1:10:56.700 --> 1:11:02.700] So that when you seek for a single value you're likely to overlap with a read on more values which you'll use later
[1:11:02.700 --> 1:11:06.700] So that you don't need to do another file system operation
[1:11:06.700 --> 1:11:12.700] So I could go on and on, I just went on, touched on two projects that we have internally
[1:11:12.700 --> 1:11:18.700] This is actually part of a huge continuous effort to optimize the compute that we have in-house
[1:11:18.700 --> 1:11:21.700] So accumulating and aggregating through all these optimizations
[1:11:21.700 --> 1:11:26.700] We now train our occupancy networks twice as fast just because it's twice as efficient
[1:11:26.700 --> 1:11:33.700] And now if we add in a bunch more compute and go parallel we can now train this in hours instead of days
[1:11:33.700 --> 1:11:44.700] And with that I'd like to hand it off to the biggest user of compute, John
[1:11:44.700 --> 1:11:49.700] Hi everybody, my name is John Emmons, I lead the autopilot vision team
[1:11:49.700 --> 1:11:53.700] I'm going to cover two topics with you today, the first is how we predict lanes
[1:11:53.700 --> 1:11:58.700] And the second is how we predict the future behavior of other agents on the road
[1:11:58.700 --> 1:12:05.700] In the early days of autopilot we modeled the lane detection problem as an image space instant segmentation task
[1:12:05.700 --> 1:12:12.700] Our network was super simple though, in fact it was only capable of predicting lanes from a few different kinds of geometries
[1:12:12.700 --> 1:12:19.700] Specifically it would segment the ego lane, it could segment adjacent lanes, and then it had some special casing for forks and merges
[1:12:19.700 --> 1:12:24.700] This simplistic modeling of the problem worked for highly structured roads like highways
[1:12:24.700 --> 1:12:29.700] But today we're trying to build a system that's capable of much more complex maneuvers
[1:12:29.700 --> 1:12:35.700] Specifically we want to make left and right turns at intersections where the road topology can be quite a bit more complex and diverse
[1:12:35.700 --> 1:12:40.700] When we try to apply this simplistic modeling of the problem here, it just totally breaks down
[1:12:40.700 --> 1:12:48.700] Taking a step back for a moment, what we're trying to do here is to predict the sparse set of lane instances and their connectivity
[1:12:48.700 --> 1:12:54.700] And what we want to do is to have a neural network that basically predicts this graph where the nodes are the lane segments
[1:12:54.700 --> 1:12:59.700] And the edges encode the connectivity between these lanes
[1:12:59.700 --> 1:13:05.700] So what we have is our lane detection neural network, it's made up of three components
[1:13:05.700 --> 1:13:10.700] In the first component we have a set of convolutional layers, attention layers, and other neural network layers
[1:13:10.700 --> 1:13:17.700] That encode the video streams from our eight cameras on the vehicle and produce a rich visual representation
[1:13:17.700 --> 1:13:24.700] We then enhance this visual representation with a coarse road level map data
[1:13:24.700 --> 1:13:29.700] Which we encode with a set of additional neural network layers that we call the lane guidance module
[1:13:29.700 --> 1:13:40.700] This map is not an HD map, but it provides a lot of useful hints about the topology of lanes inside of intersections, the lane counts on various roads, and a set of other attributes that help us
[1:13:40.700 --> 1:13:45.700] The first two components here produce a dense tensor that sort of encodes the world
[1:13:45.700 --> 1:13:51.700] But what we really want to do is to convert this dense tensor into a sparse set of lanes and their connectivity
[1:13:51.700 --> 1:14:03.700] We approach this problem like an image captioning task where the input is this dense tensor and the output text is predicted into a special language that we developed at Tesla for encoding lanes and their connectivity
[1:14:03.700 --> 1:14:08.700] In this language of lanes, the words and tokens are the lane positions in 3D space
[1:14:08.700 --> 1:14:15.700] In the ordering of the tokens, encrypted modifiers in the tokens encode the connected relationships between these lanes
[1:14:15.700 --> 1:14:24.700] By modeling the task as a language problem, we can capitalize on recent autoregressive architectures and techniques from the language community for handling the multiple-diality of the problem
[1:14:24.700 --> 1:14:32.700] We're not just solving the computer vision problem at Autopilot, we're also applying the state-of-the-art in language modeling and machine learning more generally
[1:14:32.700 --> 1:14:36.700] I'm now going to dive into a little bit more detail of this language component
[1:14:36.700 --> 1:14:43.700] What I have depicted on the screen here is a satellite image which sort of represents the local area around the vehicle
[1:14:43.700 --> 1:14:50.700] The set of nose and edges is what we refer to as the lane graph, and it's ultimately what we want to come out of this neural network
[1:14:50.700 --> 1:14:53.700] We start with a blank slate
[1:14:53.700 --> 1:14:57.700] We're going to want to make our first prediction here at this green dot
[1:14:57.700 --> 1:15:03.700] This green dot's position is encoded as an index into a course grid which discretizes the 3D world
[1:15:03.700 --> 1:15:07.700] Now we don't predict this index directly because it would be too computationally expensive to do so
[1:15:07.700 --> 1:15:14.700] There's just too many grid points and predicting a categorical distribution over this has both implications at training time and test time
[1:15:14.700 --> 1:15:23.700] So instead what we do is we discretize the world coarsely first, we predict the heat map over the possible locations, and then we latch in the most probable location
[1:15:23.700 --> 1:15:28.700] Condition on this, we then refine the prediction and get the precise point
[1:15:28.700 --> 1:15:32.700] Now we know where the position of this token is, but we don't know it's tight
[1:15:32.700 --> 1:15:35.700] In this case though, it's a beginning of a new lane
[1:15:35.700 --> 1:15:38.700] So we predict it as a start token
[1:15:38.700 --> 1:15:42.700] And because it's a start token, there's no additional attributes in our language
[1:15:42.700 --> 1:15:48.700] We then take the predictions from this first forward pass, and we encode them using a learned positional embedding
[1:15:48.700 --> 1:15:52.700] Which produces a set of tensors that we combine together
[1:15:52.700 --> 1:15:55.700] Which is actually the first word in our language of lanes
[1:15:55.700 --> 1:15:58.700] We add this to the first position in our sentence here
[1:15:58.700 --> 1:16:04.700] We then continue this process by predicting the next lane point in a similar fashion
[1:16:04.700 --> 1:16:09.700] Now this lane point is not the beginning of a new lane, it's actually a continuation of the previous lane
[1:16:09.700 --> 1:16:13.700] So it's a continuation token type
[1:16:13.700 --> 1:16:17.700] Now it's not enough just to know that this lane is connected to the previously predicted lane
[1:16:17.700 --> 1:16:23.700] We want to encode its precise geometry, which we do by regressing a set of spline coefficients
[1:16:23.700 --> 1:16:28.700] We then take this lane, we encode it again, and add it as the next word in the sentence
[1:16:28.700 --> 1:16:33.700] We continue predicting these continuation lanes until we get to the end of the prediction grid
[1:16:33.700 --> 1:16:36.700] We then move on to a different lane segment
[1:16:36.700 --> 1:16:38.700] So you can see that cyan dot there
[1:16:38.700 --> 1:16:41.700] Now it's not topologically connected to that pink point
[1:16:41.700 --> 1:16:45.700] It's actually forking off of that green point there
[1:16:45.700 --> 1:16:48.700] So it's got a fork type
[1:16:48.700 --> 1:16:55.700] And fork tokens actually point back to previous tokens from which their fork originates
[1:16:55.700 --> 1:16:58.700] So you can see here the fork point predictor is actually the index zero
[1:16:58.700 --> 1:17:04.700] So it's actually referencing back to a token that is already predicted, like you would in language
[1:17:04.700 --> 1:17:09.700] We continue this process over and over again until we've enumerated all of the tokens in the lane graph
[1:17:09.700 --> 1:17:13.700] And then the network predicts the end of sentence token
[1:17:13.700 --> 1:17:18.700] Yeah, I just wanted to note that the reason we do this is not just because we want to build something complicated
[1:17:18.700 --> 1:17:21.700] It almost feels like a Turing complete machine here with neural networks though
[1:17:21.700 --> 1:17:28.700] Is that we try simple approaches, for example, trying to just segment the lanes along the road or something like that
[1:17:28.700 --> 1:17:32.700] But then the problem is when there's uncertainty, say you cannot see the road clearly
[1:17:32.700 --> 1:17:35.700] And there could be two lanes or three lanes and you can't tell
[1:17:35.700 --> 1:17:39.700] A simple segmentation-based approach would just draw both of them
[1:17:39.700 --> 1:17:41.700] It's kind of a 2.5 lane situation
[1:17:41.700 --> 1:17:45.700] And the post-processing algorithm would hilariously fail when the predictions are such
[1:17:45.700 --> 1:17:47.700] Yeah, the problems don't end there
[1:17:47.700 --> 1:17:51.700] I mean, you need to predict these connective lanes inside of intersections
[1:17:51.700 --> 1:17:54.700] Which is just not possible with the approach that Ashok's mentioning
[1:17:54.700 --> 1:17:56.700] Which is why we had to upgrade to this sort of approach
[1:17:56.700 --> 1:17:59.700] Yeah, when it overlaps like this, segmentation would just go haywire
[1:17:59.700 --> 1:18:03.700] But even if you try very hard to put them on separate layers, it's just a really hard problem
[1:18:03.700 --> 1:18:09.700] But language just offers a really nice framework for getting a sample from a posterior
[1:18:09.700 --> 1:18:13.700] As opposed to trying to do all of this in post-processing
[1:18:13.700 --> 1:18:15.700] But this doesn't actually stop for just autopilot, right?
[1:18:15.700 --> 1:18:17.700] John, this can be used for optimists
[1:18:17.700 --> 1:18:19.700] Yeah, I guess they wouldn't be called lanes
[1:18:19.700 --> 1:18:22.700] But you could imagine, sort of in this stage here
[1:18:22.700 --> 1:18:27.700] That you might have sort of paths that sort of encode the possible places that people could walk
[1:18:27.700 --> 1:18:33.700] Yeah, basically if you're in a factory or in a home setting, you can just ask the robot
[1:18:33.700 --> 1:18:39.700] Okay, please route to the kitchen or please route to some location in the factory
[1:18:39.700 --> 1:18:42.700] And then we predict a set of pathways that would go through the aisles, take the robot
[1:18:42.700 --> 1:18:44.700] And say, okay, this is how you get to the kitchen
[1:18:44.700 --> 1:18:48.700] It just really gives us a nice framework to model these different paths
[1:18:48.700 --> 1:18:51.700] That simplify the navigation problem for the downstream planner
[1:18:54.700 --> 1:18:58.700] Alright, so ultimately what we get from this lane detection network
[1:18:58.700 --> 1:19:01.700] Is a set of lanes in their connectivity, which comes directly from the network
[1:19:01.700 --> 1:19:07.700] There's no additional step here for sparsifying these dense predictions into sparse ones
[1:19:07.700 --> 1:19:09.700] This is just a direct unfiltered output of the network
[1:19:12.700 --> 1:19:14.700] Okay, so I talked a little bit about lanes
[1:19:14.700 --> 1:19:20.700] I'm going to briefly touch on how we model and predict the future paths and other semantics on objects
[1:19:20.700 --> 1:19:23.700] So I'm just going to go really quickly through two examples
[1:19:23.700 --> 1:19:28.700] The video on the right here, we've got a car that's actually running a red light and turning in front of us
[1:19:28.700 --> 1:19:34.700] What we do to handle situations like this is we predict a set of short time horizon future trajectories on all objects
[1:19:34.700 --> 1:19:38.700] We can use these to anticipate the dangerous situation here
[1:19:38.700 --> 1:19:42.700] And apply whatever breaking and steering actions required to avoid a collision
[1:19:42.700 --> 1:19:45.700] In the video on the right, there's two vehicles in front of us
[1:19:45.700 --> 1:19:49.700] The one on the left lane is parked, apparently it's being loaded, unloaded
[1:19:49.700 --> 1:19:51.700] I don't know why the driver decided to park there
[1:19:51.700 --> 1:19:55.700] But the important thing is that our neural network predicted that it was stopped
[1:19:55.700 --> 1:19:57.700] Which is the red color there
[1:19:57.700 --> 1:20:00.700] The vehicle in the other lane, as you notice, also is stationary
[1:20:00.700 --> 1:20:03.700] But that one's obviously just waiting for that red light to turn green
[1:20:03.700 --> 1:20:06.700] So even though both objects are stationary and have zero velocity
[1:20:06.700 --> 1:20:08.700] It's the semantics that is really important here
[1:20:08.700 --> 1:20:11.700] So that we don't get stuck behind that awkwardly parked car
[1:20:13.700 --> 1:20:18.700] Predicting all of these agent attributes presents some practical problems when trying to build a real-time system
[1:20:18.700 --> 1:20:21.700] We need to maximize the frame rate of our object section stack
[1:20:21.700 --> 1:20:24.700] So that autopilot can quickly react to the changing environment
[1:20:24.700 --> 1:20:26.700] Every millisecond really matters here
[1:20:26.700 --> 1:20:31.700] To minimize the inference latency, our neural network is split into two phases
[1:20:31.700 --> 1:20:36.700] In the first phase, we identified the locations in 3D space where agents exist
[1:20:36.700 --> 1:20:40.700] In the second stage, we then pull out tensors at those 3D locations
[1:20:40.700 --> 1:20:43.700] Append it with additional data that's on the vehicle
[1:20:43.700 --> 1:20:46.700] And then we do the rest of the processing
[1:20:46.700 --> 1:20:51.700] This specification step allows the neural network to focus compute on the areas that matter most
[1:20:51.700 --> 1:20:54.700] Which gives us superior performance for a fraction of the latency cost
[1:20:54.700 --> 1:20:57.700] So, putting it all together
[1:20:57.700 --> 1:21:00.700] The autopilot vision stack predicts more than just the geometry and kinematics of the world
[1:21:00.700 --> 1:21:05.700] It also predicts a rich set of semantics, which enables safe and human-like driving
[1:21:05.700 --> 1:21:09.700] I'm now going to hand things off to Sri who will tell us how we run all these cool neural networks on our FSD computer
[1:21:09.700 --> 1:21:10.700] Thank you
[1:21:18.700 --> 1:21:20.700] Hi everyone, I'm Sri
[1:21:20.700 --> 1:21:24.700] Today I'm going to give a glimpse of what it takes to run these FSD networks in the car
[1:21:24.700 --> 1:21:27.700] And how do we optimize for the inference latency?
[1:21:27.700 --> 1:21:32.700] Today I'm going to focus just on the FSD lanes network that John just talked about
[1:21:35.700 --> 1:21:42.700] So, when we started this track, we wanted to know if we can run this FSD lanes network natively on the trip engine
[1:21:42.700 --> 1:21:47.700] Which is our in-house neural network accelerator that we built in the FSD computer
[1:21:47.700 --> 1:21:54.700] When we built this hardware, we kept it simple and we made sure it can do one thing ridiculously fast
[1:21:54.700 --> 1:21:56.700] Dense dot products
[1:21:56.700 --> 1:22:00.700] But this architecture is autoregressive and iterative
[1:22:00.700 --> 1:22:05.700] Where it crunches through multiple attention-attention blocks in the inner loop
[1:22:05.700 --> 1:22:08.700] Producing sparse points directly at every step
[1:22:08.700 --> 1:22:15.700] So, the challenge here was how can we do this sparse point prediction and sparse computation on a dense dot product engine
[1:22:15.700 --> 1:22:18.700] Let's see how we did this on the trip
[1:22:18.700 --> 1:22:25.700] So, the network predicts the heat map of most probable spatial locations of the point
[1:22:42.700 --> 1:22:47.700] To do this on trip, we actually built a lookup table in SRAM
[1:22:47.700 --> 1:22:55.700] And we engineered the dimensions of this embedding such that we could achieve all of this thing with just matrix multiplication
[1:22:55.700 --> 1:23:01.700] Not just that, we also wanted to store this embedding into a token cache
[1:23:01.700 --> 1:23:06.700] So that we don't recompute this for every iteration, rather reuse it for future point prediction
[1:23:06.700 --> 1:23:12.700] Again, we put some tricks here where we did all these operations just on the dot product engine
[1:23:12.700 --> 1:23:19.700] It's actually cool that our team found creative ways to map all these operations on the trip engine
[1:23:19.700 --> 1:23:24.700] In ways that were not even imagined when this hardware was designed
[1:23:24.700 --> 1:23:28.700] But that's not the only thing we had to do to make this work
[1:23:28.700 --> 1:23:34.700] We actually implemented a whole lot of operations and features to make this model compilable
[1:23:34.700 --> 1:23:39.700] To improve the intate accuracy as well as to optimize performance
[1:23:39.700 --> 1:23:46.700] All of these things helped us run this 75 million parameter model just under 10 millisecond of latency
[1:23:46.700 --> 1:23:50.700] Consuming just 8 watts of power
[1:23:50.700 --> 1:23:54.700] But this is not the only architecture running in the car
[1:23:54.700 --> 1:23:58.700] There are so many other architectures, modules and networks we need to run in the car
[1:23:58.700 --> 1:24:05.700] To give a sense of scale, there are about a billion parameters of all the networks combined
[1:24:05.700 --> 1:24:08.700] Producing around 1000 neural network signals
[1:24:08.700 --> 1:24:16.700] So we need to make sure we optimize them jointly and such that we maximize the compute utilization
[1:24:16.700 --> 1:24:19.700] Throughput and minimize the latency
[1:24:19.700 --> 1:24:26.700] So we built a compiler just for neural networks that shares the structure to traditional compilers
[1:24:26.700 --> 1:24:33.700] As you can see, it takes the massive graph of neural nets with 150k nodes and 375k connection
[1:24:33.700 --> 1:24:37.700] Takes this thing, partitions them into independent subgraphs
[1:24:37.700 --> 1:24:43.700] And compiles each of those subgraphs natively for the inference devices
[1:24:43.700 --> 1:24:48.700] Then we have a neural network linker which shares the structure to traditional linker
[1:24:48.700 --> 1:24:51.700] Where we perform this link time optimization
[1:24:51.700 --> 1:24:59.700] There we solve an offline optimization problem with compute memory and memory band with constraints
[1:24:59.700 --> 1:25:04.700] So that it comes with an optimized schedule that gets executed in the car
[1:25:04.700 --> 1:25:12.700] On the runtime, we designed a hybrid scheduling system which basically does heterogeneous scheduling on one SOC
[1:25:12.700 --> 1:25:18.700] And distributed scheduling across both the SOCs to run these networks in a model parallel fashion
[1:25:18.700 --> 1:25:25.700] To get 100 tops of compute utilization, we need to optimize across all the layers of software
[1:25:25.700 --> 1:25:33.700] Right from tuning the network architecture, the compiler, all the way to implementing a low latency high bandwidth RDMA link
[1:25:33.700 --> 1:25:43.700] Across both the SOCs and in fact going even deeper to understanding and optimizing the cache coherent and non-coherent data path of the accelerator in the SOC
[1:25:43.700 --> 1:25:51.700] This is a lot of optimization at every level in order to make sure we get the highest frame rate and as every millisecond counts here
[1:25:51.700 --> 1:25:59.700] And this is just the visualization of the neural networks that are running in the car
[1:25:59.700 --> 1:26:02.700] This is our digital brain essentially
[1:26:02.700 --> 1:26:10.700] As you can see these operations are nothing but just the matrix multiplication, convolution to name a few real operations running in the car
[1:26:10.700 --> 1:26:16.700] To train this network with a billion parameters, you need a lot of labeled data
[1:26:16.700 --> 1:26:22.700] So Egan is going to talk about how do we achieve this with the auto labeling pipeline
[1:26:30.700 --> 1:26:31.700] Thank you Sri
[1:26:31.700 --> 1:26:36.700] Hi everyone, I'm Egan Zhang and I'm leading a geometric vision at autopilot
[1:26:36.700 --> 1:26:41.700] So yeah, let's talk about auto labeling
[1:26:41.700 --> 1:26:47.700] So we have several kinds of auto labeling frameworks to support various types of networks
[1:26:47.700 --> 1:26:52.700] But today I'd like to focus on the awesome lanes net here
[1:26:52.700 --> 1:27:05.700] So to successfully train and generalize this network to everywhere, we think we went tens of millions of trips from probably one million intersection or even more
[1:27:05.700 --> 1:27:09.700] Than how to do that
[1:27:09.700 --> 1:27:21.700] So it is certainly achievable to source sufficient amount of trips because we already have, as Tim explained earlier, we already have like 500,000 trips per day cache rate
[1:27:21.700 --> 1:27:28.700] However, converting all those data into a training form is a very challenging technical problem
[1:27:28.700 --> 1:27:35.700] To solve this challenge, we've tried various ways of manual and auto labeling
[1:27:35.700 --> 1:27:44.700] So from the first column to the second, from the second to the third, each advance provided us nearly 100x improvement in throughput
[1:27:44.700 --> 1:27:54.700] But still, we run an even better auto labeling machine that can provide us good quality, diversity and scalability
[1:27:54.700 --> 1:28:08.700] To meet all these requirements, despite the huge amount of engineering effort required here, we've developed a new auto labeling machine powered by multi-trip reconstruction
[1:28:08.700 --> 1:28:18.700] So this can replace 5 million hours of manual labeling with just 12 hours on cluster for labeling 10,000 trips
[1:28:18.700 --> 1:28:28.700] So how we solved? There are three big steps. The first step is high precision trajectory and structural recovery by multi-camera, visual, inertial, or geometry
[1:28:28.700 --> 1:28:38.700] So here, all the features including ground surface are inferred from videos by neural networks, then tracked and reconstructed in the vector space
[1:28:38.700 --> 1:28:51.700] So the typical trip rate of this trajectory in car is like 1.3 centimeter per meter and 0.45 milliliter per meter, which is pretty decent considering its compact compute requirement
[1:28:51.700 --> 1:28:59.700] Then the recovered surface and road details are also used as a strong guidance for the later manual verification stuff
[1:28:59.700 --> 1:29:10.700] This is also enabled in every FSD vehicle, so we get preprocessed trajectories and structures along with the trip data
[1:29:10.700 --> 1:29:16.700] The second step is multi-trip reconstruction, which is the big and core piece of this machine
[1:29:16.700 --> 1:29:26.700] So the video shows how the previously shown trip is reconstructed and aligned with other trips, basically other trips from different vehicles, not the same vehicle
[1:29:26.700 --> 1:29:35.700] So this is done by multiple internal steps like course alignment, pairwise matching, joint optimization, then further surface refinement
[1:29:35.700 --> 1:29:40.700] In the end, the human analyst comes in and finalizes the label
[1:29:40.700 --> 1:29:51.700] So each heavy steps are already fully parallelized on the cluster, so the entire process usually takes just a couple of hours
[1:29:51.700 --> 1:29:56.700] The last step is actually auto-labeling the new trips
[1:29:56.700 --> 1:30:04.700] So here we use the same multi-trip alignment engine, but only between pre-built reconstruction and each new trip
[1:30:04.700 --> 1:30:09.700] So it's much, much simpler than fully reconstructing all the clips altogether
[1:30:09.700 --> 1:30:18.700] That's why it only takes 30 minutes per trip to auto-label instead of several hours of manual labeling
[1:30:18.700 --> 1:30:25.700] And this is also the key of scalability of this machine
[1:30:25.700 --> 1:30:32.700] This machine easily scales as long as we have available compute and trip data
[1:30:32.700 --> 1:30:41.700] So about 50 trips were newly auto-labeled from this scene and some of them are shown here, so 53 from different vehicles
[1:30:41.700 --> 1:30:48.700] So this is how we capture and transform the space-time slices of the world into the network supervision
[1:30:48.700 --> 1:30:54.700] One thing I'd like to note is that Jagan just talked about how we auto-label our lanes
[1:30:54.700 --> 1:30:58.700] We have auto-labels for almost every task that we do, including our planner
[1:30:58.700 --> 1:31:01.700] And many of these are fully automatic, there's no humans involved
[1:31:01.700 --> 1:31:07.700] For example, for objects, all the kinematics, the shapes, the futures, everything just comes from auto-labeling
[1:31:07.700 --> 1:31:11.700] And the same is true for our occupancy too, and we have really just built a machine around this
[1:31:11.700 --> 1:31:15.700] Yeah, so if you can go back one slide
[1:31:15.700 --> 1:31:20.700] One more, it says parallelized on cluster
[1:31:20.700 --> 1:31:24.700] So that sounds pretty straightforward, but it really wasn't
[1:31:24.700 --> 1:31:27.700] Maybe it's fun to share how something like this comes about
[1:31:27.700 --> 1:31:33.700] So a while ago we didn't have any auto-labeling at all, and then someone makes a script
[1:31:33.700 --> 1:31:37.700] It starts to work, it starts working better, until you reach a volume that's pretty high
[1:31:37.700 --> 1:31:39.700] And we clearly need a solution
[1:31:39.700 --> 1:31:45.700] And so there were two other engineers in our team who were like, you know, that's an interesting, you know, thing
[1:31:45.700 --> 1:31:51.700] What we needed to do was build a whole graph of essentially Python functions that would need to run one after the other
[1:31:51.700 --> 1:31:56.700] First you pull the clip, then you do some cleaning, then you do some network inference, then another network inference
[1:31:56.700 --> 1:31:58.700] Until you finally get this
[1:31:58.700 --> 1:32:05.700] But so you need to do this at a large scale, so I tell them we probably need to shoot for, you know, 100,000 clips per day
[1:32:05.700 --> 1:32:08.700] Or like 100,000 items, that seems good
[1:32:08.700 --> 1:32:15.700] And so the engineers said, well, we can do, you know, a bit of post-gres and a bit of elbow grease, we can do it
[1:32:15.700 --> 1:32:22.700] Meanwhile, we are a bit later and we're doing 20 million of these functions every single day
[1:32:22.700 --> 1:32:28.700] Again, we pull in around half a million clips and on those we run a ton of functions, each of these, in a streaming fashion
[1:32:28.700 --> 1:32:34.700] And so that's kind of the backend infra that's also needed to not just run training, but also auto-labeling
[1:32:34.700 --> 1:32:40.700] Yeah, it really is like a factory that produces labels and production lines, yield, quality, inventory
[1:32:40.700 --> 1:32:46.700] Like all of these same concepts applied to this label factory that applies for, you know, the factory for our cars
[1:32:46.700 --> 1:32:48.700] That's right
[1:32:48.700 --> 1:32:52.700] Okay, thanks, Tim and Ashok
[1:32:52.700 --> 1:33:00.700] So, yeah, so concluding this section, I'd like to share a few more challenging and interesting examples for network for sure
[1:33:00.700 --> 1:33:02.700] And even for humans, probably
[1:33:02.700 --> 1:33:12.700] So from the top, there's like examples for like lack of lights, case or foggy night or roundabout and occlusions by heavy occlusions by parked cars
[1:33:12.700 --> 1:33:15.700] And even rainy night with rain drops on camera lenses
[1:33:15.700 --> 1:33:23.700] These are challenging, but once their original scenes are fully reconstructed by other clips, all of them can be auto-labeled
[1:33:23.700 --> 1:33:27.700] So that our cars can drive even better through these challenging scenarios
[1:33:27.700 --> 1:33:33.700] So, now, let me pass the mic to David to learn more about how Sim is creating the new world on top of these labels
[1:33:33.700 --> 1:33:35.700] Thank you
[1:33:35.700 --> 1:33:40.700] Thank you, Yegan
[1:33:40.700 --> 1:33:45.700] My name is David and I'm going to talk about simulation
[1:33:45.700 --> 1:33:52.700] So simulation plays a critical role in providing data that is difficult to source and or hard to label
[1:33:52.700 --> 1:33:56.700] However, 3D scenes are notoriously slow to produce
[1:33:56.700 --> 1:34:00.700] Take for example, the simulated scene playing behind me
[1:34:00.700 --> 1:34:05.700] A complex intersection from Market Street in San Francisco
[1:34:05.700 --> 1:34:07.700] It would take two weeks for artists to complete
[1:34:07.700 --> 1:34:10.700] And for us, that is painfully slow
[1:34:10.700 --> 1:34:21.700] However, I'm going to talk about using Yegan's automated ground truth labels along with some brand new tooling that allows us to procedurally generate this scene in many like it in just five minutes
[1:34:21.700 --> 1:34:25.700] That's an amazing a thousand times faster than before
[1:34:25.700 --> 1:34:29.700] So let's dive in to how a scene like this is created
[1:34:29.700 --> 1:34:37.700] We start by piping the automated ground truth labels into our simulated world creator tooling inside the software Houdini
[1:34:37.700 --> 1:34:44.700] Starting with road boundary labels, we can generate a solid road mesh and re-topologize it with the lane graph labels
[1:34:44.700 --> 1:34:50.700] This helps inform important road details like cross-road slope and detailed material blending
[1:34:50.700 --> 1:35:00.700] Next, we can use the line data and sweep geometry across its surface and project it to the road, creating lane paint decals
[1:35:00.700 --> 1:35:07.700] Next, using median edges, we can spawned island geometry and populate it with randomized foliage
[1:35:07.700 --> 1:35:10.700] This drastically changes the visibility of the scene
[1:35:10.700 --> 1:35:15.700] Now the outside world can be generated through a series of randomized heuristics
[1:35:15.700 --> 1:35:22.700] Modular building generators create visual obstructions while randomly placed objects like hydrants can change the color of the curves
[1:35:22.700 --> 1:35:27.700] while trees can drop leaves below it obscuring lines or edges
[1:35:27.700 --> 1:35:34.700] Next, we can bring in map data to inform positions of things like traffic traffic lights or stop signs
[1:35:34.700 --> 1:35:42.700] We can trace along its normal to collect important information like number of lanes and even get accurate street names on the signs themselves
[1:35:42.700 --> 1:35:51.700] Next, using lane graph, we can determine lane connectivity and spawn directional road markings on the road and their accompanying road signs
[1:35:51.700 --> 1:36:01.700] And finally, with lane graph itself, we can determine lane adjacency and other useful metrics to spawn randomized traffic permutations inside our simulator
[1:36:01.700 --> 1:36:06.700] And again, this is all automatic, no artist in the loop and happens within minutes
[1:36:06.700 --> 1:36:10.700] And now this sets us up to do some pretty cool things
[1:36:10.700 --> 1:36:17.700] Since everything is based on data and heuristics, we can start to fuzz parameters to create visual variations of the single ground truth
[1:36:17.700 --> 1:36:26.700] It can be as subtle as object placement and random material swapping to more drastic changes like entirely new biomes or locations of environment
[1:36:26.700 --> 1:36:29.700] like urban, suburban, or rural
[1:36:29.700 --> 1:36:37.700] This allows us to create infinite, targeted permutations for specific ground truths that we need more ground truth for
[1:36:37.700 --> 1:36:41.700] And all this happens within a click of a button
[1:36:41.700 --> 1:36:46.700] And we can even take this one step further by altering our ground truth itself
[1:36:46.700 --> 1:36:53.700] Say John wants his network to pay more attention to directional road markings to better detect an upcoming captive left turn lane
[1:36:53.700 --> 1:37:01.700] We can start to procedurally alter our lane graph inside the simulator to help create entirely new flows through this intersection
[1:37:01.700 --> 1:37:07.700] to help focus the network's attention to the road markings to create more accurate predictions
[1:37:07.700 --> 1:37:14.700] And this is a great example of how this tooling allows us to create new data that can never be collected from the real world
[1:37:14.700 --> 1:37:22.700] And the true power of this tool is in its architecture and how we can run all tasks in parallel to infinitely scale
[1:37:22.700 --> 1:37:29.700] So you saw the tile creator tool in action converting the ground truth labels into their counterparts
[1:37:29.700 --> 1:37:37.700] Next we can use our tile extractor tool to divide this data into geo hash tiles about 150 meter square in size
[1:37:37.700 --> 1:37:42.700] We then save out that data into separate geometry and instance files
[1:37:42.700 --> 1:37:49.700] This gives us a clean source of data that's easy to load and allows us to be rendering engine agnostic for the future
[1:37:49.700 --> 1:37:56.700] Then using a tile loader tool we can summon any number of those cache tiles using a geo hash ID
[1:37:56.700 --> 1:38:05.700] Currently we're doing about these 5x5 tiles or 3x3 usually centered around fleet hotspots or interesting lane graph locations
[1:38:05.700 --> 1:38:12.700] And the tile loader also converts these tile sets into U assets for consumption by the unreal engine
[1:38:12.700 --> 1:38:17.700] and gives you a finished product from what you saw in the first slide
[1:38:17.700 --> 1:38:20.700] And this really sets us up for size and scale
[1:38:20.700 --> 1:38:26.700] And as you can see on the map behind us we can easily generate most of San Francisco city streets
[1:38:26.700 --> 1:38:31.700] And this didn't take years or even months of work but rather two weeks by one person
[1:38:31.700 --> 1:38:38.700] We can continue to manage and grow all this data using our PDG network inside of the tooling
[1:38:38.700 --> 1:38:43.700] This allows us to throw compute at it and regenerate all these tile sets overnight
[1:38:43.700 --> 1:38:50.700] This ensures all environments are consistent, quality and features which is super important for training
[1:38:50.700 --> 1:38:54.700] since new ontologies and signals are constantly released
[1:38:57.700 --> 1:39:03.700] And now to come full circle, because we generated all these tile sets from ground truth data
[1:39:03.700 --> 1:39:06.700] They contain all the weird intricacies from the real world
[1:39:06.700 --> 1:39:14.700] We can combine that with the procedural, visual and traffic variety to create limitless, targeted data for the network to learn from
[1:39:14.700 --> 1:39:20.700] And that concludes the SIM section, I'll pass it to Kate to talk about how we can use all this data to improve autopilot
[1:39:20.700 --> 1:39:22.700] Thank you
[1:39:22.700 --> 1:39:36.700] Thanks David, hi everyone, my name is Kate Park and I'm here to talk about the data engine
[1:39:36.700 --> 1:39:40.700] Which is the process by which we improve our neural networks via data
[1:39:40.700 --> 1:39:45.700] We're going to show you how we deterministically solve interventions via data
[1:39:45.700 --> 1:39:48.700] And walk you through the life of this particular clip
[1:39:48.700 --> 1:39:56.700] In this scenario, autopilot is approaching a turn and incorrectly predicts that crossing vehicle as stopped for traffic
[1:39:56.700 --> 1:39:59.700] and thus a vehicle that we would slow down for
[1:39:59.700 --> 1:40:03.700] In reality, there's nobody in the car, it's just awkwardly parked
[1:40:03.700 --> 1:40:11.700] We've built this tooling to identify the mispredictions, correct the label and categorize this clip into an evaluation set
[1:40:11.700 --> 1:40:18.700] This particular clip happens to be one of 126 that we've diagnosed as challenging parked cars at turns
[1:40:18.700 --> 1:40:27.700] Because of this infra, we can curate this evaluation set without any engineering resources custom to this particular challenge case
[1:40:27.700 --> 1:40:33.700] To actually solve that challenge case requires mining thousands of examples like it
[1:40:33.700 --> 1:40:36.700] And it's something Tesla can trivially do
[1:40:36.700 --> 1:40:44.700] We simply use our data sourcing infra, request data and use the tooling shown previously to correct the labels
[1:40:44.700 --> 1:40:52.700] By surgically targeting the mispredictions of the current model, we're only adding the most valuable examples to our training set
[1:40:52.700 --> 1:41:00.700] We surgically fix 13,900 clips and because those were examples where the current model struggles
[1:41:00.700 --> 1:41:08.700] We don't even need to change the model architecture, a simple weight update with this new valuable data is enough to solve the challenge case
[1:41:08.700 --> 1:41:15.700] So you see we no longer predict that crossing vehicle as stopped, as shown in orange, but parked, as shown in red
[1:41:15.700 --> 1:41:22.700] In academia, we often see that people keep data constant, but at Tesla it's very much the opposite
[1:41:22.700 --> 1:41:31.700] We see time and time and again that data is one of the best if not the most deterministic lever to solving these interventions
[1:41:31.700 --> 1:41:37.700] We just showed you the data engine loop for one challenge case, namely these parked cars at turns
[1:41:37.700 --> 1:41:41.700] But there are many challenge cases even for one signal of vehicle movement
[1:41:41.700 --> 1:41:49.700] We apply this data engine loop to every single challenge case we've diagnosed, whether it's buses, curvy roads, stopped vehicles, parking lots
[1:41:49.700 --> 1:41:55.700] And we don't just add data once, we do this again and again to perfect the semantic
[1:41:55.700 --> 1:42:02.700] In fact, this year we updated our vehicle movement signal five times and with every weight update trained on the new data
[1:42:02.700 --> 1:42:06.700] We push our vehicle movement accuracy up and up
[1:42:06.700 --> 1:42:17.700] This data engine framework applies to all our signals, whether they're 3D, multi-cam video, whether the data is human labeled, auto-labeled, or simulated
[1:42:17.700 --> 1:42:20.700] Whether it's an offline model or an online model
[1:42:20.700 --> 1:42:30.700] And Tesla is able to do this at scale because of the fleet advantage, the infra that our NG team has built, and the labeling resources that feed our networks
[1:42:30.700 --> 1:42:39.700] To train on all this data, we need a massive amount of compute, so I'll hand it off to Pete and Ganesh to talk about the Dojo supercomputing platform
[1:42:39.700 --> 1:42:40.700] Thank you
[1:42:40.700 --> 1:42:49.700] Thank you, Katie
[1:42:49.700 --> 1:42:53.700] Thanks everybody, thanks for hanging in there, we're almost there
[1:42:53.700 --> 1:42:59.700] My name is Pete Bannon, I run the custom silicon and low voltage teams at Tesla
[1:42:59.700 --> 1:43:03.700] And my name is Ganesh Renke, I run the Dojo program
[1:43:03.700 --> 1:43:10.700] Thank you
[1:43:10.700 --> 1:43:15.700] I'm frequently asked, why is a car company building a supercomputer for training?
[1:43:15.700 --> 1:43:21.700] And this question fundamentally misunderstands the nature of Tesla
[1:43:21.700 --> 1:43:25.700] At its heart, Tesla is a hardcore technology company
[1:43:25.700 --> 1:43:44.700] All across the company, people are working hard in science and engineering to advance the fundamental understanding and methods that we have available to build cars, energy solutions, robots, and anything else that we can do to improve the human condition around the world
[1:43:44.700 --> 1:43:51.700] It's a super exciting thing to be a part of, and it's a privilege to run a very small piece of it in the semiconductor group
[1:43:51.700 --> 1:43:58.700] Tonight we're going to talk a little bit about Dojo and give you an update on what we've been able to do over the last year
[1:43:58.700 --> 1:44:04.700] But before we do that, I wanted to give a little bit of background on the initial design that we started a few years ago
[1:44:04.700 --> 1:44:11.700] When we got started, the goal was to provide a substantial improvement to the training latency for our autopilot team
[1:44:11.700 --> 1:44:21.700] Some of the largest neural networks they train today run for over a month, which inhibits their ability to rapidly explore alternatives and evaluate them
[1:44:21.700 --> 1:44:29.700] So a 30X speedup would be really nice if we could provide it at a cost competitive and energy competitive way
[1:44:29.700 --> 1:44:37.700] To do that, we wanted to build a chip with a lot of arithmetic units that we could utilize at a very high efficiency
[1:44:37.700 --> 1:44:46.700] And we spent a lot of time studying whether we could do that using DRAM, various packaging ideas, all of which failed
[1:44:46.700 --> 1:44:53.700] And in the end, even though it felt like an unnatural act, we decided to reject DRAM as the primary storage medium for this system
[1:44:53.700 --> 1:44:57.700] And instead focus on SRAM embedded in the chip
[1:44:57.700 --> 1:45:08.700] SRAM provides, unfortunately, a modest amount of capacity, but extremely high bandwidth and very low latency, and that enables us to achieve high utilization with the arithmetic units
[1:45:08.700 --> 1:45:15.700] Those choices, that particular choice led to a whole bunch of other choices
[1:45:15.700 --> 1:45:22.700] For example, if you want to have virtual memory, you need page tables, they take up a lot of space, we didn't have space, so no virtual memory
[1:45:22.700 --> 1:45:35.700] So we also don't have interrupts, the accelerator is a bare bonds, raw piece of hardware that's presented to a compiler and the compiler is responsible for scheduling everything that happens in a deterministic way
[1:45:35.700 --> 1:45:39.700] So there's no need or even desire for interrupts in the system
[1:45:39.700 --> 1:45:55.700] We also chose to pursue model parallelism as a training methodology, which is not the typical situation most machines today use data parallelism, which consumes additional memory capacity, which we obviously don't have
[1:45:55.700 --> 1:46:03.700] So all of those choices led us to build a machine that is pretty radically different from what's available today
[1:46:03.700 --> 1:46:09.700] We also had a whole bunch of other goals, one of the most important ones was no limits
[1:46:09.700 --> 1:46:17.700] So we wanted to build a compute fabric that would scale in an unbounded way for the most part, I mean obviously there's physical limits now and then
[1:46:17.700 --> 1:46:25.700] But pretty much if your model was too big for the computer, you just had to go buy a bigger computer, that's what we were looking for
[1:46:25.700 --> 1:46:35.700] Today the way machines are packaged, there's a pretty fixed ratio of for example GPU, CPUs and DRAM capacity and network capacity
[1:46:35.700 --> 1:46:48.700] And we really wanted to disaggregate all that so that as models evolved, we could vary the ratios of those various elements and make the system more flexible to meet the needs of the autopilot team
[1:46:48.700 --> 1:46:58.700] And it's so true, no limits philosophy was our guiding star all the way, all of our choices were centered around that
[1:46:58.700 --> 1:47:08.700] And to the point that we didn't want traditional data center infrastructure to limit our capacity to execute these programs at speed
[1:47:08.700 --> 1:47:24.700] That's why we integrated vertically our data center, the entire data center by doing a vertical integration of the data center
[1:47:24.700 --> 1:47:36.700] We could extract new levels of efficiency, we could optimize power delivery, cooling and as well as system management across the whole data center stack
[1:47:36.700 --> 1:47:43.700] Rather than doing box by box and integrating that, those boxes into data centers
[1:47:43.700 --> 1:47:53.700] And to do this, we also wanted to integrate early to figure out limits of scale for our software workloads
[1:47:53.700 --> 1:48:00.700] So we integrated Dojo environment into our autopilot software very early and we learned a lot of lessons
[1:48:00.700 --> 1:48:09.700] And today Bill Chang will go over our hardware update as well as some of the challenges that we faced along the way
[1:48:09.700 --> 1:48:18.700] And Rajiv Kurian will give you a glimpse of our compiler technology as well as go over some of our cool results
[1:48:18.700 --> 1:48:20.700] Great
[1:48:25.700 --> 1:48:28.700] Thanks Pete, thanks Ganesh
[1:48:28.700 --> 1:48:38.700] I'll start tonight with a high level vision of our system that will help set the stage for the challenges and the problems we're solving
[1:48:38.700 --> 1:48:43.700] And then also how software will then leverage this for performance
[1:48:43.700 --> 1:48:49.700] Now our vision for Dojo is to build a single unified accelerator, a very large one
[1:48:49.700 --> 1:49:03.700] Software would see a seamless compute plane with globally addressable, very fast memory and all connected together with uniform high bandwidth and low latency
[1:49:03.700 --> 1:49:08.700] Now to realize this, we need to use density to achieve performance
[1:49:08.700 --> 1:49:18.700] Now we leverage technology to get this density in order to break levels of hierarchy all the way from the chip to the scale out systems
[1:49:18.700 --> 1:49:23.700] Now silicon technology has done this for decades
[1:49:23.700 --> 1:49:31.700] Chips have followed Moore's law for density integration to get performance scaling
[1:49:31.700 --> 1:49:36.700] Now a key step in realizing that vision was our training tile
[1:49:36.700 --> 1:49:48.700] Probably can we integrate 25 dies at extremely high bandwidth but we can scale that to any number of additional tiles by just connecting them together
[1:49:48.700 --> 1:49:57.700] Now last year we showcased our first functional training tile and at that time we already had workloads running on it
[1:49:57.700 --> 1:50:05.700] And since then the team here has been working hard and diligently to deploy this at scale
[1:50:05.700 --> 1:50:09.700] Now we've made amazing progress and had a lot of milestones along the way
[1:50:09.700 --> 1:50:13.700] And of course we've had a lot of unexpected challenges
[1:50:13.700 --> 1:50:21.700] But this is where our fail fast philosophy has allowed us to push our boundaries
[1:50:21.700 --> 1:50:26.700] Now pushing density for performance presents all new challenges
[1:50:26.700 --> 1:50:29.700] One area is power delivery
[1:50:29.700 --> 1:50:37.700] Here we need to deliver the power to our compute die and this directly impacts our top line compute performance
[1:50:37.700 --> 1:50:41.700] But we need to do this at unprecedented density
[1:50:41.700 --> 1:50:48.700] We need to be able to match our die pitch with a power density of almost 1 amp per millimeter squared
[1:50:48.700 --> 1:50:55.700] And because of the extreme integration this needs to be a multi-tiered vertical power solution
[1:50:55.700 --> 1:51:02.700] And because there's a complex heterogeneous material stack up we have to carefully manage the material transition
[1:51:02.700 --> 1:51:06.700] Especially CTE
[1:51:06.700 --> 1:51:10.700] Now why does the coefficient of thermal expansion matter in this case?
[1:51:10.700 --> 1:51:22.700] CTE is a fundamental material property and if it's not carefully managed that stack up would literally rip itself apart
[1:51:22.700 --> 1:51:28.700] We started this effort by working with vendors to develop this power solution
[1:51:28.700 --> 1:51:33.700] But we realized that we actually had to develop this in-house
[1:51:33.700 --> 1:51:41.700] Now to balance schedule and risk we built quick iterations to support both our system bring up in software development
[1:51:41.700 --> 1:51:47.700] And also to find the optimal design and stack up that would meet our final production goals
[1:51:47.700 --> 1:51:57.700] And in the end we were able to reduce CTE over 50% and meet our performance by 3x over our initial version
[1:51:57.700 --> 1:52:08.700] Now needless to say finding this optimal material stack up while maximizing performance at density is extremely difficult
[1:52:08.700 --> 1:52:12.700] Now we did have unexpected challenges along the way
[1:52:12.700 --> 1:52:19.700] Here's an example where we pushed the boundaries of integration that led to component failures
[1:52:19.700 --> 1:52:29.700] This started when we scaled up to larger and longer workloads and then intermittently a single site on a tile would fail
[1:52:29.700 --> 1:52:39.700] Now they started out as recoverable failures but as we pushed some much higher and higher power these would become permanent failures
[1:52:39.700 --> 1:52:46.700] Now to understand this failure you have to understand why and how we build our power modules
[1:52:46.700 --> 1:52:53.700] Solving density at every level is the cornerstone of actually achieving our system performance
[1:52:53.700 --> 1:53:01.700] Now because our XY plane is used for high bandwidth communication everything else must be stacked vertically
[1:53:01.700 --> 1:53:08.700] This means all other components other than our die must be integrated into our power modules
[1:53:08.700 --> 1:53:15.700] Now that includes our clock and our power supplies and also our system controllers
[1:53:15.700 --> 1:53:21.700] Now in this case the failures were due to losing clock output from our oscillators
[1:53:21.700 --> 1:53:29.700] And after an extensive debug we found that the root cause was due to vibrations on the module from piezoelectric effects
[1:53:29.700 --> 1:53:33.700] Our nearby capacitors
[1:53:33.700 --> 1:53:39.700] Now singing caps are not a new phenomenon and in fact very common in power design
[1:53:39.700 --> 1:53:46.700] But normally clock chips are placed in a very quiet area of the board and often not affected by power circuits
[1:53:46.700 --> 1:53:54.700] But because we needed to achieve this level of integration these oscillators need to be placed in very close proximity
[1:53:54.700 --> 1:53:59.700] Now due to our switching frequency and then the vibration resonance created
[1:53:59.700 --> 1:54:07.700] It caused out of plane vibration on our MEMS oscillator that caused it to crack
[1:54:07.700 --> 1:54:10.700] Now the solution to this problem is a multi-prong approach
[1:54:10.700 --> 1:54:16.700] We can reduce the vibration by using soft terminal caps
[1:54:16.700 --> 1:54:24.700] We can update our MEMS part with a lower Q factor for the out of plane direction
[1:54:24.700 --> 1:54:34.700] And we can also update our switching frequency to push the resonance further away from these sensitive bands
[1:54:34.700 --> 1:54:43.700] Now in addition to the density at the system level we've been making a lot of progress at the infrastructure level
[1:54:43.700 --> 1:54:53.700] We knew that we had to read examine every aspect of the data center infrastructure in order to support our unprecedented power and cooling density
[1:54:53.700 --> 1:55:00.700] We brought in a fully custom designed CDU to support Dojo's dense cooling requirements
[1:55:00.700 --> 1:55:07.700] And the amazing part is we're able to do this at a fraction of the cost versus buying off the shelf and modifying it
[1:55:07.700 --> 1:55:15.700] And since our Dojo cabinet integrates enough power and cooling to match an entire row of standard IT racks
[1:55:15.700 --> 1:55:20.700] We need to carefully design our cabinet and infrastructure together
[1:55:20.700 --> 1:55:25.700] And we've already gone through several iterations of this cabinet to optimize this
[1:55:25.700 --> 1:55:30.700] And earlier this year we started low testing our power and cooling infrastructure
[1:55:30.700 --> 1:55:36.700] And we were able to push it over 2 megawatts before we tripped our substation and got a call from the city
[1:55:40.700 --> 1:55:44.700] Now last year we introduced only a couple of components of our system
[1:55:44.700 --> 1:55:51.700] The custom D1 die and the training tile, but we teased the exit pod as our end goal
[1:55:51.700 --> 1:55:55.700] We'll walk through the remaining parts of our system that are required to build out this exit pod
[1:55:58.700 --> 1:56:03.700] Now the system tray is a key part of realizing our vision of a single accelerator
[1:56:03.700 --> 1:56:11.700] It enables us to seamlessly connect tiles together, not only within the cabinet, but between cabinets
[1:56:11.700 --> 1:56:17.700] We can connect these tiles at very tight spacing across the entire accelerator
[1:56:17.700 --> 1:56:21.700] And this is how we achieve our uniform communication
[1:56:21.700 --> 1:56:30.700] This is a laminated bus bar that allows us to integrate very high power, mechanical and thermal support, and an extremely dense integration
[1:56:30.700 --> 1:56:37.700] It's 75 millimeters in height and supports 6 tiles at 135 kilograms
[1:56:37.700 --> 1:56:43.700] This is the equivalent of 3 to 4 fully loaded high performance racks
[1:56:46.700 --> 1:56:49.700] Next we need to feed data to the training tiles
[1:56:49.700 --> 1:56:53.700] This is where we've developed the Dojo interface processor
[1:56:53.700 --> 1:56:58.700] It provides our system with high bandwidth DRAM to stage our training data
[1:56:58.700 --> 1:57:09.700] And it provides full memory bandwidth to our training tiles using TTP, our custom protocol that we use to communicate across our entire accelerator
[1:57:09.700 --> 1:57:15.700] It also has high speed Ethernet that helps us extend this custom protocol over standard Ethernet
[1:57:15.700 --> 1:57:21.700] And we provide native hardware support for this with little to no software overhead
[1:57:21.700 --> 1:57:28.700] And lastly we can connect to it through a standard Gen4 PCIe interface
[1:57:30.700 --> 1:57:37.700] Now we pair 20 of these cards per tray and that gives us 640 gigabytes of high bandwidth DRAM
[1:57:37.700 --> 1:57:42.700] And this provides our disaggregated memory layer for our training tiles
[1:57:42.700 --> 1:57:48.700] These cards are a high bandwidth ingest path both through PCIe and Ethernet
[1:57:48.700 --> 1:57:56.700] They also provide a high-ratex Z-connectivity path that allows shortcuts across our large Dojo accelerator
[1:57:58.700 --> 1:58:03.700] Now we actually integrate the host directly underneath our system tray
[1:58:03.700 --> 1:58:10.700] These hosts provide our ingest processing and connect to our interface processors through PCIe
[1:58:10.700 --> 1:58:17.700] These hosts can provide hardware video decoder support for video-based training
[1:58:17.700 --> 1:58:26.700] And our user applications land on these hosts so we can provide them with the standard X86 Linux environment
[1:58:29.700 --> 1:58:42.700] Now we can put two of these assemblies into one cabinet and pair it with redundant power supplies that do direct conversion of three-phase 480-volt AC power to 52-volt DC power
[1:58:42.700 --> 1:58:53.700] Now by focusing on density at every level we can realize the vision of a single accelerator
[1:58:53.700 --> 1:59:02.700] Now starting with the uniform nodes on our custom D1 die we can connect them together in our fully integrated training tile
[1:59:02.700 --> 1:59:10.700] And then finally seamlessly connecting them across cabinet boundaries to form our Dojo accelerator
[1:59:10.700 --> 1:59:19.700] And all together we can house two full accelerators in our Exapod for a combined one exa-flop of ML compute
[1:59:19.700 --> 1:59:28.700] Now all together this amount of technology and integration has only ever been done a couple of times in the history of compute
[1:59:28.700 --> 1:59:41.700] Next we'll see how software can leverage this to accelerate their performance
[1:59:41.700 --> 1:59:46.700] Thanks Bill, my name is Rajiv and I'm going to talk some numbers
[1:59:46.700 --> 1:59:54.700] So our software stack begins with the PyTorch extension that speaks to our commitment to run standard PyTorch models out of the box
[1:59:54.700 --> 2:00:01.700] We're going to talk more about our JIT compiler and the ingest pipeline that feeds the hardware with data
[2:00:01.700 --> 2:00:07.700] Abstractly, performance is tops times utilization times accelerator occupancy
[2:00:07.700 --> 2:00:15.700] We've seen how the hardware provides peak performance is the job of the compiler to extract utilization from the hardware while code is running on it
[2:00:15.700 --> 2:00:22.700] And it's the job of the ingest pipeline to make sure that data can be fed at a throughput high enough for the hardware to not ever starve
[2:00:22.700 --> 2:00:27.700] So let's talk about why communication-bound models are difficult to scale
[2:00:27.700 --> 2:00:32.700] But before that let's look at why ResNet 50-like models are easier to scale
[2:00:32.700 --> 2:00:37.700] You start off with a single accelerator, run the forward and backward passes, followed by the optimizer
[2:00:37.700 --> 2:00:42.700] Then to scale this up you run multiple copies of this on multiple accelerators
[2:00:42.700 --> 2:00:50.700] And while the gradients produced by the backward pass do need to be reduced and this introduces some communication, this can be done pipeline with the backward pass
[2:00:50.700 --> 2:00:57.700] This setup scales fairly well, almost linearly
[2:00:57.700 --> 2:01:04.700] For models with much larger activations we run into a problem as soon as we want to run the forward pass
[2:01:04.700 --> 2:01:09.700] The batch size that fits in a single accelerator is often smaller than the batch norm surface
[2:01:09.700 --> 2:01:15.700] So to get around this researchers typically run this setup on multiple accelerators in sync batch norm mode
[2:01:15.700 --> 2:01:23.700] This introduces latency bound communication to the critical path of the forward pass and we already have a communication bottleneck
[2:01:23.700 --> 2:01:29.700] And while there are ways to get around this they usually involve tedious manual work best suited for a compiler
[2:01:29.700 --> 2:01:38.700] And ultimately there's no skirting around the fact that if your state does not fit in a single accelerator you can be communication bound
[2:01:38.700 --> 2:01:46.700] And even with significant efforts from our ML engineers we see such models don't scale linearly
[2:01:46.700 --> 2:01:51.700] The doger system was built to make such models work at high utilization
[2:01:51.700 --> 2:01:59.700] The high density integration was built to not only accelerate the compute bound portions of a model but also the latency bound portions
[2:01:59.700 --> 2:02:06.700] Like a batch norm or the bandwidth bound portions like a gradient all reduced or a parameter all gathered
[2:02:06.700 --> 2:02:11.700] A slice of the doger mesh can be carved out to run any model
[2:02:11.700 --> 2:02:18.700] The only thing users need to do is to make the slice large enough to fit a batch norm surface for their particular model
[2:02:18.700 --> 2:02:27.700] After that the partition presents itself as one large accelerator freeing the users from having to worry about the internal details of execution
[2:02:27.700 --> 2:02:32.700] And as the job of the compiler to maintain this abstraction
[2:02:32.700 --> 2:02:41.700] Fine grain synchronization primitives in uniform low latency makes it easy to accelerate all forms of parallelism across integration boundaries
[2:02:41.700 --> 2:02:47.700] Tensors are usually stored sharded in SRAM and replicated just in time for a layer's execution
[2:02:47.700 --> 2:02:51.700] We depend on the high doger bandwidth to hide this replication time
[2:02:51.700 --> 2:03:00.700] Tensor replication and other data transfers are overlapped with compute and the compiler can also recompute layers when it's profitable to do so
[2:03:00.700 --> 2:03:04.700] We expect most models to work out of the box
[2:03:04.700 --> 2:03:10.700] As an example we took the recently released stable diffusion model and got it running on dojo in minutes
[2:03:10.700 --> 2:03:16.700] Out of the box the compiler was able to map it in a model parallel manner on 25 dojo dies
[2:03:16.700 --> 2:03:23.700] Here are some pictures of a Cybertruck on Mars generated by stable diffusion running on dojo
[2:03:23.700 --> 2:03:36.700] Looks like it still has some ways to go before matching the Tesla design studio team
[2:03:36.700 --> 2:03:40.700] So we've talked about how communication bottlenecks can hamper scalability
[2:03:40.700 --> 2:03:46.700] Perhaps an asset test of a compiler and the underlying hardware is executing a cross die batch norm layer
[2:03:46.700 --> 2:03:49.700] Like mentioned before this can be a serial bottleneck
[2:03:49.700 --> 2:03:55.700] The communication phase of a batch norm begins with nodes computing their local mean and standard deviations
[2:03:55.700 --> 2:04:02.700] Then coordinating to reduce these values, then broadcasting these values back and then they resume their work in parallel
[2:04:02.700 --> 2:04:07.700] So what would an ideal batch norm look like on 25 dojo dies?
[2:04:07.700 --> 2:04:12.700] Let's say the previous less activations are already split across dies
[2:04:12.700 --> 2:04:20.700] We would expect the 350 nodes on each die to coordinate and produce die local mean and standard deviation values
[2:04:20.700 --> 2:04:26.700] Ideally these would get further reduced with the final value ending somewhere towards the middle of the tile
[2:04:26.700 --> 2:04:32.700] We would then hope to see a broadcast of this value radiating from the center
[2:04:32.700 --> 2:04:37.700] Let's see how the compiler actually executes a real batch norm operation across 25 dies
[2:04:37.700 --> 2:04:43.700] The communication trees were extracted from the compiler and the timing is from a real hardware one
[2:04:43.700 --> 2:04:52.700] We're about to see 8,750 nodes on 25 dies coordinating to reduce and then broadcast the batch norm mean and standard deviation values
[2:04:52.700 --> 2:04:59.700] Die local reduction followed by global reduction towards the middle of the tile
[2:04:59.700 --> 2:05:06.700] Then the reduced value broadcast radiating from the middle accelerated by the hardware's broadcast facility
[2:05:06.700 --> 2:05:14.700] This operation takes only 5 microseconds on 25 dojo dies
[2:05:14.700 --> 2:05:18.700] The same operation takes 150 microseconds on 24 GPUs
[2:05:18.700 --> 2:05:22.700] This is an orders of magnitude improvement over GPUs
[2:05:22.700 --> 2:05:26.700] And while we talked about an already used operation in the context of a batch norm
[2:05:26.700 --> 2:05:32.700] It's important to reiterate that the same advantages apply to all other communication primitives
[2:05:32.700 --> 2:05:37.700] And these primitives are essential for large scale training
[2:05:37.700 --> 2:05:39.700] So how about full model performance?
[2:05:39.700 --> 2:05:44.700] So while we think that ResNet 50 is not a good representation of real world Tesla workloads
[2:05:44.700 --> 2:05:48.700] It is a standard benchmark, so let's start there
[2:05:48.700 --> 2:05:51.700] We are already able to match the 100 die for die
[2:05:51.700 --> 2:05:58.700] However, perhaps a hint of dojo's capabilities is that we're able to hit this number with just a batch of 8 per die
[2:05:58.700 --> 2:06:02.700] But dojo was really built to tackle larger complex models
[2:06:02.700 --> 2:06:08.700] So when we set out to tackle real world workloads, we looked at the usage patterns of our current GPU cluster
[2:06:08.700 --> 2:06:14.700] And two models stood out, the autolabeling networks, a class of offline models that are used to generate ground truth
[2:06:14.700 --> 2:06:17.700] And the occupancy networks that you heard about
[2:06:17.700 --> 2:06:22.700] The autolabeling networks are large models that have high arithmetic intensity
[2:06:22.700 --> 2:06:25.700] While the occupancy networks can be ingest bound
[2:06:25.700 --> 2:06:30.700] We chose these models because together they account for a large chunk of our current GPU cluster usage
[2:06:30.700 --> 2:06:36.700] And they would challenge the system in different ways
[2:06:36.700 --> 2:06:38.700] So how do we do on these two networks?
[2:06:38.700 --> 2:06:46.700] The results we're about to see were measured on multi die systems for both the GPU and dojo, but normalized to per die numbers
[2:06:46.700 --> 2:06:51.700] On our autolabeling network, we're already able to surpass the performance of an A100
[2:06:51.700 --> 2:06:55.700] With our current hardware running on our older generation VRMs
[2:06:55.700 --> 2:07:01.700] On our production hardware with our newer VRMs, that translates to doubling the throughput of an A100
[2:07:01.700 --> 2:07:09.700] And our model showed that with some key compiler optimizations, we could get to more than 3x the performance of an A100
[2:07:09.700 --> 2:07:13.700] We see even bigger leaps on the occupancy network
[2:07:13.700 --> 2:07:27.700] Almost 3x with our production hardware, with room for more
[2:07:27.700 --> 2:07:29.700] So what does that mean for Tesla?
[2:07:29.700 --> 2:07:42.700] With a current level of compiler performance, we could replace the ML compute of 1, 2, 3, 4, 5 and 6 GPU boxes with just a single dojo tile
[2:07:42.700 --> 2:07:58.700] And this dojo tile costs less than one of these GPU boxes
[2:07:58.700 --> 2:08:07.700] What it really means is that networks that took more than a month to train now take less than a week
[2:08:07.700 --> 2:08:15.700] Alas, when we measure things, it did not turn out so well. At the PyTorch level, we did not see our expected performance out of the gate
[2:08:15.700 --> 2:08:23.700] And this timeline chart shows our problem. The teeny, tiny little green bars, that's the compile code running on the accelerator
[2:08:23.700 --> 2:08:30.700] The row is mostly white space where the hardware is just waiting for data
[2:08:30.700 --> 2:08:38.700] With our dense ML compute, dojo hosts effectively have 10x more ML compute than the GPU hosts. The data loader is running on this one host
[2:08:38.700 --> 2:08:42.700] Simply couldn't keep up with all that ML hardware
[2:08:42.700 --> 2:08:48.700] So to solve our data loader scalability issues, we knew we had to get over the limit of this single host
[2:08:48.700 --> 2:08:55.700] The Tesla transport protocol moves data seamlessly across hosts, tiles and ingest processors
[2:08:55.700 --> 2:09:04.700] So we extended the Tesla transport protocol to work over Ethernet. We then built the dojo network interface card, the D-NIC, to leverage TTP over Ethernet
[2:09:04.700 --> 2:09:11.700] This allows any host with a D-NIC card to be able to DMA2 and from other TTP endpoints
[2:09:11.700 --> 2:09:20.700] So we started with the dojo mesh, then we added a tier of data loading hosts equipped with the D-NIC card
[2:09:20.700 --> 2:09:33.700] We connected these hosts to the mesh via an Ethernet switch. Now every host in this data loading tier is capable of reaching all TTP endpoints in the dojo mesh via hardware accelerated DMA
[2:09:33.700 --> 2:09:40.700] After these optimizations went in, our occupancy went from 4% to 97%
[2:09:40.700 --> 2:09:52.700] So the data loading sections have reduced drastically and the ML hardware has kept busy
[2:09:52.700 --> 2:09:55.700] We actually expect this number to go to 100% pretty soon
[2:09:55.700 --> 2:10:03.700] After these changes went in, we saw the full expected speed up from the PyTorch layer and we were back in business
[2:10:03.700 --> 2:10:11.700] So we started with hardware design that breaks through traditional integration boundaries in service of our vision of a single giant accelerator
[2:10:11.700 --> 2:10:15.700] We've seen how the compiler and ingest layers build on top of that hardware
[2:10:15.700 --> 2:10:22.700] So after proving our performance on these complex real-world networks, we knew what our first large-scale deployment would target
[2:10:22.700 --> 2:10:26.700] Our high arithmetic intensity auto-labeling networks
[2:10:26.700 --> 2:10:31.700] Today that occupies 4,000 GPUs over 72 GPU racks
[2:10:31.700 --> 2:10:39.700] With our dense computer and our high performance, we expect to provide the same throughput with just 4 dojo cabinets
[2:10:47.700 --> 2:10:54.700] And these 4 dojo cabinets will be part of our first exapod that we plan to build by quarter one of 2023
[2:10:54.700 --> 2:11:05.700] This will more than double Tesla's auto-labeling capacity
[2:11:05.700 --> 2:11:14.700] The first exapod is part of a total of 7 exapods that we plan to build in Palo Alto right here across the wall
[2:11:14.700 --> 2:11:21.700] And we have a display cabinet from one of these exapods for everyone to look at
[2:11:21.700 --> 2:11:33.700] 6 tiles densely packed on a tray, 54 petaflops of compute, 640 gigabytes of high bandwidth memory with power and host defeated
[2:11:33.700 --> 2:11:38.700] A lot of compute
[2:11:38.700 --> 2:11:46.700] And we're building out new versions of all our cluster components and constantly improving our software to hit new limits of scale
[2:11:46.700 --> 2:11:53.700] We believe that we can get another 10x improvement with our next generation hardware
[2:11:53.700 --> 2:11:57.700] And to realize our ambitious goals, we need the best software and hardware engineers
[2:11:57.700 --> 2:12:01.700] So please come talk to us or visit tesla.com.
[2:12:01.700 --> 2:12:28.700] Alright, so hopefully that was enough detail
[2:12:28.700 --> 2:12:32.700] And now we can move to questions
[2:12:32.700 --> 2:12:40.700] And guys, I think the team can come out on stage
[2:12:40.700 --> 2:12:52.700] We really wanted to show the depth and breadth of Tesla in artificial intelligence, compute hardware, robotics actuators
[2:12:52.700 --> 2:13:01.700] And try to really shift the perception of the company away from, you know, a lot of people think we're like just a car company
[2:13:01.700 --> 2:13:03.700] Or we make cool cars, whatever
[2:13:03.700 --> 2:13:13.700] But most people have no idea that Tesla is arguably the leader in real world AI hardware and software
[2:13:13.700 --> 2:13:26.700] And that we're building what is arguably the most radical computer architecture since the Kray-1 supercomputer
[2:13:26.700 --> 2:13:35.700] And I think if you're interested in developing some of the most advanced technology in the world that's going to really affect the world in a positive way
[2:13:35.700 --> 2:13:38.700] Tesla's the place to be
[2:13:38.700 --> 2:13:42.700] So yeah, let's fire away with some questions
[2:13:42.700 --> 2:13:50.700] I think there's a mic at the front and a mic at the back
[2:13:50.700 --> 2:13:55.700] Just throw mics at people
[2:13:55.700 --> 2:13:57.700] Jump all for the mic
[2:13:57.700 --> 2:14:00.700] Yeah, hi, thank you very much
[2:14:00.700 --> 2:14:03.700] I was impressed here
[2:14:03.700 --> 2:14:09.700] I was impressed very much by Optimus, but I wonder why did not driven the hand
[2:14:09.700 --> 2:14:12.700] Why did you choose a tendon-driven approach for the hand?
[2:14:12.700 --> 2:14:15.700] Because tendons are not very durable
[2:14:15.700 --> 2:14:20.700] And why spring-loaded?
[2:14:20.700 --> 2:14:24.700] Cool, awesome, yes, that's a great question
[2:14:24.700 --> 2:14:32.700] You know, when it comes to any type of actuation scheme, there's trade-offs between, you know, whether or not it's a tendon-driven system or some type of linkage-based system
[2:14:32.700 --> 2:14:34.700] Keep the mic close to your mouth
[2:14:34.700 --> 2:14:36.700] A little bit closer, hear me?
[2:14:36.700 --> 2:14:38.700] Cool
[2:14:38.700 --> 2:14:49.700] Yeah, the main reason why we went for a tendon-based system is that, you know, first we actually investigated some synthetic tendons, but we found that metallic boating cables are, you know, a lot stronger
[2:14:49.700 --> 2:14:55.700] One of the advantages of these cables is that it's very good for part reduction
[2:14:55.700 --> 2:15:04.700] We do want to make a lot of these hands, so having a bunch of parts, a bunch of small linkages ends up being, you know, a problem when you're making a lot of something
[2:15:04.700 --> 2:15:12.700] One of the big reasons that, you know, tendons are better than linkages in a sense is that you can be anti-backlash
[2:15:12.700 --> 2:15:20.700] So anti-backlash essentially, you know, allows you to not have any gaps or, you know, stuttering motion in your fingers
[2:15:20.700 --> 2:15:27.700] Spring-loaded, mainly what spring-loaded allows us to do is allows us to have active opening
[2:15:27.700 --> 2:15:38.700] So instead of having to have two actuators to drive the fingers closed and then open, we have the ability to, you know, have the tendon drive them closed and then the springs passively extend
[2:15:38.700 --> 2:15:45.700] And this is something that's seen in our hands as well, right? We have the ability to actively flex and then we also have the ability to extend
[2:15:45.700 --> 2:15:47.700] Yeah
[2:15:47.700 --> 2:15:53.700] I mean, our goal with Optimus is to have a robot that is maximally useful as quickly as possible
[2:15:53.700 --> 2:15:58.700] So there's a lot of ways to solve the various problems of a humanoid robot
[2:15:58.700 --> 2:16:04.700] And we're probably not barking up the right tree on all the technical solutions
[2:16:04.700 --> 2:16:11.700] And I should say that we're open to evolving the technical solutions that you see here over time, they're not locked in stone
[2:16:11.700 --> 2:16:22.700] But we have to pick something, and we want to pick something that's going to allow us to produce the robot as quickly as possible and have it, like I said, be useful as quickly as possible
[2:16:22.700 --> 2:16:29.700] We're trying to follow the goal of fastest path to a useful robot that can be made at volume
[2:16:29.700 --> 2:16:38.700] And we're going to test the robot internally at Tesla in our factory and just see, like, how useful is it
[2:16:38.700 --> 2:16:45.700] Because you have to have a, you've got to close the loop on reality to confirm that the robot is in fact useful
[2:16:45.700 --> 2:16:52.700] And, yeah, so we're just going to use it to build things
[2:16:52.700 --> 2:16:56.700] And we're confident we can do that with the hand that we have currently designed
[2:16:56.700 --> 2:17:02.700] But I'm sure there'll be hand version 2, version 3, and we may change the architecture quite significantly over time
[2:17:02.700 --> 2:17:15.700] Hi, the Optimus robot is really impressive, you did a great job, bipedal robots are really difficult
[2:17:15.700 --> 2:17:24.700] But what I noticed might be missing from your plan is to acknowledge the utility of the human spirit
[2:17:24.700 --> 2:17:32.700] And I'm wondering if Optimus will ever get a personality and be able to laugh at our jokes while it folds our clothes
[2:17:32.700 --> 2:17:40.700] Yeah, absolutely. I think we want to have really fun versions of Optimus
[2:17:40.700 --> 2:17:50.700] And so that Optimus can both be utilitarian and do tasks, but can also be kind of like a friend and a buddy
[2:17:50.700 --> 2:17:58.700] And hang out with you, and I'm sure people will think of all sorts of creative uses for this robot
[2:17:58.700 --> 2:18:06.700] And, you know, the thing, once you have the core intelligence and actuators figured out
[2:18:06.700 --> 2:18:15.700] Then you can actually, you know, put all sorts of costumes, I guess, on the robot
[2:18:15.700 --> 2:18:22.700] I mean, you can make the robot look, you can skin the robot in many different ways
[2:18:22.700 --> 2:18:30.700] And I'm sure people will find very interesting ways to, yeah, versions of Optimus
[2:18:34.700 --> 2:18:36.700] Thanks for the great presentation
[2:18:36.700 --> 2:18:41.700] I wanted to know if there was an equivalent to interventions in Optimus
[2:18:41.700 --> 2:18:46.700] It seems like labeling through moments where humans disagree with what's going on is important
[2:18:46.700 --> 2:18:52.700] And in a humanoid robot, that might be also a desirable source of information
[2:19:00.700 --> 2:19:06.700] Yeah, I think we will have ways to remote operate the robot and intervene when it does something bad
[2:19:06.700 --> 2:19:09.700] Especially when we are training the robot and bringing it up
[2:19:09.700 --> 2:19:15.700] And hopefully we, you know, design it in a way that we can stop the robot from, if it's going to hit something
[2:19:15.700 --> 2:19:18.700] We can just, like, hold it and it will stop, it won't, like, you know, crush your hand or something
[2:19:18.700 --> 2:19:22.700] And those are all intervention data
[2:19:22.700 --> 2:19:24.700] Yeah, and we can learn a lot from our simulation systems, too
[2:19:24.700 --> 2:19:29.700] Where we can check for collisions and supervise that those are bad actions
[2:19:29.700 --> 2:19:38.700] Yeah, I mean, so Optimus, we went over time for it to be, you know, an android, the kind of android that you've seen in sci-fi movies
[2:19:38.700 --> 2:19:41.700] Like Star Trek, The Next Generation, like data
[2:19:41.700 --> 2:19:46.700] But obviously we could program the robot to be less robot-like and more friendly
[2:19:46.700 --> 2:19:52.700] And, you know, you can obviously learn to emulate humans and feel very natural
[2:19:52.700 --> 2:19:58.700] So as AI in general improves, we can add that to the robot
[2:19:58.700 --> 2:20:07.700] And, you know, it should be obviously able to do simple instructions or even intuit what it is that you want
[2:20:07.700 --> 2:20:13.700] So you could give it a high level instruction and then it can break that down into a series of actions
[2:20:13.700 --> 2:20:16.700] And take those actions
[2:20:19.700 --> 2:20:29.700] Hi, yeah, it's exciting to think that with the Optimus you will think that you can achieve orders of magnitude of improvement in economic output
[2:20:29.700 --> 2:20:32.700] That's really exciting
[2:20:32.700 --> 2:20:39.700] And when Tesla started, the mission was to accelerate the advent of renewable energy or sustainable transport
[2:20:39.700 --> 2:20:57.700] So with the Optimus, do you still see that mission being the mission statement of Tesla or is it going to be updated with, you know, mission to accelerate the advent of, I don't know, infinite abundance or limitless economy
[2:20:57.700 --> 2:21:09.700] Yeah, it is not strictly speaking, Optimus is not strictly speaking directly in line with accelerating sustainable energy
[2:21:09.700 --> 2:21:19.700] To the degree that it is more efficient at getting things done than a person, it does, I guess, help with sustainable energy
[2:21:19.700 --> 2:21:29.700] But I think the mission effectively does somewhat broaden with the advent of Optimus to, you know, I don't know, making the future awesome
[2:21:29.700 --> 2:21:37.700] So, you know, I think you look at Optimus and I know about you, but I'm excited to see what Optimus will become
[2:21:37.700 --> 2:21:51.700] And, you know, this is like, you know, if you could, I mean, you can tell like any given technology, do you want to see what it's like in a year, two years, three years, four years, five years, ten?
[2:21:51.700 --> 2:21:56.700] I'd say for sure, you definitely want to see what's happened with Optimus
[2:21:56.700 --> 2:22:01.700] Whereas, you know, a bunch of other technologies are, you know, sort of plateaued
[2:22:01.700 --> 2:22:17.700] About name names here, but, you know, so, I think Optimus is going to be incredible in like five years, ten years like mind-blowing
[2:22:17.700 --> 2:22:23.700] And I'm really interested to see that happen, and I hope you are too
[2:22:23.700 --> 2:22:35.700] I have a quick question here, Justin, and I was wondering, like, are you planning to extend like conversational capabilities for the robot?
[2:22:35.700 --> 2:22:41.700] And my second full-on question to that is, what's like the end goal? What's the end goal with Optimus?
[2:22:41.700 --> 2:22:46.700] Yeah, Optimus would definitely have conversational capabilities
[2:22:46.700 --> 2:22:54.700] So, you'd be able to talk to it and have a conversation, and it would feel quite natural
[2:22:54.700 --> 2:23:09.700] So, from an end goal standpoint, I don't know, I think it's going to keep evolving, and I'm not sure where it ends up, but some place is interesting for sure
[2:23:09.700 --> 2:23:16.700] And, you know, we always have to be careful about the, you know, don't go down the terminator path
[2:23:16.700 --> 2:23:23.700] That's a, you know, I thought maybe we should start off with a video of like the terminator starting off with this, you know, skull crushing
[2:23:23.700 --> 2:23:27.700] But that might be, you know, people might not get too seriously
[2:23:27.700 --> 2:23:38.700] So, you know, we do want Optimus to be safe, so we are designing in safeguards where you can locally stop the robot
[2:23:38.700 --> 2:23:46.700] And, you know, with like basically a localized control ROM that you can't update over the internet
[2:23:46.700 --> 2:23:52.700] Which I think that's quite important, essential, frankly
[2:23:52.700 --> 2:24:03.700] So, like a localized stop button or remote control, something like that, that cannot be changed
[2:24:03.700 --> 2:24:11.700] But, I mean, it's definitely going to be interesting, it won't be boring
[2:24:11.700 --> 2:24:22.700] Okay, yeah, I see today you have a very attractive product with Dojo and its applications
[2:24:22.700 --> 2:24:25.700] So, I'm wondering what's the future for the Dojo platform?
[2:24:25.700 --> 2:24:34.700] So, you know, like provide like infrastructure and service like AWS or you will like sell the chip like the NVIDIA
[2:24:34.700 --> 2:24:41.700] So, basically, what's the future? Because I say you use 7nm, so the developer cost is like easily over 10 million US dollars
[2:24:41.700 --> 2:24:46.700] How do you make the business like business wise?
[2:24:46.700 --> 2:24:55.700] Dojo is a very big computer and actually will use a lot of power and need a lot of cooling
[2:24:55.700 --> 2:25:01.700] So, I think it's probably going to make more sense to have Dojo operate in like an Amazon Web Services manner
[2:25:01.700 --> 2:25:05.700] Than to try to sell it to someone else
[2:25:05.700 --> 2:25:13.700] So, that would be the most efficient way to operate Dojo is just have it be a service that you can use
[2:25:13.700 --> 2:25:20.700] That's available online and that where you can train your models way faster and for less money
[2:25:20.700 --> 2:25:28.700] And as the world transitions to software 2.0
[2:25:28.700 --> 2:25:31.700] And that's on the bingo card
[2:25:31.700 --> 2:25:35.700] Someone I know has to know to drink 5 tequila
[2:25:35.700 --> 2:25:47.700] So, let's see, software 2.0 will use a lot of neural net training
[2:25:47.700 --> 2:25:54.700] So, it kind of makes sense that over time as there's more neural net stuff
[2:25:54.700 --> 2:26:01.700] People will want to use the fastest, lowest cost neural net training system
[2:26:01.700 --> 2:26:06.700] So, I think there's a lot of opportunity in that direction
[2:26:06.700 --> 2:26:11.700] Hi, my name is Ali Jahanian
[2:26:11.700 --> 2:26:15.700] Thank you for this event, it's very inspirational
[2:26:15.700 --> 2:26:28.700] My question is, I'm wondering what is your vision for humanoid robots that understand our emotions and art
[2:26:28.700 --> 2:26:34.700] And can contribute to our creativity
[2:26:34.700 --> 2:26:42.700] Well, I think you're already seeing robots that at least are able to generate very interesting art
[2:26:42.700 --> 2:26:47.700] Like Dali and Dali 2
[2:26:47.700 --> 2:26:55.700] And I think we'll start seeing AI that can actually generate even movies that have coherence
[2:26:55.700 --> 2:26:58.700] Like interesting movies and tell jokes
[2:26:58.700 --> 2:27:09.700] So, it's quite remarkable how fast AI is advancing at many companies besides Tesla
[2:27:09.700 --> 2:27:12.700] We're headed for a very interesting future
[2:27:12.700 --> 2:27:16.700] And yeah, so, any guys want to comment on that?
[2:27:16.700 --> 2:27:22.700] Yeah, I guess the Optimus Robot can come up with physical art, not just digital art
[2:27:22.700 --> 2:27:27.700] You can ask for some dance moves in text or voice and then you can produce those in the future
[2:27:27.700 --> 2:27:32.700] So, it's a lot of physical art, not just digital art
[2:27:32.700 --> 2:27:36.700] Oh, yeah, computers can absolutely make physical art, yeah, 100%
[2:27:36.700 --> 2:27:39.700] Yeah, like dance, play soccer or whatever you...
[2:27:39.700 --> 2:27:45.700] I mean, it needs to get more agile over time, for sure
[2:27:45.700 --> 2:27:47.700] Thanks so much for the presentation
[2:27:47.700 --> 2:27:55.700] Now, for the Tesla Autopilot slides, I noticed that the models that you were using were heavily motivated by language models
[2:27:55.700 --> 2:27:59.700] And I was wondering what the history of that was and how much of an improvement it gave
[2:27:59.700 --> 2:28:05.700] I thought that that was a really interesting, curious choice to use language models for the lane transitioning
[2:28:05.700 --> 2:28:09.700] So, there are sort of two aspects for why we transition to language modeling
[2:28:09.700 --> 2:28:10.700] So, the first...
[2:28:10.700 --> 2:28:12.700] Talk loud and close
[2:28:12.700 --> 2:28:15.700] Okay, got it
[2:28:15.700 --> 2:28:18.700] Yeah, so the language models help us in two ways
[2:28:18.700 --> 2:28:21.700] The first way is that it lets us predict lanes that we couldn't have otherwise
[2:28:21.700 --> 2:28:27.700] As Ashok mentioned earlier, basically when we predicted lanes in sort of a dense 3D fashion
[2:28:27.700 --> 2:28:32.700] You can only model certain kinds of lanes, but we want to get those criss-crossing connections inside of intersections
[2:28:32.700 --> 2:28:36.700] It's just not possible to do that without making it a graph prediction
[2:28:36.700 --> 2:28:39.700] If you try to do this with dense segmentation, it just doesn't work
[2:28:39.700 --> 2:28:42.700] Also, the lane prediction is a multimodal problem
[2:28:42.700 --> 2:28:48.700] Sometimes you just don't have sufficient visual information to know precisely how things look on the other side of the intersection
[2:28:48.700 --> 2:28:53.700] So you need a method that can generalize and produce coherent predictions
[2:28:53.700 --> 2:28:56.700] You don't want to be predicting two lanes and three lanes at the same time
[2:28:56.700 --> 2:29:00.700] You want to commit to one in a general model like these language models provides that
[2:29:04.700 --> 2:29:06.700] Hi
[2:29:06.700 --> 2:29:09.700] Hi, my name is Giovanni
[2:29:09.700 --> 2:29:12.700] Yeah, thanks for the presentation. It's really nice
[2:29:12.700 --> 2:29:15.700] I have a question for FSD team
[2:29:15.700 --> 2:29:21.700] For the neural networks, how do you test...
[2:29:21.700 --> 2:29:24.700] How do you do unit tests, software unit tests on that?
[2:29:24.700 --> 2:29:30.700] Do you have a bunch or I don't know, mid-thousands or...
[2:29:30.700 --> 2:29:36.700] Yes, cases where the neural network that after you train it, you have to pass it
[2:29:36.700 --> 2:29:39.700] Before you release it as a product, right?
[2:29:39.700 --> 2:29:44.700] Yeah, what's your software unit testing strategies for this, basically?
[2:29:44.700 --> 2:29:50.700] Yeah, glad you asked. There's like a series of tests that we have defined starting from unit tests for software itself
[2:29:50.700 --> 2:29:56.700] But then for the neural network models, we have VAP sets defined where you can define...
[2:29:56.700 --> 2:29:59.700] If you just have a large test set, that's not enough what we find
[2:29:59.700 --> 2:30:03.700] We need like sophisticated VAP sets for different failure modes
[2:30:03.700 --> 2:30:06.700] And then we queate them and grow them over the time of the product
[2:30:06.700 --> 2:30:13.700] So over the years, we have like hundreds of thousands of examples where we have been failing in the past
[2:30:13.700 --> 2:30:19.700] That we have curated and so for any new model, we test against the entire history of these failures
[2:30:19.700 --> 2:30:21.700] And then keep adding to this test set
[2:30:21.700 --> 2:30:26.700] On top of this, we have shadow modes where we ship these models in silent to the car
[2:30:26.700 --> 2:30:29.700] And we get data back on where they are failing or succeeding
[2:30:29.700 --> 2:30:33.700] And there's an extensive QA program
[2:30:33.700 --> 2:30:35.700] It's very hard to ship for regression
[2:30:35.700 --> 2:30:38.700] There's like nine levels of filters before it hits customers
[2:30:38.700 --> 2:30:41.700] But then we have really good infra to make this all efficient
[2:30:43.700 --> 2:30:46.700] I'm one of the QA testers, so I have QA the car...
[2:30:46.700 --> 2:30:48.700] Yeah, QA tester
[2:30:48.700 --> 2:30:57.700] Yeah, so I'm constantly in the car just being queuing like whatever the latest alpha build is that doesn't totally crash
[2:30:57.700 --> 2:30:59.700] Yeah, finds a lot of bugs
[2:31:02.700 --> 2:31:08.700] Hi, great event. I have a question about foundational models for autonomous driving
[2:31:08.700 --> 2:31:12.700] We have all seen that big models that really can...
[2:31:12.700 --> 2:31:19.700] When you scale up with data and model parameter from GP3 to POM, it can actually now do reasoning
[2:31:19.700 --> 2:31:26.700] Do you see that it's essential scaling up foundational models with data and size
[2:31:26.700 --> 2:31:32.700] And then at least you can get a teacher model that potentially can solve all the problems
[2:31:32.700 --> 2:31:35.700] And then you distill to a student model
[2:31:35.700 --> 2:31:40.700] Is that how you see foundational models relevant for autonomous driving?
[2:31:40.700 --> 2:31:42.700] That's quite similar to our auto labeling models
[2:31:42.700 --> 2:31:45.700] So we don't just have models that run in the car
[2:31:45.700 --> 2:31:51.700] We train models that are entirely offline that are extremely large that can't run in real time on the car
[2:31:51.700 --> 2:31:59.700] So we just run those offline on the servers producing really good labels that can then train the online networks
[2:31:59.700 --> 2:32:04.700] So that's one form of distillation of these teacher-student models
[2:32:04.700 --> 2:32:10.700] In terms of foundation models, we are building some really, really large datasets that are multiple petabytes
[2:32:10.700 --> 2:32:15.700] And we are seeing that some of these tasks work really well when we have these large datasets
[2:32:15.700 --> 2:32:21.700] Kinematics, like I mentioned, video in, all the kinematics out of all the objects and up to the fourth derivative
[2:32:21.700 --> 2:32:24.700] And people thought we couldn't do detection with cameras
[2:32:24.700 --> 2:32:26.700] Detection, depth, velocity, acceleration
[2:32:26.700 --> 2:32:32.700] And imagine how precise these have to be for these higher-order derivatives to be accurate
[2:32:32.700 --> 2:32:36.700] And this all comes from these kind of large datasets and large models
[2:32:36.700 --> 2:32:44.700] So we are seeing the equivalent of foundation models in our own way for geometry and kinematics and things like those
[2:32:44.700 --> 2:32:46.700] Do you want to add anything, John?
[2:32:46.700 --> 2:32:48.700] Yeah, I'll keep it brief
[2:32:48.700 --> 2:32:57.700] Basically, whenever we train on a larger dataset, we see big improvements in our model performance
[2:32:57.700 --> 2:33:03.700] And basically, whenever we initialize our networks with some pre-training steps from some other auxiliary tasks
[2:33:03.700 --> 2:33:05.700] We basically see improvements
[2:33:05.700 --> 2:33:08.700] The self-supervised or supervised with large datasets both help a lot
[2:33:08.700 --> 2:33:19.700] Hi, so at the beginning, Elon said that Tesla was potentially interested in building artificial general intelligence systems
[2:33:19.700 --> 2:33:23.700] Given the potentially transformative impact of technology like that
[2:33:23.700 --> 2:33:28.700] It seems prudent to invest in technical AGI safety expertise specifically
[2:33:28.700 --> 2:33:33.700] I know Tesla does a lot of technical, narrow AI safety research
[2:33:33.700 --> 2:33:42.700] I was curious if Tesla was intending to try to build expertise in technical artificial general intelligence safety specifically
[2:33:42.700 --> 2:33:50.700] Well, I mean, if we start looking like we're going to be making a significant contribution to artificial general intelligence
[2:33:50.700 --> 2:33:55.700] Then we'll for sure invest in safety on big believer in AI safety
[2:33:55.700 --> 2:34:01.700] I think there should be an AI sort of regulatory authority at the government level
[2:34:01.700 --> 2:34:07.700] Just as there is a regulatory authority for anything that affects public safety
[2:34:07.700 --> 2:34:13.700] So we have regulatory authority for aircraft and cars and sort of food and drugs
[2:34:13.700 --> 2:34:17.700] Because they affect public safety and AI also affects public safety
[2:34:17.700 --> 2:34:22.700] So I think, and this is not really something that government I think understands yet
[2:34:22.700 --> 2:34:31.700] I think there should be a referee that is trying to ensure public safety for AGI
[2:34:31.700 --> 2:34:38.700] And you think of like, well, what are the elements that are necessary to create AGI?
[2:34:38.700 --> 2:34:44.700] Like the accessible dataset is extremely important
[2:34:44.700 --> 2:34:58.700] And if you've got a large number of cars and humanoid robots processing petabytes of video data and audio data from the real world
[2:34:58.700 --> 2:35:05.700] Just like humans, that might be the biggest dataset, probably is the biggest dataset
[2:35:05.700 --> 2:35:10.700] Because in addition to that, you can obviously incrementally scan the internet
[2:35:10.700 --> 2:35:18.700] But what the internet can't quite do is have millions or hundreds of millions of cameras in the real world
[2:35:18.700 --> 2:35:23.700] Like I said, with audio and other sensors as well
[2:35:23.700 --> 2:35:28.700] So I think we probably will have the most amount of data
[2:35:28.700 --> 2:35:33.700] And probably the most amount of training power
[2:35:33.700 --> 2:35:39.700] Therefore probably we will make a contribution to AGI
[2:35:39.700 --> 2:35:47.700] Hey, I noticed the semi was back there, but we haven't talked about it too much
[2:35:47.700 --> 2:35:53.700] I was just wondering for the semi truck, what are the changes you're thinking about from a sensing perspective?
[2:35:53.700 --> 2:35:57.700] I imagine there's very different requirements obviously than just a car
[2:35:57.700 --> 2:36:00.700] And if you don't think that's true, why is that true?
[2:36:00.700 --> 2:36:04.700] No, I think basically you can drive a car
[2:36:04.700 --> 2:36:11.700] Think about what drives any vehicle, it's a biological neural net with eyes
[2:36:11.700 --> 2:36:13.700] With cameras essentially
[2:36:13.700 --> 2:36:19.700] What is your primary sensors are?
[2:36:19.700 --> 2:36:24.700] Two cameras on a slow gimbal, a very slow gimbal
[2:36:24.700 --> 2:36:26.700] That's your head
[2:36:26.700 --> 2:36:34.700] So if a biological neural net with two cameras on a slow gimbal can drive a semi truck
[2:36:34.700 --> 2:36:39.700] Then if you've got like eight cameras with continuous 360 degree vision
[2:36:39.700 --> 2:36:42.700] Operating at a higher frame rate and a much higher reaction rate
[2:36:42.700 --> 2:36:47.700] Then I think it is obvious that you should be able to drive a semi or any vehicle much better than human
[2:36:50.700 --> 2:36:54.700] Hi, my name is Akshay, thank you for the event
[2:36:54.700 --> 2:37:02.700] Assuming Optimus would be used for different use cases and would evolve at different speeds for these use cases
[2:37:02.700 --> 2:37:10.700] Would it be possible to sort of develop and deploy different software and hardware components independently
[2:37:10.700 --> 2:37:25.700] And deploy them in Optimus so that the overall feature development is faster for Optimus
[2:37:25.700 --> 2:37:27.700] Okay, we did not comprehend
[2:37:27.700 --> 2:37:33.700] Unfortunately our neural net did not comprehend the question
[2:37:33.700 --> 2:37:38.700] Next question
[2:37:38.700 --> 2:37:40.700] Hi, I want to switch the gear to the autopilot
[2:37:40.700 --> 2:37:47.700] So when you guys plan to roll out the FSD beta to countries other than US and Canada
[2:37:47.700 --> 2:37:54.700] And also my next question is what's the biggest bottleneck or the technology or barrier you think in the current autopilot stack
[2:37:54.700 --> 2:38:02.700] And how you envision to solve that to make the autopilot is considerably better than human in terms of performance matrix
[2:38:02.700 --> 2:38:05.700] Like safety assurance and the human confidence
[2:38:05.700 --> 2:38:12.700] I think you also mentioned for the FSD V11 you are going to combine the highway and the city as a single stack
[2:38:12.700 --> 2:38:18.700] And some architectural big improvements, can you maybe expand a bit on that, thank you
[2:38:18.700 --> 2:38:21.700] Well, that's a whole bunch of questions
[2:38:21.700 --> 2:38:27.700] We're hopeful to be able to, I think from a technical standpoint
[2:38:27.700 --> 2:38:37.700] FSD beta should be possible to roll out FSD beta worldwide by the end of this year
[2:38:37.700 --> 2:38:41.700] But for a lot of countries we need regulatory approval
[2:38:41.700 --> 2:38:48.700] And so we are somewhat gated by the regulatory approval in other countries
[2:38:48.700 --> 2:38:57.700] But I think from a technical standpoint it will be ready to go to a worldwide beta by the end of this year
[2:38:57.700 --> 2:39:02.700] And there's quite a big improvement that we're expecting to release next month
[2:39:02.700 --> 2:39:09.700] That will always be especially good at assessing the velocity of fast moving cross traffic
[2:39:09.700 --> 2:39:11.700] And a bunch of other things
[2:39:11.700 --> 2:39:17.700] So, anyone want to elaborate?
[2:39:17.700 --> 2:39:22.700] I guess so, there used to be a lot of differences between production autopilot and the full self driving beta
[2:39:22.700 --> 2:39:25.700] But those differences have been getting smaller and smaller over time
[2:39:25.700 --> 2:39:34.700] I think just a few months ago we now use the same vision only object detection stack in both FSD and in the production autopilot on all vehicles
[2:39:34.700 --> 2:39:39.700] There's still a few differences, the primary one being the way that we predict lanes right now
[2:39:39.700 --> 2:39:44.700] So we upgraded the modeling of lanes so that it could handle these more complex geometries like I mentioned in the talk
[2:39:44.700 --> 2:39:48.700] In production autopilot we still use a simpler lane model
[2:39:48.700 --> 2:39:54.700] But we're extending our current FSD beta models to work in all sort of highway scenarios as well
[2:39:54.700 --> 2:40:00.700] The version of FSD beta that I drive actually does have the integrated stack
[2:40:00.700 --> 2:40:08.700] So it uses the FSD stack both in city streets and highway and it works quite well for me
[2:40:08.700 --> 2:40:14.700] But we need to validate it in all kinds of weather like heavy rain, snow, dust
[2:40:14.700 --> 2:40:24.700] And just make sure it's working better than the production stack across a wide range of environments
[2:40:24.700 --> 2:40:26.700] But we're pretty close to that
[2:40:26.700 --> 2:40:35.700] I think it's, I don't know, maybe, it'll definitely be before the end of the year and maybe November
[2:40:35.700 --> 2:40:41.700] Yeah, in our personal drives, the FSD stack on highway drives already way better than the production stack we have
[2:40:41.700 --> 2:40:47.700] And we do expect to also include the parking lot stack as a part of the FSD stack before the end of this year
[2:40:47.700 --> 2:40:56.700] So that will basically bring us to, you sit in the car in the parking lot and drive till the end of the parking lot at a parking spot before the end of this year
[2:40:56.700 --> 2:41:06.700] And in terms of the fundamental metric to optimize against is how many miles between a necessary intervention
[2:41:06.700 --> 2:41:18.700] So just massively improving how many miles the car can drive in full autonomy before an intervention is required that is safety critical
[2:41:18.700 --> 2:41:29.700] So, yeah, that's the fundamental metric that we're measuring every week and we're making radical improvements on that
[2:41:29.700 --> 2:41:36.700] Hi, thank you, thank you so much for the presentation, very inspiring
[2:41:36.700 --> 2:41:40.700] My name is Daisy, I actually have a non-technical question for you
[2:41:40.700 --> 2:41:47.700] I'm curious, if you are back to your 20s, what are some of the things you wish you knew back then?
[2:41:47.700 --> 2:41:51.700] What are some advice you would give to your younger self?
[2:42:01.700 --> 2:42:05.700] Well, I'm trying to figure out something useful to say
[2:42:08.700 --> 2:42:11.700] Yeah, a joint Tesla would be one thing
[2:42:11.700 --> 2:42:21.700] Yeah, I think just trying to expose yourself to as many smart people as possible
[2:42:21.700 --> 2:42:24.700] I don't read a lot of books
[2:42:28.700 --> 2:42:31.700] You know, I did do that though
[2:42:31.700 --> 2:42:41.700] So, I think there's some merit to just also not being necessarily too intense
[2:42:41.700 --> 2:42:48.700] And enjoying the moment a bit more, I would say to 20-something me
[2:42:48.700 --> 2:42:55.700] Just to stop and smell the roses occasionally would probably be a good idea
[2:42:55.700 --> 2:43:04.700] You know, it's like when we were developing the Falcon 1 rocket on the Quageline Atoll
[2:43:04.700 --> 2:43:08.700] And we had this beautiful little island that we were developing the rocket on
[2:43:08.700 --> 2:43:12.700] And not once during that entire time did I even have a drink on the beach
[2:43:12.700 --> 2:43:17.700] I'm like, I should have had a drink on the beach, that would have been fine
[2:43:17.700 --> 2:43:22.700] Thank you very much
[2:43:22.700 --> 2:43:26.700] I think you have excited all of the robotics people with Optimus
[2:43:26.700 --> 2:43:29.700] This feels very much like 10 years ago in driving
[2:43:29.700 --> 2:43:35.700] But as driving has proved to be harder than it actually looked 10 years ago
[2:43:35.700 --> 2:43:42.700] What do we know now that we didn't 10 years ago that would make, for example, AGI on a humanoid come faster?
[2:43:42.700 --> 2:43:47.700] Well, I mean, it seems to me that AGI is advancing very quickly
[2:43:47.700 --> 2:43:53.700] Hardly a week goes by without some significant announcement
[2:43:53.700 --> 2:44:04.700] And, yeah, I mean, at this point, like, AI seems to be able to win at almost any rule-based game
[2:44:04.700 --> 2:44:12.700] It's able to create extremely impressive art
[2:44:12.700 --> 2:44:21.700] Engage in conversations that are very sophisticated, you know, write essays
[2:44:21.700 --> 2:44:25.700] And these just keep improving
[2:44:25.700 --> 2:44:31.700] And there's so many more talented people working on AI
[2:44:31.700 --> 2:44:35.700] And the hardware is getting better
[2:44:35.700 --> 2:44:40.700] AI is on a super, like, a strong exponential curve of improvements
[2:44:40.700 --> 2:44:44.700] Independent of what we do at Tesla
[2:44:44.700 --> 2:44:51.700] And obviously we'll benefit somewhat from that exponential curve of improvement with AI
[2:44:51.700 --> 2:44:54.700] Like, Tesla just also has to be very good at actuators
[2:44:54.700 --> 2:45:01.700] Motors gearboxes, controllers, power electronics, batteries, sensors
[2:45:01.700 --> 2:45:09.700] And, you know, really, like, I'd say the biggest difference between the robot on four wheels
[2:45:09.700 --> 2:45:13.700] And the robot with arms and legs is getting the actuators right
[2:45:13.700 --> 2:45:16.700] It's an actuators and sensors problem
[2:45:16.700 --> 2:45:21.700] And obviously, how you control those actuators and sensors
[2:45:21.700 --> 2:45:27.700] But it's, yeah, actuators and sensors and how you control the actuators
[2:45:27.700 --> 2:45:32.700] I don't know, we have to have, like, the ingredients necessary to create a compelling robot
[2:45:32.700 --> 2:45:37.700] And we're doing it, so...
[2:45:37.700 --> 2:45:39.700] Hi, Ilan
[2:45:39.700 --> 2:45:42.700] You are actually bringing the humanity to the next level
[2:45:42.700 --> 2:45:46.700] Literally, Tesla, and you are bringing the humanity to the next level
[2:45:46.700 --> 2:45:52.700] So, you said Optimus Prime, Optimus will be used in next Tesla factory
[2:45:52.700 --> 2:45:58.700] My question is, will a new Tesla factory be fully run by Optimus program?
[2:45:58.700 --> 2:46:05.700] And when can general public order a humanoid?
[2:46:05.700 --> 2:46:11.700] Yeah, I think it'll, you know, we're going to start Optimus with very simple tasks in the factory
[2:46:11.700 --> 2:46:14.700] You know, like maybe just, like, loading a part, like you saw in the video
[2:46:14.700 --> 2:46:19.700] You know, carrying a part from one place to another
[2:46:19.700 --> 2:46:29.700] Or loading a part into one of our more conventional robot cells to, you know, that welds body together
[2:46:29.700 --> 2:46:34.700] So we'll start, you know, just trying to, how do we make it useful at all?
[2:46:34.700 --> 2:46:39.700] And then gradually expand the number of situations where it's useful
[2:46:39.700 --> 2:46:47.700] And I think that number of situations where Optimus is useful will grow exponentially
[2:46:47.700 --> 2:46:50.700] Like really, really fast
[2:46:50.700 --> 2:46:56.700] In terms of when people can order one, I don't know, I think it's not that far away
[2:46:56.700 --> 2:47:01.700] Well, I think you mean, when can people receive one?
[2:47:01.700 --> 2:47:07.700] So, I don't know, I'm like, I'd say probably within three years
[2:47:07.700 --> 2:47:09.700] And not more than five years
[2:47:09.700 --> 2:47:18.700] Within three to five years, you could probably receive an Optimus
[2:47:18.700 --> 2:47:24.700] I feel the best way to make the progress for AGI is to involve as many smart people across the world as possible
[2:47:24.700 --> 2:47:28.700] And given the size and resource of Tesla compared to robot companies
[2:47:28.700 --> 2:47:31.700] And given the state of humanoid research at the moment
[2:47:31.700 --> 2:47:38.700] Would it make sense for the kind of Tesla to sort of open source some of the simulation hardware parts?
[2:47:38.700 --> 2:47:43.700] I think Tesla can still be the dominant platformer where it can be something like an Android OS
[2:47:43.700 --> 2:47:47.700] Or like an iOS stuff for the entire humanoid research
[2:47:47.700 --> 2:47:52.700] Would that be something that rather than keeping the Optimus to just Tesla researchers
[2:47:52.700 --> 2:48:04.700] Or the factory itself can open it and let the whole world explore humanoid research?
[2:48:04.700 --> 2:48:10.700] I think we have to be careful about Optimus being potentially used in ways that are bad
[2:48:10.700 --> 2:48:13.700] Because that is one of the possible things to do
[2:48:13.700 --> 2:48:23.700] So I think we would provide Optimus where you can provide instructions to Optimus
[2:48:23.700 --> 2:48:33.700] But where those instructions are governed by some laws of robotics that you cannot overcome
[2:48:33.700 --> 2:48:46.700] So not doing harm to others and I think probably quite a few safety related things with Optimus
[2:48:46.700 --> 2:48:52.700] We'll just take maybe a few more questions and then thank you all for coming
[2:48:52.700 --> 2:48:57.700] Questions, one deep and one broad
[2:48:57.700 --> 2:49:03.700] On the deep for Optimus, what's the current and what's the ideal controller bandwidth?
[2:49:03.700 --> 2:49:09.700] And then in the broader question, there's this big advertisement for the depth and breadth of the company
[2:49:09.700 --> 2:49:15.700] What is it uniquely about Tesla that enables that?
[2:49:15.700 --> 2:49:19.700] Anyone want to tackle the bandwidth question?
[2:49:19.700 --> 2:49:22.700] So the technical bandwidth of the...
[2:49:22.700 --> 2:49:24.700] Close to your mouth and loud
[2:49:24.700 --> 2:49:30.700] For the bandwidth question, you have to understand or figure out what is the task that you want it to do
[2:49:30.700 --> 2:49:35.700] And if you took a frequency transform of that task, what is it that you want your limbs to do?
[2:49:35.700 --> 2:49:37.700] And that's where you get your bandwidth from
[2:49:37.700 --> 2:49:41.700] It's not a number that you can specifically just say you need to understand your use case
[2:49:41.700 --> 2:49:44.700] And that's where the bandwidth comes from
[2:49:44.700 --> 2:49:48.700] What is the broad question?
[2:49:48.700 --> 2:49:58.700] The breadth and depth thing, I can answer the breadth and depth
[2:49:58.700 --> 2:50:04.700] On the bandwidth question, I think we probably will just end up increasing the bandwidth
[2:50:04.700 --> 2:50:13.700] Which translates to the effective dexterity and reaction time of the robot
[2:50:13.700 --> 2:50:21.700] It's safe to say it's not one hertz and maybe you don't need to go all the way to 100 hertz
[2:50:21.700 --> 2:50:25.700] But maybe 10, 25, I don't know
[2:50:25.700 --> 2:50:29.700] Over time, I think the bandwidth will increase quite a bit
[2:50:29.700 --> 2:50:33.700] Or translate it to dexterity and latency
[2:50:33.700 --> 2:50:38.700] You'd want to minimize that over time
[2:50:38.700 --> 2:50:42.700] Minimize latency, maximize dexterity
[2:50:42.700 --> 2:50:49.700] In terms of breadth and depth, I guess we're a pretty big company at this point
[2:50:49.700 --> 2:50:53.700] So we've got a lot of different areas of expertise that we necessarily had to develop
[2:50:53.700 --> 2:50:59.700] In order to make electric cars and then in order to make autonomous electric cars
[2:50:59.700 --> 2:51:04.700] Tesla is like a whole series of startups basically
[2:51:04.700 --> 2:51:10.700] And so far they've almost all been quite successful
[2:51:10.700 --> 2:51:13.700] So we must be doing something right
[2:51:13.700 --> 2:51:19.700] And I consider one of my core responsibilities in running the company
[2:51:19.700 --> 2:51:24.700] Is to have an environment where great engineers can flourish
[2:51:24.700 --> 2:51:29.700] And I think in a lot of companies, I don't know, maybe most companies
[2:51:29.700 --> 2:51:35.700] If somebody's a really talented driven engineer, they're unable to actually
[2:51:35.700 --> 2:51:40.700] Their talents are suppressed at a lot of companies
[2:51:40.700 --> 2:51:45.700] And some of the companies that the engineering talent is suppressed
[2:51:45.700 --> 2:51:49.700] In a way that is maybe not obviously bad
[2:51:49.700 --> 2:51:53.700] But where it's just so comfortable and you paid so much money
[2:51:53.700 --> 2:51:59.700] The output you actually have to produce is so low that it's like a honey trap
[2:51:59.700 --> 2:52:04.700] So there's a few honey trap places in Silicon Valley
[2:52:04.700 --> 2:52:08.700] Where they don't necessarily don't seem like bad places for engineers
[2:52:08.700 --> 2:52:13.700] But you have to say like a good engineer went in and what did they get out
[2:52:13.700 --> 2:52:20.700] And the output of that engineering talent seems very low
[2:52:20.700 --> 2:52:23.700] Even though there seem to be enjoying themselves
[2:52:23.700 --> 2:52:27.700] That's why I call it there's a few honey trap companies in Silicon Valley
[2:52:27.700 --> 2:52:30.700] Tesla is not a honey trap that we're demanding and it's like
[2:52:30.700 --> 2:52:34.700] You're going to get a lot of shit done and it's going to be really cool
[2:52:34.700 --> 2:52:38.700] And it's not going to be easy
[2:52:38.700 --> 2:52:43.700] But if you are a super talented engineer
[2:52:43.700 --> 2:52:52.700] Your talents will be used I think to a greater degree than anywhere else
[2:52:52.700 --> 2:52:56.700] You know, SpaceX also that way
[2:52:56.700 --> 2:53:01.700] Hi Ilan, I have two questions
[2:53:01.700 --> 2:53:03.700] So both to the autopilot team
[2:53:03.700 --> 2:53:07.700] So the thing is like I have been following your progress for the past few years
[2:53:07.700 --> 2:53:10.700] So today you have made changes on like the lane detection
[2:53:10.700 --> 2:53:13.700] Like you said that previously you were doing instant semantic segmentation
[2:53:13.700 --> 2:53:17.700] Now you guys are built transfer models for like building the lanes
[2:53:17.700 --> 2:53:21.700] So what are some other common challenges which you guys are facing right now
[2:53:21.700 --> 2:53:24.700] Like which you are solving in future as a curious engineer
[2:53:24.700 --> 2:53:27.700] So that like we as a researcher can work on those
[2:53:27.700 --> 2:53:28.700] Start working on those
[2:53:28.700 --> 2:53:31.700] And the second question is like I'm really curious about the data engine
[2:53:31.700 --> 2:53:36.700] Like you guys have like told a case like where the car is stopped
[2:53:36.700 --> 2:53:41.700] So how are you finding cases which is very much similar to that from the data which you have
[2:53:41.700 --> 2:53:44.700] So a little bit more on the data engine would be great
[2:53:46.700 --> 2:53:50.700] I'll answer the first question using occupancy network as an example
[2:53:50.700 --> 2:53:55.700] So what you saw in the presentation did not exist a year ago
[2:53:55.700 --> 2:53:58.700] So we only spent one year on time
[2:53:58.700 --> 2:54:00.700] We actually shipped more than 12 occupancy network
[2:54:00.700 --> 2:54:06.700] And to have a one foundation model actually to represent the entire physical world
[2:54:06.700 --> 2:54:11.700] Around everywhere and you always condition is actually really really challenging
[2:54:11.700 --> 2:54:16.700] So only over a year ago we're kind of like driving a 2D world
[2:54:16.700 --> 2:54:21.700] If there's a wall and if there's a curve we kind of represent with the same static edge
[2:54:21.700 --> 2:54:24.700] Which is obviously you know not ideal right
[2:54:24.700 --> 2:54:28.700] There's a big difference between a curve and a wall when you drive you make different choices right
[2:54:28.700 --> 2:54:31.700] So after we realized that we have to go to 3D
[2:54:31.700 --> 2:54:36.700] We have to basically rethink the entire problem and think about how we address that
[2:54:36.700 --> 2:54:42.700] So this will be like one example of a challenges we have we have a conquer in the past year
[2:54:42.700 --> 2:54:50.700] Yeah to answer the question about how we actually source examples of the tricky stopped cars
[2:54:50.700 --> 2:54:57.700] There's a few ways to go about this but two examples are one we can trigger for disagreements within our signals
[2:54:57.700 --> 2:55:01.700] So let's say that parked bit flickers between parked and driving
[2:55:01.700 --> 2:55:05.700] We'll trigger that back and the second is we can leverage more of the shadow mode logic
[2:55:05.700 --> 2:55:10.700] So if the customer ignores the car but we think we should stop for it we'll get that data back too
[2:55:10.700 --> 2:55:16.700] So these are just different like various trigger logic that allows us to get those data campaigns back
[2:55:19.700 --> 2:55:21.700] Hi
[2:55:21.700 --> 2:55:24.700] Thank you for the amazing presentation thanks so much
[2:55:24.700 --> 2:55:29.700] So there are a lot of companies that are focusing on the AGI problem
[2:55:29.700 --> 2:55:34.700] And one of the reasons why it's such a hard problem is because the problem itself is so hard to define
[2:55:34.700 --> 2:55:38.700] Several companies have several different definitions they focus on different things
[2:55:38.700 --> 2:55:43.700] So what is Tesla how's Tesla defining the AGI problem and what are you focusing on specifically
[2:55:45.700 --> 2:55:49.700] Well we're not actually specifically focused on AGI
[2:55:49.700 --> 2:55:56.700] I'm simply saying that AGI is seems likely to be an emergent property of what we're doing
[2:55:56.700 --> 2:56:02.700] Because we're creating the oldies autonomous cars and autonomous humanoids
[2:56:02.700 --> 2:56:11.700] That are actually with a truly gigantic data stream that's coming in and being processed
[2:56:11.700 --> 2:56:18.700] It's by far the most amount of real world data and data you can't get by just searching the internet
[2:56:18.700 --> 2:56:24.700] Because you have to be out there in the world and interacting with people and interacting with the roads
[2:56:24.700 --> 2:56:29.700] And just you know it's Earth is a big place and reality is messy and complicated
[2:56:29.700 --> 2:56:36.700] So I think it's sort of like it just seems likely to be an emergent property
[2:56:36.700 --> 2:56:43.700] If you've got tens or hundreds of millions of autonomous vehicles and maybe even a comparable number of humanoids
[2:56:43.700 --> 2:56:46.700] Maybe more than that on the humanoid front
[2:56:46.700 --> 2:56:52.700] Well that's just the most amount of data and if that video is being processed
[2:56:52.700 --> 2:57:00.700] It just seems likely that the cars will definitely get way better than human drivers
[2:57:00.700 --> 2:57:08.700] And the humanoid robots will become increasingly indistinguishable from humans perhaps
[2:57:10.700 --> 2:57:17.700] And so then like I said you have this emergent property of AGI
[2:57:17.700 --> 2:57:26.700] And arguably humans collectively are sort of a superintelligence as well
[2:57:26.700 --> 2:57:30.700] Especially as we improve the data rate between humans
[2:57:30.700 --> 2:57:38.700] The thing like that seems way back in the early days the internet was like the internet was like humanity acquiring a nervous system
[2:57:38.700 --> 2:57:47.700] Where now all of a sudden any one element of humanity could know all of the knowledge of humans by connecting to the internet
[2:57:47.700 --> 2:57:50.700] Almost all the knowledge or certainly a huge part of it
[2:57:50.700 --> 2:57:54.700] Whereas previously we would exchange information by osmosis
[2:57:54.700 --> 2:57:59.700] Like in order to transfer data so you would have to write a letter
[2:57:59.700 --> 2:58:03.700] Someone would have to carry the letter by person to another person
[2:58:03.700 --> 2:58:09.700] And then a whole bunch of things in between and then it was like
[2:58:09.700 --> 2:58:13.700] Yeah I mean it's insanely slow when you think about it
[2:58:13.700 --> 2:58:18.700] And even if you were in the Library of Congress you still didn't have access to all the world's information
[2:58:18.700 --> 2:58:24.700] And you certainly couldn't search it and obviously very few people are in the Library of Congress
[2:58:24.700 --> 2:58:33.700] So I mean one of the great sort of equality elements
[2:58:33.700 --> 2:58:41.700] Like the internet has been the biggest equalizer in history in terms of access to information and knowledge
[2:58:41.700 --> 2:58:45.700] And any student of history I think would agree with this
[2:58:45.700 --> 2:58:49.700] Because you know you go back a thousand years there were very few books
[2:58:49.700 --> 2:58:54.700] And books would be incredibly expensive but only a few people knew how to read
[2:58:54.700 --> 2:58:57.700] And even a small number of people even had a book
[2:58:57.700 --> 2:59:02.700] Now look at it like you can access any book instantly
[2:59:02.700 --> 2:59:05.700] You can learn anything basically for free
[2:59:05.700 --> 2:59:07.700] It's pretty incredible
[2:59:07.700 --> 2:59:19.700] So you know I was asked recently what period of history would I prefer to be at the most
[2:59:19.700 --> 2:59:22.700] And my answer was right now
[2:59:22.700 --> 2:59:27.700] This is the most interesting time in history and I read a lot of history
[2:59:27.700 --> 2:59:31.700] So let's do our best to keep that going
[2:59:31.700 --> 2:59:37.700] And to go back to one of the earlier questions I would ask
[2:59:37.700 --> 2:59:50.700] The thing that's happened over time with respect to Tesla autopilot is that the neural nets have gradually absorbed more and more software
[2:59:50.700 --> 2:59:57.700] And in the limit of course you could simply take the videos as seen by the car
[2:59:57.700 --> 3:00:02.700] And compare those to the steering inputs from the steering wheel and pedals
[3:00:02.700 --> 3:00:04.700] Which are very simple inputs
[3:00:04.700 --> 3:00:09.700] And in principle you could train with nothing in between
[3:00:09.700 --> 3:00:12.700] Because that's what humans are doing with the biological neural net
[3:00:12.700 --> 3:00:21.700] You could train based on video and what trains the video is the moving of the steering wheel and the pedals
[3:00:21.700 --> 3:00:24.700] With no other software in between
[3:00:24.700 --> 3:00:28.700] We're not there yet but it's gradually going in that direction
[3:00:28.700 --> 3:00:33.700] Alright, one last question
[3:00:33.700 --> 3:00:36.700] How are you going?
[3:00:36.700 --> 3:00:38.700] I think we've got a question at the front here
[3:00:38.700 --> 3:00:40.700] Hello, they're right there
[3:00:40.700 --> 3:00:42.700] We'll do two questions, fine
[3:00:42.700 --> 3:00:44.700] They're here
[3:00:44.700 --> 3:00:46.700] Thanks for such a great presentation
[3:00:46.700 --> 3:00:48.700] We'll do your question last
[3:00:48.700 --> 3:00:49.700] Okay, cool
[3:00:49.700 --> 3:00:51.700] With FSD being used by so many people
[3:00:51.700 --> 3:00:57.700] How do you evaluate the company's risk tolerance in terms of performance statistics
[3:00:57.700 --> 3:01:01.700] And do you think there needs to be more transparency or regulation from third parties
[3:01:01.700 --> 3:01:10.700] As to what's good enough and defining thresholds for performance across many miles
[3:01:15.700 --> 3:01:20.700] The number one design requirement at Tesla is safety
[3:01:20.700 --> 3:01:23.700] And that goes across the board
[3:01:23.700 --> 3:01:26.700] So in terms of the mechanical safety of the car
[3:01:26.700 --> 3:01:30.700] We have the lowest probability of injury of any cars ever tested by the government
[3:01:30.700 --> 3:01:34.700] For just a passive mechanical safety
[3:01:34.700 --> 3:01:38.700] Essentially crash structure and airbags and what not
[3:01:38.700 --> 3:01:44.700] We have the highest rating for active safety as well
[3:01:44.700 --> 3:01:51.700] And I think it's going to get to the point where the active safety is so ridiculously good
[3:01:51.700 --> 3:01:55.700] It's just absurdly better than a human
[3:01:55.700 --> 3:01:59.700] And then with respect to autopilot
[3:01:59.700 --> 3:02:05.700] We do publish broadly speaking the statistics on miles driven
[3:02:05.700 --> 3:02:08.700] With cars that have no autonomy
[3:02:08.700 --> 3:02:10.700] Tesla cars with no autonomy
[3:02:10.700 --> 3:02:14.700] With hardware one, hardware two, hardware three
[3:02:14.700 --> 3:02:17.700] And then the ones that are in FSD beta
[3:02:17.700 --> 3:02:21.700] And we see steady improvements all along the way
[3:02:21.700 --> 3:02:25.700] And sometimes there's this dichotomy of
[3:02:25.700 --> 3:02:33.700] Should you wait until the car is three times safer than a person before deploying any technology
[3:02:33.700 --> 3:02:36.700] But I think that's actually morally wrong
[3:02:36.700 --> 3:02:45.700] At the point at which you believe that adding autonomy reduces injury and death
[3:02:45.700 --> 3:02:48.700] I think you have a moral obligation to deploy it
[3:02:48.700 --> 3:02:52.700] Even though you're going to get sued and blamed by a lot of people
[3:02:52.700 --> 3:02:57.700] Because the people whose lives you saved don't know that their lives are saved
[3:02:57.700 --> 3:03:01.700] And the people who do occasionally die or get injured
[3:03:01.700 --> 3:03:08.700] Definitely know, or their state does, that there was a problem with autopilot
[3:03:08.700 --> 3:03:13.700] That's why you have to look at the numbers in total miles driven
[3:03:13.700 --> 3:03:17.700] How many accidents occurred, how many accidents were serious, how many fatalities
[3:03:17.700 --> 3:03:20.700] And we've got well over three million cars on the road
[3:03:20.700 --> 3:03:23.700] So that's a lot of miles driven every day
[3:03:23.700 --> 3:03:25.700] And it's not going to be perfect
[3:03:25.700 --> 3:03:31.700] But what matters is that it is very clearly safer than not deploying it
[3:03:31.700 --> 3:03:33.700] Yeah
[3:03:33.700 --> 3:03:36.700] So, I think, last question
[3:03:39.700 --> 3:03:42.700] I think, yeah, thanks
[3:03:42.700 --> 3:03:44.700] The last question here
[3:03:50.700 --> 3:03:52.700] Okay, hi
[3:03:52.700 --> 3:03:55.700] So, I do not work on hardware
[3:03:55.700 --> 3:03:59.700] So maybe the hardware team and you guys can enlighten me
[3:03:59.700 --> 3:04:05.700] Why is it required that there be symmetry in the design of Optimus?
[3:04:05.700 --> 3:04:09.700] Because humans, we have handedness, right?
[3:04:09.700 --> 3:04:13.700] We use some set of muscles more than others
[3:04:13.700 --> 3:04:16.700] Over time there's wear and tear, right?
[3:04:16.700 --> 3:04:21.700] So maybe you'll start to see some joint failures or some actuator failures more
[3:04:21.700 --> 3:04:25.700] Over time, I understand that this is extremely pre-stage
[3:04:25.700 --> 3:04:31.700] Also, we as humans have based so much fantasy and fiction
[3:04:31.700 --> 3:04:33.700] Over superhuman capabilities
[3:04:33.700 --> 3:04:36.700] Like all of us don't want to walk right over there
[3:04:36.700 --> 3:04:40.700] We want to extend our arms and like we have all these, you know
[3:04:40.700 --> 3:04:43.700] A lot of fantasy, fantastical designs
[3:04:43.700 --> 3:04:47.700] So considering everything else that is going on
[3:04:47.700 --> 3:04:51.700] In terms of batteries and intensity of compute
[3:04:51.700 --> 3:04:57.700] Maybe you can leverage all those aspects into coming up with something
[3:04:57.700 --> 3:05:03.700] Well, I don't know, more interesting in terms of the robot that you're building
[3:05:03.700 --> 3:05:07.700] And I'm hoping you're able to explore those directions
[3:05:07.700 --> 3:05:11.700] Yeah, I think it would be cool to have like, you know, make Inspector Gadget real
[3:05:11.700 --> 3:05:13.700] That would be pretty sweet
[3:05:13.700 --> 3:05:19.700] So, yeah, I mean, right now we just want to make basic humanoid work well
[3:05:19.700 --> 3:05:24.700] And our goal is to pass this path to a useful humanoid robot
[3:05:24.700 --> 3:05:28.700] I think this will ground us in reality, literally
[3:05:28.700 --> 3:05:33.700] And ensure that we are doing something useful
[3:05:33.700 --> 3:05:37.700] Like one of the hardest things to do is to be useful
[3:05:37.700 --> 3:05:43.700] To actually, and then to have high utility under the curve
[3:05:43.700 --> 3:05:50.700] Like how much help did you provide to each person on average
[3:05:50.700 --> 3:05:52.700] And then how many people did you help?
[3:05:52.700 --> 3:05:54.700] The total utility
[3:05:54.700 --> 3:05:58.700] Like trying to actually ship useful product that people like
[3:05:58.700 --> 3:06:02.700] To a large number of people is so insanely hard
[3:06:02.700 --> 3:06:04.700] It boggles the mind
[3:06:04.700 --> 3:06:08.700] You know, that's why I can say like, man, there's a hell of a difference between a company that has shipped product
[3:06:08.700 --> 3:06:10.700] And one has not shipped product
[3:06:10.700 --> 3:06:12.700] This is night and day
[3:06:12.700 --> 3:06:16.700] And then even once you ship product, can you make the cost, the value of the output
[3:06:16.700 --> 3:06:19.700] Worth more than the cost of the input
[3:06:19.700 --> 3:06:22.700] Which is, again, insanely difficult, especially with hardware
[3:06:22.700 --> 3:06:28.700] So, but I think over time I think it would be cool to do creative things
[3:06:28.700 --> 3:06:30.700] And have like eight arms and whatever
[3:06:30.700 --> 3:06:33.700] And have different versions
[3:06:33.700 --> 3:06:37.700] And maybe, you know, there'll be some hardware
[3:06:37.700 --> 3:06:41.700] Like companies that are able to add things to an optimist
[3:06:41.700 --> 3:06:45.700] Like maybe we, you know, add a power port or something like that
[3:06:45.700 --> 3:06:49.700] Or attach them, you can add attachments to your optimist
[3:06:49.700 --> 3:06:51.700] Like you can add them to your phone
[3:06:51.700 --> 3:06:54.700] There could be a lot of cool things that could be done over time
[3:06:54.700 --> 3:06:58.700] And there could be maybe an ecosystem of small companies that, or big companies that
[3:06:58.700 --> 3:07:01.700] Make add-ons for optimists
[3:07:01.700 --> 3:07:07.700] So, with that, I'd like to thank the team for their hard work
[3:07:07.700 --> 3:07:09.700] You guys are awesome
[3:07:09.700 --> 3:07:16.700] And thank you all for coming
[3:07:16.700 --> 3:07:19.700] And for everyone online, thanks for tuning in
[3:07:19.700 --> 3:07:23.700] And I think this will be one of those great videos where you can like
[3:07:23.700 --> 3:07:27.700] If you can fast forward to the bits that you find most interesting
[3:07:27.700 --> 3:07:31.700] But we try to give you a tremendous amount of detail
[3:07:31.700 --> 3:07:35.700] Literally so that you can look at the video at your leisure
[3:07:35.700 --> 3:07:39.700] And you can focus on the parts that you find interesting and skip the other parts
[3:07:39.700 --> 3:07:43.700] So, thank you all, and we'll do this, try to do this every year
[3:07:43.700 --> 3:07:46.700] And we might do a monthly podcast even
[3:07:46.700 --> 3:07:53.700] So, but I think it'll be great to sort of bring you along for the ride
[3:07:53.700 --> 3:07:56.700] And like show you what cool things are happening
[3:07:56.700 --> 3:07:58.700] And yeah, thank you
[3:07:58.700 --> 3:08:27.700] Alright, thanks
[3:08:28.700 --> 3:08:54.700] Thank you
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment