Skip to content

Instantly share code, notes, and snippets.

@rencryptofish
Created September 27, 2021 01:27
Show Gist options
  • Save rencryptofish/260901ffcbcd9d6d74a985459a7476f4 to your computer and use it in GitHub Desktop.
Save rencryptofish/260901ffcbcd9d6d74a985459a7476f4 to your computer and use it in GitHub Desktop.
jim keller - moore's law is not dead
Jim Keller 0:00
A low long term at the end of the day. All right, we ready? Go.
Eric Paul 0:08
Okay. I'm going to start at the maximum Hi. Alright, welcome everyone to the east colloquium. Thank you for being here. My name is Eric Paul's, I'm just gonna give a very brief pointer to next week we will have Rodney Brooks here as a speaker. But I want to give our time you heard it here, Moore's law is not dead. So, today, we have a really special introduction by our very own Dean sujay. King Lou.
CJ 0:41
Thanks very much, Eric. Good afternoon, everyone. So I'm CJ and Dean of College of Engineering, but I'm a professor of electrical engineering computer sciences. And it's my pleasure to introduce Jim Keller. He's a senior VP and general manager for silicon engine, silicon engineering product of the silicon engineering group at Intel Corporation. So you might know that Moore's Law refers to Gordon Moore, who's a alumna of Berkeley. And Moore's Law really has set the pace for advanced exponential advancement in computing performance over the last 50 plus years. And there has been debate for many years about Moore's law being dead. And, and Jim will, will call out some people who have said that on the record, but in any case, Jim is really well qualified. Let me tell you, he, so Intel Corporation was co founded by Gordon Moore. So this is really, you know, he has the, I think we would hope that Intel would continue to be the, the company that ensures that Moore's law is not dead. But Jim has impressive background, he's before coming to Intel, he actually had more than 20 years of experience designing microchips, computing computer processing units, not only with the Intel's x86 architecture, but also the ARM architecture was just derived from the reduced instruction set computing architecture developed here at Berkeley. And so he designed these chips where he led the development of these chips for many applications. So not only servers, but PCs and mobile devices. And before Intel, he was actually at Tesla, he basically helped them design their chip for automated driving. And then Before that, he was actually at AMD, you might have heard of AMD recent successes with the Zen architecture, the CPUs, that was Jim, okay, that led that chip development effort. And before that he was at Apple, which was through a company acquisition there, he designed, he led the design of the a four processor power, which powered the iPhone four, and subsequent Apple five processor. So really impressive experience, great perspective to share with us today. He did get his bachelor's degree in electrical engineering a few years ago from Pennsylvania State University. So without further ado, please join me in welcoming Jim to Berkeley eecs.
Jim Keller 3:05
And is this great? Well, firstly, it's almost 40 years, there'll be 40 years the computer design next summer, which started to make me feel old, but and I'm an architect, I have a title that says SVP, and I really wanted that because of the big s, I was gonna put that on my shirt. Like, I don't mostly think, Well, my staff knows better. I'm not, I'm not really a manager, but I have a really good staff. So we run a big organization. And silicon engineering at Intel is like 10,000 people. So it's a wild phenomenon. And I'm delighted to be there. Super fun. But when I joined, everybody's saying Moore's law is dead. And since Intel's the Moore's Law company, I thought, well, it's kind of a bad career move. Like, what am I doing there? But I also been thinking, people have been telling me Moore's law is dead in 10 to 15 years for my entire career. And I realized about 10 years ago, I stopped worrying about it, because despite its imminent demise, it kept proceeding. And so at some level, you know, I don't care that much. But then I had this funny problem, because let me just walk you through a little two by two matrix. Everybody knows how they work. Right? So to access, Moore's law is not dead, or it's dead. And then you believe in Moore's Law, or you don't believe in it. All right. All right, here's a here's a matrix. And I'm going to walk you through the matrix A little bit, and then I'll tell you about why I think support. So if it's not dead, and you believe in it, well, it's challenging because you got twice as many transistors every couple years, your designs get bigger and harder, but you're sort of ready, right? And if it's dead and you think it is, you're a little delusional, right, and, and many people think that's how the world works and that's, that's fun and, and then If it is dead, and you don't believe in Moore's Law, well, it's going to be sad. Because if you're in a high tech computer company, where things aren't really changing, it's going to be a race to the bottom. And that's a different thing. Here's my problem. If it's alive, and you think it's dead, your design teams, your methodology, architects aren't getting ready for the next wave and more transistors, right? And that's a tough situation and an Intel, we had some groups that literally design team doubled when we got twice as many transistors because they had twice as much to do and the tools didn't scale. And so we're there. And I don't want to be pedantic about the details of Moore's law. Because my fundamental challenge is the scaling. Right? So and I really want to say, because of the endless change in transistor count, it's enabled a whole bunch of changes. People say computer design is x. But it's changed so much over the last 40 years that I've been doing it. And it continues to change, and we're reinventing Newcastle's. So not only is the transistor math continuing, but adjacently, memory packaging many, many things. And that's causing us to rethink architectures and how we do things. So I want to walk through this a little bit. So and, and one interesting thing, and again, I'm intrigued about the phenomena how things scale. So if you sort of glanced down here, you know, transistor core has gone up 1,000x, frequency 1000, I'll get back to some of the details later wafer size hasn't changed that much transistors on a chip, it's way up cost per transistor, 10 million, so that this operating voltages, you know, five to 10x, depends on where you started. So, you know, things have scaled very irregularly, right? And depends on which technology part you're looking at. It's a very uneven thing, but the net net has been very strong scaling. So everybody's seen all these it's kind of five, Moore's laws, dad, now what? Well, I like the other one. So this happens all the time. Everything that could be invented has been invented at 99. All right, there's a deep intellectual framework about how things scale and move along. Right. And it's easy to be attached to a set of technologies. And then think that's kind of the limit. And this, the declaration of the end has happened multiple times. Now, when I worked at Apple, I had two vendors come to us to talk about CPU performance. And one said, GPU performance per clock has kind of plateaued. And the future is parallel programming and accelerators.
Intel showed up and said we're gonna make five or 10%, year keggers, you know, compound annual growth five to 10% a year, which sounds a lot better than plateauing. Right? And our internal plan was to do a new quarter that was twice as fast, and then do one after that, that was twice as fast. If you look at the apple roadmap that happened, right? The expectation and mindset sets your direction and your possibilities. It's important. So. So here's some famous skeptics. And my understanding is they're both very smart people. And I've talked to Dave and I can validate that Jensen is kind of inch, I've never quite got that his whole business has bigger, faster GPUs. And why he's running around saying Moore's law is dead. I don't know. But somebody told me that one year they announced 12 nanometre and AMD was gonna announce seven he wanted everybody to think the technology didn't matter. I don't know if that's true, but it's, it's a fine story. But that's, that's sort of puzzling. So and then the funnier version of this is so there's there's a little of a you know, cultural idea about this, right? And again, my problem is if I'm getting twice as many transistors and sometimes you say transistor, sometimes we say transistor sent times frequencies, there's a whole bunch of things that drive it. What as computer architects design or toolmakers fabricators are we going to do right? So So we've been digging into understanding this and it was interesting how many different things we popped up so everybody's familiar with this curve, right? diminishing return curve says you have a technology you do an invention and the first accelerates pretty fast and then you reach the limits of it. You know, so think rotary phones super exciting. Right? And at some point you had the best rotary phone you can do and then push buttons came along. That was super exciting. You know, the buttons got better and better. And then touchpad you know, it's a cascade of these things. And if you go look at askers of the computing stack, they're sort of Everywhere you look, we wrote assembly hit the limits of that, we wrote C, I worked at Digital Equipment when the BMS operating system was written in assembly, right, and everybody's moving to C. And they thought, well, we'll just make a better assembly, they call it bless, which is essentially assembly with a compiler on it, right? And they keep moving along processor architecture, single thread. There are plateaued reasons for their when went to multi thread two GPUs. And now we're building computers that know how to parse through petabytes of data and do computationally highly effectively. So we, we keep changing the game as we go through it. So what would Gordon say, I really liked this first time I read it, I thought was really funny. Because the trousers falling the number of components per circuit rises by 75, we'll get as many as 65,000 components on chip. And I thought he's only off by a factor of a million. Because we're at 65 billion transistors a chip, we now measure transistors in 100 million transistors per square millimeter. But of course, he was a the founder. And he did say in 75, right, so this is a pretty simple role, which is fun. And Moore's Law, famously drove what's called Bell's law, right? mainframe, many workstation, PC laptop, a faster, smaller version of the same computer every 10 years or so, as we shrunk it down, right. And there's all kinds of incredible things in the history of this. IBM built an out of order machine out of essentially, tubes, discrete transistors years ago, right. And this transformation kind of hit the limit on smartphones, and I'll return to this point at the end. Right? There's a lot of arguments about Moore's Law, right? One is, well, there's limits to optical scaling, there's a lipped limit to the cost, right? And what we keep finding as, as we drive it into mass production, we always saw for cost. We've always saw for power, we've always solved for whatever limiter is in the way. And then kurtzweil really liked this graph, he went and plotted. And I don't know if this is exactly true, but it's super fun. He said, computing technology has been on this, you know, log linear curve. Since the 1800s. Right? It didn't just start, like and there's been many transitions of what you build a computer app, but that's continued to generate this exponential growth and computing power. Right, which is a really fun side.
All right. So on the one hand, individual technologies tend to be on diminishing return curves. But total computing power is on an accelerating return curve. Right? Build faster computers, you get more technology ideas, you have tools to analyze stuff, you have compound growth, and it keeps going. So what's what's going on under the sheets, and, and many people have observed this is Excel return curves are typically made out of a cascade of diminishing return curves, often a very different kinds of things. rotary phone, touch phone, you know, keep keypad and the touch phone like that. What what drives the next wave of innovation? innovation, often is a very different thing, right? And then here's the human nature problem, which is, when we really understand something really well, you're hyper aware of what's going into the innovation curve, and you're standing on, it's really easy to see that it's plateauing. When in fact, what we've seen over and over and over the next innovation curve moves us along. So that makes sense. So, some of Intel's own scientists, Paul packin. Got quoted, this is pretty funny. He got interviewed by the Wall Street Journal and said, we're not really sure we can drive technology much further. And the next day was a press conference with odo at the then CEO of Intel. And they said, so Intel engineers think Moore's law is dead. And he said, I talked to Gordon this morning, and he's still doing fine. And then he turned around said Who the hell is Paul pack and interestingly enough, Intel's very tolerant company, he's driving her seven nanometre transistor definition. And Ruth, appropriately named Ruth brain, she's super smart that the factors that go into driving transistors shrink are complicated, and there's many Many, many technologies. And it's easy to point your finger at one or more of them, and say I see the limits. But the details about how we drive this forward and the trade offs we make are really, really quite interesting. So you know, when I first started hearing about Moore's Law was about optical scalar, we had 2d transistors, and we slowly made them smaller, right? Now we change materials, located dielectrics, we change the transistor architecture, right? The scaling things underneath that does everybody remember when the big problem of semiconductors actually looking around? maybe nobody does. We put one male layer down, and then another one had to go over it, and then the next one went over it, the bump got higher and higher. And all the technology development is like, how do you keep the metal from thinning so you got the higher metal stacks. And somebody figured out how to grow oxide on it and sand it flat. Like literally, they call it planarized metal, but it was sandpaper, right? transform. Now, when you look at a metal stack, you see these beautiful, flat metal layers with vias. There are so much technology in that stack. There's different metal technologies. The sandpaper got better, right? So the hunt, like literally, very smooth, the super fun. Alright, so we're looking through the laws of this stuff. I worked, I worked at Tesla for a couple years and got to know Ilan a little bit. And one of the things you want is fantastic app is what does he really want to build. And he used to say, first figure out the configuration of atoms, then you figure out how to put them there. Right? Now he meant that in terms of rocket ships, car factories, lithium batteries, right? He started building electric cars confident in this model, that when you look at the fundamental physics of lithium ion batteries, it has a cost trajectory to get there. And everybody said, well, lithium is expensive. No, it's not. It's a rare, two common metal, metal in the earth. It was expensive, because the lithium ions had been sized for get this lithium grease for industrial applications. It was a capital problem. It's a billion dollars to build a lithium mine. But once you build them, lithium is relatively cheap. It's very low technology extraction. Right? So this the summary I tell the engineers now is don't let the house constrain the what's right. When you want to build something, sometimes iterating on the current set of houses important, sometimes it's completely different.
I want to say something about power scaling. So I hear often, power scaling is the limit. So back in the late 90s, I was digital working on alpha computers. And we're building a supercomputer for quiz Cray. And their their line was a megawatt per teraflop. Right. And they said, we really need to improve this by 10 to 100x. And we're working on it. But a megawatt you know, get 100x out of that. That's that's a hard number. Right? We just crossed a watt per teraflop 1,000,000x improvement in power in 20 years, right. And then I think it's almost ironic that as power scaling went down, we're starting to hit the thermal limits of power flux density through a square centimeter of silicon just the time we can really stack it properly. That's one of the ironies. But we're using stacking for a whole bunch of things currently in memory, we're starting to look at logic architectures where we stack it. And even though the peak power flux through the vertical is too high, there's a whole bunch of tricks to manage that power to do something about it. So I'm personally confident that we're going to keep moving the power wall and number of changes that we've already done is high, and then we're not done yet. Let's just talk about the current, you know, thought process on where we are. So we had planar transistors, we went to FinFET, where we're all building nano wires in the fab, right? Intel tsmc has announced that they have Samsung, everybody's working on it, right. And there's a really interesting thing, while the the world thinks Moore's Law stat, all the fab, and the technologists think it's not and everybody's announced now a 10 year roadmap for Moore's law. And if you actually looked at it from the left side to the right side here is about 5000 times shrink, right? Actually, if we scaled this properly, you wouldn't be able to see it. So we just scan a little bit. And then what's driving this stuff. So here's a fun slide we're staring at disco. When we plotted the wavelengths of light or so I read when I first heard about this, the wave you know, as the smaller the wavelength got too small. These sort of interfering with it. But if you actually look at the wavelength stacking, we went from 436 to 193. Well, the feature scaling was actually exponential, right, which is pretty wild. And he shows up. And that really resets the wavelength. Right, so we get back to direct printing. And we're up, we're talking earlier today, it should have been, you know, introduced way back here, before we had to go crazy, but look what they actually did. So here's a different way to graph it. So the wavelength on the vertical thing generated the printed dimension on the bottom, right. And this is a super fun thing. So when a wavelength first got small, started interfere, so they started with tweaking the pattern, they tweaked the wavelength to tweak the material and throw. And then they started doing, you know, computation with autography. So it started in 1990, with printing with no correction, and then they started doing minor correction. And then basically, they started printing on the mask, the field that you wanted, so that when the light interfered a little bit, you got what you wanted. Okay. And then things started to go crazy. That's the pattern, you print on a mass to print an x, right? And they got something like 20 to 50x out of that. And then here's a set of patterns. And you're probably looking at the goes, What is that gonna print? That? Right, so the limitation of wavelengths that everybody thought you understood was, you know, and and at some level, you think, well, semiconductor business is complicated. There's physicists and chemists and material people and machine people, optic people, you know, the diversity of technology is driving, this is so high. And now we're even using AI. So we evaluate how we build stuff. We have big data sets about what works and what doesn't, and they're starting to do closed loop. What do we build? What do we print? What causes defects? And what do we do about it super interesting. So I'm going to whip through a path to 50x. So I talked to a bunch of engineers and until I said, I really want to pass through 100x scaling, this is 100 times more transistors per square millimeter. And after about three weeks, they came back and Roomful people. And they said, they look kind of glom because they only got 250 and three weeks.
And I asked them to maybe spend a couple more weeks on it. And we're still looking at it. But this is the first graph. So there's fins, right. And we have clear line of sight to pitch scaling the fence. This is printing fans, and there's a whole bunch of metallurgy, I want to do want to point out that the top of the fence is still over 100 atoms wide. Right? So we're not running out of atoms, we know how to print single layers of atoms. There's a whole bunch of process steps where the metallurgy is really interesting in terms of small number of atoms, but the fins themselves, they're mountains, right? In terms of atomic layers. We're going to nanowire stack, and the way we build the nano wires, we get more drive current because with gate all around, you get way more control the device, especially a low voltage, we're super excited about this. It's another factor of two. And then we're going to go stack the nanowires. So in the same nanowire stack, we get p and n devices. And we already do wafer to wafer bonding. And I'm not going to go into all the details of this. But we're starting to build stacks of transistors and metal stacks, to you know, there's, there's goodness both and how the metal stack drives down to the transistors, multi sides. And there's goodness about how we build logic functions. 3d stacking is going to be more important important as we build stuff. And just think about a big building. If you want lots of square foot and it's it's one layer, the distances the XY from anywhere, if it's stacked the distances x, y, z, it's way shorter. Right. And then finally, we already do die to die died away for stacking. And it's not clear how high we can do that. Now, people have said, well, that's more wafer steps and more processing and more costs. I've been talking to fabs for like 30 years now. And they always Promise me at some point in the near future, that wafers are going to be too expensive and the transistor cost per transistor is not going to cross over. And it's never ever happened. It's unbelievable how fast that happens if we go into high volume manufacturer and the work on it, and we've seen some remarkable things when tsmc and Now step 16 ff process it was fairly complicated. They got the FFC, they radically reduced the number of wafers. Right? You can, you can put it away for through Samsung 14. And I guess I should probably shouldn't quote the number of days, but it would shock you how simple it is. Right? So we, we always figure out how to do the what. And then we always radically improve the how. Here's a funny one. wire bond technology way back in the 80s. And 90s was the limit because of inductance in these wires, and then wire bond guys, and the flash guys got together, and they ship production products where a wire bonder went 12345678, this is eight, they got up to 16 in production. Right. So the flash cells were getting smaller, to get more flash in the package, they stacked them. And it got too complicated. So they decided to stack layers of flash cells vertically. And I found this picture I thought it was really funny because that's actually the bat Babbage's Difference Engine from 1854. Right? That was a meter monitoring devices or nanometre. tenzan a nice shrink. And here's a modern flash device, which all looks a whole lot like a Babbage engine, to cracks me up, they're now up to 128 layers thick. Now, those are very simple layers. Just imagine that those layers get even more complicated, and we start building 3d devices really deep. Right? So the limits of shrinking you know, I sometimes feel we're so far away, like, I'm not sure how many atoms across we need, things are sort of quantum mechanical, or two or three, we know how to print single layers of atoms, we know how to work, we're going to know how to print down to small numbers of atoms, and this is going to keep going. Okay, back to how this changes computer designs. Now, the really interesting thing is like I'm an architect, right. And if you go look at the frequency, move by 1002 transistors per core moved by 1000, the costs move by transistors per millimeter by 100,000. The bottom two numbers are kind of sad, because memory latency, as seen by the processor went up 100x. Right.
Now at four gigahertz in 100 nanoseconds, that's a lot of cycles. And the instructions per cycle, while it's improved a lot isn't nearly like the rest of the curves. Right? And that makes sense. So if I go sit down with my partners, you know, the frequency guys, the transistor density guys and the micro architects under the low man on the totem pole, right? And, and a lot of times we look at graphs like this and think. And this is a famous graph, which I quite like that the performance ramped linearly for a while with frequency and process scaling. And it's been tougher the ramp that as the frequency didn't ramp, but we have lots more transistors. Now. We are continuing to move this along. And I my personal belief is some of the limits of this have been mindset about how far can single threaded programming go, because we're actually going to do something about it. But there's a whole bunch of dimensions on this. The programming guys kind of cracked me up because here's an old C program and it did something interesting convulse three by three. And then an open CL looks about the same. We wrote a pytorch program that looks, you know, like one line call, right? And, and I just want to talk about abstraction layers because this, this graph right here, super interesting. Somebody said, No, when we went from assembly, here, right align, you got a line, you write C, you know, it's something like 10 lines of executed code for every line, you go to C plus plus it goes up, and depends on who you are with programs you'll look at, you get a graph, something like this. And there's two arguments about this one is, boy, modern programmers are inefficient, right. And if you're using JavaScript to write hello world, it's probably true. But you can write one line of code today and fire up a data center and find a cat photo. And if you tried to do that in assembly, you'd have 1000 people working for 1000 years and they'd fail. Right. So the interesting thing about the abstraction layers, and this goes for computing in general, the reason processor performance is Linear with transistor count is because it's limited by unpredictability, branch predictability, data predictability, sometimes instruction predictability, right? But we are, we're building a new abstraction layers to meet that need. In some places, it's linear. We add a lot of floating point extensions, bigger floating point. And that can be linear with processing. And then the AI stuff has been super linear with processing because the architectural innovation that was allowed by the huge increase in transistors came along. Alright. I gave a talk on micro architecture recently, and I just wanted to, you know, the, the innovation track marker architecture has been hilarious that computers used to be pretty dumb fetch, execute right back, right, super small, super pipe, super short. And then the risk innovation, I think, really started is like that stretched out that pipeline, do the minimal amount of work on each stage. It was super, super exciting. And then many went, you know, super scalar. And then they went superscalar super pipeline. And then they started to build big out of order machines, or big superscalar super pipeline machines, right. And we saw sequential improvements in performance. And then one of my favorite computers because I was the architect, this was easy six was Digital's first big out of order machine. So we built fetch for instructions. We had two full integer three integer for stock, quite drawing, right, three integer pipes, three floating point pipes, big execution monster, we had a whopping 24 instruction window in the scheduler and 100 instructions in flight. Here's a, let's say, abstracted diagram of a reason sunny Cove. So the 800 instruction window
sustains between three and six x86 instructions, clock, massive data predictors, massive branch predictors, right. And if you look inside and carefully, it's not one big computer anymore. There is a fetch predictor, a branch predictor in instruction fetcher, micro App Engine, where we take decoders instructions and execute them, every single piece of that computer looks more complicated than a computer. Now, going from point three IPC to you know, let's say three or four, IPC is only 10 acts, but the memory latency went from one clock a memory back in the good old days, to hundreds of clocks of memory latency. And when you're running floating point code, where we can really apply that acceleration, it's, it's completely different. Now, one thing, I was talking to some software people and one guy was looking at, and he said, You're scaring me the complexity of this engine? Like, how do you ever verify it, there's been recent people talking about, well, you have a million flip flops and a computer. In principle, it has to have the million states, we don't really verify the whole thing, by the way. But it's been a long time since we actually executed c programs in a computer. Right? The compiler is generating basic blocks that average six instructions, we put in an instruction window of 800 instructions, we then go find out many, many parallel dependency graphs and execute those in parallel. And we're working on a generation that's significantly bigger than this. And closer to the linear curve on performance. Right? This is a really big mindset change. So I finished up here in a few minutes. So why do we make new computers? Right? This is just data from the web somewhere. Like if you just look at the total numbers, IPC went way up the frequency five megahertz. We're shipping out 4.2 gigahertz ice lakes, performance 14,000x. Like these are huge scaling numbers. And lots of people think, well, we're hitting some kind of limit. I really doubt it. We have a roadmap to 50x more performance on 50x more transistors, and huge steps to make on every single piece of the stack. Right. And remember, computers are built by large numbers of people, but actually many, many small teams write better prediction, better instruction, set architecture, better optimization, better capitals, better libraries, the number of different places that we're doing innovation is really really high. Right? It's not one thing
Call this Rogers law because Roger was the first one to show me this graph. But I think it's an interesting phenomena. Right? We had single, single core single threaded computing, and then went up and kind of plateaued a little bit. And then we went to multi threading, right. And then we went to GPU computing, which, and now we're working on AI computing. And we're working on essentially computers that compute across very large, very sparse data sets with very high computational intensity. And this is one place where Moore's law is enabling architects to have a field day, right. And you can see when the idea sets are powerful. So when we first build parallel computers, lots of people thought we'll make a paralyzing compiler will take your C program and the parallelism compiler and make it work. And basically, I saw the first one of those in 1985. It didn't work at all, right? And paralyzing code was something only a small number of people can do. I still remember when Google built massive data centers with low cost servers. They built a Google file system anybody could use, they use MapReduce and a bunch of tools, where they got amazing scaling. Right now you can say it's not efficient, it could get 10,000 cores, and go 1000 times as fast. But 1000 times as fast as really good, right? So there's a there's a real creative tension on that. The GPU computing, guys, man, they started trying to do computing on GPUs, when they didn't even get the right answer. They didn't have any tools. People remember trying to write, you know, OpenGL programs to get shader shaders to do what you wanted. But they co evolved the software stack and the computing stack and the math engines underneath. And they started to do an incredible job of computing. Now we're starting to do the same thing for AI. It's super interesting. And there's so far to go. So every, and this is what Roger said specifically, it's very hard to get software communities to move for 20% or 50%, or even 2x. But for 10x, computing, it enables a whole different genre of computing types. So the architecture challenges to take all the transistors that we're building, build the architectures that allow to solve problems, that kind of work your way up the stack. super interesting problem. And we're literally in this transition, you know, there was 100 million devices, then a billion devices, we're going into a world of 10 billion smart devices. And that curve, I believe, it's not stopping. Now, we'll see how diverse they are, you know, cloud computing, mobile, computing, personal computing, image processors, number of smart devices running all around is really incredible. So the the opportunities for people you know, in college today, looking at all kinds of different applications is really phenomenal, and will continue to grow. Want to end on this thing? We'll take some questions. Richard Fineman famously said, there's a lot of room at the bottom. Right? Our current transistors are 1000 by 1000. By 1000. If we can get that to 10 by 10, by 10, that's 1,000,000x scaling. The kind of computer we build out of that, I don't even know if we know how to imagine that yet. Right? It's gonna be so many transitions over the next 1020 years. So I believe Moore's law is not dead. I hope I convinced somebody and, but more deeply. If your ideas that says this is going to keep going, and there's a whole bunch of challenges, that I think will rise to the challenge, and I've seen that over and over, if you think it's running out of gas, it will, if you think it's not, it's not going to, and there's many people in the industry working on this with so many people told me well, computer architecture can't move any further, really. How many times is the change over the last 30 years? Like over and over and over? The hardware software contract is really interesting. We've definitely found out that a more open source world is really interesting, because when everybody participates, you gather that together, we saw that first and open source software. Amazingly, we're seeing an insecurity today. If you think security is keeping something secret, and the secret gets out, then you're you're dead, right? If you think security is getting the best people to look at your stuff and investigate and collaborate with you. You have a plan for security. We see this in software and the CO optimization of software architecture and an actual Chelsea build is going to keep going alright with that, any questions?
Eric Paul 40:04
All right, great. So I know we have a lot of questions. So I'm gonna, this
Jim Keller 40:12
Oh, if it doesn't work, I can always,
Eric Paul 40:14
oh, okay, we get to be interactive. Okay, so this is your question box, you're then responsible, we will pass it to the next person. We'll start here the front, but it's gonna work its way back.
Unknown Speaker 40:26
Thanks for your talk, you give a lot of hope for conventional architecture, but you just build a neural net accelerator at Tesla. Where does that fit in the computing landscape going forward is going to revolutionize thing is that a niche niche product?
Jim Keller 40:40
Yeah, I have mixed feelings about it. So if people look at the human brain, you know, there's the motor cortex and the reptile cortex and the lower primate, you know, there's a diversity of thinking in our own brains, right? So I think computers will have diverse programming. I'm more interested in I think, long run, like right now the AI world is changing very fast, right? So there's a lot of accelerators, we're pretty bespoke to a narrow set of problems. And then there's things that are much more programmable, I like those. The widget, we build a Tesla was super interesting, because it runs the output of cafe. Right? That's a very programmable inference framework. And we can also port other things to that. And it had some very interesting phenomena is like, it's not hard to move the data around because the entrepass, Rams hold the data. And it's not hard to run the instructions, because it runs cafe instructions. So that was an interesting accelerator experiment. And let us go from standing started driving a car in 18 months, right? Without the six years of software development, right? So the world's we're interested in right now we're putting AI acceleration instructions in CPUs, the GPU, guys are getting better at making them programmable, right. And then there's the bespoke accelerator kinds of things have a place we saw a stabilizing, like video encoding. And this is Rogers point, there's encoders and decoders, and you can get hardware for both on a Sha 264. But on the cloud, the arms races on the encoder, right? They're all software encoders. Because if you can tweak a little bit of compression with a new encoder, right, you can save a lot of bandwidth on the internet. But everybody's phone runs a hardware decoder, like the decoder is the fixed target. And it could be AI as the same way training is a monster that's going to be evolving. And inference is a more simplified thing, but it's still pretty open ended. And I think experimental world is the right answer.
Unknown Speaker 42:46
If I could just sharpen up that question. So imagine a pie chart with all the silicon coming off fabs in five years. In that pie chart, how much of that is that soaking in CPU, how much of it is transistors to GPU? And how much of it is neural net accelerator like Mavericks?
Jim Keller 43:03
macro software? guys move slower than you think? And the AI is moving faster you think? So today, it's sort of at CPU, 20, GPU, zero everything else. And if things move quickly, it'd be 1/3 1/3 1/3. But I don't think it'll move that fast, but I couldn't call it. Okay, thanks.
Unknown Speaker 43:26
So, Moore's law is not dead. Dennard scaling is not dead. Right? That's what you also said, is, is the number of semiconductor companies shrinking?
Jim Keller 43:38
Well, sold a couple years ago, in the valley once
Unknown Speaker 43:42
at the high end, right, trying to do trying to push push to stay there. here's the here's the
Jim Keller 43:46
worldly wild thing. So so that, you know, now pendulums work, right? They swing back and forth. Right? So, you know, back back in the late 90s, there was you couldn't kick over a rock without finding a chip startup, right? And everybody's doing a six and then five years later, you couldn't get funding if you're doing hardware because quote, hardware doesn't make any money. And now there's 100 ai chip startups. fabs you Yeah, the big ones are Intel tsmc, Samsung, bind gf, UMC smek. And then there's like five more, so there's probably, I thought some of them gave up on leading edge. Jeff's famously said, they're not doing seven. They have a huge business doing other technologies. So here's a really interesting thing. The business model of a semiconductor company is complicated. tsmc brags, it's on their website. They just keep building new fabs and they keep the old ones running. So at Tesla, we had a whole bunch of products and like 100, and something nanometre. So, I would say today, there's a consolidation leading edge, there's three. Next step, there's like three more next step. There's like 10 more And then there's a whole bunch of bespoke technologies for like RF and special purpose power devices. And there's actually some new stuff there. So that's, that's pretty good, and the investment is good. And the other people always say, well, fabs are getting so expensive, you can't build a new fab. You know, one thing, the world doesn't have a shortage of money. Like, it's amazing how much we're spending on it. And UV machines, when they came out, were like $400 million, they used to show you go to a fab, and they always have billion dollar row. Like there's 10 of these 100 million dollar things, and we're replacing them with $300 million dollar, it's gonna be $3 billion. Price of those things is crashing, so so I don't think we're gonna run out of money. The worldwide appetite for square feet of semiconductors is still going up, all the big fabs are building more capacity. So we'll see we'll see what happens long run but and you can now get a fun startup funded building chips. And you can actually build chips in 18 months with 50, IPS, integrate them together and powered on and run drive a car. So I prove that
Unknown Speaker 46:05
your abstract seem to indicate you weren't that optimistic about voltage scaling, and frequency scaling. But some of us think that there is especially if you have an infinite amount of money, you can actually solve those problems too. What's your attitude about voltage scaling and frequency scaling?
Jim Keller 46:21
So the roadmap I showed here didn't include much for voltage scaling or frequency scaling, we're working on it, I don't want to disclose what the frequency numbers are. They're pretty good. The voltage scale. So using silicon or germanium, and fans and nano wires is moving the voltage a little. There's bigger developments to move voltage scaling further. And I'm totally interested in it's definitely, like we're gonna we're gonna see significant changes over the next 10 years. But I have nothing to talk about today. Next year with a couple
Eric Paul 47:01
more students back there goes.
Jim Keller 47:07
Save throw off the roof.
Unknown Speaker 47:11
Thanks for the talk. I just want to know what's your view on trading of power area frequency versus computation accuracy? Like are you guys moving toward approximate computing, let's say a 12 transistor aerosol versus a twin a transistor? Because if the data is a statistical, why should we now
Jim Keller 47:29
so solar people anybody know anything about AI in here? Ai, there's a couple years ago, there was a whole bunch of work around reduced precision variable session, and let's say fuzzy answers. And the problem is people are going to bigger and bigger datasets, and they're trying to get repeatability, and convergence. Having fixed answers and data types really mattered. And most of the chips that the first spend was some reduced precision actually went and so and there's also some standards now there's 16 bit floats, there's two formats of that, of course, people are working on some eight bit stuff. You kind of your your intuition is your brain seems to do a lot of computation, and it's pretty fuzzy. And it seems like a right answer. And we know that when you're training big datasets, you can sort of randomly nuke stuff without much loss of coherence. But the current thing is when you're in development mode, the repeatability is super important on the training side. And I know we did a lot of work is like, you get your new algorithms all work on 32 bit float for Christ's sakes. Because it's the ability is so high, and then as you refine it, you could drop down to 16 bit, and then the inference engines can pretty pretty low. So there seems to be a lot of variability on that problem today. directionally it seems like fuzzy computes and approximate answers is right thing. But today, it's been difficult to drive projects to completion that way. x. Short, bro. Alright, what are your thoughts on FPGAs? FPGAs they're really cool. I build a lot of stuff using FPGAs. Yeah, generically, when you go from fixed accelerator to programmable thing, or something like you know, let's say around, you know, an average number of 10x per for watt and performance. Now when you go to FPGAs, you might lose another 10x. Right. Now, the good thing about FPGAs is you can build your algorithm you want and that often works very well. And then there's a whole bunch of places in industry where you want a specific chip, and you don't have enough volume to justify the capex cost to build it and FPGAs are great. So I've been intrigued that That the FPGA companies, were able to put multipliers and adders in FPGAs, and then use that to get effective compute density out of them. But there's a pretty big spread on frequency and, and class numbers. So the FPGA businesses are growing. But what they're not like taking over computing, for example. So it's a little complicated. And the stuff you could do with FPGAs is great. Like, we build FPGA models of everything we do now. Like they're the, they're the the heart of all the emulation technology we have. And, you know, so FPGAs are super useful. If you go into network space, you can pull it, there's FPGAs, all over everything. Because there's lots of special purpose low volume devices that FPGAs are great. And I'm a fan. I've used them a lot, but I'm not sure they're part of their soon as the part of computing gets really big building purpose built accelerator or the right extension for a programmable computer seems to win. The question behind?
Unknown Speaker 51:12
Well, it's interesting to see that you on the graph for where we are going into architecture from single quarter medical, and it seems like for the next 20 years AI will be will be a focus. And I imagining that AI will be built into the Intel chip, at the same level of branch predictor instruction instructor, like instruction feature, or is different abstraction.
Jim Keller 51:38
didn't quite understand your question. Oh,
Unknown Speaker 51:40
like how to, like, how do you think the AI components for the chipset will be integrated with the 16? instruction? executer?
Jim Keller 51:49
You know, that's a really big question. And you know, Intel just recently introduced what's called CSL, which is basically a coherent memory port. Because, you know, we think there's going to be, you know, standard computers that run lots of C programs, and JavaScript and all kinds of stuff. And then there's stuff that works really well on the vector architectures that GPUs have. And then the AI chips today tend to be dense computational things like convolution matrix multiplies. Something that you know, can take tensors right out of TensorFlow and do the right operations on them. But the algorithms, you know, are going to want to talk to each other. And the customers, you know, are uncertain what to build, like, one reason GPUs haven't gotten bigger in the data center is if the GPU workload is half your workload, and it's 10% duty cycle wise, then standing up to GPU data center, super expensive, right? And but what people want is how do we get CPU GPU acceleration in the same box where it's really flexible to move the data back and forth and do it, and they want to do it a flexible way? So if they think, well, I need this much GPU and AI computing, but it turns out to be true. They don't want to be stuck with the wrong computer set. So there is real question.
Unknown Speaker 53:12
Do you foresee that say in the future Intel CPU, it will how you struction specialized for convolution?
Jim Keller 53:21
Yes. Yeah, so the short answer is, there's a there's a whole bunch of stuff where you have the code is pretty good. And some kind of data type acceleration really works. And we'll definitely do that we're working on multiple projects are, there's other things where either I have the data set, and the math intensity doesn't really care about, like the unpredictable C code, or the data set is so big and weird, that having a CPU grovel through it's not the right answer, right? So AI has significant diversity on how you want to architect for that. And
Unknown Speaker 54:05
my question, is Intel looking for new materials, especially in Richard Feynman calls, you highlight seven atoms? Yes. So can you
Jim Keller 54:19
count we had a really we made a funny slide and then I saw tsmc crib that we did a walk through the periodic table. So you know, used to be silicon and aluminum. And so and then silicon oxide didn't the number of materials used in chip fabrication just keeps going up. We use more and more atoms. And then in terms of, as you're alluding to, in terms of the actual device type. It's it's really interesting what's going to happen, but that's not my exact wheelhouse, but you want to answer that question, carbon. We'll see what happens. Like the material science. We've just scratched the surface on material science. I mean, you got a couple atoms together and have them talk like it's so unpredictable. Like everyone knows the Schrodinger equation works. Right? You can write it down. It's pretty simple. Well, we solved hydrogen, sad story like material sciences amazing. And
Eric Paul 55:14
long term the front.
Jim Keller 55:17
Oh, look at that handoff relay cooperation. So Hey, Mark. Yeah, I was
Unknown Speaker 55:24
interested in your in your comments about FPGAs. Because it's one thing that's missing from all your graphs and all your charts, all these big multipliers, but nowhere to see anything about Inari. Right writes, non recurring engineering, right? So the cost of developing new design? That's probably going up pretty steep as well. You were a little, you know, hesitant about FPGAs. Really,
Jim Keller 55:52
I can give you seen
Unknown Speaker 55:54
how, what can you do you got any other way to beat down the NRA so that small companies can, you know, come up with new products quick.
Jim Keller 56:03
On 130 nanometers, you can build a sophisticated device for about one to $2 million. That right, and 28 nanometer, I built multiple parts for $10,000,000.14 nanometre 40. Right. And the advanced process node like depends on how you allocate the cost, who the hell knows, hundreds of million if you take the process for granted billions if you if you need to build the cost. So the diversity class, you know, let's imagine to do a new product ranges from 2 million to $2 billion. It's fairly diverse. Right? Now, FPGAs are really interesting, because if it's a small number of parts, even a $2 million, it takes a lot of parts, I amortize that, right. And if you're doing experiments and research lab, and you'll care about the current frequency, but you really want to do architecture experiments, modern modern FPGAs are great tools work way more than they used to be, you know, you can get a stack of hops boards with a boatload of FPGAs on each one and standard interfaces. And I'm sure to some of the students here have built some really cool computers, using FPGAs. And working memory controllers, and Ethernet NICs and embedded CPU, so you can do what you want. So I'm a huge fan of that. But I would say just think 2 million to 2 billion is the range depending on what you're paying for, you know, advanced technology, you know, 16, fin fat, you know, 25 to 45, you start going into seven nanometre starting to look at, you know, five to 10 million for mass costs alone, right? And but everybody says, well, it's just gonna explode, run out of control within the mass costs, and surely, notes keep coming down. And, like 16, FinFET, when it first came out, that was $100 million minimum. And they're down under 40. And they're on their way to 20. So so it keeps moving. And I have to say, What do I need to make it a competitive product? With architectural innovation is 10x, the other guy than the 2x on the process isn't the thing. But if everybody is on the process, they have the equal architecture, then the process is the differentiator. And that stuff is pretty stable.
Eric Paul 58:15
All right, I'm gonna take the last question or Okay, let me take one more. And then I want to ask, so yeah.
Unknown Speaker 58:23
Yeah. Hi. Great talk. So one thing I was wondering is, in that chart of processor processors you had the while computer, it seems to be scaling pretty fast still, I the size of the register files and cache isn't really scaling as fast. So is that going to bottleneck things? And know how might we mitigate that?
Jim Keller 58:42
Yeah, so I mean, what actually, so there's a big step function. So the O one caches are like 32, to 64k, because you're trying to keep in a four cycle load, use pipeline, and grow that when you step off that one more cycle, you get 2% on the bigger cash, and you lose five to 7% on the latency. So we build our one caches mid level, cache level, last level cache. So the trade off has been around the cache hierarchy, though instruction stream caches have gotten bigger, faster. But that that trade off is out of a fairly sophisticated performance model environment. And then the other thing is we build more load store pipes, right? When you want to make the L one caches bigger and said you don't you put more ports on them. So those modern caches have a lot of ports like five, or six or eight, you know, read write ports. And so we've sort of been spending the, the bits in the transistors, the metal tracks on ports more than size, and then trying to compensate with big block moves out of the mid and last level cache. School questions. There's one more back there. Thanks.
Unknown Speaker 59:53
So this may be slightly off topic question but Intel and IBM have both come up with quantum computers with a Granted, like 49 and 51 cubits? And so is there. Is there a competition between the two technologies? Because they're two very different types of processing? So is there like a competition between which one is gonna prevail? If so, Which do you think is gonna win?
Jim Keller 1:00:13
I literally have no opinion. I've been watching quantum computing for years. And I had a friend who was working on for a couple years and he said, I'm not sure it's not a fraud. Because we haven't really gotten results out of it. And I'm curious, because, you know, the physics says a whole bunch of interesting things is happening, but we haven't really gotten the results that we expect. And but they are making progress. And so I would have to direct you to the curtain papers more for for answers to that. Okay. Thanks, everybody. All right. Actually, I had I had one more
Eric Paul 1:00:57
question, which is a little bit of, you've had this amazing career of different, you know, places where you've worked. I'm wondering, you have an audience of certainly the next generation of practitioners in this space. So you have advice to students? or What do you wish that someone had told you when you were in the seat? What what's your kind of messaging?
Jim Keller 1:01:17
Well, first of all, they told me something I wouldn't listen. So I'm not really, you know, when I was in college, I was a W e. And then a senior year, we had a two inch wafer fab. And my adviser ran it. So I worked on semiconductor physics, I thought that was super cool. And then my first job I took in Florida because I want to live next to the beach and surf. And surfing was shitty, and the company was horrible. But I learned that good they threw me in the lab and I fixed stuff, and I got to be a, you know, scope jockey. And I can fix bloody anything in the lab. And it was like one of those worst two years your life and best two years of experience, and then somebody told me I should work for digital equipment. So I went back then his met his newspaper, he times I found an advertisement or recruiting ad for digital equipment. And I read the Vax 11 780 manual in this 1170 manual on the plane. And I talked to the chief architect Bob Stewart of the Vax 11 780 which is a very famous computer he said all right, so why are we here I said I have a lot of questions for you. And it turns out he thought I was a complete goof but he thought it'd be funny to have me work for him and and you know, I my my thing is, I like to throw myself in the stuff and learn what I can and make sure that you're working with people who are excited, like I worked at Harris as the Florida company and at lunch, all people did was complain. And then I was a digital in the first eight years there. A friend of mines wife said, what do they put in the water all you guys do is work and when you got drinking, you talk about work. And it was super fun. And way back when I gave a talk in microprocessor forum on the backside 800 was my first computer I worked on was kind of weird, I was a junior person and my boss quit and then his boss quit and we got all done I was the chief architect of the cache. And there was no poison involved. And I gave a talk on it and I was a nervous wreck and after me with Marty Hopkins is a fellow at IBM who's a great guy and now we're all standing around it was like a little chat and he goes he said you know cuz back then he went to a computer conference I remember the risk was you know, you'd say this and they'd say that and everybody was off and he said you know, we're so passionate about this because it's a great endeavor and I thought that was a perfect way to say you know, there's so many technologies involved is so transformative to society. So many good things have happened some bad things obviously. So find some smart people that are excited about everybody's all bummed out and ragging on the company go somewhere else right? Or be a change agent you know, I I worked at AMD when after they'd fired half the people it was not a happy place but you know, my conviction was if we go do the right thing design racing put our energy into it. We'll do something cool now it's a happy place. So Intel's away wild replaced because you know, we had record revenue, we have some stuff that's great. Some stuff that's not so good. Super fun for me, because I like to think about those problems, but
Eric Paul 1:04:36
Oh, great. Thank you for bringing that energy here. And Thanks, Jim, for giving a great talk. He's gonna stick around a little bit. If you want to interact with him, please come up. Thank you again. Want to talk to you
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment