Let's Reinvent Modern CPU Caches! | |
In The Beginning, programs were hard-coded, entered directly with switches. Values would be input, and then results would output, | |
but couldn't really be stored. We'll draw this like so: | |
Input -> Fixed Calculations -> Output | |
An early improvement in generality was the addition of storage (ENIAC eventually gained 100 words of magnetic core memory), | |
leaving us with something along these lines: | |
Input/Output <-> Memory <-> Fixed Calculations | |
The next increase in generality was the Von Neumann architecture, in which programs are stored in the same memory that data is, | |
rather than entered by directly configuring the hardware. Now we have | |
Input/Output <-> Memory -> Instructions \ | |
-> Data -> Calculations | |
<- Results </ | |
We still use essentially this architecture today. Programs and program data are stored in memory, loaded together into a processor | |
which applies computations described by the program to the data, then stores the results to the same memory. | |
The first column is USB, etc... the second column is hard drives and ram. | |
Columns three and four are the processor. The third (currently loaded instructions and data, and just-calculated results) column | |
is "registers", and the fourth (actual computation) is "functional units". | |
We're focusing on the processor, so omit column 1 from now on. For ease of diagramming, we'll draw stuff in a line, | |
even though results are being stored in the same registers and memory that values are loaded into/from. | |
So, same architecture as before: | |
Memory -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Memory | |
Each time the processor "ticks", we load an instruction and some values from memory into registers, | |
apply the calculation the instruction describes to the values, and store the result. | |
Improvements from this point on are about speed, rather than flexibility. | |
The first speedup is pipelining, which is quite simple conceptually: instead of waiting for all three steps above to finish each tick, | |
we start the next operation as soon as the current one moves to step 2. | |
This lets us "overlap" 3 operations (the term "in-flight" operations is often used). | |
This technique can be extended further by breaking down "load", "calculate" and "store" into | |
smaller pieces; in the somewhat ill-fated "Prescott" Pentium 4 processor, these basic steps were broken down into 31(!) | |
tiny sub-steps in an attempt to keep work flowing as fast as possible. | |
When several operations are chained, for example | |
a = b + c | |
d = a * 2 | |
We can get another speedup by doing the second operation directly on the register the result of the first was stored into. | |
So the above becomes (ignoring having to load instructions) | |
load b into register 1 | |
load c into register 2 | |
add register 1 to register 2, storing result in register 3 | |
multiply register 3 by 2, storing result in register 4 | |
Loading from memory is *immensely* (100x or more) slower than using a value already loaded into a register, | |
so this can be a huge improvement. | |
Now our architecture looks like this: | |
Memory -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Memory | |
| | | |
<- <- < | |
However, this leaves us with a situation where a sequence of fast operations on values already in registers can get | |
backed up behind a slow operation that has to load a value from memory. An immense amount of effort has been devoted | |
to mitigating this problem ( see https://www.reddit.com/r/applehelp/comments/3dt39n/activity_monitor_shows_up_to_800_cpu/ct8gysn/ | |
for an explanation of two approaches: out of order execution, and hyperthreading ). | |
The most direct, and probably most effective, strategy is to take advantage of the fact that you can actually make memory faster | |
if you're willing to have less of it (fast memory is more power hungry, more expensive, takes more space, etc...). | |
We don't want to give up having lots of memory, so we get most of the benefits of both by introducing a small pool | |
of fast memory between the processor and the real memory, and keeping just the most recently used values in it: | |
Memory (~16GB, ~80 ticks) -> Cache Memory (64kB, ~4 ticks) -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Cache Memory -> Memory | |
| | | |
<- <- < | |
This means that as long as a program mostly operates on the same set of data repeatedly (a property called "temporal locality"), | |
and that data fits in 64kB, we get immensely faster loads. It turns out that many programs do have good temporal locality, | |
so the size of the data (called the "working set") is a more significant limitation. A rule of thumb is that the | |
"hit rate" (the % of loads that find their value in the cache) increases by roughly a factor of √2 each time the size of the cache doubles. | |
So, if we can make super fast tiny caches, can we make medium fast larger caches? Sure can! Intel's Skylake processor | |
has a series of caches that look approximately like so: | |
Memory (~16GB, ~80 ticks) -> Cache (8MB, ~40 ticks) -> Cache (256kB, ~12 ticks) -> Cache (64kB, ~4 ticks) -> | |
This is called the "cache hierarchy" of the processor, and it's absolutely critical to good performance. | |
Many programs see better than 90% hit rates for even the smallest "level 1" cache, and very very few of their loads | |
have to go all the way to glacially slow main memory. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment