Catfish-Man/Caches

## Caches
Let's Reinvent Modern CPU Caches!

In The Beginning, programs were hard-coded, entered directly with switches. Values would be input, and then results would output,
but couldn't really be stored. We'll draw this like so:

Input -> Fixed Calculations -> Output

An early improvement in generality was the addition of storage (ENIAC eventually gained 100 words of magnetic core memory),
leaving us with something along these lines:

Input/Output <-> Memory <-> Fixed Calculations

The next increase in generality was the Von Neumann architecture, in which programs are stored in the same memory that data is,
rather than entered by directly configuring the hardware. Now we have

Input/Output <-> Memory -> Instructions  \
					    -> Data     	 -> Calculations
						<- Results  	</

We still use essentially this architecture today. Programs and program data are stored in memory, loaded together into a processor
which applies computations described by the program to the data, then stores the results to the same memory.
The first column is USB, etc... the second column is hard drives and ram.

Columns three and four are the processor. The third (currently loaded instructions and data, and just-calculated results) column
is "registers", and the fourth (actual computation) is "functional units".

We're focusing on the processor, so omit column 1 from now on. For ease of diagramming, we'll draw stuff in a line,
even though results are being stored in the same registers and memory that values are loaded into/from.
So, same architecture as before:

Memory -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Memory

Each time the processor "ticks", we load an instruction and some values from memory into registers,
apply the calculation the instruction describes to the values, and store the result.

Improvements from this point on are about speed, rather than flexibility.

The first speedup is pipelining, which is quite simple conceptually: instead of waiting for all three steps above to finish each tick,
we start the next operation as soon as the current one moves to step 2.
This lets us "overlap" 3 operations (the term "in-flight" operations is often used).
This technique can be extended further by breaking down "load", "calculate" and "store" into
smaller pieces; in the somewhat ill-fated "Prescott" Pentium 4 processor, these basic steps were broken down into 31(!)
tiny sub-steps in an attempt to keep work flowing as fast as possible.

When several operations are chained, for example

a = b + c
d = a * 2

We can get another speedup by doing the second operation directly on the register the result of the first was stored into.
So the above becomes (ignoring having to load instructions)

load b into register 1
load c into register 2
add register 1 to register 2, storing result in register 3
multiply register 3 by 2, storing result in register 4

Loading from memory is *immensely* (100x or more) slower than using a value already loaded into a register,
so this can be a huge improvement.

Now our architecture looks like this:

Memory -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Memory
											 |				|
											 <-  	 <-		<

However, this leaves us with a situation where a sequence of fast operations on values already in registers can get
backed up behind a slow operation that has to load a value from memory. An immense amount of effort has been devoted
to mitigating this problem ( see https://www.reddit.com/r/applehelp/comments/3dt39n/activity_monitor_shows_up_to_800_cpu/ct8gysn/
for an explanation of two approaches: out of order execution, and hyperthreading ).

The most direct, and probably most effective, strategy is to take advantage of the fact that you can actually make memory faster
if you're willing to have less of it (fast memory is more power hungry, more expensive, takes more space, etc...).
We don't want to give up having lots of memory, so we get most of the benefits of both by introducing a small pool
of fast memory between the processor and the real memory, and keeping just the most recently used values in it:

Memory (~16GB, ~80 ticks) -> Cache Memory (64kB, ~4 ticks) ->  Load (Instruction, Values) -> Calculations -> Store (Results) -> Cache Memory -> Memory
											 														|				|
											 														<-  	 <-		<

This means that as long as a program mostly operates on the same set of data repeatedly (a property called "temporal locality"),
and that data fits in 64kB, we get immensely faster loads. It turns out that many programs do have good temporal locality,
so the size of the data (called the "working set") is a more significant limitation. A rule of thumb is that the
"hit rate" (the % of loads that find their value in the cache) increases by roughly a factor of √2 each time the size of the cache doubles.

So, if we can make super fast tiny caches, can we make medium fast larger caches? Sure can! Intel's Skylake processor
has a series of caches that look approximately like so:

Memory (~16GB, ~80 ticks) -> Cache (8MB, ~40 ticks) -> Cache (256kB, ~12 ticks) -> Cache (64kB, ~4 ticks) ->

This is called the "cache hierarchy" of the processor, and it's absolutely critical to good performance.
Many programs see better than 90% hit rates for even the smallest "level 1" cache, and very very few of their loads
have to go all the way to glacially slow main memory.
	Let's Reinvent Modern CPU Caches!

	In The Beginning, programs were hard-coded, entered directly with switches. Values would be input, and then results would output,
	but couldn't really be stored. We'll draw this like so:

	Input -> Fixed Calculations -> Output

	An early improvement in generality was the addition of storage (ENIAC eventually gained 100 words of magnetic core memory),
	leaving us with something along these lines:

	Input/Output <-> Memory <-> Fixed Calculations

	The next increase in generality was the Von Neumann architecture, in which programs are stored in the same memory that data is,
	rather than entered by directly configuring the hardware. Now we have

	Input/Output <-> Memory -> Instructions \
	-> Data -> Calculations
	<- Results </

	We still use essentially this architecture today. Programs and program data are stored in memory, loaded together into a processor
	which applies computations described by the program to the data, then stores the results to the same memory.
	The first column is USB, etc... the second column is hard drives and ram.

	Columns three and four are the processor. The third (currently loaded instructions and data, and just-calculated results) column
	is "registers", and the fourth (actual computation) is "functional units".

	We're focusing on the processor, so omit column 1 from now on. For ease of diagramming, we'll draw stuff in a line,
	even though results are being stored in the same registers and memory that values are loaded into/from.
	So, same architecture as before:

	Memory -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Memory

	Each time the processor "ticks", we load an instruction and some values from memory into registers,
	apply the calculation the instruction describes to the values, and store the result.

	Improvements from this point on are about speed, rather than flexibility.

	The first speedup is pipelining, which is quite simple conceptually: instead of waiting for all three steps above to finish each tick,
	we start the next operation as soon as the current one moves to step 2.
	This lets us "overlap" 3 operations (the term "in-flight" operations is often used).
	This technique can be extended further by breaking down "load", "calculate" and "store" into
	smaller pieces; in the somewhat ill-fated "Prescott" Pentium 4 processor, these basic steps were broken down into 31(!)
	tiny sub-steps in an attempt to keep work flowing as fast as possible.

	When several operations are chained, for example

	a = b + c
	d = a * 2

	We can get another speedup by doing the second operation directly on the register the result of the first was stored into.
	So the above becomes (ignoring having to load instructions)

	load b into register 1
	load c into register 2
	add register 1 to register 2, storing result in register 3
	multiply register 3 by 2, storing result in register 4

	Loading from memory is immensely (100x or more) slower than using a value already loaded into a register,
	so this can be a huge improvement.

	Now our architecture looks like this:

	Memory -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Memory
	\| \|
	<- <- <

	However, this leaves us with a situation where a sequence of fast operations on values already in registers can get
	backed up behind a slow operation that has to load a value from memory. An immense amount of effort has been devoted
	to mitigating this problem ( see https://www.reddit.com/r/applehelp/comments/3dt39n/activity_monitor_shows_up_to_800_cpu/ct8gysn/
	for an explanation of two approaches: out of order execution, and hyperthreading ).

	The most direct, and probably most effective, strategy is to take advantage of the fact that you can actually make memory faster
	if you're willing to have less of it (fast memory is more power hungry, more expensive, takes more space, etc...).
	We don't want to give up having lots of memory, so we get most of the benefits of both by introducing a small pool
	of fast memory between the processor and the real memory, and keeping just the most recently used values in it:

	Memory (~16GB, ~80 ticks) -> Cache Memory (64kB, ~4 ticks) -> Load (Instruction, Values) -> Calculations -> Store (Results) -> Cache Memory -> Memory
	\| \|
	<- <- <

	This means that as long as a program mostly operates on the same set of data repeatedly (a property called "temporal locality"),
	and that data fits in 64kB, we get immensely faster loads. It turns out that many programs do have good temporal locality,
	so the size of the data (called the "working set") is a more significant limitation. A rule of thumb is that the
	"hit rate" (the % of loads that find their value in the cache) increases by roughly a factor of √2 each time the size of the cache doubles.

	So, if we can make super fast tiny caches, can we make medium fast larger caches? Sure can! Intel's Skylake processor
	has a series of caches that look approximately like so:

	Memory (~16GB, ~80 ticks) -> Cache (8MB, ~40 ticks) -> Cache (256kB, ~12 ticks) -> Cache (64kB, ~4 ticks) ->

	This is called the "cache hierarchy" of the processor, and it's absolutely critical to good performance.
	Many programs see better than 90% hit rates for even the smallest "level 1" cache, and very very few of their loads
	have to go all the way to glacially slow main memory.