computer-whisperer/why-no-gpu.txt

## why-no-gpu.txt
About generating k32s on GPUs:

At a high level, plotting involves alternating steps of calculating a bunch of values (parallel!) and sorting them
(parallel? but harder). The problem is that you have to finish every single value calculation for table i (and sort
them somehow) before you can start table i+1. This means you have to put all that data somewhere, and then read it all
back in again. Here is a rough idea how much data needs to get stored:

table1             table2             table3            table4             table5             table6              table7
  ^                  ^                  ^                 ^                  ^                  ^                   ^
  |                  |                  |                 |                  |                  |                   |
 26GB               26GB               26GB              26GB               26GB               26GB               26GB
  |                  |                  |                 |                  |                  |                   |
f1(x) -> 39GB -> f2(a,b) -> 56GB -> f3(a,b) -> 90GB -> f4(a,b) -> 90GB -> f5(a,b) -> 73GB -> f6(a,b) -> 56GB -> f7(a,b)

As you can see, if the plotting device (gpu?) doesn't have enough on-board memory to completely hold the data line
moving along the bottom, then you have to put that data somewhere else. This means you will need to write it all out
over pcie, which is only 32GB/s for 16 lane pcie 4. This is going to be a critical bottleneck for the plotter, and
can't be overcome with faster gpus. The destination for this data is also a critical element. DDR4 can approach that
speed, but not many systems can host 250GB + of DDR4, and most NVME solutions are off the table.

The total data transfer comes out to about 586GB of data coming out of the GPU and about 404GB going back into the GPU.
If you could max out your pcie 4 bus, then the minimum phase1 plot time would be in the range of 18 seconds.
	About generating k32s on GPUs:

	At a high level, plotting involves alternating steps of calculating a bunch of values (parallel!) and sorting them
	(parallel? but harder). The problem is that you have to finish every single value calculation for table i (and sort
	them somehow) before you can start table i+1. This means you have to put all that data somewhere, and then read it all
	back in again. Here is a rough idea how much data needs to get stored:

	table1 table2 table3 table4 table5 table6 table7
	^ ^ ^ ^ ^ ^ ^
	\| \| \| \| \| \| \|
	26GB 26GB 26GB 26GB 26GB 26GB 26GB
	\| \| \| \| \| \| \|
	f1(x) -> 39GB -> f2(a,b) -> 56GB -> f3(a,b) -> 90GB -> f4(a,b) -> 90GB -> f5(a,b) -> 73GB -> f6(a,b) -> 56GB -> f7(a,b)

	As you can see, if the plotting device (gpu?) doesn't have enough on-board memory to completely hold the data line
	moving along the bottom, then you have to put that data somewhere else. This means you will need to write it all out
	over pcie, which is only 32GB/s for 16 lane pcie 4. This is going to be a critical bottleneck for the plotter, and
	can't be overcome with faster gpus. The destination for this data is also a critical element. DDR4 can approach that
	speed, but not many systems can host 250GB + of DDR4, and most NVME solutions are off the table.

	The total data transfer comes out to about 586GB of data coming out of the GPU and about 404GB going back into the GPU.
	If you could max out your pcie 4 bus, then the minimum phase1 plot time would be in the range of 18 seconds.