Created
June 14, 2021 13:12
-
-
Save computer-whisperer/42b8507b2fb506faf0b1c9ee1189cd58 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
About generating k32s on GPUs: | |
At a high level, plotting involves alternating steps of calculating a bunch of values (parallel!) and sorting them | |
(parallel? but harder). The problem is that you have to finish every single value calculation for table i (and sort | |
them somehow) before you can start table i+1. This means you have to put all that data somewhere, and then read it all | |
back in again. Here is a rough idea how much data needs to get stored: | |
table1 table2 table3 table4 table5 table6 table7 | |
^ ^ ^ ^ ^ ^ ^ | |
| | | | | | | | |
26GB 26GB 26GB 26GB 26GB 26GB 26GB | |
| | | | | | | | |
f1(x) -> 39GB -> f2(a,b) -> 56GB -> f3(a,b) -> 90GB -> f4(a,b) -> 90GB -> f5(a,b) -> 73GB -> f6(a,b) -> 56GB -> f7(a,b) | |
As you can see, if the plotting device (gpu?) doesn't have enough on-board memory to completely hold the data line | |
moving along the bottom, then you have to put that data somewhere else. This means you will need to write it all out | |
over pcie, which is only 32GB/s for 16 lane pcie 4. This is going to be a critical bottleneck for the plotter, and | |
can't be overcome with faster gpus. The destination for this data is also a critical element. DDR4 can approach that | |
speed, but not many systems can host 250GB + of DDR4, and most NVME solutions are off the table. | |
The total data transfer comes out to about 586GB of data coming out of the GPU and about 404GB going back into the GPU. | |
If you could max out your pcie 4 bus, then the minimum phase1 plot time would be in the range of 18 seconds. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment