Skip to content

Instantly share code, notes, and snippets.

@FCLC
Last active January 15, 2023 19:28
Show Gist options
  • Save FCLC/bf6cf691ce78b80a8be60d4e4c5a746a to your computer and use it in GitHub Desktop.
Save FCLC/bf6cf691ce78b80a8be60d4e4c5a746a to your computer and use it in GitHub Desktop.
Thinking about what a small M1 Ultra cluster would look like

Ignore this post and read the new one instead: https://gist.github.com/FCLC/6e0f0e79e9d4f5740573f09d7579eb72

Originally this was a borderline copy/paste of a Mastodon exchange. it was fairly crap, so I rewrote the whole thing; the updated version is available via the link above. I prefer not to hide this sort of thing, so the archive will remain public

# Warnings and alarm bells

"What Cursed thing are you talking about now?"

During todays #hpchuddle, the topic of HPC development and workloads on Apple silicon came up briefly.  
Thinking on it, once #Asahi Linux has GPU compute support going, I can see a world where devices like Mac Studio with M1 ultra augmented by a Thunderbolt4 to networking cards (TB4 guaranties 32 Gb/s in direct PCIe bandwidth, has up to 40) makes sense. 

## Down the rabit hole

The current state of Metal is such that interop with other lang/frameworks is not happening without major work, so MacOS is out. 

### Basics

To get the most out of networking, you are looking at link aggregation, using each of the 6 Thunderbolt 4 controllers in tandem with SFP28 for a total peak of 150 Gb/s networking. 
Technically using QSFP28, SFP56 (or plain old QSFP+) you are up to 192 Gb/s as a possibility, and you can even aggregate the built in 10G baseT ethernet to get you over the 200 Gb/s mark. 
For the specific machine, maxed out memory, CPU and storage is 8k USD, or, assuming you can manage either the EDU or an EDU like discount, ~7200 per machine.  
When I originally posted the above on Mastodon, ECC was brought up; 
It is far from Ideal, but we have done worse to get less.  
 
Example, the US DOD/Air Force ran a PS3 cluster for years without ECC https://phys.org/news/2010-12-air-playstation-3s-supercomputer.html 

## Breakdown and estimates

Considering the relative cost if you are "only" doing a single rack (say something "small" like a deep 42u), you can fit ~10 Mac studios in groups of 4u, with a max power draw per machine of 213W. For that 4U you are still under 2200W. To "feed" that 4U you are looking at ~60 SFP28/56/QSFP+ interconnects.  
Assuming QSFP56, you can do that with 15 ports at the switch, each port breaking out to 4*SFP56.  

Quick spec out assuming:

Bottom of rack has 12 U in UPS's 

10 macs per 4U 

For each 8U of macs you need 1U of switching (overkill) and 1U of adapter spaces  

Of the remaining 30 U, you are getting 3*(10u groups) 

Per 10u/20 Mac group, costs: 
20*7200USD macs  

18k USD for a 40xQSFP56 managed switch (https://www.fs.com/products/158523.html?attribute=51861&id=804125) 

15* QSFP56 to SFP56 breakout cables https://www.fs.com/products/88141.html @ 90 USD /piece 
1350, call it 1.5k 
 
Per 10U, 120 SFP56 to thunderbolt adapters 
The best I am finding is single card enclosures from sonnet at 700 per unit, + the SFP56 card at 550. (connectX-6 single port SFP56) 
Call it 1350/port, 120 ports, 162k. (I am sure you can do better here, but this is baseline) 
 
### Costs once you group it together? 
Per group of 10U: 
144K Macs, 
18K switch 
1.5K in "optics" (DACs)
162k in TB4->SFP 

### And the whole rack?

325.5K, x3 
Rounding up, 977K.  
23K for the UPSs and we'll call it 1M for the rack. 
Bringing that together, you have a "system" that pulls in hardware 213W * 20 * 3 (macs and TB4) + 3*switch (250W typical) ~13.5KW peak. Call in 14KW 
You have 960 P, 240E CPU cores 

CFD testing (USM3D from NASA) puts it at 180 Gflops per 16 P cores (link at the end), or 10.8 CPU Tflops of SP for the rack

GPU power in the ~20(SP) Tflops per mac, or 1.2 Pflops (SP). Double that in FP16 

Depending on if the neural accelerator ever gets support, that's up to an added ~40Tflops of FP16 per machine, 0.8 Pflops per 20 machine unit, 2.4 Pflops total.  

Total RAM for the entire rack is 7.6 TB 

Storage for the rack is 0.48 PB of *extremely* fast NVME.  

## Bring it together
  
For ~ 1M USD 
You get a 42U rack with 1200 total cores, 960 of which you would use for compute, the remaining 240 Ecores for shuffling/tasking data (Think something like A64fx in Fugaku) 
~0.5 PB of some of the fastest storage around (but not aggregated) 
14KW of draw  
10.8 CPU CFD Tflops  
1.2 Pflops SP GPU  
2.4 Pflops fp16 GPU 
Up to another 2.4 Pflops NPU 
7.68 TB of RAM 
 
 
Caveats?  
 
You can *definitely* get the thunderbolt/SFP56 portion of this for cheaper.  
 
I did not include some sort of large storage array, nor the cost for cables/optics to get data to and from this rack.  
 
The GPU portions rely on GPU compute coming to Linux 
Redundancy is not really a thing here.  
 
Budget ~40K for extra Macs, switches etc. 
 
No ECC beyond what LPDDR5 already provides.  
ARM on Linux in an unsupported capacity, software will be “fun” 
 
## Comparables: DGX H100 pods

Each pod is:

8 * Nvidia H100 80GB

2 * Xeon SPR 56 Core CPUs 

2TB of system memory

10 * 400 Gbs network interfaces

8*3.84TB TBs of NVME (~30TB)

Power draw of ~10.2KW peak

Size is ~7U
 
Price is somewhere between [secret] to [don't ask]. The PCIe version of h100 80GB is ~36K each. Between the 8 of them, the main board cost, the system memory, 2 CPUs I'm estimating ~450K. CPU's are SPR 8480 or 9480 (10-12K), meaning 112 cores, 224 threads.

Assuming we fit 4 in our 42U, that's 28U of compute, leaving 14 U for switching and power. A little tight, especially since we're hitting ~41KW of peak draw. 

Performance per H100 is listed at: 
60 Tflops FP32/64
2 PFlops FP16

32*H100: 
1.92 Pflops FP32/64
64 Pflops FP16

For the 4 of them, switching, etc. I'm going to call it ~1.9M USD (I expect it to be higher IRL, probably closer to 2.5)

### DGX H100 comparison

For:
~ 2x cost
~ 3x power

You get:
~ 1.5x the SP (but you do get FP64)

~ 13.3x the FP16

~ 0.25x the storage (but it's agregated, so nuances)

~ 0.33x the GPU memory (if you include system memory, you're at ~1.25x)

 
## Conclusion
At the very least, it is an interesting thought exercise. If you do this, let me know! I'd love to see it! It's cursed but I love this sort of stuff :-P 
 

 
CFD benchmarks here: http://hrtapps.com/blogs/20220427/ ~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment