Skip to content

Instantly share code, notes, and snippets.

@FCLC
Last active January 16, 2023 22:01
Show Gist options
  • Save FCLC/6e0f0e79e9d4f5740573f09d7579eb72 to your computer and use it in GitHub Desktop.
Save FCLC/6e0f0e79e9d4f5740573f09d7579eb72 to your computer and use it in GitHub Desktop.
MacStudio Cluster: What if we threw sanity to the wind?

The Beginning

During the 13th of January 2023 HPC Huddle (now hosted by hpc.social) the topic of #HPC development and workloads on Apple silicon came up briefly.

Thinking on it, once #Asahi Linux has GPU compute support squared away; I can see a world where devices like Mac Studio with M1 ultra are augmented by Thunderbolt4 networking cards. Even if it is for PR, vendors like Oracle amongst others have demonstrated a willingness to build weird and wonderful clusters as a “because we can.” It is far from Ideal, but we have done worse to get less. Beyond Oracle and the Pi cluster, the US DOD/Air Force ran a PS3 cluster for years. https://phys.org/news/2010-12-air-playstation-3s-supercomputer.html

Setting the stage

A few baselines before I go on:

  • We are going to go through this thought experiment from the point of view of a small laboratory/bootstrap cluster that can only use a single 48U, 42” deep rack.
  • The current state of Metal is such that interop with other lang/frameworks is not happening without major work, so MacOS is out, Linux is in
  • I assume GPU compute on M1 is working while also being able to use the Neural Processor.
  • For networking, link aggregation is necessary. Using all 6 Thunderbolt 4 controllers adapted to 50 Gb/s SFP56 gets us to “acceptable” throughput per machine. Because of limitations around ThudnerBolt4, each NIC is limited to 32Gbps.
  • TB4 is working on the Mac, and the chosen NIC is compatible.
  • You use the built in 10G baseT ethernet as a BMC of sorts.
  • M1 has both Ecores (using the Ice Storm architecture) and Pcores (Using Firestorm). The Ecores are basically unusable without messing with the scheduler. For the sake of simplifying the setup, we will be thinking of them as auxiliary cores, the same way Fugaku A64FX has a secondary cluster of CPU cores meant to deal with IO, scheduling, etc.

Density

If setup horizontally, you can fit 2 Mac Studios in 3U, we can fit the macs together every 8 inches, for a total of 10 Macs per 3U. Better is mounting them vertically, allowing for 4 Macs per row in 5U. Push it all the way to the back on we are at 20 Macs per 5U.

Since we need access to power and the thunderbolt ports on the front and the back, let us build in 1U below for the 2 front thunderbolt ports, and 3U above for power and 4 thunderbolt ports.

We have all these network interfaces, choosing a 1U, 40 port QSFP56 switch seems right. Each port breaks out to 4*SFP56.

Unfortunately, the market has not seen fit to develop SFP56 to thunderbolt 4 adapters (I wonder why 😉). The best method of dealing with this is a PCIe NIC into a thunderbolt4 to PCIe adapter. It is very expensive and involves disassembling 360 thunderbolt4 housings. Hope you have strong hands and a charged screwdriver. These, once stripped, will fit into the extra 4U we allowed for in the Mac Studio enclosure.
Outside of power, we now have all the components needed, and fit in 10U. We will call it the “Apple Pod”

Since we have 48U, the assumption is 3 “Apple Pods”, leaving 18U for power and other needs. First, we need to figure out Performance and power needs, which leads nicely to:

Performance

CFD testing (USM3D from NASA) puts it at each Mac at 180 Gflops using the CPU performance cores.

GPU power is in the ~20 FP32 Tflops per Mac, twice that in FP16.

Assuming the Neural accelerator receives support, ~40Tflops of FP16 per Mac.

128 GBs per machine shared between CPU and GPU.

For our 3 Apple Pod cluster:
1.2 Pflops of FP32 4.8 Pflops of FP16

Power

Between 60 Macs, 360 sets of adapters and 3 switches, I calculated 24.3 KW.

To deal with that, I chose Eaton 9PX6K UPSs. Each is 5.4 KW/6 KVA and 3U, leaving us a 10% power buffer.

Finally, the remaining 3U is for PDUs to split out power to each Apple pod.

Screen Shot 2023-01-13 at 7 56 06 PM

Parts

Component selection:

  • Maxed out Mac Studio Ultra’s * 60

  • MQM8700-HS2F, 40-Port 200G QSFP56 switch's * 3

  • QSFP56 to SFP56 breakout direct attach copper cables * 15

  • Sonnet Thunderbolt4 to PCIe adapters * 360

  • Mellanox Connectx6 PCIe to SFP56 network interfaces * 360

  • Custom Mac studio Rack shelf Enclosures * 3

  • Eaton 9PX6K UPS’s * 5

Cost

A few assumptions, largely in line with the market.

  • 10% off Apple EDU pricing, each Mac then costing 7200.
  • You can buy at near street price.
  • You are not adding 60 PiKVMs
  • You buy +1 of everything for hot swap.
  • Total taxes are 10%

Screen Shot 2023-01-13 at 7 54 59 PM

Total after taxes: $1.12M

Caveats

  • You can definitely get the thunderbolt/SFP56 parts for cheaper.
  • I did not include some sort of large storage array, nor the cost for cables/optics to get data to and from this rack.
  • The GPU portions rely on GPU compute coming to Linux
  • Redundancy is not really a thing here.
  • No ECC beyond what LPDDR5 already provides.
  • ARM on Linux with unsupported hardware will be “fun”

Conclusion

Screen Shot 2023-01-13 at 7 51 33 PM

At the very least, it is an interesting thought exercise.

** edit: ** followed up with a similar piece on Orin AGX: https://gist.github.com/FCLC/7d75d12e4c368c13e400fda1475da673

Here’s a spread sheet with all the juicy data: https://www.icloud.com/numbers/05fF59zXuDyHbYG2L4Fzuz8Lg#Cursed_cluster

CFD benchmarks here: http://hrtapps.com/blogs/20220427/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment