FCLC/M1 Cluster follow up: Jetson Orin.md

## M1 Cluster follow up: Jetson Orin.md

      
    Raw
  

              M1 Cluster follow up: Jetson Orin.md
            
          
    The follow up on M1 Cluster- Jetson Orin AGX

Original M1 Piece here: https://gist.github.com/FCLC/6e0f0e79e9d4f5740573f09d7579eb72
No system exists in a vacuum, and so as a follow up to the M1 Cluster, I thought I’d look at a similar cluster based on another integrated ARM device.
Oracle typically builds a Rpi cluster every few years. Their most recent unit, built using 1060 Rpi 3B+ is an interesting piece of tech. Another is the 750 Pi cluster built by LANL. But Pi clusters seem like the domain of Jeff Geerling and co. so, let’s look at something else. The most popular developer board is the Nvidia Jetson series, and the most powerful unit is the latest Orin AGX 64GB.
Setting the stage

Similarly, to the M1 cluster, a few baselines before I go on:

We are going to go through this thought experiment from the point of view of a small laboratory/bootstrap cluster that can only use a single 48U, 42” deep rack.
You use the built in 10G baseT ethernet as a BMC of sorts.

Density

There’s this interesting product from a Canadian company called Connext Tech that fits 12 Jetson Xavier AGX modules in 1U. I’m going to suppose that they’ll be following up with a model that supports Orin AGX SoonTM. These 1U servers integrate switching for the built in ethernet and expose the built in PCIe links, which we can use for storage and connecting to network interfaces. Speaking of which, if we’re going to be doing “proper” HPC, we need proper network bandwidth, and using the PCIe 4 x8 interface, we can use a QSFP56 adapter.

12 modules means 12 QSFP56 per U, and we can hit ~25U. Accounting for the need to uplink, 8 * 40 port switches should do the trick. That leaves us with 15U for power, which just barely works out leaving us with 3% peak draw headroom.
Performance

Unfortunately, the 64GB built in EMMC isn’t going to cut it. Adding an 8TB Sabrent NVME SSD deals with it.
GPU power is in the ~5.3 FP32 Tflops per module, FP16 in the 95 Tflop range.
64 GBs per machine shared between CPU and GPU.
For our 25U of 12 module clusters:

1.6 Pflops of FP32

28.5 Pflops of FP16
Power

Between 300 modules, network adapters, NVME drives and 8 switches, I calculated 26.2 KW.
As I did in the original Mac piece, I chose Eaton 9PX6K UPSs. Each is 5.4 KW/6 KVA and 3U, leaving us a 3% power buffer.


Parts

Component selection:

Nvidia Jetson Orin AGX 64GB * 300
MQM8700-HS2F, 40-Port 200G QSFP56 switch's * 8
QSFP56 direct attach copper cables * 300
Mellanox Connectx6 PCIe to QSFP56 * 300
Sabrent Rocket 8TB * 300
Eaton 9PX6K UPS’s * 5

Cost

A few assumptions, in line with the market.


1k unit pricing


You can buy everything at near street prices.


You are not adding 300 PiKVMs


You buy +2 of everything for hot swap.


Total taxes are 10%


Total after taxes: $1.49M
Caveats


I did not include some sort of large storage array, nor the cost for cables/optics to get data to and from this rack.
Redundancy is not particularly great.
No ECC beyond what LPDDR5 already provides.

Conclusion

 
Once again it is an interesting thought exercise.
Here’s a spread sheet with all the juicy data: https://www.icloud.com/numbers/05fF59zXuDyHbYG2L4Fzuz8Lg#Cursed_cluster