Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active July 20, 2024 20:32
Show Gist options
  • Save scyto/76e94832927a89d977ea989da157e9dc to your computer and use it in GitHub Desktop.
Save scyto/76e94832927a89d977ea989da157e9dc to your computer and use it in GitHub Desktop.
proxmox cluster proof of concept

ProxMox Cluster - Soup-to-Nutz

aka what i did to get from nothing to done.

note: these are designed to be primarily a re-install guide for myself (writing things down helps me memorize the knowledge), as such don't take any of this on blind faith - some areas are well tested and the docs are very robust, some items, less so). YMMV

Purpose of Proxmox cluster project

Required Outomces of cluster project

image

The first 3 NUCs are the new proxmox cluster, the second set of 3 NUCs is the old Hyper-V nodes.

Updates as of 9/30/2023 This cluster is no longer a PoC and is my production cluster for all my VMs and docker containers (in VM based swarm).

All my initial objectives have been achivied and then some. All VMs migrated from Hyper-V and working - despite some stupidty on my part - though i learnt a lot!)

I will update if and when i make major changes, redesign or add new capabilities, but to be clear i now consider this gist set complete for my needs and have no more edits planned.

If you spot a critical type let me know and I can change but as as these are notes for me (not a tutorial) i make no promises :-)

Outcomes

  1. Hardware and Base Proxmox Install

  2. Thunderbolt Mesh Networking Setup

  3. Enable OSPF Routing On Mesh network - deprecated - old gist here

  4. Enable Dual Stack (IPv4 and IPv6) Openfabric Routing on Mesh Network

  5. Setup Cluster

  6. Setup Ceph and High Availability

  7. Create CephFS and storage for ISOs and CT Templates

  8. Setup HA Windows Server VM + TPM

  9. How to migrate Gen2 Windows VM from Hyper-V to Proxmox

    1. Notes on migrating my real world domain controller #2
    2. Notes on migrating my real world domain controller #1 (FSMO holder, AAD Sync and CA server)
    3. Notes on migrating my windows (server 2019) admin center VM
  10. Migrate HomeAssistant VM from Hyper-V

  11. Migrate my debian VM based docker swarm from Hyper-V to proxmox

  12. Extra Credit (optional):

    1. Enable vGPU Passthrough (+windows guest, CT guest configs
    2. Install Lets Encrypt Cert (CloudFlare as DNS Provder
    3. Azure Active Directory Auth
    4. Install Proxmox Backup Server (PBS) on synology with CIFS backend
    5. Send email alerts via O365 using Postfix HA Container
  13. Random Notes & Troubleshootig

TODO

  • add TLS to the mail relay? with LE certs? maybe?
  • maybe send syslog to my syslog server (securely)
  • figure out ceph public/cluster running on different networks - unclear its needed for this size of install
  • get all nodes listening to my network UPS and shut down before power runs out
  • For the docker VMs implement both cephfs via virtiofs for and a cephs docker volume and test which i like best in a swarm - using this ceph volume guide and this mounting guide by Drallas - using one of these three ceph volume plugins Brindster/docker-plugin-cephfs flaviostutz/cepher n0r1sk/docker-volume-cephfs each has different strengths and weaknesses (i will like choose either the n0r1sk or the Brindster one).

Purpose of cluster

I have been using Hyper-V for my docker swarm cluster VM hosts (see other gists). Original intenttion was to try and get Thunderbolt Networking for a Hyper-V cluster going and clustered storage for the VMs. This turns out to be super hard when using NUCs as cluster nodes due to too few disks. I looked at solar winds as alternative but this was both complex and not pervasive.

I had been watching proxmox for years and thought now was a good time to jump in and see what it is all about. (i had never booted or looked at proxmox UI before doing this - so this documentation is soup to nuts and intended for me to repro if needed)

Goals of Cluster

  1. VMs running on clustered storage {completed}
  2. Use of ThunderBolt for ~26Gbe Cluster VM operations (replication, failover etc)
    • Thunderbolt meshs with OSPF routing {completed}
    • Ceph over thunderbolt mesh {completed}
    • VM running with live migration {completed}
    • VM running with HA failove of node failure {completed}
    • Seperate VM/CT Migration network over thunderbolt mesh {not started}
  3. Use low powered off the shelf Intel NUCs {completed}
  4. Migrate VMs from Hyper-V:
    • Windows Server Domain Controler / DNS / DHCP / CA / AAD SYNC VMs {not started}
    • Debian Dcoker Host (for my 3 running 3 node swarm) VMs {not started}
    • HomeAssistant VM {not started}
  5. Sized to last me 5+ years (lol, yeah, right)

Hardware Selected

  1. 3x 13th Gen Intel NUCs (NUC13ANHi7):
    • Core i7-1360P Processor(12 Cores, 5.0 GHz, 16 Threads)
    • Intel Iris Xe Graphics
    • 64 GB DDR4 3200 CL22 RAM
    • Samsung 870 EVO SSD 1TB Boot Drive
    • Samsung 980 Pro NVME 2 TB Data Drive
    • 1x Onboard 2.5Gbe LAN Port
    • 2x Onboard Thunderbolt4 Ports
    • 1 x 2.5Gbe usinng Intel NUCIOALUWS nvme epxansion port
  2. 3 x OWC TB4 Cables

Key Software Components Used

  1. Proxmox v8.x
  2. Ceph (included with Proxmox)
  3. LLDP (included with Proxmox)
  4. Free Range Routing - FRR OSPF - (included with Proxmox)
  5. nano ;-)

Key Resources Leveraged

Proxmox/Ceph Guide from packet pushers

Proxmox Forum - several community members were invaluable in providing me a breadcrumb trail.

systemd.link manual pages

udevadm manual

udev manual

@Mister-Odd
Copy link

Mister-Odd commented Dec 9, 2023

Maybe I should return my NPB5's and wait until next spring when the (now) ASUS NUC 14 comes out with Thunderbolt 5 and 80Gb ...
I'm just doing this for S&Gs and it's getting obsessive - how bad do I really need to self-host and maintain cluster nodes? At this point I'm feeling like I'm trying to force a square peg into a round hole. Your gist is the most complete (and rare) guide even talking about this idea.
I'm super impressed you got all this working reliably!

@pieter-v-n
Copy link

I am building a cluster with 3 Bee-link GTR7 mini-pc's (AMD Ryzen 7840HS), Each provide 2 USB4 ports. When I configure them for fully meshed thunderbolt-net (mostly following @scyto's instructions), I get IPv4 and IPv6 working with approx. 12 Gbps speed when tested with iperf3. Further inspection of the USB values shows that the hosts interconnect only at 20 Gbps not the expected 40 Gbps. I have tried with Thunderbolt 4 cables from 2 different vendors (not Apple but from China) and the specs of these mini-pc's clearly state that USB4 provides 40 Gbps as are the cables. For now, I am OK with the performance as my goal was to prove that thunderbolt-net works over USB4 on the AMD platform..

@Mister-Odd
Copy link

Mister-Odd commented Dec 10, 2023

Very cool that you did it with the AMD over USB4! How is the latency?
As for the cables - the Apple cables have some rare tech in them (why they are so over-priced). The chips, shielding, and twists in the Apple cables are specifically there to filter noise (interference) and handle other tasks. It's rather a bit (pun intended) of cool tech. I'm not an Apple fan normally - but those cables may actually increase your speeds. Watch Adam Savage's "Tested" [video] or the article for info - if you're interested.

@Mister-Odd
Copy link

I have noticed, that (normally) the 40Gbps USB4 ports are sharing the same 40Gbps lane(?) - so 20Gbps is expected when using both port at once.
Mine looks like this:
/0/100/d/1 usb4 bus xHCI Host Controller
/0/100/d.2 bus Intel Corporation
/0/100/d.3 bus Intel Corporation

@scyto
Copy link
Author

scyto commented Dec 11, 2023

20 Gbps not the expected 40 Gbps

on intel hardware the limit is the DMA controllers - thats why even at 40gbps connection one only gets 26gbps
as for why you only have 20gb - sounds like a hardware limitation on your platform, seems like you may only have one PCIE lane not 2 per controller for some reason - actually IIRC thats exactly the reason becuase its optional on USB4 and the difference between USB4 and TB4 which is superset of the spec, may also depend on how many retimers your HW implemented, i would need to go look at spec but i think the # of retimers is an optional thing

@scyto
Copy link
Author

scyto commented Dec 11, 2023

yup spec:

3 Electrical Layer
3.1 On-Board Re-timers
An On-Board Re-timer shall implement an Electrical Layer as defined in the USB4 Specification with the 
following changes:
• An On-Board Re-timer shall support Gen 2 speed of 10Gbps. Support for other speeds is optional.
• An On-Board Re-timer shall support two Lanes

and

3.2 Cable Re-timers
A Cable Re-timer shall meet the requirements in the USB4 Specification with the following changes:
• A Cable Re-timer shall support Gen 2 speed of 10Gbps and Gen 3 speed of 20Gbps.
• A Cable Re-timer shall support two Lane

and from wiki

USB4 products must support 20 Gbit/s throughput and can support 40 Gbit/s throughput

as such 40 gbps seems optional.... and given current DMA controllers limits real world to 26Gbps - no wonder many won't bother supporting more than 20 and saving some money...

this is why i only buy true TB4 certified hardware - it requires the 40gbps.

@scyto
Copy link
Author

scyto commented Dec 29, 2023

hosts interconnect only at 20 Gbps not the expected 40 Gbps

This is the min spec of USB4, 40gbps is optional on USB4. on TB4 40Gbps is required. This is why i tell people to be very careful when selecting USB4 hardware - TB4 guarantees the superset of USB4 specs.

@scyto
Copy link
Author

scyto commented Dec 29, 2023

Very cool that you did it with the AMD over USB4! How is the latency?

definitely, nice to know its working!

@scyto
Copy link
Author

scyto commented Dec 29, 2023

the specs of these mini-pc's clearly state that USB4 provides 40 Gbps

i couldn't find that claim anywhere on their website (i see resellers making the claim, but not beelink) It might be worth contacting bee-link and asking, maybe they have a USB4 BIOS issue....

@DarkPhyber-hg
Copy link

I am planning on setting this up once I get my mini PCs, question for you, 26gbe is that per port? Or aggregate? If you run iperf3 from node a to nodes b and c at the same time, do they both get 26gbps each?

@rlabusiness
Copy link

rlabusiness commented Apr 3, 2024

@scyto Thank you for sharing your depth and breadth of experience here!

A few questions for you:

  1. Would you mind sharing what BIOS/firmware version you're running on your Intel NUC 13 Pro i7's?
  2. Did you update your firmware before installing?
  3. Have you had any experience with different firmwares on these NUCs?
  4. Is there a version that you'd recommend?

I just bought 3 of the exact same model as you in order to replicate your build (after I experienced an issue with a single NUC 13 Pro i5 running Proxmox that caused me some major headaches).

@rlabusiness
Copy link

@scyto Well - I'm responding to my own question here. Haha.

I was googling for more information on the BIOS firmware that came on all 3 of the NUCs that arrived this week (ANRPL357.0026.2023.0314.1458), and I quickly found the Proxmox forum thread below where you mention that this is the stable BIOS version you stuck with in your build. I'm thrilled about that!

https://forum.proxmox.com/threads/intel-nuc-13-pro-thunderbolt-ring-network-ceph-cluster.131107/post-582678

The last question I have before I do my deep dive is about which version of Proxmox to install. I recall reading in one of the 20 or so threads I've read that a later kernel version (6.5?) causes issues. So I'll be researching that a bit more before starting, but if you (or anyone) wants to provide a shortcut to a solid recommendation there, it would be much appreciated.

@scyto
Copy link
Author

scyto commented Apr 4, 2024

yes you need the proxmox kernel version 6.2.16-14-pve or higher to ensure when nodes power cycle the mesh doesn't break and to enable IPv6 correctly

to be clear I haven't upgraded my kernel beyond 6.2.16-14-pve - so i haven't tested to ensure nothing else has broken since then, so let me know if you hit any issues

@scyto
Copy link
Author

scyto commented Apr 4, 2024

fatal last words, all nodes now on 6.5.13-3-pve everything seems fine

of course install from 6.5 might be a different bag if there are setup issues so YMMV

@rlabusiness
Copy link

@scyto - That’s great! Thanks for taking the plunge to validate 6.5. I’ll plan to start with the latest PVE installer and won’t shy away from updating. I’ll also keep an eye out for any anomalies in the process and will report back either way.

Unfortunately I’m traveling at the moment, so I won’t be able to get this built out until next week, but I’m even more excited now. If you notice anything strange with 6.5 over the next few days, please share; otherwise, I hope my next report will be one of success!

@Allistah
Copy link

Allistah commented May 9, 2024

First off, thanks so very much for putting this guide together - really appreciate it! I had a question now that you have had your setup running for some time now. You installed a 1TB SSD as the boot drive and a 2TB NVMe drive for the VMs. How many VMs are. you running and how is your free space looking today? Was the 1TB boot drive too much? I'm curious if a 512GB SSD would have been plenty or not. Once I get two more NUC 13 Pros, I'm going to start over and give this guide a try from the ground up! I currently have the 13 Pro and two old MacBook Pros as a cluster but replication and migrations are tough since it's over a 1Gb network and node 2-3 only have 16GB of ram. Thanks again - really looking forward to trying this out!

@SchuFire
Copy link

Greetings and thank you for this write up.

I am working on proving this out on a three NUC cluster. I have the network up and running. However, after reboots, sometimes the routing is set-up wonky where node1 routes through node2 to get to node3 even though node1 and node3 are directly connected. Was wondering if anyone has seen this behavior and how it can be addressed. I can get the routes correct but it takes some time restarting thunderbolt ports and/or restarting frr.

Thanks in advance.

Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment