Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active April 18, 2024 17:26
Show Gist options
  • Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
setting up the ceph cluster

CEPH HA Setup

Note this should only be done once you are sure you have reliable TB mesh network.

this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.

All installation done via command line due to gui not understanding the mesh network

This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.

this gist is part of this series

Ceph Initial Install & monitor creation

  1. On all nodes execute the command pveceph install --repository no-subscription accept all the packages and install
  2. On node 1 execute the command pveceph init --network 10.0.0.81/24
  3. On node 1 execute the command pveceph mon create --mon-address 10.0.0.81
  4. On node 2 execute the command pveceph mon create --mon-address 10.0.0.82
  5. On node 3 execute the command pveceph mon create --mon-address 10.0.0.83

Now if you access the gui Datacenter > pve1 > ceph > monitor you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).

If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.

Add Addtional managers

  1. On any node go to Datacenter > nodename > ceph > monitor and click create manager in the manager section.
  2. Selecty an node that doesn't have a manager from the drop dwon and click create 3 repeat step 2 as needed If this fails it probably means your networking is not working

Add OSDs

  1. On any node go to Datacenter > nodename > ceph > OSD
  2. click create OSDselect all the defaults (again this for a simple setup)
  3. repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green) image

If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk. image

Create Pool

  1. On any node go to Datacenter > nodename > ceph > pools and click create
  2. name the volume, e.g. vm-disks and leave defaults as is and click create

Configure HA

  1. On any node go to Datacenter > options
  2. Set Cluster Resource Scheduling to ha-rebalance-on-start=1 (this will rebalance nodes as needed)
  3. Set HA Settings to shutdown_policy=migrate (this will migrate VMs and CTs if you gracefully shutdown a node).
  4. Set migration settings leave as default (seperate gist will talk about seperating migration network later)
@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 23, 2024

Following up on what i've done so far. I've reinstalled proxmox quite a few times. I couldn't go 3 minutes into restoring a VM from PBS without at least 1 node locking up hard.

In an attempt to isolate the issue, i only used 2 nodes, i was still having the exact same issue. I'm using 2/2 replication and in corosync.conf i gave one node 2 votes.

I decided to eliminate open fabric, so i am just using standard IP'ing assigned to en05 with 2 hosts. I also used reef instead of quincy, so I changed 2 variables. It's been working perfectly for like 8 hours so i think this is a success.

My next test that i'm gonna start working on now, will be to add openfabric to the working configuration. If this doesn't work then there's some kind of issue with TB, openfabric, ceph, PVE 8.1.4, and kernel 6.5, and if it does work then the issue is likely with quincy and the combination of variables on kernel 6.5

@DarkPhyber-hg
Copy link

ok, going to reef did the trick, no more lockups even with openfabric

@jacoburgin
Copy link

I had to lock the kernel to 6.2 to get stability on my nuc 12's

@DarkPhyber-hg
Copy link

I forgot that i had commented out the MTU of 65520, so it was defaulting to 1500, when i put it back to 65520 i got an instant lock up! I'm playing around with various mtu sizes right now. What's strange is that with an extended iperf3 test i got no lockups with the higher mtu value.

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 25, 2024

ok, i've been playing around with various mtu sizes, there's no perceivable difference on my hardware in iperf3 speeds for an mtu betweeen 1500 and 34,000. I always wind up with an iperf3 test of around 22-23gbps. Going to 35,000 i get lockups with ceph.

Using the ceph benchmark tool rados, on a write test, is a good way to stress test and see if i will get a lockup without having to use real world load. Additionally, i consistently get the best write throughput and iop performance with an mtu of 1500 with my current hardware. I am using consumer wd sn850x m.2 drives, until i get some enterprise ones, so this could have an impact on this as well.

I have some Samsung PM9A3 u.2 drives on the way, along with some PM983 m.2 drives. Once i get those i'll do another round of testing and hopefully put this stack into production to replace my r730.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment