Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active April 18, 2024 17:26
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
Save scyto/8c652f3eab61ed1fa2f980d02a484c35 to your computer and use it in GitHub Desktop.
setting up the ceph cluster

CEPH HA Setup

Note this should only be done once you are sure you have reliable TB mesh network.

this is because proxmox UI seems fragile wrt to changing underlying network after configuration of ceph.

All installation done via command line due to gui not understanding the mesh network

This setup doesn't attempt to seperate the ceph public network and ceph cluster network (not same as proxmox clutser network), The goal is to get an easy working setup.

this gist is part of this series

Ceph Initial Install & monitor creation

  1. On all nodes execute the command pveceph install --repository no-subscription accept all the packages and install
  2. On node 1 execute the command pveceph init --network 10.0.0.81/24
  3. On node 1 execute the command pveceph mon create --mon-address 10.0.0.81
  4. On node 2 execute the command pveceph mon create --mon-address 10.0.0.82
  5. On node 3 execute the command pveceph mon create --mon-address 10.0.0.83

Now if you access the gui Datacenter > pve1 > ceph > monitor you should have 3 running monitors (ignore any errors on the root ceph UI leaf for now).

If so you can proceed to next step. If not you probably have something wrong in your network, check all settings.

Add Addtional managers

  1. On any node go to Datacenter > nodename > ceph > monitor and click create manager in the manager section.
  2. Selecty an node that doesn't have a manager from the drop dwon and click create 3 repeat step 2 as needed If this fails it probably means your networking is not working

Add OSDs

  1. On any node go to Datacenter > nodename > ceph > OSD
  2. click create OSDselect all the defaults (again this for a simple setup)
  3. repeat untill you have 3 nodes like this (note it can take 30 seconds for a new OSD to go green) image

If you find there are no availale disks when you try to add it probably means your dedicated nvme/ssd has some other filesystem or old osd on it. To wipe the disk use the following UI. Becareful not to wipe your OS disk. image

Create Pool

  1. On any node go to Datacenter > nodename > ceph > pools and click create
  2. name the volume, e.g. vm-disks and leave defaults as is and click create

Configure HA

  1. On any node go to Datacenter > options
  2. Set Cluster Resource Scheduling to ha-rebalance-on-start=1 (this will rebalance nodes as needed)
  3. Set HA Settings to shutdown_policy=migrate (this will migrate VMs and CTs if you gracefully shutdown a node).
  4. Set migration settings leave as default (seperate gist will talk about seperating migration network later)
@zombiehoffa
Copy link

I was migrating because I added the fourth node so it was rebalancing. I don't think it's an frr thing anymore, I can recreate with iperf3 all I have to do is have the path go through another node and I get 1-5MB/sec instead of 12 gbit/sec. It's really, really weird. I was expecting potentially a 50% performance drop, not nearly entire performance drop just by transiting a node.
it happens across ip4 and ip6. Direct connections 12 gbit/sec transit through one or more nodes to get to the node (I disconnected the ring to test it out) and it's 1-5 MB/sec. (it's weird I thought it would drop even more with 2 nodes in between but it basically didn't).

@scyto
Copy link
Author

scyto commented Jan 6, 2024

my two thesis are:

  1. the traffic is getting routed onto the LAN and back into the mesh
  2. you found some weird and wonderful new bug
  3. you have some sort of lower level TB problem as IIRC you have USB4 not TB4 - so maybe some other bug i wont hit?

if i get time i will test the scenario again, i originally did it when i wasn't using fabricd but OSPF, i did test this exact scenario when i was fiing the bugs with intel and don't recall seeing any iperf3 drop off like this

@zombiehoffa
Copy link

zombiehoffa commented Jan 6, 2024

Thanks,
That would be great if you could disconnect your ring and then run iperf3 between the two end nodes through the middle node and see what happens.

How would I figure out if 1 is happening? I am indeed on usb4, it's 4 beelink ser7's

Edit: I just ran iperf3 with --udp to see if behaviour was different in udp and it's basically the same but slightly worse.

@nicedevil007
Copy link

nicedevil007 commented Jan 10, 2024

on the gist before I added my notes belonging to IP route problems after a reboot. If I use everything with IPv4 or IPv6 before a reboot I was able to get to the point where I created all 3 monitors that were up and running.

At the step on creating the manager I always get a timeout :(

image

image

Can someone help me out with this?

@zombiehoffa
Copy link

I got something similar when i did it through the web interface for a node other than the one that i inited the neywork on. Try logging into the prox web interface for the init node and doing it from there

@nicedevil007
Copy link

Ok, I was able to test this, but still the same issue :( Maybe it is relating to the other problem on the other gist, because after adding one monitor it looses after some time the ip routes :(

@nicedevil007
Copy link

command to deploy reef version in shell:

pveceph install --repository no-subscription --version reef

@nicedevil007
Copy link

Fix for Clock Skew was => check ntp settings, is timesync working ;)

@zombiehoffa
Copy link

my two thesis are:

1. the traffic is getting routed onto the LAN and back into the mesh

2. you found some weird and wonderful new bug

3. you have some sort of lower level TB problem as IIRC you have USB4 not TB4 - so maybe some other bug i wont hit?

if i get time i will test the scenario again, i originally did it when i wasn't using fabricd but OSPF, i did test this exact scenario when i was fiing the bugs with intel and don't recall seeing any iperf3 drop off like this

Did you get a chance to test the traversing a node scenario?

Thanks.

@jacoburgin
Copy link

Just noticed, is that your pikvm in the pi rack?

@Kirkland-gh
Copy link

I was migrating because I added the fourth node so it was rebalancing. I don't think it's an frr thing anymore, I can recreate with iperf3 all I have to do is have the path go through another node and I get 1-5MB/sec instead of 12 gbit/sec. It's really, really weird. I was expecting potentially a 50% performance drop, not nearly entire performance drop just by transiting a node. it happens across ip4 and ip6. Direct connections 12 gbit/sec transit through one or more nodes to get to the node (I disconnected the ring to test it out) and it's 1-5 MB/sec. (it's weird I thought it would drop even more with 2 nodes in between but it basically didn't).

I see the same behavior. 3 11th gen intel nucs using TB 3. traversing another node to get to my destination takes me from 19/Gbps to .5-5Mbps tested with iperf. Did you manage to work around this?

@zombiehoffa
Copy link

I was migrating because I added the fourth node so it was rebalancing. I don't think it's an frr thing anymore, I can recreate with iperf3 all I have to do is have the path go through another node and I get 1-5MB/sec instead of 12 gbit/sec. It's really, really weird. I was expecting potentially a 50% performance drop, not nearly entire performance drop just by transiting a node. it happens across ip4 and ip6. Direct connections 12 gbit/sec transit through one or more nodes to get to the node (I disconnected the ring to test it out) and it's 1-5 MB/sec. (it's weird I thought it would drop even more with 2 nodes in between but it basically didn't).

I see the same behavior. 3 11th gen intel nucs using TB 3. traversing another node to get to my destination takes me from 19/Gbps to .5-5Mbps tested with iperf. Did you manage to work around this?

Nope. No solution yet. I am eyeing the ms01 instead, as it has dual 10 gig, which should be fine for my purposes. Pretty sad about this because if it worked it would be awesome.

@lettucebuns
Copy link

I'm wondering if anyone else is having similar issues. I'm able to get through setup without issue, communication works over IPv4/IPv6, but as soon as I add an ISO to the CephFS disk or migrate a VM to the vm-disks Ceph storage, the nodes go offline. Usually, the node where the upload or migration started from stays online, but isn't able to get the status of Ceph components. The hosts cannot ping each other and I cannot ping them from my management workstation. I've wiped the cluster twice and configured it again, the 3rd time as IPv6 but the same issue occurred all 3 builds. I'm using 3 Intel NUCs 12 gen.

I reviewed logs using journalctl -xe but I couldn't find anything that pointed to what the issue could be. If anyone has any suggestions for logs to review I'm happy to do so.

It did look like the line to restart the frr.service did not working for me:

Jan 28 15:10:17 LAB-PX-01 /usr/sbin/ifup[715]: error: /etc/network/interfaces: line41: error processing line 'post-up /usr/bin/systemctl restart frr.service'
Jan 28 15:10:17 LAB-PX-01 /usr/sbin/ifup[715]: >>> Full logs available in: /var/log/ifupdown2/network_config_ifupdown2_43_Jan-28-2024_15:10:16.989774 <<<

My experience was if the Ceph cluster was configured using IPv4, then I needed to manually restart the frr service post-reboot. The 3rd time I configured the Ceph cluster to use IPv6 and it would come back up without needing to restart the frr service.

After seeing this entry, I did try setting the IOPS to 310000 and then 10000 but neither change made a difference:

706474281.7610672 osd.0 (osd.0) 1 : cluster 3 OSD bench result of 106755.004772 IOPS exceeded the threshold limit of 80000.000000 IOPS for osd.0. IOPS capacity is unchanged at 21500.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].
1706474281.7613761 osd.1 (osd.1) 1 : cluster 3 OSD bench result of 115067.749781 IOPS exceeded the threshold limit of 80000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].

not sure if this is at all useful:

Jan 28 13:51:59 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:51:59.263-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:04 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:52:04.266-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:04 LAB-PX-02 kernel: libceph: mon1 (1)10.0.0.82:6789 socket closed (con state OPEN)
Jan 28 13:52:08 LAB-PX-02 fabricd[817]: [NBV6R-CM3PT] OpenFabric: Needed to resync LSPDB using CSNP!
Jan 28 13:52:09 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:52:09.266-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:09 LAB-PX-02 kernel: libceph: mon1 (1)10.0.0.82:6789 socket closed (con state OPEN)
Jan 28 13:52:14 LAB-PX-02 ceph-mon[1131]: 2024-01-28T13:52:14.266-0500 7f71b77cd6c0 -1 mon.LAB-PX-02@2(probing) e3 get_health_metrics reporting 1 slow ops, oldest is auth(proto 0 26 bytes epoch 0)
Jan 28 13:52:14 LAB-PX-02 kernel: libceph: mon1 (1)10.0.0.82:6789 socket closed (con state OPEN)

Thanks for reading - let me know if you have any tips!

@jacoburgin
Copy link

I am having something similar. I have 3 nuc12's and has been working fine until recently. then out of the blue one node is shown as disconnected. Can't ping it and my KVM shows the video output frozen and won't accept input....

Did you use ceph reef or Quincy? I'm wanting to cross that variable off the list as I never had an issue with Quincy.

@lettucebuns
Copy link

I've deployed the cluster using both versions - the issue existed for both.

@jacoburgin
Copy link

Hmmmm. Mine has been fine. Perhaps do not run apt update && apt upgrade after initial install incase something new is breaking it from a fresh iso install

@jacoburgin
Copy link

Just restarted all 3 machines. As you say as soon as I upload an ISO the other two medicines crash completely. Will reinstall all 3 later today from the same iso that has been working but will not upgrade it and test....

I'm getting good at reinstalling this!

@lettucebuns
Copy link

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

@jacoburgin
Copy link

jacoburgin commented Jan 30, 2024

I think I'm up to at least 10 wipes 😭😂 keep breaking it on my own 😂

@jacoburgin
Copy link

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

Well I have learnt a lot more about removing cephfs...

But nothing has fixed the random node freezing and subsequently disconnecting.

I fresh installed with the 8.1-1 iso. Ran apt update only to get the package list or lldp won't install (maybe that was a mistake)?

I have tried with no cephfs for ISOs-Templates and used a NFS share instead.

This worked the longest but shortly after nodes froze...

I'm off to bed but tomorrow I'll try the 8.0-2 iso, then maybe Kernal update on-top if it is stable.

But something has completely broken it for us NUC12 users...

To me though the it has to be some sort of driver issue maybe for the CPU as at least in my case when the node "disconnects" In the webui, the machine has actually locked up/frozen (I can see this through my KVM) and has to be hard reset.

@jacoburgin
Copy link

jacoburgin commented Feb 7, 2024

Let me know how it goes! It it works for you maybe I'll consider wiping a 3rd time...

Some success, updating the microcode has made migrating a windows VM possible. No crashes there. But still uploading an iso to a cephfs. That locked two nodes and had to be hard reset.

Others are experiencing similar after AN update. Just not sure what broke it all

https://www.reddit.com/r/Proxmox/s/pDMvr9WKA8

@jacoburgin
Copy link

SO I have reinstalled the 3 nuc12's to 7.5, zero issues as expected with Scyto's gist. Upgraded to PVE8 and kernel 6.5 and everything is broken.

Downgraded the kernel to 6.2.16.20 (which includes Scyto's TB fix) and have had zero issues so far! I can live migrate a again and upload ISO's. No other "fixes" applied just a change in kernel

@lettucebuns

@zombiehoffa
Copy link

Thevlater kernels reverted the fix???

@jacoburgin
Copy link

Thevlater kernels reverted the fix???

No, Scyto's thunderbolt fix is applied from 6.2.16-14 onwards.

@DarkPhyber-hg
Copy link

i just got my ms-01's, i followed the guide and i've re-installed 3 or4 times now. When using 10gbe for my ceph network, everything works fine. When using thunderbolt i keep getting random lock ups on any node when ever the ceph storage pool is under load. I am on kernel 6.5.13-1-pve and pve 8.1.4.

I wonder if something broke in the later kernel?

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 23, 2024

Following up on what i've done so far. I've reinstalled proxmox quite a few times. I couldn't go 3 minutes into restoring a VM from PBS without at least 1 node locking up hard.

In an attempt to isolate the issue, i only used 2 nodes, i was still having the exact same issue. I'm using 2/2 replication and in corosync.conf i gave one node 2 votes.

I decided to eliminate open fabric, so i am just using standard IP'ing assigned to en05 with 2 hosts. I also used reef instead of quincy, so I changed 2 variables. It's been working perfectly for like 8 hours so i think this is a success.

My next test that i'm gonna start working on now, will be to add openfabric to the working configuration. If this doesn't work then there's some kind of issue with TB, openfabric, ceph, PVE 8.1.4, and kernel 6.5, and if it does work then the issue is likely with quincy and the combination of variables on kernel 6.5

@DarkPhyber-hg
Copy link

ok, going to reef did the trick, no more lockups even with openfabric

@jacoburgin
Copy link

I had to lock the kernel to 6.2 to get stability on my nuc 12's

@DarkPhyber-hg
Copy link

I forgot that i had commented out the MTU of 65520, so it was defaulting to 1500, when i put it back to 65520 i got an instant lock up! I'm playing around with various mtu sizes right now. What's strange is that with an extended iperf3 test i got no lockups with the higher mtu value.

@DarkPhyber-hg
Copy link

DarkPhyber-hg commented Feb 25, 2024

ok, i've been playing around with various mtu sizes, there's no perceivable difference on my hardware in iperf3 speeds for an mtu betweeen 1500 and 34,000. I always wind up with an iperf3 test of around 22-23gbps. Going to 35,000 i get lockups with ceph.

Using the ceph benchmark tool rados, on a write test, is a good way to stress test and see if i will get a lockup without having to use real world load. Additionally, i consistently get the best write throughput and iop performance with an mtu of 1500 with my current hardware. I am using consumer wd sn850x m.2 drives, until i get some enterprise ones, so this could have an impact on this as well.

I have some Samsung PM9A3 u.2 drives on the way, along with some PM983 m.2 drives. Once i get those i'll do another round of testing and hopefully put this stack into production to replace my r730.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment