Skip to content

Instantly share code, notes, and snippets.

Last active June 10, 2024 12:46
Show Gist options
  • Save scyto/f04f27ecc07b1dc0662a5fdc38a4418c to your computer and use it in GitHub Desktop.
Save scyto/f04f27ecc07b1dc0662a5fdc38a4418c to your computer and use it in GitHub Desktop.
weird stuff troublehsooting etc

this gist is part of this series

PROXMOX Setup Weirdness Observed

  1. Most of the time gui setup didn't get dhcp address from my server, one time in 6 installs it did.
  2. Sometimes gui setup populated an example IPv4 address, sometimes an actual IPv6 address and one time real DCHP IPv4 address
  3. on one node in 6 installs it added a thunderbol0 entry to /etc/network/interaces resulting in what you see below, the fix was to remove from interfaces file


  1. sometimes after setup the tunderbolt modules load and sometimes they don't - i don't know why, i currently have 1 machine with them specified in /etc/modules file and 2 machines without - i figured out this is (stupidly) by design, one machine in the cluster needs to manually have the the modules loaded (but this will cause other machines to load them on the fly) as such its best to define them on all nodes.

  2. on node 3 i accidentally gave it an IP of instead of during gui install. after install i changed this IIRC in /etc/network/interfaces and all seemed ok, but when i tried to join the cluster it failed saying didn't exist on the local machine. The fix was to edit /etc/hosts which had the incorrect IP mapped as follows pve3 correcting the IP on the line seemed to fix it.

  3. IPv6 over thunderbolt seem to be quite broken on Proxmox VE8 and Debian 12 - as such this guide switched to all IPv4 but you can find the IPv6 routed mesh page here if you are interested.


In general setup ceph only once you know the mesh network is stable and can survive cable pulls, managed reboots, node failure. Doing ceph and cluster networking after they are fully cofigured rendered my cluster nearly unseable (but never irrevecoably - just need to google the forums a ton)

UI very fragile to certain ceph configs

It seems the UI is super fragile to stupid confis and will stop responding - for example if one delets a cephfs storage item and stops all MDS it can hange. The only fix is to fix ceph at command line.

For CephFS i found this to be super useful to blow away CephFS pools in strange states pveceph fs destroy ISOs-Templates --remove-pools 1 --remove-storages 1 where the ISOs-Templates is the name of the pool without the _data or _metadata suffixes

I also found doing this can leave orphnaned directories in /mnt/pve/ which if are the same as a new CephFS folder you want to create will result in a mount error - i have not yet figure out how to remove these as operations on these folders hangs the shell...

Reseting Ceph

It is possible to tear down ceph, this is the general approach:

  1. remove pools
  2. remove OSDs
  3. remove mons and managers on all but one node
  4. on laast node to pveceph purge

I did find when things were really screwy and the purge failed or mon deletion fails that one might need to delete one or more of the following:

  • delete the file for the failed monitor on its node aka rm /etc/systemd/system/<monitor name>
  • delere the file for the failed monnitor on its node aka rm /var/lib/ceph/mon/<monitor name>
  • source:
  • the enries in the ceph.conf for the ghost mon (never delete the last one this way)

OSD Couldnt be created

My second and third OSD (first on node 2 and first on node 3) couldn't be created with:

create OSD on /dev/nvme0n1 (bluestore)
wiping block device /dev/nvme0n1
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 0.0956093 s, 2.2 GB/s
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 879da4b3-c227-41a8-823f-a2357fb01af7
 stderr: 2023-08-20T08:18:19.776-0700 7ff8c93b06c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
 stderr: 2023-08-20T08:18:19.776-0700 7ff8c93b06c0 -1 AuthRegistry(0x7ff8c4060610) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: 2023-08-20T08:18:19.780-0700 7ff8c27fc6c0 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
 stderr: 2023-08-20T08:18:19.780-0700 7ff8c2ffd6c0 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
 stderr: [errno 13] RADOS permission denied (error connecting to the cluster)
-->  RuntimeError: Unable to create a new OSD id
TASK ERROR: command 'ceph-volume lvm create --cluster-fsid 5e55fd50-d135-413d-bffe-9d0fae0ef5fa --data /dev/nvme0n1' failed: exit code 1

fix was to run the following on each node

ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring source:

it was unclear to me why this file didn't exist - but i had purged / reinstalled ceph more than a few times - so I suspect that was the issue.


My first migration failed with

2023-08-20 08:35:44 # /usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve2' root@ /bin/true
2023-08-20 08:35:44 Host key verification failed.
2023-08-20 08:35:44 ERROR: migration aborted (duration 00:00:00): Can't connect to destination address using public key
TASK ERROR: migration aborted

this was fixed by either one or both of these being done on EVERY node by connecting explictly to the node (not using one node to access another):

so maybe this was the magic:
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pve1"
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pve2"
ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "pve3"

fix migrate ssh issues (hhm why did this work - the names are wrong, maybe it was just clearing down the ssh known hosts files that worked?

ssh -o 'HostKeyAlias=pve1' root@
ssh -o 'HostKeyAlias=pve2' root@
ssh -o 'HostKeyAlias=pve3' root@


Copy link

B-C-C commented Jun 10, 2024

My Weird Error was slightly different:

thunderbolt setup and changed the migration network to that per information...

could not get migration ip: no IP address configured on local node for network ''
TASK ERROR: command '/usr/bin/ssh -e none -o 'BatchMode=yes' -o 'HostKeyAlias=pve03' -o 'UserKnownHostsFile=/etc/pve/nodes/pve03/ssh_known_hosts' -o 'GlobalKnownHostsFile=none' root@ pvecm mtunnel -migration_network -get_migration_ip' failed: exit code 255

Can see it knows the host via root@ but doesn't seem to know that host is also over on the Thunderbolt network...
If that makes any sense...

expect same issue on all three hosts - causing no migration...

Ceph currently also on the old network and expect I'll need to update that as well to use the thunderbolt network - but that might be a bit more tricky as there are VMS up and running currently - not quite production though.

Temporarily corrected by changing Datacenter > Options > Migrate network back to the lower speed network, but interested in the correct method to change both Ceph and Migration over to the higher speed 10+GB TBNet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment