Skip to content

Instantly share code, notes, and snippets.

@scyto
Last active April 21, 2024 13:39
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save scyto/4c664734535da122f4ab2951b22b2085 to your computer and use it in GitHub Desktop.
Save scyto/4c664734535da122f4ab2951b22b2085 to your computer and use it in GitHub Desktop.

Enable Dual Stack (IPv4 and IPv6) OpenFabric Routing

This will result in an IPv4 and IPv6 routable mesh network that can survive any one node failure or any one cable failure. Alls the steps in this section must be performed on each node

Note for ceph do not dual stack - either use IPv4 or IPv6 addressees for all the monitors, MDS and daemons - despite the docs implying it is ok my findings on quincy are is it is funky....

this gist is part of this series

Create Loopback interfaces

Doing this means we don't have to give each thunderbolt a manual IPv6 or IPv4 addrees and that these addresses stay constant no matter what. Add the following to each node using nano /etc/network/interfaces

This should go uder the auto lo section and for each node the X should be 1, 2 or depending on the node

auto lo:0
iface lo:0 inet static
        address 10.0.0.8X/32
        
auto lo:6
iface lo:6 inet static
        address fc00::8X/128

so on the first node it would look comething like this:

...
auto lo
iface lo inet loopback
 
auto lo:0
iface lo:0 inet static
        address 10.0.0.81/32

auto lo:6
iface lo:6 inet static
        address fc00::81/128
...

also add this is as the last line to the interfaces file

# This must be the last line in the file
post-up /usr/bin/systemctl restart frr.service

Save file, repeat on each node.

Enable IPv4 and IPv6 forwarding

  1. use nano /etc/sysctl.conf to open the file
  2. uncomment #net.ipv6.conf.all.forwarding=1 (remove the # symbol)
  3. uncomment #net.ipv4.ip_forward=1 (remove the # symbol)
  4. save the file
  5. reboot?

FRR Setup

Install FRR

Install Free Range Routing (FRR) apt install frr

Enable the fabricd daemon

  1. edit the frr daemons file (nano /etc/frr/daemons) to change fabricd=no to fabricd=yes
  2. save the file
  3. restart the service with systemctl restart frr

Configure OpenFabric (perforn on all nodes)

  1. enter the FRR shell with vtysh
  2. optionally show the current config with show running-config
  3. enter the configure mode with configure
  4. Apply the bellow configuration (it is possible to cut and paste this into the shell instead of typing it manually, you may need to press return to set the last !. Also check there were no errors in repsonse to the paste text.).

Note: the X should be the number of the node you are working on, as an example - 0.0.0.1, 0.0.0.2 or 0.0.0.3.

ip forwarding
ipv6 forwarding
!
frr version 8.5.2
frr defaults traditional
hostname pve1
service integrated-vtysh-config
!
interface en05
ip router openfabric 1
ipv6 router openfabric 1
exit
!
interface en06
ip router openfabric 1
ipv6 router openfabric 1
exit
!
interface lo
ip router openfabric 1
ipv6 router openfabric 1
openfabric passive
exit
!
router openfabric 1
net 49.0000.0000.000X.00
exit
!

  1. you may need to pres return after the last ! to get to a new line - if so do this

  2. exit the configure mode with the command end

  3. save the configu with write memory

  4. show the configure applied correctly with show running-config - note the order of the items will be different to how you entered them and thats ok. (If you made a mistake i found the easiest way was to edt /etc/frr/frr.conf - but be careful if you do that.)

  5. use the command exit to leave setup

  6. repeat steps 1 to 9 on the other 3 nodes

  7. once you have configured all 3 nodes issue the command vtysh -c "show openfabric topology" if you did everything right you will see:

Area 1:
IS-IS paths to level-2 routers that speak IP
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
10.0.0.81/32         IP internal  0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
10.0.0.82/32         IP TE        20     pve2                 en06      pve2(4)
10.0.0.83/32         IP TE        20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers that speak IPv6
Vertex               Type         Metric Next-Hop             Interface Parent
pve1                                                                  
fc00::81/128         IP6 internal 0                                     pve1(4)
pve2                 TE-IS        10     pve2                 en06      pve1(4)
pve3                 TE-IS        10     pve3                 en05      pve1(4)
fc00::82/128         IP6 internal 20     pve2                 en06      pve2(4)
fc00::83/128         IP6 internal 20     pve3                 en05      pve3(4)

IS-IS paths to level-2 routers with hop-by-hop metric
Vertex               Type         Metric Next-Hop             Interface Parent

Now you should be in a place to ping each node from evey node across the thunderbolt mesh using IPv4 or IPv6 as you see fit.

@scyto
Copy link
Author

scyto commented Dec 28, 2023

If you figure it out please let me know what settings you make :)

can you SSH from one node to another other over the mesh IPv6 addresses? This needs to be working.

also what did you put in /etc/pve/datacenter.cfg

@scyto
Copy link
Author

scyto commented Dec 28, 2023

because i use ceph i never worried about migration traffic as state is already shared

@ricksteruk so i just set it up again, assuming you used my steps precisely this works:

migration: insecure,network=fc00::81/64

maybe i should add this to the cluster setup section :-)?

and pictorial evidence:
image

@scyto
Copy link
Author

scyto commented Dec 28, 2023

but the IPv4 requires me to do a 'systemctl restart frr' on all machines before it eventually starts working. Annoying!

weird, i wonder why thats happening, i switched to IPv6 from the get go for ceph

and i left the corosync network on its own dedicated private switch (cheap $50 2.5gbe one, not connected to anything else) using IPv4

@ricksteruk
Copy link

ricksteruk commented Dec 28, 2023

can you SSH from one node to another other over the mesh IPv6 addresses? This needs to be working.

also what did you put in /etc/pve/datacenter.cfg

I never tried SSH over IPv6 mesh network - but I was able to ping via the IPv6.

I can't remember what settings I tried when I attempted setting up the IPv6 migration network

@ricksteruk
Copy link

ricksteruk commented Dec 28, 2023

because i use ceph i never worried about migration traffic as state is already shared

@ricksteruk so i just set it up again, assuming you used my steps precisely this works:

migration: insecure,network=fc00::81/64

maybe i should add this to the cluster setup section :-)?

I will give this a try!

I assume the syntax of the line needs to read:
migration: network=fc00::81/64,type=insecure

@scyto
Copy link
Author

scyto commented Dec 28, 2023

I never tried SSH over IPv6 mesh network - but I could ping via the IPv6.

ping is not good enough, this only tests ICMP - not UDP or TCP (plus migration happens over an SSH pipe IIRC) - this was how i first realized that the thunderbolt-net code was broken... SSH not working.

Proxmox folks backported the fixes from a way later kernel to theirs, i want to make sure they didn't break that backport.... i haven't updated my kernel in some time and it's working ok.... if they broke the backport it's possible it might affect IPv4 too as there was a key thunderbolt reliability fix as part of the fix i got intel to do....

when doing this all for first time also make sure you firewall is disabled on all nodes - just incase thats the issue too

@jacoburgin
Copy link

@scyto @ricksteruk Yes I used migration: network=fc00::81/64,type=insecure and did the ceph setup the same way with ipv6 instead and all woks.

ipv4 only works after frr reboot

@ricksteruk
Copy link

I just tested the SSH via IPv6 and it worked fine.
I also tested the IPv6 "migration: network=fc00::81/64,type=insecure" and that was successful as well!

Which is great news - as now I do not need to worry if the IPv4 is unreliable.

I would have like it if the Thunderbolt was able to provide a backup for a failed main Ethernet connection to the internet, but I have not been able to get this to work. But I have two Ethernet ports on these old Mac Pros I'm using so I will try and set one of those up as a backup connection.

@ricksteruk
Copy link

I was thinking about getting some NUC 13s for my system, but now that Intel has stopped producing NUCs I am not sure if this is something I should base my set up on going forward.

I have seen some of the other NUC type AMD based machines have USB4 - Thunderbolt type connections -- but I would not be keen on buying two or three of them and being a guinea pig. That's a bit too much expensive for a test setup for me!

@scyto
Copy link
Author

scyto commented Dec 28, 2023

don't assume USB4 = TB4 - it might, it might not, TB4 certified means they have to include all the optional USB4 specs.... so just research to make sure you really are getting 40Gbp/s for example the table in the post is a good example of the issues.... https://cravedirect.com/blogs/technology/differences-between-thunderbolt-4-and-usb4

on the intel NUCS - the business is now owned by ASUS, i have nothing but ASUS mobos and have used their US support a few times, have always been happy,

tl;dr the intel NUC 13 form factor is not going anywhere and is fully supported. https://www.asus.com/us/displays-desktops/nucs/nuc-kits/filter?Series=NUC-Kits

@scyto
Copy link
Author

scyto commented Dec 28, 2023

I would have like it if the Thunderbolt was able to provide a backup for a failed main Ethernet connection to the internet,

should be possible, you will need to set a higher cost default route on each node to point to that mac - of course you will need the Mac on a totally different physical network to your LAN (i,.e. if your LAN goes down you Mac goes down too...)

you will also want a route defined on your router back to the IPv6 network

something like this but to the mac (i used this approach so ANY machine on my LAN can reach the private ceph network if needed.... even if one node goes down....

image

the last address on the right is the real IPv6 address on my LAN for each node

and while i laugh at folks who redact IPv4 IP's, redacting IPv6 ips makes sense as they are globally routable..... with no NAT to protect you - only a firewall (and any one who uses NAT-v6 should be shot lol :-) )

@ricksteruk
Copy link

don't assume USB4 = TB4 - it might, it might not, TB4 certified means they have to include all the optional USB4 specs.... so just research to make sure you really are getting 40Gbp/s for example

on the intel NUCS - the business is now owned by ASUS, i have nothing but ASUS mobos and have used their US support a few times, have always been happy,

tl;dr the intel NUC 13 form factor is not going anywhere and is fully supported. https://www.asus.com/us/displays-desktops/nucs/nuc-kits/filter?Series=NUC-Kits

Yep.. that is why I'm not keen on buying an AMD style NUC unless someone has tested it's 'Thunderbolt compatibility' in this scenario

Good to know that ASUS owns the NUC business and will carry it on thanks 👍

@scyto
Copy link
Author

scyto commented Dec 28, 2023

Good to know that ASUS owns the NUC business and will carry it on thanks 👍

you are welcome, i also believe they will be handling all RMA and warranty claims for all units intel has sold to date...

@ricksteruk
Copy link

I would have like it if the Thunderbolt was able to provide a backup for a failed main Ethernet connection to the internet,

should be possible, you will need to set a higher cost default route on each node to point to that mac - of course you will need the Mac on a totally different physical network to your LAN (i,.e. if your LAN goes down you Mac goes down too...)

you will also want a route defined on your router back to the IPv6 network

something like this but to the mac (i used this approach so ANY machine on my LAN can reach the private ceph network if needed.... even if one node goes down....

Sounds a bit beyond my noob networking skills! Where would I be setting this sort of thing up?
The Macs aren't running as Macs anymore, they are running Proxmox and hosting VMs with Ubuntu

@scyto
Copy link
Author

scyto commented Dec 28, 2023

Sounds a bit beyond my noob networking skills! Where would I be setting this sort of thing up?
The Macs aren't running as Macs anymore, they are running Proxmox and hosting VMs with Ubuntu

You would be setting up FRR on them so they participate in the routing network and configuring default routes on linux.... and then configuring route return on your router (assuming it lets you do static rouutes)... on pve i know one can use the ip route command, not sure if there is a better way...

@ricksteruk
Copy link

Sounds a bit beyond my noob networking skills! Where would I be setting this sort of thing up?
The Macs aren't running as Macs anymore, they are running Proxmox and hosting VMs with Ubuntu

You would be setting up FRR on them so they participate in the routing network and configuring default routes on linux.... and then configuring route return on your router (assuming it lets you do static rouutes)... on pve i know one can use the ip route command, not sure if there is a better way...

Thanks for the pointer! I will look into it. I've recently set up a small N100 mini PC with PFsense - so it should be possible, but I only know the basics of setting that up so far. I still need to set up my WAN failover on that with Starlink as the backup when I get round to it.

@jacoburgin
Copy link

jacoburgin commented Dec 29, 2023

I just tested the SSH via IPv6 and it worked fine. I also tested the IPv6 "migration: network=fc00::81/64,type=insecure" and that was successful as well!

Which is great news - as now I do not need to worry if the IPv4 is unreliable.

I would have like it if the Thunderbolt was able to provide a backup for a failed main Ethernet connection to the internet, but I have not been able to get this to work. But I have two Ethernet ports on these old Mac Pros I'm using so I will try and set one of those up as a backup connection.

Awesome news!

I'm a noob to dude, if you're interested in discussing this on discord I would be happy to share! https://discord.gg/rvW4zE6k

@tooeffayy
Copy link

tooeffayy commented Dec 30, 2023

The proxmox wiki has a page on OpenFabric that has timer settings plus a post-up command in /etc/network/interfaces to restart frr.service which may help to alleviate some problems.

I did try the dual stack setup as described in the gist and had the same issues with IPv4 installing incorrect routes on reboot but IPv6 working correctly. I then tried only IPv6 and started experiencing identical issues with the v6 only stack so there definitely is some sort of bug involved. Ended up with just a OSPFv3 setup that appears to be working great for me so far although I only just set up routing and cluster last night.

@pieter-v-n
Copy link

pieter-v-n commented Jan 1, 2024

don't assume USB4 = TB4 - it might, it might not, TB4 certified means they have to include all the optional USB4 specs.... so just research to make sure you really are getting 40Gbp/s for example
on the intel NUCS - the business is now owned by ASUS, i have nothing but ASUS mobos and have used their US support a few times, have always been happy,
tl;dr the intel NUC 13 form factor is not going anywhere and is fully supported. https://www.asus.com/us/displays-desktops/nucs/nuc-kits/filter?Series=NUC-Kits

Yep.. that is why I'm not keen on buying an AMD style NUC unless someone has tested it's 'Thunderbolt compatibility' in this scenario

Good to know that ASUS owns the NUC business and will carry it on thanks 👍

I have built the 3-server setup with CEPH on AMD hardware (Ryzen 7 7840HS) using USB4. I have the same issues with IPv4 not starting automatically after reboot for the thunderbolt connections. Also, when configuring the frr configuration , I get an error message on "ip router openfabric 1" command. This line does not show up in the frr.conf file. When edited manually the IPv4 links eventually get enabled.
Now I have CEPH using IPv4, but I want to switch to use IPv6 because that always works. But CEPH seems to have problems with dual stack setup.
As reported earlier, the mini-PC's I use (Bee-link GTR7) show iperf3 throughput of about 12 Gbps over USB4 between two hosts. I expected more, but the interfaces seem to connect only with 20 Gbps USB speed. For me, that is enough for now.

@nicedevil007
Copy link

nicedevil007 commented Jan 10, 2024

Hi guys, I have the 3 NUC 13 pro with the 1340P CPU on it here, connected all 3 TB4 cables in the mesh:

image

After I did everything from here I ended up with the possibility to ping each node after step 11 vtysh -c "show openfabric topology".
I can do ssh on each of the other 3 nodes as well, iperf tests were done etc. Looking like this here:

image

After a reboot, no ip route is there anymore and nothing is possible anymore. Is there something I can do to fix this?

Oh I forgot, this is my /etc/network/interfaces config

auto lo
iface lo inet loopback

### thunderbolt part start
# > https://gist.github.com/scyto/4c664734535da122f4ab2951b22b2085
###

auto lo:0
iface lo:0 inet static
        address 10.0.0.11/32

auto lo:6
iface lo:6 inet static
        address fc00::11/128

### thunderbolt part end


iface enp86s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.21.1/29
        gateway 192.168.21.6
        bridge-ports enp86s0
        bridge-stp off
        bridge-fd 0

### thunderbolt part start

auto en05
iface en05 inet static
        mtu 4000

iface en05 inet6 static
        mtu 4000

auto en06
iface en06 inet static
        mtu 4000

iface en06 inet6 static
        mtu 4000

### thunderbolt part end

source /etc/network/interfaces.d/*

post-up /usr/bin/systemctl restart frr.service

@zombiehoffa
Copy link

Change all your statics in interfaces to manual. Also do an ip a command and see if en05 and en06 exist. If they dony unplug replug the thunderbolt cables. After that amd if they exist do ifup en05, ifup en06, and ifup lo on all three

@nicedevil007
Copy link

nicedevil007 commented Jan 10, 2024

Change all your statics in interfaces to manual. Also do an ip a command and see if en05 and en06 exist. If they dony unplug replug the thunderbolt cables. After that amd if they exist do ifup en05, ifup en06, and ifup lo on all three

en05 and 06 exists after reboot everytime, but I will try the manual stuff tomorrow (not at the site where my playground is located right now :()

@zombiehoffa
tested this as well, doesn't make any difference.
What I noticed was, that when doing an ifup en06 f.e. it tells me that it couldn't run the post-up command. If I run the command without the post-up directly in the shell it works ofc.

@scyto
Copy link
Author

scyto commented Jan 17, 2024

oops maybe my bad

i have this at bottom of my interfaces file.... does this fix your issue... i added this to the sample
i think it got lost in a cut and paste, sorry

post-up /usr/bin/systemctl restart frr.service

@scyto
Copy link
Author

scyto commented Jan 17, 2024

for reference this is my running file... FWIW

root@pve1:/etc/frr# cat /etc/network/interfaces
# network interface settings; autogenerated
# Please do NOT modify this file directly, unless you know what
# you're doing.
#
# If you want to manage parts of the network configuration manually,
# please utilize the 'source' or 'source-directory' directives to do
# so.
# PVE will preserve these directives, but will NOT read its network
# configuration from sourced files, so do not attempt to move any of
# the PVE managed interfaces into external files!

auto lo
iface lo inet loopback

auto lo:0
iface lo:0 inet static
        address 10.0.0.81/32

auto lo:6
iface lo:6 inet static
        address fc00::81/128

iface enp86s0 inet manual
#part of vmbr0

auto en05
iface en05 inet manual
        mtu 65520

auto en06
iface en06 inet manual
        mtu 65520

iface wlo1 inet manual

auto enp87s0
iface enp87s0 inet manual

auto vmbr0
iface vmbr0 inet static
        address 192.168.1.81/24
        gateway 192.168.1.1
        bridge-ports enp86s0
        bridge-stp off
        bridge-fd 0

iface vmbr0 inet6 static
        address <redacted>
        gateway <redacted>

post-up /usr/bin/systemctl restart frr.service

Also i see some of you now have source /etc/network/interfaces.d/* in your interfaces file - make sure there are no files in /etc/network/interfaces.d/ that could be conflicting (seems the proxmox team might be moving to one file per interface....

@jacoburgin
Copy link

You've dropped the "iface en05 inet6 manual" and "iface en06 inet6 manual"?

@scyto
Copy link
Author

scyto commented Apr 10, 2024

You've dropped the "iface en05 inet6 manual" and "iface en06 inet6 manual"?

oh i see what you mean, no, they are documented in the mesh network setup gist not this gist - in this gist i am showing just the changes / additions needed for FRR - not the whole file

@vdovhanych
Copy link

vdovhanych commented Apr 12, 2024

Ive setup cluster with quite a lot of help from these gists. Just wanted to mention it here for anyone else dealing with network being broken after node reboot.

post-up /usr/bin/systemctl restart frr.service was not working for me in the /etc/network/interfaces and when i run ifreload -a it complained about it and couldt resolve the line.

I created a file restart-frr in the /etc/network/if-up.d/ with command to restart the service as sh script

#!/bin/sh

# Check if the interface is either en05 or en06
if [ "$IFACE" = "en05" ] || [ "$IFACE" = "en06" ]; then
    # Restart the frr service
    /usr/bin/systemctl restart frr.service
fi

Make the file executable and it should run this script after the network target is up, check for en06 or en05 and run the restart command. Reason for this is that it will run the script in that folder for any network up so depending on how many networks you have it will run for every one.
I tested and rebooted one node and the network came back online.

@flx-666
Copy link

flx-666 commented Apr 20, 2024

I faced issues getting en05/06 up after reboot, I ended up adding auto en05 and auto en06 in all my interfaces config files.
frr did not start properly either before I added this before the starting of the service:

post-up /usr/bin/systemctl reset-failed frr.service

So, my /etc/networking/interfaces files end like this:
auto en05
allow-hotplug en05
iface en05 inet manual
mtu 65520

iface en05 inet6 manual
mtu 65520

auto en06
allow-hotplug en06
iface en06 inet manual
mtu 65520

iface en06 inet6 manual
mtu 65520

#source /etc/network/interfaces.d/*
post-up /usr/bin/systemctl reset-failed frr.service
post-up /usr/bin/systemctl restart frr.service

Hope this helps.

BTW, I wonder if I should have just added the reset-failed command in the scipt /etc/network/if-up.d/restart-frr ??

@vdovhanych
Copy link

BTW, I wonder if I should have just added the reset-failed command in the scipt /etc/network/if-up.d/restart-frr ??

If you went and did what i described in my post about setting up a simple script to restart the frr service you shouldnt need to have anything else in the /etc/networking/interfaces. Saying that if its working for you like you described and you have the post-up scripts to restart the service in /etc/networking/interfaces i would just get rid of the /etc/network/if-up.d/restart-frr if you have that. It will only restart the service multiple times (maybe that is what was failing the service too). Also that script i have in the /etc/network/if-up.d/ runs after any of the en05 en06 is up, it wont run if the interfaces are not up.

My /etc/networking/interfaces looks like this

<default proxmox configuration above>

# thunderbolt network configuration
allow-hotplug en05
iface en05 inet manual
       mtu 65520

allow-hotplug en06
iface en06 inet manual
        mtu 65520

auto lo
iface lo inet loopback

auto lo:0
iface lo:0 inet static
        address 10.0.0.81/32

source /etc/network/interfaces.d/*

And then i have what was described in my previous post.

@flx-666
Copy link

flx-666 commented Apr 21, 2024

@vdovhanych thanks for your answer!
I added the reset failed command in the restart-frr script, removed all entries about it in /etc/network/interfaces and now everything is running smoothly :)
Seems to me that routing comes back quicker, probably as I don't restart the services many times

I however had to add the auto en0X entries in the /etc/network/interfaces to have them up at reboot.

Thanks a lot for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment