Skip to content

Instantly share code, notes, and snippets.

@spali
Last active December 31, 2024 08:47
Show Gist options
  • Save spali/2da4f23e488219504b2ada12ac59a7dc to your computer and use it in GitHub Desktop.
Save spali/2da4f23e488219504b2ada12ac59a7dc to your computer and use it in GitHub Desktop.
Disable WAN Interface on CARP Backup
#!/usr/local/bin/php
<?php
require_once("config.inc");
require_once("interfaces.inc");
require_once("util.inc");
$subsystem = !empty($argv[1]) ? $argv[1] : '';
$type = !empty($argv[2]) ? $argv[2] : '';
if ($type != 'MASTER' && $type != 'BACKUP') {
log_error("Carp '$type' event unknown from source '{$subsystem}'");
exit(1);
}
if (!strstr($subsystem, '@')) {
log_error("Carp '$type' event triggered from wrong source '{$subsystem}'");
exit(1);
}
$ifkey = 'wan';
if ($type === "MASTER") {
log_error("enable interface '$ifkey' due CARP event '$type'");
$config['interfaces'][$ifkey]['enable'] = '1';
write_config("enable interface '$ifkey' due CARP event '$type'", false);
interface_configure(false, $ifkey, false, false);
} else {
log_error("disable interface '$ifkey' due CARP event '$type'");
unset($config['interfaces'][$ifkey]['enable']);
write_config("disable interface '$ifkey' due CARP event '$type'", false);
interface_configure(false, $ifkey, false, false);
}
@toddgonzo74
Copy link

Anyone find a fix for this issue yet?

@skl283
Copy link

skl283 commented Nov 7, 2024

i haven't tried it yet, but does this issue also occur at 24.7.8? @bitcoredotorg perhaps you tried the update?

@toddgonzo74
Copy link

I just upgraded to 24.7.8 (I was actually on 24.7.7 and it was working fine... as was it in 24.7.6). I run both my firewalls in Proxmox, so I took a backup snapshot before each upgrade, just in case. When the primary node came back up, the only thing I noticed was that it was pinned up in persistent carp maintenance mode.. I enabled and disabled and the backup failed right over to the primary. Only issue I still have is with Spectrum. For some reason, when I use a vlan on my managed switch (Juniper EX3400 POE), the Spectrum routinely fails to DHCP a new address (I have dhcp snooping and damn near everything else disabled in that vlan that could be interfering). For a goof, I grabbed an old gig switch from Netgear and plugged in the Spectrum primary/backup and circuit.. been fine for 4 months now. Fails over Spectrum with no issues.

Anyway... not seeing the problem in 24.7.8.

@mknap
Copy link

mknap commented Dec 2, 2024

I am running 24.7.9_1, and I see the same error mentioned by bitcoredotorg.

I also tried the recent development branch as of this writing, and it is the same.

Implementing @bitcoredotorg 's fix seemed to work well enough, though I had to edit it slightly. The script with his workaround looks like this for me:

if ($type === "MASTER") {
    log_error("enable interface '$ifkey' due CARP event '$type'");
    $config['interfaces'][$ifkey]['enable'] = '1';
    write_config("enable interface '$ifkey' due CARP event '$type'", false);
    #interface_configure(false, $ifkey, false, false);
    shell_exec("/sbin/ifconfig {$interface[$ifkey]} up");

} else {
    log_error("disable interface '$ifkey' due CARP event '$type'");
    unset($config['interfaces'][$ifkey]['enable']);
    write_config("disable interface '$ifkey' due CARP event '$type'", false);
    #interface_configure(false, $ifkey, false, false);
    shell_exec("/sbin/ifconfig {$interface[$ifkey]} down");

}

error stack:

[01-Dec-2024 21:24:07 America/Chicago] PHP Fatal error:  Uncaught Error: Call to undefined function system_routing_configure() in /usr/local/etc/inc/interfaces.inc:3777
Stack trace:
#0 /usr/local/etc/inc/interfaces.inc(2498): interfaces_restart_by_device(false, Array, false)
#1 /usr/local/etc/rc.syshook.d/carp/10-wancarp(28): interface_configure(false, 'wan', false, false)
#2 {main}
  thrown in /usr/local/etc/inc/interfaces.inc on line 3777

@huetruong
Copy link

huetruong commented Dec 16, 2024

I'm running OPNsense 24.7.10_2-amd64 and incorporated the bits and pieces of code here and there. The solution I found for the undefined function for system_routing_configure() was by including the system.inc to the script and then I can use interface_configure without it crashing. Although, I have CARP event issues unrelated to this.

require_once("config.inc");
require_once("interfaces.inc");
require_once("util.inc");

// Ensure system_routing_configure is included
require_once("system.inc");
.
.
.

@MEntOMANdo
Copy link

So is this script considered stable on OPNsense 24.7.10_2 (with the possible need to require system.inc as mentioned directly above)?

@huetruong
Copy link

So is this script considered stable on OPNsense 24.7.10_2 (with the possible need to require system.inc as mentioned directly above)?

Not sure. I barely got the whole script installed and troubleshot my installation. I figured I would share what I did to make it work with the crash. I have it running on 1 physical baremetal and 1 proxmox vm with 11 internal VIP VLANs. Stable? Not sure.

@bitcoredotorg
Copy link

I upgraded today to 24.7.11_2. Adding:
require_once("system.inc");
does prevent the crashing issue. Nice find, huetruong.

I'm still having an issue with entering persistent maintenance mode not causing a failover: opnsense/core#7877
I've also not had enough time to find the most optimal way to shut/noshut the WAN interface - to ensure active/passive device reboot behavior produces a consistent and desired state for the interface based on the CARP status. (I don't want my backup/passive device to have it's WAN interface enabled upon boot, and requesting a DHCP lease while the active device is already handling traffic)

@MEntOMANdo
Copy link

MEntOMANdo commented Dec 21, 2024 via email

@bitcoredotorg
Copy link

Creative suggestions, MEntOMANdo. You could do that and probably achieve a workable situation, but I see potential problems with that approach, and for some users and ISPs.
In your VM example, though the interface will be "down" by default, I believe the interface will still be brought up by configuration during boot - if it's stored in the opnsense configuration for the interface to be up, it will be brought up during boot.
In your CRON example, you may also run into a race condition, and still have your WAN interface come up, and do things like request a DHCP Lease, and possibly also not be shut down by the cron job if the device is 'backup' - depending on when the boot process that cron entry actually executes.

Towards the end of 'boot', the interface configuration is read, and then applied. So, with either approach, you have both the risk of the interface coming up in the first place, or not being shut down after the opnsense scripts read the configuration and bring up the interface.

This is one reason why I mention my workaround of using shell_exec to manually set the interfaces up or down is not very clean, or ideal - both because I'm calling shell_exec in the first place (bad practice, a security no-no!), and because the state of the interface will not persist across reboots).

IMO, it's better for the syshook.d CARP script to set the interface's configuration to be down, and save this in the configuration - so that only when CARP's state changes to "master", will the WAN interface be brought up at all. This way, you don't have to change default interface behavior, the script handles this for you.

Thoughts?

@huetruong
Copy link

huetruong commented Dec 21, 2024

I upgraded today to 24.7.11_2. Adding: require_once("system.inc"); does prevent the crashing issue. Nice find, huetruong.

I'm still having an issue with entering persistent maintenance mode not causing a failover: opnsense/core#7877 I've also not had enough time to find the most optimal way to shut/noshut the WAN interface - to ensure active/passive device reboot behavior produces a consistent and desired state for the interface based on the CARP status. (I don't want my backup/passive device to have it's WAN interface enabled upon boot, and requesting a DHCP lease while the active device is already handling traffic)

I reread your comments. I have to disable the WAN interface of the instance that is in backup state when I update and reboot so it doesn’t switch over.

This script works fine as an automatic failover if something goes wrong with the master.

@vc1cv1
Copy link

vc1cv1 commented Dec 27, 2024

Long story short, after finding out I couldn't unbridge my ONT -- I went about testing the WAN failover between my opnsense VMs again.
Either I haven't tested it in a long time or I was mistaken the last time I tested it. I had most of the issues that everyone mentioned .. most noticeably, the wan interface not disabling or enabling properly on the master/backup node respectively.

Also, on the backup/master node -- I noticed that it kept repeating master/backup node messages (as per the logging from 10-wancarp).

_2024-12-27T01:38:05-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: enable interface 'wan' due CARP event 'MASTER'
2024-12-27T01:38:05-05:00 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member " (172.30.67.254) (40@vlan009)" has resumed the state "BACKUP" for vhid 40
2024-12-27T01:38:05-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: disable interface 'wan' due CARP event 'BACKUP'
2024-12-27T01:38:04-05:00 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member " (172.30.67.254) (40@vlan009)" has resumed the state "INIT" for vhid 40
2024-12-27T01:38:04-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: disable interface 'wan' due CARP event 'INIT'
2024-12-27T01:42:05-05:00 Notice configd.py [c8268658-528e-4180-9efb-b4465da3c196] Carp event on subsystem 200@vtnet1 for type MASTER
2024-12-27T01:40:05-05:00 Notice configd.py [75707303-20a1-468e-add3-97c31659f7cf] Carp event on subsystem 215@vlan09 for type MASTER
__

What I believe fixed the inconsistent master/backup status messages in the 10-wancarp -- was seeing that the IF type check in 20-openvpn in /usr/local/etc/rc.syshook.d/carp was different. Thanks for everyone that posted their fixes.

https://gist.github.com/vc1cv1/f59273ce98fda57cf8000cca65193b6b

#last updated for opnsense 24.7.11_2
#!/usr/local/bin/php
<?php

require_once("config.inc");
require_once("interfaces.inc");
require_once("util.inc");
require_once("system.inc");

$subsystem = !empty($argv[1]) ? $argv[1] : '';
$type = !empty($argv[2]) ? $argv[2] : '';

if (!in_array($type, ['MASTER', 'BACKUP', 'INIT'])) {
    log_msg("Carp '$type' event unknown from source '{$subsystem}'");
    exit(1);
}

if (!strstr($subsystem, '@')) {
    log_error("Carp '$type' event triggered from wrong source '{$subsystem}'");
    exit(1);
}

$ifkey = 'wan';
$real_if = get_real_interface($ifkey);

# since all my CARP ips fail over together, I just wanted it to only run when it matched the CARP status change for my LAN interface. You can find it in your debug log searching for 'carp' and/or totally comment out the IF statement.
	if ($subsystem === "200@vtnet1") {
if ($type === "MASTER") {
    log_error("enable interface '$ifkey' due CARP event '$type' on '$subsystem'");
    $config['interfaces'][$ifkey]['enable'] = '1';
    write_config("enable interface '$ifkey' due CARP event '$type'", false);
    interface_configure(false, $ifkey, false, false);
    sleep(2);
    shell_exec("/sbin/ifconfig {$real_if} up");
    log_msg("Issuing dhclient command on '$real_if' to request a DHCP lease");
    sleep(1);
    shell_exec("dhclient {$real_if}");

} else {
    log_error("disable interface '$ifkey' due CARP event '$type' on '$subsystem'");
    unset($config['interfaces'][$ifkey]['enable']);
    write_config("disable interface '$ifkey' due CARP event '$type'", false);
    interface_configure(false, $ifkey, false, false);
    shell_exec("/sbin/ifconfig {$real_if} down");
}
	} #if subsystem

@vc1cv1
Copy link

vc1cv1 commented Dec 27, 2024

Creative suggestions, MEntOMANdo. You could do that and probably achieve a workable situation, but I see potential problems with that approach, and for some users and ISPs. In your VM example, though the interface will be "down" by default, I believe the interface will still be brought up by configuration during boot - if it's stored in the opnsense configuration for the interface to be up, it will be brought up during boot. In your CRON example, you may also run into a race condition, and still have your WAN interface come up, and do things like request a DHCP Lease, and possibly also not be shut down by the cron job if the device is 'backup' - depending on when the boot process that cron entry actually executes.

Towards the end of 'boot', the interface configuration is read, and then applied. So, with either approach, you have both the risk of the interface coming up in the first place, or not being shut down after the opnsense scripts read the configuration and bring up the interface.

This is one reason why I mention my workaround of using shell_exec to manually set the interfaces up or down is not very clean, or ideal - both because I'm calling shell_exec in the first place (bad practice, a security no-no!), and because the state of the interface will not persist across reboots).

IMO, it's better for the syshook.d CARP script to set the interface's configuration to be down, and save this in the configuration - so that only when CARP's state changes to "master", will the WAN interface be brought up at all. This way, you don't have to change default interface behavior, the script handles this for you.

Thoughts?

agreed, it's better for the status of the interface to be saved. after testing my failovers, i saw nothing in my backup node on reboot that mentioned the disabled 'wan' interface being tried to be brought online and/or it being disabled by carp status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment