Two changes were required to reproduce this consistently in a lab:
- Expand the window where the race condition can occur (time spent between
port_update
steps 12 and 22). - Slow port creation for the remote secgroup members
I picked some extreme delays here, but the difference between hitting the race condition or not in production is measured in ms.
These steps were tested on a multi-node environment deployed from https://github.com/openstack/openstack-ansible/tree/436e777, but should work on any deployment with 2x compute nodes.
We will designate one compute node as the "server" hypervisor and the other as the "client" hypervisor. The distinction is important when we get to the "Expand race condition window" step below. I've chosen compute1
as the "client" hypervisor and compute2
as the "server" hypervisor.
Please see the attachment for helper scripts and configuration from neutron-server
and neutron-linuxbridge-agent
.
# Image
wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
openstack image create "cirros" --file cirros-0.3.4-x86_64-disk.img --disk-format qcow2 --container-format bare --public
# Flavor
openstack flavor create --disk 10 --vcpus 1 --ram 128 default
# Network
openstack network create network
openstack subnet create --network network --subnet-range 10.0.0.0/24 --allocation-pool start=10.0.0.10,end=10.0.0.254 subnet
openstack subnet set --dns-nameserver 8.8.8.8 subnet
# Secgroups
openstack security group create server
openstack security group create client
# Create server security group rules
openstack security group rule create --remote-group client --dst-port 9092 --protocol tcp server
openstack security group rule create --remote-group server --dst-port 9092 --protocol tcp server
# Create client security group rules
openstack security group rule create --egress --remote-group server --dst-port 9092 --protocol tcp client
- Patch
SecurityGroupAgentRpc._apply_port_filter()
inneutron/agent/securitygroups_rpc.py
whereneutron-linuxbridge-agent
will run on the "server" hypervisor (compute2 in this example).
This extends the window for port_update
steps 12 and 22 and makes it easier to land security_group_member_updated
events in it.
root@compute2:~# diff -u /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py{-orig,}
--- /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py-orig 2020-07-15 12:12:29.585918642 -0500
+++ /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py 2020-07-15 12:12:58.741670522 -0500
@@ -142,6 +142,11 @@
devices.update(devices_info['devices'])
security_groups.update(devices_info['security_groups'])
security_group_member_ips.update(devices_info['sg_member_ips'])
+ import time
+ sleeptime = 10
+ LOG.info("neutron_diag: sleeping {}s after security_group_info_for_devices".format(sleeptime))
+ time.sleep(sleeptime)
+ LOG.info("neutron_diag: done sleeping {}s".format(sleeptime))
else:
devices = self.plugin_rpc.security_group_rules_for_devices(
self.context, list(device_ids))
root@compute2:~#
- Restart
neutron-linuxbridge-agent
root@compute2:~# systemctl restart neutron-linuxbridge-agent
root@compute2:~#
- Record the id of the
client
security group
root@infra1-utility-container-25a610a0:~# openstack security group show client -cid -fvalue
3ac28f59-719c-44d8-9583-647dea1c6018
- Patch
Ml2Plugin.create_port()
inneutron/plugins/ml2/plugin.py
whereneutron-server
will run. Updateclient_sg_id
with the id from the previous step:
This slows down port_updates for the remote security group members, which in turn slows down receipt of security_group_member_updated
events.
root@infra1-neutron-server-container-178b5047:/openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/plugins/ml2# diff -u plugin.py{-orig,}
--- plugin.py-orig 2020-07-14 16:28:05.465274317 -0500
+++ plugin.py 2020-07-15 13:13:39.619711712 -0500
@@ -1335,6 +1335,13 @@
@db_api.retry_if_session_inactive()
def create_port(self, context, port):
self._before_create_port(context, port)
+ client_sg_id = '3ac28f59-719c-44d8-9583-647dea1c6018'
+ sleeptime = 8
+ if client_sg_id in port['port']['security_groups']:
+ import time
+ LOG.info("neutron_diag: sleeping {}s before creating port".format(sleeptime))
+ time.sleep(sleeptime)
+ LOG.info("neutron_diag: done sleeping {}s".format(sleeptime))
result, mech_context = self._create_port_db(context, port)
return self._after_create_port(context, result, mech_context)
- Restart
neutron-server
root@infra1-neutron-server-container-178b5047:/openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/plugins/ml2# systemctl restart neutron-server
root@infra1-neutron-server-container-178b5047:/openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/plugins/ml2#
Source credentials and run build.sh
to create VMs and wait for them to reach an active state
root@infra1-utility-container-25a610a0:~# . ~/openrc
root@infra1-utility-container-25a610a0:~# ./build.sh
issuing server create for VMs ... done
All VMs in ACTIVE state
+--------------------------------------+----------+--------+-------------------+--------+---------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------+--------+-------------------+--------+---------+
| 572d1e98-62f6-4247-b39d-92305be188f1 | server01 | ACTIVE | network=10.0.0.16 | cirros | default |
| ed339f3c-9102-4686-9eb0-fa08887b264e | client01 | ACTIVE | network=10.0.0.28 | cirros | default |
+--------------------------------------+----------+--------+-------------------+--------+---------+
You're looking for a line starting with !!!
.
Note that this script assumes the following:
- you can SSH to each hypervisor with a key from your current location
- you can run
mysql
and access the database from your current location
If those assumptions do not apply to your environment it's probably easiest to just run ssh compute2 ipset list
and look for sets with no members.
root@infra1-utility-container-25a610a0:~# ./check.sh
Starting at Wed Jul 15 18:23:41 UTC 2020
Instance info:
+--------------------------------------+----------+--------+-------------------+--------+---------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+----------+--------+-------------------+--------+---------+
| 572d1e98-62f6-4247-b39d-92305be188f1 | server01 | ACTIVE | network=10.0.0.16 | cirros | default |
| ed339f3c-9102-4686-9eb0-fa08887b264e | client01 | ACTIVE | network=10.0.0.28 | cirros | default |
+--------------------------------------+----------+--------+-------------------+--------+---------+
Security groups assigned to stack's instances
+--------------------------------------+--------+
| security_group_id | name |
+--------------------------------------+--------+
| 3ac28f59-719c-44d8-9583-647dea1c6018 | client |
| 16feb3e7-e37f-4d4a-af8f-e86e37071178 | server |
+--------------------------------------+--------+
Checking instances:
--- client01 (10.0.0.28) on compute1
--- port: e4cb184f-7e08-47b5-9226-7bf4fff8803b (tape4cb184f-7e)
--- iptables chains: in = neutron-linuxbri-ie4cb184f-7 out = neutron-linuxbri-oe4cb184f-7
--- references remote secgroup server (16feb3e7-e37f-4d4a-af8f-e86e37071178) ...
--- referencing secgroups:
--- client (3ac28f59-719c-44d8-9583-647dea1c6018)
--- ipset name: NIPv416feb3e7-e37f-4d4a-af8f-
--- expected IPs: 10.0.0.16
--- found IPs: 10.0.0.16
--- server01 (10.0.0.16) on compute2
--- port: 0e7eb2d3-5d14-42f5-b4ee-45ade16e50d2 (tap0e7eb2d3-5d)
--- iptables chains: in = neutron-linuxbri-i0e7eb2d3-5 out = neutron-linuxbri-o0e7eb2d3-5
--- references remote secgroup server (16feb3e7-e37f-4d4a-af8f-e86e37071178) ...
--- referencing secgroups:
--- server (16feb3e7-e37f-4d4a-af8f-e86e37071178)
--- ipset name: NIPv416feb3e7-e37f-4d4a-af8f-
--- expected IPs: 10.0.0.16
--- found IPs: 10.0.0.16
--- references remote secgroup client (3ac28f59-719c-44d8-9583-647dea1c6018) ...
--- referencing secgroups:
--- server (16feb3e7-e37f-4d4a-af8f-e86e37071178)
--- ipset name: NIPv43ac28f59-719c-44d8-9583-
--- expected IPs: 10.0.0.28
--- found IPs:
!!! IP MISMATCH DETECTED ^^^
Completed at Wed Jul 15 18:23:44 UTC 2020
root@infra1-utility-container-25a610a0:~#
root@infra1-utility-container-25a610a0:~# ssh compute2 ipset list NIPv43ac28f59-719c-44d8-9583- 2>/dev/null
Name: NIPv43ac28f59-719c-44d8-9583-
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 384
References: 1
Members:
root@infra1-utility-container-25a610a0:~#
root@infra1-utility-container-25a610a0:~# ./destroy.sh
deleting VMs ... done
root@infra1-utility-container-25a610a0:~#
This is optional and included for convenience, although using 8s and 10s for the delays as described above gives me a 100% failure rate. This command will keep building, checking, and destroying until a failure occurs.
# while ./build.sh && ./check.sh && ./destroy.sh ; do continue; done