Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cfarquhar/927a989e5a325adfa948d4388c2525d3 to your computer and use it in GitHub Desktop.
Save cfarquhar/927a989e5a325adfa948d4388c2525d3 to your computer and use it in GitHub Desktop.

Overview

Two changes were required to reproduce this consistently in a lab:

  1. Expand the window where the race condition can occur (time spent between port_update steps 12 and 22).
  2. Slow port creation for the remote secgroup members

I picked some extreme delays here, but the difference between hitting the race condition or not in production is measured in ms.

These steps were tested on a multi-node environment deployed from https://github.com/openstack/openstack-ansible/tree/436e777, but should work on any deployment with 2x compute nodes.

We will designate one compute node as the "server" hypervisor and the other as the "client" hypervisor. The distinction is important when we get to the "Expand race condition window" step below. I've chosen compute1 as the "client" hypervisor and compute2 as the "server" hypervisor.

Please see the attachment for helper scripts and configuration from neutron-server and neutron-linuxbridge-agent.

Setup

OpenStack resources

# Image
wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
openstack image create "cirros" --file cirros-0.3.4-x86_64-disk.img   --disk-format qcow2 --container-format bare   --public

# Flavor
openstack flavor create --disk 10 --vcpus 1 --ram 128 default

# Network
openstack network create network
openstack subnet create --network network --subnet-range 10.0.0.0/24 --allocation-pool start=10.0.0.10,end=10.0.0.254 subnet
openstack subnet set --dns-nameserver 8.8.8.8 subnet

# Secgroups
openstack security group create server
openstack security group create client

# Create server security group rules
openstack security group rule create --remote-group client  --dst-port 9092 --protocol tcp server
openstack security group rule create --remote-group server  --dst-port 9092 --protocol tcp server

# Create client security group rules
openstack security group rule create --egress --remote-group server --dst-port 9092 --protocol tcp client

Expand race condition window

  1. Patch SecurityGroupAgentRpc._apply_port_filter() in neutron/agent/securitygroups_rpc.py where neutron-linuxbridge-agent will run on the "server" hypervisor (compute2 in this example).

This extends the window for port_update steps 12 and 22 and makes it easier to land security_group_member_updated events in it.

root@compute2:~# diff -u /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py{-orig,}
--- /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py-orig  2020-07-15 12:12:29.585918642 -0500
+++ /openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/agent/securitygroups_rpc.py       2020-07-15 12:12:58.741670522 -0500
@@ -142,6 +142,11 @@
                 devices.update(devices_info['devices'])
                 security_groups.update(devices_info['security_groups'])
                 security_group_member_ips.update(devices_info['sg_member_ips'])
+            import time
+            sleeptime = 10
+            LOG.info("neutron_diag: sleeping {}s after security_group_info_for_devices".format(sleeptime))
+            time.sleep(sleeptime)
+            LOG.info("neutron_diag: done sleeping {}s".format(sleeptime))
         else:
             devices = self.plugin_rpc.security_group_rules_for_devices(
                 self.context, list(device_ids))
root@compute2:~#
  1. Restart neutron-linuxbridge-agent
root@compute2:~# systemctl restart neutron-linuxbridge-agent
root@compute2:~#

Slow down port creation for remote secgroup members

  1. Record the id of the client security group
root@infra1-utility-container-25a610a0:~# openstack security group show client -cid -fvalue
3ac28f59-719c-44d8-9583-647dea1c6018
  1. Patch Ml2Plugin.create_port() in neutron/plugins/ml2/plugin.py where neutron-server will run. Update client_sg_id with the id from the previous step:

This slows down port_updates for the remote security group members, which in turn slows down receipt of security_group_member_updated events.

root@infra1-neutron-server-container-178b5047:/openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/plugins/ml2# diff -u plugin.py{-orig,}
--- plugin.py-orig      2020-07-14 16:28:05.465274317 -0500
+++ plugin.py   2020-07-15 13:13:39.619711712 -0500
@@ -1335,6 +1335,13 @@
     @db_api.retry_if_session_inactive()
     def create_port(self, context, port):
         self._before_create_port(context, port)
+        client_sg_id = '3ac28f59-719c-44d8-9583-647dea1c6018'
+        sleeptime = 8
+        if client_sg_id in port['port']['security_groups']:
+            import time
+            LOG.info("neutron_diag: sleeping {}s before creating port".format(sleeptime))
+            time.sleep(sleeptime)
+            LOG.info("neutron_diag: done sleeping {}s".format(sleeptime))
         result, mech_context = self._create_port_db(context, port)
         return self._after_create_port(context, result, mech_context)
  1. Restart neutron-server
root@infra1-neutron-server-container-178b5047:/openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/plugins/ml2# systemctl restart neutron-server
root@infra1-neutron-server-container-178b5047:/openstack/venvs/neutron-18.1.19.dev2/lib/python2.7/site-packages/neutron/plugins/ml2#

Execution

Create instances

Source credentials and run build.sh to create VMs and wait for them to reach an active state

root@infra1-utility-container-25a610a0:~# . ~/openrc
root@infra1-utility-container-25a610a0:~# ./build.sh
issuing server create for VMs ... done

All VMs in ACTIVE state

+--------------------------------------+----------+--------+-------------------+--------+---------+
| ID                                   | Name     | Status | Networks          | Image  | Flavor  |
+--------------------------------------+----------+--------+-------------------+--------+---------+
| 572d1e98-62f6-4247-b39d-92305be188f1 | server01 | ACTIVE | network=10.0.0.16 | cirros | default |
| ed339f3c-9102-4686-9eb0-fa08887b264e | client01 | ACTIVE | network=10.0.0.28 | cirros | default |
+--------------------------------------+----------+--------+-------------------+--------+---------+

Check for IP mismatch

You're looking for a line starting with !!!.

Note that this script assumes the following:

  • you can SSH to each hypervisor with a key from your current location
  • you can run mysql and access the database from your current location

If those assumptions do not apply to your environment it's probably easiest to just run ssh compute2 ipset list and look for sets with no members.

root@infra1-utility-container-25a610a0:~# ./check.sh
Starting at Wed Jul 15 18:23:41 UTC 2020
Instance info:
+--------------------------------------+----------+--------+-------------------+--------+---------+
| ID                                   | Name     | Status | Networks          | Image  | Flavor  |
+--------------------------------------+----------+--------+-------------------+--------+---------+
| 572d1e98-62f6-4247-b39d-92305be188f1 | server01 | ACTIVE | network=10.0.0.16 | cirros | default |
| ed339f3c-9102-4686-9eb0-fa08887b264e | client01 | ACTIVE | network=10.0.0.28 | cirros | default |
+--------------------------------------+----------+--------+-------------------+--------+---------+

Security groups assigned to stack's instances
+--------------------------------------+--------+
| security_group_id                    | name   |
+--------------------------------------+--------+
| 3ac28f59-719c-44d8-9583-647dea1c6018 | client |
| 16feb3e7-e37f-4d4a-af8f-e86e37071178 | server |
+--------------------------------------+--------+


Checking instances:

--- client01 (10.0.0.28) on compute1
    --- port: e4cb184f-7e08-47b5-9226-7bf4fff8803b (tape4cb184f-7e)
    --- iptables chains: in = neutron-linuxbri-ie4cb184f-7 out = neutron-linuxbri-oe4cb184f-7
    --- references remote secgroup server (16feb3e7-e37f-4d4a-af8f-e86e37071178) ...
        --- referencing secgroups:
            --- client (3ac28f59-719c-44d8-9583-647dea1c6018)
        --- ipset name: NIPv416feb3e7-e37f-4d4a-af8f-
        --- expected IPs: 10.0.0.16
        --- found IPs: 10.0.0.16

--- server01 (10.0.0.16) on compute2
    --- port: 0e7eb2d3-5d14-42f5-b4ee-45ade16e50d2 (tap0e7eb2d3-5d)
    --- iptables chains: in = neutron-linuxbri-i0e7eb2d3-5 out = neutron-linuxbri-o0e7eb2d3-5
    --- references remote secgroup server (16feb3e7-e37f-4d4a-af8f-e86e37071178) ...
        --- referencing secgroups:
            --- server (16feb3e7-e37f-4d4a-af8f-e86e37071178)
        --- ipset name: NIPv416feb3e7-e37f-4d4a-af8f-
        --- expected IPs: 10.0.0.16
        --- found IPs: 10.0.0.16
    --- references remote secgroup client (3ac28f59-719c-44d8-9583-647dea1c6018) ...
        --- referencing secgroups:
            --- server (16feb3e7-e37f-4d4a-af8f-e86e37071178)
        --- ipset name: NIPv43ac28f59-719c-44d8-9583-
        --- expected IPs: 10.0.0.28
        --- found IPs:
!!! IP MISMATCH DETECTED ^^^

Completed at Wed Jul 15 18:23:44 UTC 2020
root@infra1-utility-container-25a610a0:~#

Manual validation

root@infra1-utility-container-25a610a0:~# ssh compute2 ipset list NIPv43ac28f59-719c-44d8-9583- 2>/dev/null
Name: NIPv43ac28f59-719c-44d8-9583-
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 384
References: 1
Members:
root@infra1-utility-container-25a610a0:~#

Cleanup

root@infra1-utility-container-25a610a0:~# ./destroy.sh
deleting VMs ... done

root@infra1-utility-container-25a610a0:~#

Run in a loop (optional)

This is optional and included for convenience, although using 8s and 10s for the delays as described above gives me a 100% failure rate. This command will keep building, checking, and destroying until a failure occurs.

# while ./build.sh && ./check.sh && ./destroy.sh ; do continue; done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment