Skip to content

Instantly share code, notes, and snippets.

@jtaleric
Last active August 23, 2017 14:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jtaleric/21dbd06d011435c80a9f7f2b4a3971ad to your computer and use it in GitHub Desktop.
Save jtaleric/21dbd06d011435c80a9f7f2b4a3971ad to your computer and use it in GitHub Desktop.

Looking through the neutron.server logs, prior to the ovs-vswitchd spike:

First neutron-delete router:

2017-08-23 02:08:17.333 127073 INFO neutron.wsgi [req-ceeeb0bb-e553-4c92-b9ee-65a288cb5264 fffac8d9a827403599ea3814a27d43e5 500841a8912d4ed0abd92de33f2265b5 - default default] 172.16.0.10 "DELETE /v2.0/routers/090967b4-e423-4590-86e9-a3eb5e76fe0f.json HTTP/1.1" status: 204  len: 149 time: 4.7521341

ovs-vswitchd at this point: https://snapshot.raintank.io/dashboard/snapshot/sHfJUMVxcPgYUwUsYONozN0RTlfqXuM3?panelId=72&fullscreen&orgId=2

CPU Utilization from vswitchd seems normal at this time. No trashing.

Deletes continue, I am seeing (neutron.log):

2017-08-23 02:08:17.808 127088 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 111] ECONNREFUSED
2017-08-23 02:08:17.812 127072 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 111] ECONNREFUSED
2017-08-23 02:08:17.888 127081 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 111] ECONNREFUSED
2017-08-23 02:08:17.946 127084 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: error: [Errno 111] ECONNREFUSED
2017-08-23 02:08:18.026 127086 WARNING neutron.plugins.ml2.drivers.mech_agent [req-5db777b5-2bf4-4a2c-944e-76015b7a8550 - - - - -] Refusing to bind port f1cc9f9a-c4ed-4b74-ae63-15d21981d0b8 to dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2017, 8, 23, 2, 2, 8), 'availability_zone': None, 'alive': False, 'topic': u'N/A', 'host': u'overcloud-controller-2.localdomain', 'agent_type': u'Open vSwitch agent', 'resource_versions': {u'Subnet': u'1.0', u'Network': u'1.0', u'SubPort': u'1.0', u'SecurityGroup': u'1.0', u'SecurityGroupRule': u'1.0', u'Trunk': u'1.1', u'QosPolicy': u'1.6', u'Port': u'1.1', u'Log': u'1.0'}, 'created_at': datetime.datetime(2017, 8, 22, 10, 48, 13), 'started_at': datetime.datetime(2017, 8, 23, 1, 23, 30), 'id': u'5197ba41-036e-4d49-956f-023f4731add3', 'configurations': {u'ovs_hybrid_plug': True, u'in_distributed_mode': False, u'datapath_type': u'system', u'vhostuser_socket_dir': u'/var/run/openvswitch', u'tunneling_ip': u'172.17.0.10', u'arp_responder_enabled': False, u'devices': 1216, u'ovs_capabilities': {u'datapath_types': [u'netdev', u'system'], u'iface_types': [u'geneve', u'gre', u'internal', u'lisp', u'patch', u'stt', u'system', u'tap', u'vxlan']}, u'log_agent_heartbeats': False, u'l2_population': False, u'tunnel_types': [u'vxlan'], u'extensions': [u'qos'], u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex'}}}

While in /var/log/messages, I see :

Aug 23 02:08:17 localhost Keepalived_vrrp[106337]: VRRP_Instance(VR_3) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:08:17 localhost Keepalived_vrrp[130111]: VRRP_Instance(VR_3) Transition to MASTER STATE
Aug 23 02:08:17 localhost Keepalived_vrrp[210970]: VRRP_Instance(VR_11) Received advert with higher priority 50, ours 50
Aug 23 02:08:17 localhost Keepalived_vrrp[210970]: VRRP_Instance(VR_11) Entering BACKUP STATE
Aug 23 02:08:18 localhost Keepalived_vrrp[287231]: VRRP_Instance(VR_19) Received advert with higher priority 50, ours 50
Aug 23 02:08:18 localhost Keepalived_vrrp[287231]: VRRP_Instance(VR_19) Entering BACKUP STATE
Aug 23 02:08:19 localhost Keepalived_vrrp[106337]: VRRP_Instance(VR_3) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:08:19 localhost Keepalived_vrrp[93902]: VRRP_Instance(VR_6) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:08:19 localhost Keepalived_vrrp[130111]: VRRP_Instance(VR_3) Entering MASTER STATE
Aug 23 02:08:19 localhost Keepalived_vrrp[286771]: VRRP_Instance(VR_21) Transition to MASTER STATE
Aug 23 02:08:19 localhost Keepalived_vrrp[108139]: VRRP_Instance(VR_2) Received advert with higher priority 50, ours 50
Aug 23 02:08:19 localhost Keepalived_vrrp[108139]: VRRP_Instance(VR_2) Entering BACKUP STATE
Aug 23 02:08:20 localhost Keepalived_vrrp[176159]: VRRP_Instance(VR_7) Transition to MASTER STATE
Aug 23 02:08:20 localhost kernel: IN=eno1 OUT= MAC=ff:ff:ff:ff:ff:ff:40:b4:f0:79:62:01:08:00 SRC=10.12.23.253 DST=255.255.255.255 LEN=389 TOS=0x00 PREC=0x00 TTL=1 ID=8642 PROTO=UDP SPT=67 DPT=68 LEN=369
Aug 23 02:08:21 localhost Keepalived_vrrp[106337]: VRRP_Instance(VR_3) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:08:21 localhost Keepalived_vrrp[286771]: VRRP_Instance(VR_21) Entering MASTER STATE
Aug 23 02:08:21 localhost ovsdb-server: ovs|01323|reconnect|ERR|tcp:127.0.0.1:47002: no response to inactivity probe after 5 seconds, disconnecting
Aug 23 02:08:22 localhost Keepalived_vrrp[93902]: VRRP_Instance(VR_6) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:08:22 localhost Keepalived_vrrp[176159]: VRRP_Instance(VR_7) Entering MASTER STATE
Aug 23 02:08:22 localhost Keepalived_vrrp[174918]: VRRP_Instance(VR_8) Received advert with higher priority 50, ours 50
Aug 23 02:08:22 localhost Keepalived_vrrp[174918]: VRRP_Instance(VR_8) Entering BACKUP STATE
Aug 23 02:08:22 localhost kernel: IN=eno1 OUT= MAC=ff:ff:ff:ff:ff:ff:40:b4:f0:79:62:01:08:00 SRC=10.12.23.253 DST=255.255.255.255 LEN=381 TOS=0x00 PREC=0x00 TTL=1 ID=9705 PROTO=UDP SPT=67 DPT=68 LEN=361
Aug 23 02:08:23 localhost Keepalived_vrrp[106337]: VRRP_Instance(VR_3) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:08:23 localhost Keepalived_vrrp[99078]: VRRP_Instance(VR_5) Transition to MASTER STATE
Aug 23 02:08:23 localhost kernel: net_ratelimit: 8 callbacks suppressed

Soon after some of the deletes, I begin seeing:

Aug 23 02:08:49 localhost Keepalived_vrrp[287556]: VRRP_Instance(VR_18) Received advert with higher priority 50, ours 50
Aug 23 02:08:49 localhost Keepalived_vrrp[287556]: VRRP_Instance(VR_18) Entering BACKUP STATE
Aug 23 02:08:49 localhost neutron-keepalived-state-change: 2017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 10.102.0.1 on qr-f8ee32b1-19 in namespace qrouter-daeb75b6-d635-4a1c-a9ec-0f587915fec3: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address#012: ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib Traceback (most recent call last):#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 1084, in _arping#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib     ip_wrapper.netns.execute(arping_cmd, extra_ok_codes=[1])#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib     log_fail_as_error=log_fail_as_error, **kwargs)#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib     raise ProcessExecutionError(msg, returncode=returncode)#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address#0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib #0122017-08-23 06:08:49.575 100137 ERROR neutron.agent.linux.ip_lib
Aug 23 02:08:49 localhost Keepalived_vrrp[138774]: VRRP_Instance(VR_10) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:08:49 localhost neutron-keepalived-state-change: 2017-08-23 06:08:49.684 84177 ERROR neutron.agent.linux.utils [-] Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
Aug 23 02:08:49 localhost neutron-keepalived-state-change: 2017-08-23 06:08:49.687 74995 ERROR neutron.agent.linux.utils [-] Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address#012: ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address

^ I do see this in the logs previous to 02:08, but after 02:08 it becomes the dominit error.

If we look at the system utilization during or close to this time. https://snapshot.raintank.io/dashboard/snapshot/sHfJUMVxcPgYUwUsYONozN0RTlfqXuM3?panelId=72&fullscreen&orgId=2&from=1503454415657&to=1503454882447

Close to this point we see in /var/log/messages

Aug 23 02:13:53 localhost Keepalived_vrrp[108139]: VRRP_Instance(VR_2) Transition to MASTER STATE
Aug 23 02:13:53 localhost Keepalived_vrrp[174918]: VRRP_Instance(VR_8) Received advert with higher priority 50, ours 50
Aug 23 02:13:53 localhost Keepalived_vrrp[174918]: VRRP_Instance(VR_8) Entering BACKUP STATE
Aug 23 02:13:54 localhost Keepalived_vrrp[122976]: VRRP_Instance(VR_11) Entering MASTER STATE
Aug 23 02:13:54 localhost Keepalived_vrrp[146351]: VRRP_Instance(VR_10) Transition to MASTER STATE
Aug 23 02:13:54 localhost Keepalived_vrrp[166233]: VRRP_Instance(VR_5) Entering MASTER STATE
Aug 23 02:13:54 localhost Keepalived_vrrp[287556]: VRRP_Instance(VR_18) Transition to MASTER STATE
Aug 23 02:13:55 localhost Keepalived_vrrp[106337]: VRRP_Instance(VR_3) Received advert with lower priority 50, ours 50, forcing new election
Aug 23 02:13:55 localhost Keepalived_vrrp[286771]: VRRP_Instance(VR_21) Entering MASTER STATE
Aug 23 02:13:55 localhost Keepalived_vrrp[210052]: VRRP_Instance(VR_16) Entering MASTER STATE
Aug 23 02:13:55 localhost Keepalived_vrrp[108139]: VRRP_Instance(VR_2) Entering MASTER STATE
Aug 23 02:13:56 localhost Keepalived_vrrp[234984]: VRRP_Instance(VR_15) Transition to MASTER STATE
Aug 23 02:13:56 localhost Keepalived_vrrp[131035]: VRRP_Instance(VR_7) Transition to MASTER STATE
Aug 23 02:13:56 localhost Keepalived_vrrp[146351]: VRRP_Instance(VR_10) Entering MASTER STATE
Aug 23 02:13:56 localhost Keepalived_vrrp[98601]: VRRP_Instance(VR_8) Transition to MASTER STATE
Aug 23 02:13:56 localhost neutron-keepalived-state-change: 2017-08-23 06:13:56.450 75797 ERROR neutron.agent.linux.utils [-] Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address#012: ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
Aug 23 02:13:56 localhost neutron-keepalived-state-change: 2017-08-23 06:13:56.454 68762 ERROR neutron.agent.linux.utils [-] Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address#012: ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
Aug 23 02:13:56 localhost Keepalived_vrrp[302856]: VRRP_Instance(VR_22) Transition to MASTER STATE
Aug 23 02:13:56 localhost Keepalived_vrrp[287556]: VRRP_Instance(VR_18) Entering MASTER STATE
Aug 23 02:13:56 localhost Keepalived_vrrp[107037]: VRRP_Instance(VR_9) Transition to MASTER STATE
Aug 23 02:13:56 localhost crmd[21924]:  notice: High CPU load detected: 69.669998

keepalived thrashing

[root@overcloud-controller-0 heat-admin]# ps aux | grep 166233
root      166233  0.0  0.0 125100  2720 ?        S    01:39   0:25 keepalived -P -f /var/lib/neutron/ha_confs/7a73b714-1b9a-484e-9cdb-773a7bf1a276/keepalived.conf -p /var/lib/neutron/ha_confs/7a73b714-1b9a-484e-9cdb-773a7bf1a276.pid -r /var/lib/neutron/ha_confs/7a73b714-1b9a-484e-9cdb-773a7bf1a276.pid-vrrp
root      468863  0.0  0.0 112664   972 pts/0    S+   14:43   0:00 grep --color=auto 166233
[root@overcloud-controller-0 heat-admin]# ip netns | grep 7a73b714-1b9a-484e-9cdb-773a7bf1a276
qrouter-7a73b714-1b9a-484e-9cdb-773a7bf1a276
[root@overcloud-controller-0 heat-admin]# 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment