Skip to content

Instantly share code, notes, and snippets.

@shannonmitchell
Last active January 18, 2017 15:18
Show Gist options
  • Save shannonmitchell/d24b98ce4a287a4cd663b464dc2f4559 to your computer and use it in GitHub Desktop.
Save shannonmitchell/d24b98ce4a287a4cd663b464dc2f4559 to your computer and use it in GitHub Desktop.
#################################
# May Need Further Investigating
#################################
- ca4cbc80-f4e7-4b10-b16f-83266b5761a9(research complete):
-> NeutronNetworks.create_and_list_networks(1): kill neutron-server service on one node
-> This one had a lot of downtime around 150 -> 25 seconds of the test on and off. This was well after the restart and before the 1k limit was hit
- 48feed1c-c902-4bb1-b166-1b5cd62687b5(research complete)
-> NeutronNetworks.list_agents(1): restart keystone service on one node
-> NeutronNetworks.list_agents(2): restart keystone service on one node
-> NeutronNetworks.list_agents(3): restart keystone service on one node
-> These had failures around 50 seconds in on each run.
- 46bcfabc-2299-4822-8d08-cf8d4b212077
-> NovaFlavors.list_flavors(1): restart keystone service on one node
-> NovaFlavors.list_flavors(2): restart keystone service on one node
-> NovaFlavors.list_flavors(3): restart keystone service on one node
-> Errors around 50 seconds in on each run
- 58857254-5e0a-44f0-a3b7-27d65348332e(research complete)
-> NovaServers.boot_and_delete_server(1): kill nova-api-os-compute service on one node
- efc3cf41-1cb0-453b-b2d6-7396344f3bb2
-> NovaServers.boot_and_delete_server(1): kill nova-api-os-compute service on one node
- 59494bf7-f799-4b0a-bd30-05bf6fc9453a
-> NovaServers.boot_and_delete_server(1): kill nova-api-os-compute service on one node
- 65434a05-205f-4f1f-862f-da08dd0233e2
-> NovaServers.boot_and_delete_server(1): kill nova-api-os-compute service on one node
- 7e273eb5-08c1-4395-b8bd-523f9ba67830
-> NovaServers.boot_and_delete_server(1): kill nova-api-os-compute service on one node
- 89cdf8d1-29b6-4779-90f0-6d6765eb5253
-> NovaServers.boot_and_delete_server(1): kill nova-api-os-compute service on one node
-> Error around 130 seconds in on several runs.
- 21b6a490-c6a9-4ed1-a4c3-6859830a9115(research complete)
-> SwiftObjects.create_container_and_object_then_delete_all(1): restart swift-proxy service on one node
-> Errors around 130-140 seconds in.
- 84ff85e5-08d2-4e14-a10e-d8f9187ba04a(research complete)
-> SwiftObjects.create_container_and_object_then_delete_all(1): restart mysql service on one node
-> Errors from about 130 to 290 seconds. (I think they had to manually restart)
- 577d7dd9-0f17-4372-ad11-ee7e0cafe07b(research complete)
-> SwiftObjects.list_objects_in_containers(1): restart keystone service
-> SwiftObjects.list_objects_in_containers(2): restart keystone service
-> SwiftObjects.list_objects_in_containers(3): restart keystone service
-> Error at about 56 or so seconds in
################
# Neutron runs
################
# NeutronNetworks.create_and_list_networks(1): restart neutron-metadata-agent service on one node
# NeutronNetworks.create_and_list_networks(1): kill neutron-metering-agent service on one node
# NeutronNetworks.create_and_list_networks(1): restart neutron-l3-agent service on one node
# NeutronNetworks.create_and_list_networks(1): restart neutron-server service on one node
# NeutronNetworks.create_and_list_networks(1): kill neutron-linuxbridge-agent service on one node
# NeutronNetworks.create_and_list_networks(1): restart neutron-linuxbridge-agent service on one node
# NeutronNetworks.create_and_list_networks(1): restart neutron-dhcp-agent service on one node
# NeutronNetworks.create_and_list_networks(1): kill neutron-metadata-agent service on one node
# NeutronNetworks.create_and_list_networks(1): kill neutron-l3-agent service on one node
# NeutronNetworks.create_and_list_networks(1): kill neutron-dhcp-agent service on one node
# NeutronNetworks.create_and_list_networks(1): restart mysql service on one node
# NeutronNetworks.create_and_list_networks(2): restart mysql service on one node
# NeutronNetworks.create_and_list_networks(3): restart mysql service on one node
- You can see a small red line from the restart, but things seemed to recover quickly and not differ much from the baseline
- It dies after a while due to it running pask the 1k vxlan limit in neutron. (https://01.org/jira/browse/OSIC-904)
- As the amount of networks increase on every creation, it will always hit a point to where it bypasses the baseline threshold eventually.
################################
# All runs without major issues
################################
# Authenticate.keystone(1): restart mysql service on one node.
# Authenticate.keystone(2): restart mysql service on one node.
# Authenticate.keystone(3): restart mysql service on one node.
# Authenticate.keystone(4): restart mysql service on one node.
# Authenticate.keystone(5): restart mysql service on one node.
# Authenticate.keystone(1): kill memcached service on one node.
# Authenticate.keystone(2): kill memcached service on one node.
# Authenticate.keystone(3): kill memcached service on one node.
# Authenticate.keystone(1): stressmem keystone service on one node.
# Authenticate.keystone(1): restart memcached service on one node.
# Authenticate.keystone(2): restart memcached service on one node.
# Authenticate.keystone(3): restart memcached service on one node.
# Authenticate.keystone(1): stressdisk keystone service on one node.
# Authenticate.keystone(1): stresscpi keystone service on one node.
# Authenticate.keystone(1): restart rabbitmq service on one node.
# Authenticate.keystone(2): restart rabbitmq service on one node.
# Authenticate.keystone(3): restart rabbitmq service on one node.
# Authenticate.keystone(4): restart rabbitmq service on one node.
# Authenticate.keystone(5): restart rabbitmq service on one node.
# CinderVolumes.list_volumes(1): restart keystone service on one node.
# CinderVolumes.list_volumes(2): restart keystone service on one node.
# CinderVolumes.list_volumes(3): restart keystone service on one node.
# GlanceImages.list_images(1): kill glance-api service on one node.
# GlanceImages.list_images(1): restart rabbitmq service on one node
# GlanceImages.list_images(1): restart keystone service on one node
# GlanceImages.list_images(2): restart keystone service on one node
# GlanceImages.list_images(3): restart keystone service on one node
# GlanceImages.list_images(1): restart mysql service on one node
# GlanceImages.list_images(1): restart glance-api service on one node.
# GlanceImages.list_images(1): kill glance-registry service on one node.
# GlanceImages.list_images(1): restart memcached service on one node.
# GlanceImages.list_images(1): restart glance-registry service on one node.
# GlanceImages.list_images(1): kill memcached service on one node.
# NovaServers.boot_and_delete_server(1): restart memecached service on one node
# NovaServers.boot_and_delete_server(1): kill nova-cert service on one node
# NovaServers.boot_and_delete_server(1): restart nova-scheduler service on one node
# NovaServers.boot_and_delete_server(1): restart nova-api-metadata service on one node
# NovaServers.boot_and_delete_server(1): kill nova-consoleauth service on one node
# NovaServers.boot_and_delete_server(1): restart rabbitmq service on one node
# NovaServers.boot_and_delete_server(1): restart nova-consoleauth service on one node
# NovaServers.boot_and_delete_server(1): restart nova-compute service on one node
# NovaServers.boot_and_delete_server(1): restart nova-cert service on one node
# NovaServers.boot_and_delete_server(1): kill nova-api-metadata service on one node
# NovaServers.boot_and_delete_server(1): kill nova-compute service on one node
# NovaServers.boot_and_delete_server(1): reboot one node with rabbitmq service
# NovaServers.boot_and_delete_server(1): kill memcached service on one node
# NovaServers.boot_and_delete_server(1): restart mysql service on one node
# NovaServers.boot_and_delete_server(1): restart nova-api-os-compute service on one node
# NovaServers.boot_and_delete_server(1): restart nova-spicehtml5proxy service on one node
# NovaServers.boot_and_delete_server(1): kill nova-conductor service on one node
# NovaServers.boot_and_delete_server(1): kill nova-spicehtml5proxy service on one node
# NovaServers.boot_and_delete_server(1): restart nova-conductor service on one node
# NovaServers.boot_and_delete_server(1): kill nova-scheduler service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-object-auditor service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-object-server service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-container-sync service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-account-reaper service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-container-auditor service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-account-reaper service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-object-replicator service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-object-updater service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-container-server service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-object-updater service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-proxy-server service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-account-server service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-object-server service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-container-replicator service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-account-auditor service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-container-server service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-object-replicator service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-object-auditor service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-container-audiobjecttor service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-account-replicator service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-container-reconciler service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart memcached service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-account-auditor service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill memcached service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-account-server service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-container-replicator service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-object-expirer service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-container-updater service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-object-expirer service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-conteiner-reconciler service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-account-replicator service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart rabbitmq service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): restart swift-container-updater service on one node
# SwiftObject.create_container_and_object_then_delete_all(1): kill swift-container-sync service on one node
- You can see a small red line from the restart, but things seemed to recover quickly and not differ much from the baseline
- Overall, looks like we might need to run some tests and get a baseline for hte 'Degradation Threshold'. Looks like these
might be set to some default which doesn't reflect the baseline for this env.
@pwnall1337
Copy link

ca4cbc80-f4e7-4b10-b16f-83266b5761a9 : NeutronNetworks.create_and_list_networks(1):

Jan 3 17:54:29 localhost haproxy[45498]: 172.22.12.128:37414 [03/Jan/2017:17:54:29.609] neutron_server-front-1~ neutron_server-back/network-2_neutron_server_container-d26d379e 6/0/0/-1/234 502 204 - - SH-- 1881/1/1/0/0 0/0 "POST /v2.0/networks.json HTTP/1.1"

This was in regards to neutron restart on network02

"2017-01-03 17:54:26.934 46897 INFO neutron.common.config [-] Logging enabled!
2017-01-03 17:54:26.935 46897 INFO neutron.common.config [-] /openstack/venvs/neutron-14.0.2/bin/neutron-server version 9.1.0
2017-01-03 17:54:26.935 46897 INFO neutron.common.config [-] Logging enabled!
2017-01-03 17:54:26.935 46897 INFO neutron.common.config [-] /openstack/venvs/neutron-14.0.2"

@shannonmitchell
Copy link
Author

We account for the restart on the graph about 120 seconds in. I hit up the guys who created the graphs to see what the "Downtime" represents around 180-200 seconds in as it isn't showing up in the results.

@shannonmitchell
Copy link
Author

ca4cbc80-f4e7-4b10-b16f-83266b5761a9 looks like the neutron restart may have been it. The red actually spans from the 1st error till the last bit. The mean duration overlaps the red and makes it look like it only happens at certain times. We may want to find out why a restart happened after the 120 second mark though.

@pwnall1337
Copy link

Log event start/finish:

(rally) root@deploy:/os-faults/tools/output/json# grep timestamp ca4cbc80-f4e7-4b10-b16f-83266b5761a9.json | awk '{ print $2 }' | cut -d, -f1 | while read;do date -d@$REPLY;done | head -1
Tue Jan 3 17:52:01 UTC 2017
(rally) root@deploy:
/os-faults/tools/output/json# grep timestamp ca4cbc80-f4e7-4b10-b16f-83266b5761a9.json | awk '{ print $2 }' | cut -d, -f1 | while read;do date -d@$REPLY;done | tail -1
Tue Jan 3 18:00:01 UTC 2017

Bad Gateway:
(rally) root@deploy:~/os-faults/tools/output/json# date -d@1483466064.038841
Tue Jan 3 17:54:24 UTC 2017

Neutron event triggered start/finish:

(rally) root@deploy:/os-faults/tools/output/json# date -d@1483466042.004516
Tue Jan 3 17:54:02 UTC 2017
(rally) root@deploy:
/os-faults/tools/output/json# date -d@1483466064.521169
Tue Jan 3 17:54:24 UTC 2017

The bad gateway event seems to correlate with the 120 seconds of restart in neutron.

@shannonmitchell
Copy link
Author

Created https://01.org/jira/browse/OSIC-933 with the results for 84ff85e5-08d2-4e14-a10e-d8f9187ba04a. Looking into 577d7dd9-0f17-4372-ad11-ee7e0cafe07b next.

@shannonmitchell
Copy link
Author

Starting on 21b6a490-c6a9-4ed1-a4c3-6859830a9115

@shannonmitchell
Copy link
Author

Created https://01.org/jira/browse/OSIC-936 for 21b6a490-c6a9-4ed1-a4c3-6859830a9115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment