smalleni/Full-Lab-Deploy.md

## Full-Lab-Deploy.md

      
    Raw
  

              Full-Lab-Deploy.md
            
          
    Full Scale Lab Deployment

Test Plan


 Build Undercloud


 Install Monitoring


 Import Nodes


 Introspect


 1 Controller + 1 Compute Deploy with minimal nic-config


 3 Controllers + 1 Compute deploy with minimal nic-config


 3 Controllers + Max.Computes at once in multiples of 32 (72 nodes total)


 3 Controllers + Max.Computes at once with composable roles in multiples of 32 (72 nodes total)


 Repeat above when all nodes are available in Week 2


Daily Updates

03/06/17 Monday


Task1- Make an R620 with 128G RAM undercloud instead of the first R620 QUADs assumes as the undercloud


For this I set nullos=false on c02-h10(to make it undercloud) and true on b10-h27(to remove it as undercloud) and Rebooted both hosts into PXE.


However since the R930s didn't have their boot order set correctly through playbooks, Will set nullos to False on them and is working on them, so he also disabled the cronjobs that check for updates to host parameters and make changes. So, I cannot reassign my undercloud and change nic pxe ordering provided at this point. Waiting on scale lab devops to give the ack to go ahead.


Even after changing host parameters to change undercloud, the instackenv.json generation (cron job) did not take effect imediately. Waiting to hear about this.
[Update] This was fixed. Will use the new instackenv.json after the R930s are also ready to go.


Task 2- Set variables and ready undercloud playbook


Setting some variablees needed for the playbook and also want to run the playbook only upto the point of install of undercloud, want to install monitoring before trying introspection


Task 3- Install undercloud


Initial attempt failed due to the undercloud install bug linked below.


Rebuilt undercloud upgraded RHOP 10 to latest (2017-03-03.1)


Install proceeded smoothly-  Time for install 24:31.187, completed at 11:47am


   =============================================================================== 
   Install undercloud --------------------------------------------------- 1471.19s
   Wait for Machine Ready ------------------------------------------------ 237.55s
   Update Packages ------------------------------------------------------- 164.22s
   Install ipa and overcloud images -------------------------------------- 133.48s
   Install tripleo -------------------------------------------------------- 77.18s
   Upload images ---------------------------------------------------------- 64.12s
   Install terminal multiplexers ------------------------------------------ 18.76s
   Turn on Private external vlan interface --------------------------------- 5.76s
   Install rhos-release ---------------------------------------------------- 5.18s
   Setup DNS on Undercloud Neutron subnet ---------------------------------- 4.78s
   setup ------------------------------------------------------------------- 4.41s
   Untar ipa and overcloud images ------------------------------------------ 3.26s
   Get neutron subnet uuid ------------------------------------------------- 2.61s
   Setup OSP version to install -------------------------------------------- 2.29s
   Add stack user ---------------------------------------------------------- 1.77s
   Reboot machine ---------------------------------------------------------- 1.40s
   Setup tripleo directories ----------------------------------------------- 1.31s
   Get rhos-release -------------------------------------------------------- 1.19s
   Copy undercloud.conf ---------------------------------------------------- 1.02s
   Deploy Private external vlan interface ---------------------------------- 0.82s


Task 4- Install Browbeat and collectd/graphite/grafama based monitoring


Task 5- Configure Ironic for cleaning- Based on previous experience while deploying HCI with OpenStack I considered it best to configure Ironic for automated cleaning to workaround some issues seen with hosts with multiple disks having preinstalled OSes(supermicro hosts in the lab). Update ironic.conf with the following and restart openstack-ironic-conductor


   cleaning_network_uuid = <UUID of ctlplane>
   erase_devices_priority = 0
   erase_devices_metadata_priority = 10
   automated_clean = True


Task 6- Baseline idle undercloud

http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488819641000&to=1488825130000
 [root@c02-h10-r620 stack]# pstree
systemd─┬─/usr/bin/python───ceilometer-agen───69*[{ceilometer-agen}]
        ├─/usr/bin/python───ceilometer-coll───132*[{ceilometer-coll}]
        ├─/usr/bin/python───ceilometer-poll───6*[{ceilometer-poll}]
        ├─/usr/bin/python───aodh-listener -───2*[{aodh-listener -}]
        ├─/usr/bin/python───aodh-notifier -───2*[{aodh-notifier -}]
        ├─/usr/bin/python───aodh-evaluator ───12*[{aodh-evaluator }]
        ├─agetty
        ├─atd
        ├─auditd───{auditd}
        ├─beam.smp─┬─inet_gethost───inet_gethost
        │          └─412*[{beam.smp}]
        ├─chronyd
        ├─collectd───11*[{collectd}]
        ├─crond
        ├─dbus-daemon───{dbus-daemon}
        ├─2*[dnsmasq]
        ├─dockerd-current─┬─docker-containe───9*[{docker-containe}]
        │                 └─30*[{dockerd-current}]
        ├─epmd
        ├─glance-api───6*[glance-api]
        ├─glance-registry───6*[glance-registry]
        ├─heat-api───6*[heat-api]
        ├─heat-api-cfn───6*[heat-api-cfn]
        ├─heat-engine───6*[heat-engine]
        ├─httpd─┬─8*[httpd]
        │       ├─2*[httpd───27*[{httpd}]]
        │       ├─httpd───28*[{httpd}]
        │       └─2*[httpd───26*[{httpd}]]
        ├─ironic-api───6*[ironic-api]
        ├─ironic-conducto
        ├─ironic-inspecto
        ├─irqbalance
        ├─lvmetad
        ├─master─┬─pickup
        │        └─qmgr
        ├─memcached───26*[{memcached}]
        ├─2*[mistral-server]
        ├─mistral-server───6*[mistral-server]
        ├─mongod───17*[{mongod}]
        ├─monitor───ovsdb-server
        ├─monitor───ovs-vswitchd───13*[{ovs-vswitchd}]
        ├─mysqld_safe───mysqld───104*[{mysqld}]
        ├─neutron-dhcp-ag───sudo───neutron-rootwra───{neutron-rootwra}
        ├─neutron-openvsw─┬─sudo───neutron-rootwra───{neutron-rootwra}
        │                 └─sudo───neutron-rootwra───ovsdb-client
        ├─neutron-server───14*[neutron-server]
        ├─nova-api───6*[nova-api]
        ├─nova-cert
        ├─nova-compute
        ├─nova-conductor───6*[nova-conductor]
        ├─nova-scheduler
        ├─polkitd───5*[{polkitd}]
        ├─puppet───{puppet}
        ├─registry───27*[{registry}]
        ├─rhel-push-plugi───4*[{rhel-push-plugi}]
        ├─rhnsd
        ├─rhsmcertd
        ├─rpcbind
        ├─rsyslogd───2*[{rsyslogd}]
        ├─sshd───sshd───bash───su───bash───sudo───su───bash───pstree
        ├─swift-account-r
        ├─swift-account-s───swift-account-s
        ├─swift-container───swift-container
        ├─swift-container
        ├─swift-object-se───swift-object-se───20*[{swift-object-se}]
        ├─swift-object-up
        ├─swift-proxy-ser───6*[swift-proxy-ser]
        ├─systemd-journal
        ├─systemd-logind
        ├─systemd-udevd
        ├─tmux───bash───sudo───su───bash───su───bash
        ├─tuned───4*[{tuned}]
        ├─xinetd
        ├─zaqar-server───4*[{zaqar-server}]
        └─zaqar-server───6*[{zaqar-server}]

Heat Resource Consumption
Memory
heat-api-rss: 857MB
heat-api-cfn-rss: 429Mb
heat-engine-rss: 568Mb
CPU
heat-engine-user- 2%

Ironic Resource Consumption
Memory
ironic-api: 439Mb
ironic-conductor: 80Mb
ironic-inspector: 58Mb
dnsmasq-ironic: 884Kb
dnsmasq-ironicinspector: 384Kb
CPU
ironicapi-user: 2%
ironic-inspector-user: 2%
ironic-conductor-user: 2%

Mistral Resource Consumption
Memory
mistral-server-api: 574Mb
mistral-server-engine: 108Mb
mistral-server-eexecutor: 107Mb
CPU
mistral-server-user: 3%
mistral-server-engine-user: 1%


Task 7 - Import nodes- openstack baremetal import was used to improt the nodes into ironic

Initially I was seeing nodes stuck in enroll and the baremetal import command hanging. Turns out due to scale lab issue mentioned below, the IPMI credentials weren't set correctly on the nodes. I would see in ironic-conductor logs
21:01:49.727 12380 ERROR ironic.drivers.modules.ipmitool
[req-f695c8a3-0e83-49c7-a913-4b758b12aad5 - - - - -] IPMI Error while
attempting "ipmitool -I lanplus -H mgmt-c07-h13-6048r.rdu.openstack.engin
eering.redhat.com -L ADMINISTRATOR -U quads -R 3 -N 5 -f /tmp/tmp1p7U2X power
status"for node 9c7ac724-bdf6-4994-baff-43dd864ac177. Error: Unexpected error
while running command. 2017-03-06 21:01:49.730 12380 ERROR
ironic.conductor.manager [req-f695c8a3-0e83-49c7-a913-4b758b12aad5 - - - - -]
Failed to get power state for node 9c7ac724-bdf6-4994-baff-43dd864ac177. Error:
IPMI call failed: power status. 

After this was fixed the nodes were enrolled. However. before moving from manageable to available due to the ironic automated cleaning we set up, ironic clears out the disks on each of these nodes so as avoid the multiple bootable OS problem we briefly mentioned earlier. It was seen that one 6018r(c04-h33-6018r.rdu.openstack.engineering.redhat.com) node seemed to be stuck in clean wait state. I tried to get it out of that state by using the following commands:
ironic node-set-maintenance UUID on
ironic --ironic-api-version 1.16 node-set-provision-state UUID abort
ironic node-set-maintenance UUID off
ironic node-set-provision-state UUID manage
ironic node-set-provision-state UUID provide

However, since cleaning is activated before setting node to available, the node went into clean wait once again. Cleaning requires the node to PXE off the control plane network, so at this point I suspect it to be a boot order issue and have raised a scale lab ticket.
Baremetal import initiated on Mon Mar  6 21:46:48 UTC 2017
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488836734798&to=1488838534000
Network utilzation during cleaning- http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488836734798&to=1488838534000&panelId=17&fullscreen

Task 8- Bulk Introspection Bulk Introspection failed with a bunch of timeouts and finished with errors

[stack@c02-h10-r620 ~]$ date; time openstack baremetal introspection bulk start; date                                                                 
                                                             
Mon Mar  6 23:01:10 UTC 2017
Setting nodes for introspection to manageable...
Starting introspection of manageable nodes
Started Mistral Workflow. Execution ID: 68fcc63f-fec7-4207-9dec-693acec985f1
Waiting for introspection to finish...
Introspection for UUID a5e1b09e-118b-4796-84a3-e83eb5d27f9f finished successfully.
Introspection for UUID 4e596284-5e19-4d00-bc57-173cc451f3a0 finished successfully.
Introspection for UUID e09b15b0-b0d3-4953-839d-5f064d2b5cdd finished successfully.
Introspection for UUID 08fed896-4a2a-4188-b37f-b877e1c1ce6c finished successfully.
Introspection for UUID e2f419b3-2195-4c0f-927c-cc047d2f662d finished successfully.
Introspection for UUID 27fbb501-82a4-45e6-a74d-dfbf64a8c2a1 finished successfully.
Introspection for UUID 86ce5b88-8631-499b-a42d-115b4429192f finished with error: Introspection timeout
Introspection for UUID 55e1da56-3d61-4981-b5d6-430d8592623c finished successfully.
Introspection for UUID 9c404557-e361-4a04-a5ed-0633278535a2 finished with error: Introspection timeout
Introspection for UUID a4258e81-6f62-4731-b343-4918806480a1 finished successfully.
Introspection for UUID 30f9320a-9f90-411b-978a-891ece2279df finished successfully.
Introspection for UUID 05f0a159-a4ce-4772-87d9-40972ed5215f finished successfully.
Introspection for UUID 1d39b3c2-2c3c-4f7a-bcab-100b7c83720a finished successfully.
Introspection for UUID 009be9e6-9206-4d48-a90a-57b7de7bac36 finished successfully.
Introspection for UUID d11d32b3-f220-4ac2-bb7d-0bbc761dfa2f finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found 
in {u'em4': {'ip': None, 'mac': u'b8:ca:3a:66:dd:d5'}, u'em1': {'ip': u'192.0.2.9', 'mac': u'b8:ca:3a:66:dd:d0'}, u'em3': {'ip': u'10.12.70.73', 'mac'
: u'b8:ca:3a:66:dd:d4'}, u'em2': {'ip': None, 'mac': u'b8:ca:3a:66:dd:d2'}}
Introspection for UUID d48283e2-7221-4796-a077-e6864284b11d finished successfully.
Introspection for UUID 7e3a56dd-f97b-48ce-b9a4-ffe49c83230f finished with error: Introspection timeout
Introspection for UUID 55c618a3-ebb4-492c-91fe-37f07a096c65 finished with error: Introspection timeout
Introspection for UUID a2542321-f87e-4c52-b0b3-294047c3d7b1 finished successfully.
Introspection for UUID afafb2ee-6b02-4ede-81c3-ed4a8ce2187c finished successfully.
Introspection for UUID 9ff5d576-f83e-42ba-9930-84405ff4ad21 finished successfully.
Introspection for UUID 332938ff-ac22-4187-a87c-62cc9eb40718 finished with error: Introspection timeout
Introspection for UUID 55ed03e9-3360-4cf3-be0c-80079b1c0c54 finished with error: Introspection timeout
Introspection for UUID 190a9c88-7e37-4ea8-bee5-7329ab500bc1 finished successfully.
Introspection for UUID 243ddedf-f62f-4800-b67b-e9d273ce9edd finished with error: Introspection timeout
Introspection for UUID 74b2d397-e733-4bf3-84b7-d9809087097b finished successfully.
Introspection for UUID 79c26cea-d700-4766-a207-7d5ea7e33928 finished with error: Preprocessing hook validate_interfaces: No suitable interfaces found 
in {u'em4': {'ip': None, 'mac': u'b8:ca:3a:66:e3:85'}, u'em1': {'ip': u'192.0.2.5', 'mac': u'b8:ca:3a:66:e3:80'}, u'em3': {'ip': u'10.12.70.69', 'mac'
: u'b8:ca:3a:66:e3:84'}, u'em2': {'ip': None, 'mac': u'b8:ca:3a:66:e3:82'}}
Introspection for UUID 0f940edd-0c6e-473f-92d6-e9607a4bd0cf finished with error: Introspection timeout
Introspection for UUID ce26d53b-ed65-4606-bda2-69e487aa9c62 finished successfully.
Introspection for UUID ca3a8ff1-da09-4630-b0c3-142d93b6404d finished successfully.
Introspection for UUID 67bb1732-3030-4538-872c-00f08d0923aa finished with error: Introspection timeout
Introspection for UUID 2f61e059-6cfe-414f-8966-09b70e69ec41 finished successfully.
Introspection for UUID bfdfdb36-5b1f-4d5c-9100-a3c060f05dbe finished successfully.
Introspection for UUID 4cac5f43-0b0d-48ed-848f-7b5bec951902 finished successfully.
Introspection for UUID 9bef615b-2033-4628-bcb0-530a62f683f0 finished with error: Introspection timeout
Introspection for UUID 084cf3c5-320e-4ebe-a00e-fc09bb3b1a2a finished with error: Introspection timeout
Introspection for UUID 00432fd9-08a3-4eb7-a3ba-25aedce9cd59 finished with error: Introspection timeout
Introspection for UUID 308981ea-0b97-4a6d-bcda-0b988bcda6ea finished successfully.
Introspection for UUID 02ea43e4-0aee-4e61-bb86-8d5b12b68646 finished successfully.
Introspection for UUID 0be18305-5fc8-4ddb-8a83-c901a7c73c80 finished successfully.
Introspection for UUID f913c51a-4ec7-461a-8376-6fa153d1862b finished successfully.
Introspection for UUID 5053c116-dcd8-45c4-826d-ab7c5d6bf015 finished with error: Introspection timeout
Introspection for UUID e9fe0a21-798c-49e9-8c38-f89aa6a15616 finished with error: Introspection timeout
Introspection for UUID cc3e74c7-cee0-49a0-8b42-9772078746f4 finished successfully.
Introspection for UUID 34bbcbc2-1f16-4b26-85c8-677c64f838f0 finished with error: Introspection timeout
Introspection for UUID 29e5697e-5564-4b10-a042-837f5b97167b finished with error: Introspection timeout
Introspection for UUID 44774623-b13f-486c-8a3f-883cd7f71e46 finished with error: Introspection timeout
Introspection for UUID bace316a-be0e-4f5d-89f9-cb4b43bd22df finished successfully.
Introspection for UUID 6cd20bf7-af25-4414-a46d-9d6eb3e168e0 finished successfully.
Introspection for UUID 8ab4ddf7-b286-4b49-b428-a04dee430e27 finished successfully.
Introspection for UUID f5acc2b9-888a-4437-9fb7-5f98994f2e96 finished with error: Introspection timeout
Introspection for UUID fa2a289e-f056-4ec9-9c2c-3d1f970f4e16 finished successfully.
Introspection for UUID 14298633-176e-498a-a2b7-ecf819f546ff finished with error: Introspection timeout
Introspection for UUID 1a980477-20a0-4d08-a11e-c13058c34626 finished with error: Introspection timeout
Introspection for UUID 5d622e7c-9ccf-4b84-b5ad-d60000340e24 finished with error: Introspection timeout
Introspection for UUID 5d62ed5c-9d09-40ce-be7e-eaad33b6b5b1 finished successfully.
Introspection for UUID f2f25c0f-84b3-4d7f-8335-0194d3f7b0ec finished with error: Introspection timeout
Introspection for UUID 266b78bc-89ba-4272-864f-e97c5069d5ff finished successfully.
Introspection for UUID 61927e24-60fc-448d-9154-c988974f8981 finished successfully.
Introspection for UUID 6bf0bb55-b88b-418b-8f6e-70af75187dfa finished with error: Introspection timeout
Introspection for UUID aae2090f-7478-4a37-a00a-957c80ae66af finished successfully.
Introspection for UUID 43e955ca-b469-45a7-9e09-a724f2f480b1 finished successfully.
Introspection for UUID 5abfe445-e1ff-4ced-a2b1-7b74af4337d2 finished successfully.
Introspection for UUID 05194d3d-2ee9-4753-89b1-4c0d9d7119ec finished successfully.
Introspection for UUID 5e4c190e-5b64-4b32-82f3-79603155d1ef finished successfully.
Introspection for UUID bba0e872-ffcb-458b-8bda-7f5a5453b4c1 finished with error: Introspection timeout
Introspection for UUID 86cc82e4-9b51-47bf-8f3b-401a6b843a78 finished successfully.
Introspection for UUID ff1e9110-6479-46e6-a2d0-caeb067e8a40 finished successfully.
Introspection for UUID 09467b4b-51da-4fdb-a3e5-faf5ef526a7b finished successfully.
Introspection for UUID 645aaf33-2e19-424c-bdb5-cbbf77994cd8 finished with error: Introspection timeout
Introspection for UUID 236c11f6-a8e0-4811-ad4c-81d6212202c4 finished with error: Introspection timeout
Introspection completed with errors:
86ce5b88-8631-499b-a42d-115b4429192f: Introspection timeout
9c404557-e361-4a04-a5ed-0633278535a2: Introspection timeout
d11d32b3-f220-4ac2-bb7d-0bbc761dfa2f: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'em4': {'ip': None, 'mac': u'b8:ca:3a:
66:dd:d5'}, u'em1': {'ip': u'192.0.2.9', 'mac': u'b8:ca:3a:66:dd:d0'}, u'em3': {'ip': u'10.12.70.73', 'mac': u'b8:ca:3a:66:dd:d4'}, u'em2': {'ip': Non
e, 'mac': u'b8:ca:3a:66:dd:d2'}}
7e3a56dd-f97b-48ce-b9a4-ffe49c83230f: Introspection timeout
55c618a3-ebb4-492c-91fe-37f07a096c65: Introspection timeout
332938ff-ac22-4187-a87c-62cc9eb40718: Introspection timeout
55ed03e9-3360-4cf3-be0c-80079b1c0c54: Introspection timeout
243ddedf-f62f-4800-b67b-e9d273ce9edd: Introspection timeout
79c26cea-d700-4766-a207-7d5ea7e33928: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'em4': {'ip': None, 'mac': u'b8:ca:3a:
66:e3:85'}, u'em1': {'ip': u'192.0.2.5', 'mac': u'b8:ca:3a:66:e3:80'}, u'em3': {'ip': u'10.12.70.69', 'mac': u'b8:ca:3a:66:e3:84'}, u'em2': {'ip': None, 'mac': u'b8:ca:3a:66:e3:82'}}
0f940edd-0c6e-473f-92d6-e9607a4bd0cf: Introspection timeout
67bb1732-3030-4538-872c-00f08d0923aa: Introspection timeout
9bef615b-2033-4628-bcb0-530a62f683f0: Introspection timeout
084cf3c5-320e-4ebe-a00e-fc09bb3b1a2a: Introspection timeout
00432fd9-08a3-4eb7-a3ba-25aedce9cd59: Introspection timeout
5053c116-dcd8-45c4-826d-ab7c5d6bf015: Introspection timeout
e9fe0a21-798c-49e9-8c38-f89aa6a15616: Introspection timeout
34bbcbc2-1f16-4b26-85c8-677c64f838f0: Introspection timeout
29e5697e-5564-4b10-a042-837f5b97167b: Introspection timeout
44774623-b13f-486c-8a3f-883cd7f71e46: Introspection timeout
f5acc2b9-888a-4437-9fb7-5f98994f2e96: Introspection timeout
14298633-176e-498a-a2b7-ecf819f546ff: Introspection timeout
1a980477-20a0-4d08-a11e-c13058c34626: Introspection timeout
5d622e7c-9ccf-4b84-b5ad-d60000340e24: Introspection timeout
f2f25c0f-84b3-4d7f-8335-0194d3f7b0ec: Introspection timeout
6bf0bb55-b88b-418b-8f6e-70af75187dfa: Introspection timeout
bba0e872-ffcb-458b-8bda-7f5a5453b4c1: Introspection timeout
645aaf33-2e19-424c-bdb5-cbbf77994cd8: Introspection timeout
236c11f6-a8e0-4811-ad4c-81d6212202c4: Introspection timeout

real    61m29.955s
user    0m2.261s
sys     0m0.296s
Tue Mar  7 00:02:40 UTC 2017

27 nodes failed bulk introspection. https://gist.github.com/jtaleric/fcca3811cd4d8f37336f9532e5b9c9ff was used to introspect the remaining nodes in batches.
Resource utilization during the script run
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488848427000&to=1488850287000
Using https://github.com/smalleni/openstack-testing/blob/master/introspect.sh I verified that we have only one node(c06-h21-6048r.rdu.openstack.engineering.redhat.com) that didn't introspect. Also c04-h33-6018r.rdu.openstack.engineering.redhat.com was exlcuded due to the clean state problem.
Summary of Day1: At the end of Day 1 we have 70 nodes introspected out of a possible 72. Remaining two nodes have been excluded due to boot order issues.
03/07/17 Tuesday


Task 1- Set root device on the Supermicros. Root device was set on the 6048rs and 6018rs using https://github.com/smalleni/openstack-testing/blob/master/set_root_device.sh


Task 2- Seet root password for overcloud image


 virt-customize -a overcloud-full.qcow2 --root-password password:blah


Task 3- Deploy 1 Controller+ 1 Compute minimal deploy

Initially I was seeing Nova valid Host erros on trying a deploy and on looking at nova-scheduler logs I could see that all hostts were failing the RAMFilter. Turns out that since I introspected nodes one node at a time using the script referenced earlier, they were not set to available post introspection. I got past this error by doing
for i in `ironic node-list | grep None | awk {'print$2'}`; do ironic node-set-provision-state $i provide; done

[stack@c02-h10-r620 ~]$ date; time openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e ~/templates/network-environment.yaml --control-scal
e 1 --compute-scale 1 --ntp-server 10.5.26.10 --neutron-network-type vxlan --neutron-tunnel-types vxlan -t 60                                                                                                      
Tue Mar  7 16:36:25 UTC 2017
Removing the current plan files
Uploading new plan files
...................................................................................................................
Stack overcloud CREATE_COMPLETE 

Started Mistral Workflow. Execution ID: 89139693-3b82-4b89-a6b0-2a9394e2aa74
/home/stack/.ssh/known_hosts updated.
Original contents retained as /home/stack/.ssh/known_hosts.old
Overcloud Endpoint: http://172.21.0.10:5000/v2.0
Overcloud Deployed

real    44m38.359s
user    0m7.773s
sys     0m0.908s

http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1488904378857&to=1488907978857

Task4- Since the mock deploy went well, I was adventurous and decided to try a full deploy staraight away instead of 32 nodes at once and see how it goes. Everything seemed to be going until one of the 6048rs failed to deploy because of https://bugs.launchpad.net/ironic/+bug/1670916. I decided to exclude the supermicros and go ahead with depoloyment of remaining 53 servers. This deployment with 53 servers seemed to be going very well except that it was taking a lot of time for the nodes to be scheduled (no node pinning intentionally). We see that the deploy timeout because of heat waiting for a response from an rpc call https://gist.github.com/smalleni/a90133782d4f903ef995339293a45b8f. Looking at resource utilization it is very evident that keystone was pegged (1 process so it was taking 1 CPU).

http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?panelId=62&fullscreen&from=1488914760000&to=1488919656000&var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All
I tuned keystone-admin and keystone-main processes to have 24 processes with 1 thread each and also bumped heat rpc_response_timeout to 1200s from 600s. The nova-api-wsgi was allso bumped to 24 proccesses and 1 thread each i nthe http conf.

Task 5- Try deploy with avaiable nodes

(minus supermicros, 1 r620 that went into clean failed and one r930 that had a foreign disk)
With keystone tuned (no node pinning yet though) deploy srtaed at  Tue Mar 7 21:42:54 UTC 2017,went smoothly and finished in 115m1.931s
Summary of Day 2- OC up with 3 Controllers + 48 Computes without scheduler hints
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?from=1488922856000&to=1488930356000&var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All
03/08/17 Wednesday


Task 1- Enable debug for IPA logs to look at root disk issue with supermicros

Talking to Lucas about the root disk issue, he sugested enabling debug for IPA. Although with OSP 11 this can be set with debug=True in ironic.conf(https://review.openstack.org/#/c/410168/), the route is more indirect for OSP10. In ironic.conf/[pxe], pxe_append_params was set to nofb nomodeset vga=normal ipa-debug=1 (ipa-debug=1 was added)

Task 2- Reintrospect supermicros

Looks like a good idea to reinstrospect supermicros since I've set the root disk.
All supermicros were put into manageable mode and using the following command they were instrospected
openstack overcloud node introspect --all-manageable --provide
I saw that nodes were stuck in clean wait and never went further. Looking at consoles, it seemed lik ironic-python-agent was meerely sleeping.  I figured out that cleaning will not work if node is set to maintenance mode and these nodes had maintenance set to true. Taking the nodes off of maintenance put them in a clean failed state. I got them back to available by setting the clean failed nodes to manageable and then to provide.

Task 3- Try scale up with supermicros

The idea was to go to 66 computes from 48. However it failed because the root device WWN couldn;t be found on one of the nodes (similar to what was seen yesterday). I decided to retry since it was just one node that failed, and hoped a retry would go through. However, the very node that failed previous time went through this time but some other node failed again. This seems really surprising that sometimes the disk with the WWN is found and sometimes it isnt. This has been captured here:
https://gist.github.com/smalleni/81884e80f499fd5417bb2a45c71ff96b
I decided to not use all the supermicros in the scale lab since one node (not the same everytime) seems to be failing in each attempt, so I decided to use only 10 of the 16 available to me to scale up. However, this didn't go well either and the deploy was stuck. I could see the one nova instance went into ERROR state yet couldn't correlate it to an instance UUID because none of the ironic nodes seemed to have the instance UUID apparently. This can be seen here:
https://gist.github.com/smalleni/f26c9760cd56e2fed5f59e670ace80ea
I finally deleted stack and decided to not mess with the supermicros for the day

Task 4- Deploy avaialable R620s and R930s with scheduler hints

For this I had to delete previous overcloud stack and I saw a weird issue were it said delete failed but on doing a stack list that stack could be seen being deleted still. This has been captured here:
https://gist.github.com/smalleni/f26c9760cd56e2fed5f59e670ace80ea
I used https://github.com/smalleni/openstack-testing/blob/master/node_type_to_pinning.sh to for node pinning and pinned all R620s to compute type and R930s to controller type. The first deployment attempt failed because of one R620 not being able to pull the image. Not sure why this happens but the error is captured here https://gist.github.com/smalleni/d1225324d8d216303c8adf81f6db974f. I went for redeploy by deleting the stack and have an overcloud up with 3 controllers and 48 compute nodes. It took  86m5.503s for the deploy with scheduler hints. Thats almost 30m less than without scheduler hints.
03/09/17 Thursday

Thursday was mostly spent trying to debug the WWN Not Found issue on the supermicros. Lucas from the Ironic team and I worked on this. We pulled in https://review.openstack.org/#/c/443649/ and rebuilt the IPA image with it. The new image was uploaded to glance and the ramdisk UUID was updated in ironic. The ramdisk was named bm-deploy-ramdisk-mod so as to not confuse it with the original (https://gist.github.com/smalleni/a701dc4fb9b3a383c108fa9427d81afc). Going for deploy, TripleO complained that it was expecting a certain ramdisk but found a different one in ironic https://gist.github.com/smalleni/83560621a8467a0802d767da27250acc Hence I did openstack baremetal configure boot hoping it would set things right. It turns out that the name of the deploy ramdisk is hardcoded in tripleo-pythonclient and hence baremetal configure boot was unable to pick up our modified image and assign it. I BZed this hard reference and renamed our modified image to bm-deploy-ramdisk and configured boot image.
Even after trying the modified IPA image, supermicros weren't deploying. One node or the other was reporting that the specified WWN couldn't be found which led to scheduling errors.
Later after Lucas signed off for the day I decided to deploy supermicros without specifying root device and for this I needed the nodes not ave OS on more than one disk. SoI foreman reprovisioned al lthe supermicros from hardwarestore, with the partition table set to not do software RAID. Then i tried to deploy on the dells and supermicros but saw this which I've never seen before. The ironic folks also didn' seem postive that they saw this(https://gist.github.com/smalleni/86e6f39e4ca1ad7f07e03c8fcccda76e).
03/10/17 Friday

Putting the supermicros behind me, I decided to use teh 51 nodes I had working to do some testing with tunables. P{reviously we tuned Nova-API processes to 24 from 1 and did deployments. I decided to tune the nova processes to 4 and see how it affects the deployment timing. With node pinning it took a 103 minutes, clearly 17 minutes more.
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489158943000&to=1489165243000
It took 15 minutes to delete this stack https://gist.github.com/smalleni/0f5a51fd9329e226600fbab0240ed3ef
Next, tuning max_concurrent_builds in unedrcloud nova.conf to 20 (from default 10), and keepin the nova API processes at 24 and using pinning, we see that it took 90 minutes vs86 minutes with everything else same but mx_concurrent_builds set to 10. Buping the max_concurrent_builds did not really make the deployment faster.
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489171243000&to=1489176763000
03/11/17 Saturday

With whatever nodes were available, I decided to do composable roles, so 3 pcmk, 3 service api and 45 computes. There were multiple issues in the R930 templates that lead to failed deploys consistently when trying with composable roles. Even after fixing them with https://github.com/redhat-performance/openstack-templates/commit/6980b0a6e545ca6f12c1ade32c3dd5c11df1b256 I was only able to get a 3 pcmk, 3 service api, 1 compute environment up (Note: Since the service API node hosts keystone, neutron-l3-agent and others it needs to have access to external network or you will see nasty messages like
[stack@c02-h10-r620 ~]$ nova list
No handlers could be found for logger "keystoneauth.identity.generic.base"
ERROR (ConnectFailure): Unable to establish connection to http://172.21.0.10:5000/v2.0/tokens: HTTPConnectionPool(host='172.21.0.10', port=5000): Max 
retries exceeded with url: /v2.0/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x225ac90>: Fail
ed to establish a new connection: [Errno 113] No route to host',))
Service API didn't have external

03/12/17 Sunday

Verified that the WWN issue is not seen in the 6018Rs and also tried to get a 3 controller + 10 compute(6018r nodes only for compute) deployment up, since I was seeing failures due to WWN issues when I was trying to schedule all available supermicros only. I decided to use only a subset so even if some fail, the others can go through. From a scheduling and finding WWN of of the disk standpoint the deployment went fine when using only a subset of nodes, however the deployment was stuck at https://gist.github.com/smalleni/97dd6caff2beecbe9e495037dcf4f391
03/13/17 Monday

I started the day waiting for the rest of the nodes (they were handed off to me at about 2:10pm excluding a couple of hosts that had firmware issue I believe).
I was trying to do some deployments on the 6048Rs and see if I hit the WWN issue, while I wait, but I kept seeing that some nodes deploy fine but others don't and the stack gets stuck at overcloud.ControllerIpListMap. Some computes were pingable and others weren't/ Going through the management interface with the root password I set earlier, found out that the templates were referencing ens3f0 and ens3f1 when some of the supermicros instead had ens5f0 and ens5f1(http://imgur.com/a/PYm0z) The templates should have instead referred to nics as nic 1 and nic 2 due to the insonsistent naming.
Baremetal import got stuck due to some nodes failing cleaning. I put the nodes that failed cleaning, 6048Rs and 6018Rs into maintenance and tried bulk introspection. Some nodes succeeded and some failed as during the previous attempt.
e3ef06b7-52be-4106-8b58-5a16e6d553af: Introspection timeout
8df37ef9-3efe-491c-bfdc-8859865eb066: Introspection timeout
a7efbc16-3865-437f-8fb5-c1d0d3b9718d: Introspection timeout
9cfb1d52-17ab-4bb8-b7b3-602185596c89: Introspection timeout
bec4de89-a151-4a67-b11f-6d3ff5190bd1: Introspection timeout
3640e04e-7adf-4e2e-ba33-6afb9f2fef82: Introspection timeout
6e30d8e6-43d2-4aac-aeef-f7dbe59d35b3: Introspection timeout
9a5392ec-6df7-46fa-abad-0f72b74c6268: Introspection timeout
b72e14fe-1092-4bf8-8e4c-975fb1d09bf2: Introspection timeout
611c4b67-12d9-410a-bc27-97f81af24011: Preprocessing hook validate_interfaces: No suitable interfaces found in {u'em4': {'ip': None, 'mac': u'b8:ca:3a:66:d7:1d'}, u'em1': {'ip': u'192.0.2.33', 'mac': u'b8:ca:3a:6
6:d7:18'}, u'em3': {'ip': u'10.12.70.90', 'mac': u'b8:ca:3a:66:d7:1c'}, u'em2': {'ip': None, 'mac': u'b8:ca:3a:66:d7:1a'}}
34c5974d-b399-4901-894a-79d7ff4ffcd0: Introspection timeout
85465cbe-d335-4e25-b569-f9f2674a03ea: Introspection timeout
2a7decda-fea4-4c12-8d2c-d85a7896e78c: Introspection timeout

real    62m1.259s
user    0m3.072s
sys     0m0.300s
Mon Mar 13 22:09:39 UTC 2017

Going to use the batch introspection script to introspect the remaining nodes. The script worked and the rest of the nodes were introspected (barring supermicros)
Later I tried to deploy 128 nodes at once (3 controller+125 computes), the stack create failed after timing out and interestingly keystone which was routinely seen at 1500% utilization on previous smaller builds wasn't even seen to have 300% utilization in this case. I wonder if the keystone processes were starved because of seeting nova-api processes to 24. On the stderr of the deploy command command I saw for a few nodes 2017-03-14 07:14:59Z [overcloud.Compute.82.NovaCompute]: CREATE_FAILED  BadRequest: resources.NovaCompute: Networking client is experiencing an unauth
orized exception. (HTTP 400) (Request-ID: req-5f0b608d-ff9e-4a84-87b6-3eb73745c5ae) and the nova logs showed this https://gist.github.com/smalleni/ebcd22ae344c397a8a50f79649cae6b4. By the time the stack create failed 111 nodes were atleast pingable over the ctlplane address.The undercloud CPU was being slammed pretty hard the entire duration.
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?from=1489460415000&to=1489476615000&var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All
03/14/17 Tuesday

Tuning max_concurrent_build back to 10 (had it set to 20 earlier), keystone processes to 16, and nova-api processes to 8 to try and deploy 100 computes at once. Going to use scheduler hints this time (did not yesterday).
However, the stack update got stuck when creating a particular node and it took way too much effort and time to debug. Learnings from this are as below
TL;DR
It is highly recommended to use ironic node cleaning[1] when reusing nodes from one overcloud deployment to another to erase the metadata on this disk that could be from a previous workload/deployment. Setting a root password on your overcloud image before deploying should also be a priority, just to help with debug in case you run into issues. However, sometimes clean could fail and the node is put into maintenance and clean failed state, however the power state of the node is not tracked correctly before putting in maintenance. The node shows up as "power off" and anyhow the power state is not tracked after the node has been put in maintenance[2]. What this could mean is, there might be a node that is left powered on with a stale network configuration from a previous deploy hanging around there, so when you do a deployment later excluding this node, you might end up reusing an ip for the control plane that this node has leading to a failed deploy as the undercloud tries to configure this node instead of the other but this node has the wrong metadata url as can be seen from os-collect-config logs. (that is what I believe was happening)
Debugging:
So, we logged into the node that the deploy command output said it was stuck on creating (compute-62), through the management interface. I could see that /etc/os-net-config/config.json was unpopulated, and there were no errors in os-collect-config. I waited for a bit and net-config was populated so I assumed things were not going bad but were just slow. However, I found it interesting that ping from this compute node to the undercloud showed around 50% loss whereas the ping from the undercloud to the control plane IP of compute-62 showed no loss at all.
Then we sshed into "compute-62" using the control plane IP from the undercloud. The hostname of this host ended up being "compute-99" on login and I also realized the MAC address on compute-62 wasn't the same as the one populated in the ARP cache on the undercloud for the IP of compute-62 (192.0.2.167). And the compute-99 which we logged into was spitting out errors in os-collect-config about not being able to pull the metadata because of wrong metadata url[3]. The MAC on the undercloud for 192.0.2.167 corresponded to an interface on compute-99 and not compute-62,  so, this node apparently had the metadata url from a previous deploy and was trying to pull it but obviously it was wrong since this is a new overcloud stack.
The deployment of the 120 odd nodes as a whole failed, because of the 9 or so nodes that failed ironic cleaning and got stuck around powered on. Apart from one node, several other nodes complained of not being able to get an IP on networks such as Tenant, Storage etc[4]. because of a duplicate on the networks left behind from "clean failed yet powered on" nodes.
I got past this issue by powering off every other node I was not trying to deploy on.
It is also recommended to use WWN to set the root device before deploying as a best practice even when deploying controllers/computes and not only ceph nodes. Setting the root device helps avoid non-deterministic boot of the nodes when there are are multiple bootloaders.
[1]- https://docs.openstack.org/developer/ironic/deploy/cleaning.html
[2]- https://bugs.launchpad.net/ironic/+bug/1672877
[3]- http://imgur.com/a/N6RjY
[4]- http://imgur.com/a/PiHhi
Later that evening, powering off all the nodes that I wasn't using for deployment I was able to get a deployment of 50 compute nodes with 3 controlelrs and scaled it up to 100 compute nodes by midnight.
50 compute node deplyoment( 8 nova api processes, 16 keystone processes, node pinning and max_concurrent_builds=10)- 79 minutes 33 seconds

http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489530344000&to=1489536044000
Scale up to 80 computes- 76 minutes 39 seconds

http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489538144000&to=1489543244000
Scale up to 100 compute nodes- 91 minutes 42 seconds

http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489544744000&to=1489550444000
03/15/17 Wednesday

Since I had a 103 node cloud up, I wanted to use the 9 R730s, 2 R630s and 1 R920 that had introspected and were available to deploy, to scale up the cloud 115 before fighting with the supermicros.
On kicking off the scale up, I saw that that deployment got stuck at a point and quickly pulled a console to see that all the R730s were complaining about "NBP too big for base memory". I was interesting that the two R630s and 1 R930 that was also part of the scale up went into active mode. Clearly this seemed to be hardware related, so after chasing this for a while I raised a scale lab ticket.
Console image- http://imgur.com/a/p7tat
Things were left in a bad state by night as I tried to do a heat stack-cancel-update to cancel the update that was anyway going to fail due to R730s and the overcloud stack ended up in a ROLLBACK_FAILED state and it was not even possible to update it from that point. So I would have to rebuild the stack if I want to scale any further. Talking upstream the only currently to cancel an update that was triggered, is to either wait for a timeout or restart heat engine. Since this seems very rudmentary, I spoke to Steve Hardy upstream and have an RFE to have a way to cancel stack update through heat cli (https://bugs.launchpad.net/heat/+bug/1673794).
03/16/17- Thursday

I deleted the ROLLBACK_FAILED stack and rebuilt the entire stack of 103 nodes. But this time I preferred not to scale up but instead have all 103 deployed at once. This attempt went well (8 nova api processes, 16 keystone, max_concurrent_builds=10) I had a cloud in  153 minutes 17 seconds.
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489671644000&to=1489680944000
When I tried to introspect rest of the nodes including supermicros but excluding R730s, nodes were trying to pxe but did not get an IP through DHCP. I suspected something is going on witht he lab and mapped all the hosts I currently had deployed to racks, to see if the amchines failing introspection were isolated to certain racks. But that didn't see to be the case. I also saw a strange issue where one of my computes had a 192.x.x.x address (from introspeection range) on the em1 when it had os-nect-config set up its ctlplane network on em2. This address on em1 was sshable/pingable intermittently. The interface that are not used by director for networking, are set to dhcp and I assumed somehow em1 was getting a DHCP response when it clearly shouldn't be getting one.
03/17/17- Friday

I started the day by bumping max_concurrent_builds to 20 (nova api wgi processes 8, keystone admin and main processes 16) to deploy the the 103 node cloud. It took 139 minutes which is 14 minutes lesser than when max_concurrent_builds was set to 10.
I decided to scale beyond the 103 nodes we had and so tried to introspect the rest of the nodes (some failed introspection earlier, some were never introspected), however I saw that that machines never pxed. After a bit of research the guess I made yesterday turned out to be right. There were several unused NICs on the overcloud nodes that were set to dhcp (these nics weren't included in the nic-configs). These nics were stealing all the IPs that inspector was throwing at machines waiting for PXE. It should be documented that unused NICs must also be included in nic-configs with use_dhcp set to false, to that I have raised an upstream bug. I also raised a downstream bug on ironic inspector for handing out IP addresses to whoever asks, but it was closed as WONTFIX because it seems ironic-inspector operated under the assumption that it doesn't know any IPs.
max_concurrent_builds=20
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489763504000&to=1489774260000
Debugging:
See how I was able to loging to an already deeployed overcloud node with an IP from the inspection range https://gist.github.com/smalleni/d50347250e6c06f35eafdb3f89abd96f
03/15/18- Saturday

Tried to deploy 103 nodes that were available without node pinning (did not pass scheduler hints to deploy command). I raised the timeout of the deploy command to 360 minutes from the default 240 minutes. However, after 240 minutes the keystone token for the overcloud create expired and caused the entire stack create to fail. Specifically the overcloud.compute stack was what was taking that long
2017-03-18 19:54:22Z [overcloud.Compute]: CREATE_FAILED  Resource CREATE failed: Unauthorized: resources[81].resources.NovaCompute: The request you have made requires authentication. (HTTP 401) (Request-ID: req-f3373924-3da4-4349-8b46-b2430ad3dc3f)
2017-03-18 19:54:22Z [overcloud.Compute]: CREATE_FAILED  Unauthorized: resources.Compute.resources[81].resources.NovaCompute: The request you have made requires authentication. (HTTP 401) (Request-ID: req-      f3373924-3da4-4349-8b46-b2430ad3dc3f)
2017-03-18 19:54:23Z [overcloud]: CREATE_FAILED  Resource CREATE failed: Unauthorized: resources.Compute.resources[81].resources.NovaCompute: The request you have made requires authentication. (HTTP 401)        (Request-ID: req-f3373924-3da4-4349-8b46-b2430ad3dc3f)

 Stack overcloud CREATE_FAILED

For scale deployments, the keystone token expiration time which is set to a default of 14400 seconds (240 minutes) needs to be bumped. It could also be seen during this deploy that nova-scheduler was pegged, at 96% because of all the load the scheduling a large number of nodes puts.
http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489866626000&to=1489881566000
03/15/18- Sunday

I decided to make use of the nodes I had an try a few tunable like max_concurrenct_builds and nova api wsgi processes. Bumping max_concurrent_builds to 40, I see the 10G provisioning interface being a bottleneck here.
http://norton.perf.lab.eng.rdu.redhat.com:3000/render/dashboard-solo/db/openstack-general-system-performance?var-Cloud=full-scale-deploy&var-NodeType=*&var-Node=undercloud&var-Interface=interface-br-ctlplane&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1489898748000&to=1489901208000&panelId=17&width=1000&height=500
OpenStack Issues


Undercloud install fails (fun right): https://bugzilla.redhat.com/show_bug.cgi?id=1428608

https://gist.github.com/smalleni/c02825ea2819cdab45c55ff0a15a3493
Moving from GA to latest(2017-03-03.1) due to this


https://bugs.launchpad.net/ironic/+bug/1670549 - Ironic Bulk Introspection
https://bugzilla.redhat.com/show_bug.cgi?id=1388286
https://bugzilla.redhat.com/show_bug.cgi?id=1384187
https://gist.github.com/smalleni/b8126cf40671d5254147bf2b253e6d9d - The 3 Controller + 67 compute node deployment failed because the root device set via WWN could not apparently be found by ironic. I verified that the value shown in ironic node-show matched with what introspection data provided. Need to BZ this.
https://bugzilla.redhat.com/show_bug.cgi?id=1430201 -  Bulk Introspection finished with message "finished with error: Preprocessing hook validate_interfaces" on some nodes (downstream)
https://bugs.launchpad.net/ironic/+bug/1670916 - Node Deployment failure due to " Error finding the disk or partition device to deploy the image onto"
https://bugzilla.redhat.com/show_bug.cgi?id=1430758 - Node Deployment failure due to " Error finding the disk or partition device to deploy the image onto" (downstream)
https://review.openstack.org/#/c/443649/ - Patch for WWN node found
https://review.openstack.org/#/c/443159/ - Add a debug log when comparing root device hints
https://bugzilla.redhat.com/show_bug.cgi?id=1430592 - Undercloud is deployed with docker bridge and docker loop devices
https://bugzilla.redhat.com/show_bug.cgi?id=1430766 - Deployment fails with stack still stuck in CREATE_IN_PROGRESS because of messaging timeout-Keystone pegged
https://bugzilla.redhat.com/show_bug.cgi?id=1430874 -  openstack baremetal configure boot is hardcoded to bm-deploy-ramdisk
https://bugs.launchpad.net/tripleo/+bug/1671597 - openstack baremetal configure boot is hardcoded to bm-deploy-ramdisk
https://bugzilla.redhat.com/show_bug.cgi?id=1431270 - Ironic Conductor is pegged during OC Deploy
https://gist.github.com/smalleni/d1225324d8d216303c8adf81f6db974f - Node failed to get image for deploy
https://gist.github.com/smalleni/36570a9595f240cace9fdc751f30902d - Ironic fails to set power state
https://gist.github.com/smalleni/b7bd80c3019f42d51d46f7ce9c1a61c9 - Stack delete failed but stack is shown as delete in progress on openstack stack list
https://gist.github.com/smalleni/f26c9760cd56e2fed5f59e670ace80ea - Instance doesn't have associated baremetal node
https://gist.github.com/smalleni/81884e80f499fd5417bb2a45c71ff96b - Node deploy fails on first attempt due to WWN node being found, goes through later
https://bugs.launchpad.net/ironic/+bug/1672877 - Node has to be powered off before putting in maintenace when clean fails
https://bugs.launchpad.net/heat/+bug/1673794 - [RFE]: Give user a way to cancel stack update through a heat command
https://bugzilla.redhat.com/show_bug.cgi?id=1433416 -  Ironic reports incorrect power state "power off" after a clean failed node is put into Maintenance
https://bugs.launchpad.net/tripleo/+bug/1673882 - The nic-config related documentation should mention that unused itnerfaces in overcloud should be set to use_dhcp: false
https://bugzilla.redhat.com/show_bug.cgi?id=1433978 - Ironic Inspector doesn't hand out IPs based on MAC of provisioning interface
https://bugzilla.redhat.com/show_bug.cgi?id=1438015 -  Heat doesn't renew token leading to authorization failure in deployments going past default 240 min timeout

Template Issues


https://github.com/redhat-performance/openstack-templates/commit/6980b0a6e545ca6f12c1ade32c3dd5c11df1b256
redhat-performance/openstack-templates#3

Scale Lab Issues


https://engineering.redhat.com/rt/Ticket/Display.html?id=439449 -  R930s don't seem ready for deployment
https://engineering.redhat.com/rt/Ticket/Display.html?id=439459 - Instackenv.json doesn't get regenerated after changes to nullOs host param [Fixed]
https://engineering.redhat.com/rt/Ticket/Display.html?id=439501 - IPMI credentials not set
https://engineering.redhat.com/rt/Ticket/Display.html?id=439506 - Boot order doesn't seem right
https://engineering.redhat.com/rt/Ticket/Display.html?id=439512 - Another Boot order issue
https://engineering.redhat.com/rt/Ticket/Display.html?id=439643 - c08-h17-r930.rdu.openstack.engineering.redhat.com has one of the disks in foreign mode
https://engineering.redhat.com/rt/Ticket/Display.html?id=440161 - Make the supermicros consistent in network interfaces

Issues found in pre-deployment testing

https://engineering.redhat.com/rt/Ticket/Display.html?id=439008 - Boot order not being correctly set through racadm
https://engineering.redhat.com/rt/Ticket/Display.html?id=439270 -  instackenv.json includes undercloud even though foreman host paremeter nullos is set to false
https://engineering.redhat.com/rt/Ticket/Display.html?id=439010 -  QUADs setup Operator account instead of sharing administrator account