jtaleric/OSP-Newton-Beta-Scale.md Secret

## OSP-Newton-Beta-Scale.md

      
    Raw
  

              OSP-Newton-Beta-Scale.md
            
          
    First successful deployment

Stack overcloud CREATE_COMPLETE
Overcloud Endpoint: http://172.21.0.18:5000/v2.0
Overcloud Deployed

real    59m46.251s
user    0m5.943s
sys     0m0.459s
Mon Oct  3 04:29:52 UTC 2016

Browbeat Genhost timing

Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
id_rsa                                                                                                                                                      100% 1679     1.6KB/s   00:00    

real	5m39.505s
user	0m1.558s
sys	0m3.230s

60 compute nodes

Stack overcloud UPDATE_COMPLETE
Overcloud Endpoint: http://172.21.0.18:5000/v2.0
Overcloud Deployed

real    234m50.263s
user    0m13.174s
sys     0m0.948s
Tue Oct  4 16:23:19 UTC 2016

Started over due to a failed scale deployment

Our Scale deployment failed due to a Supermico node becoming wedged. I tried to scale back, however that ended in failre as well. So I have started over.
To deploy 30 nodes, the timings were near identical.
When attempting to deploy 60 nodes out of the gate we saw (on the overcloud controller) :
Oct  6 20:47:50 localhost os-collect-config:<ErrorResponse><Error><Message>The request processing has failed due to an internalerror:Timed out waiting for a reply to message ID ed3a0334366348479a8fdcbf65b3a68e</Message><Code>InternalFailure</Code><Type>Server</Type></Error></ErrorResponse>+ rm/tmp/tmp.xZ5F0rDJNI

Speaking with Shardy on IRC this could be due to :
05:36:30           shardy | rook: Yes, it can happen when a back end process (such as heat-engine) is overloaded and thus takes too long to repond to an RPC call
...
05:43:05           shardy | rook: it's possible we need to tune the swift config on the undercloud to cope with 60 nodes hitting the API at the same time?

Scaling from 30 to 60 Failure

We successfully deployed 3 controllers and 30 computes. When we attempted to scale to 60 compute nodes, computenode-10 become hungup on the below.
[stack@c04-h01-6048r ~]$ heat resource-list -n 5 overcloud | grep -i prog                                                                                      
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
| ComputeAllNodesDeployment                     | 2c102ae9-e979-4e50-9201-d740de9b012e         | OS::Heat::StructuredDeployments                                                               | UPDATE_IN_PROGRESS | 2016-10-07T12:27:57 | overcloud                                                                                                                                                                          |
| 10                                            | ea656a19-71e7-4008-80d2-bfba03afd4e6         | OS::Heat::StructuredDeployment                                                                | UPDATE_IN_PROGRESS | 2016-10-07T12:28:55 | overcloud-ComputeAllNodesDeployment-oeyrhq6e7mor                                                                             


<rant> The first deployment works with computenode-10, but the scaling to 60, things fail? This is totally broken. </rant>

Scaling to 60 second attempt

Success!
Stack overcloud UPDATE_COMPLETE
Overcloud Endpoint: http://172.21.0.13:5000/v2.0
Overcloud Deployed

real    262m42.476s
user    0m13.942s
sys     0m1.072s
Fri Oct  7 19:18:20 UTC 2016

Scaling to 90 failed the first attempt:

2016-10-07 19:50:09 [overcloud-Compute-gci7wjialpks]: UPDATE_FAILED ResourceInError: resources[71].resources.NovaCompute: Went to status ERROR due to "Message:
 No valid host was found. There are not enough hosts available., Code: 500"
2016-10-07 19:50:09 [NovaComputeDeployment]: SIGNAL_IN_PROGRESS Signal: deployment d8f4ebbe-1937-4939-9725-71ad607bfbc8 succeeded
2016-10-07 19:50:09 [NovaComputeDeployment]: CREATE_COMPLETE state changed
2016-10-07 19:50:10 [overcloud-Compute-gci7wjialpks-83-nip4cihj75bx]: CREATE_FAILED Resource CREATE failed: Operation cancelled
2016-10-07 19:50:11 [Compute]: UPDATE_FAILED resources.Compute: ResourceInError: resources[71].resources.NovaCompute: Went to status ERROR due to "Message: No 
valid host was found. There are not enough hosts available., Code: 500"
2016-10-07 19:50:11 [overcloud]: UPDATE_FAILED resources.Compute: ResourceInError: resources[71].resources.NovaCompute: Went to status ERROR due to "Message: N
o valid host was found. There are not enough hosts available., Code: 500"
2016-10-07 19:50:15 [UpdateDeployment]: SIGNAL_IN_PROGRESS Signal: deployment 62cae6a1-a07a-43bf-87fe-4587c5a0d91f succeeded
2016-10-07 19:50:15 [UpdateDeployment]: CREATE_COMPLETE state changed
2016-10-07 19:50:18 [NetworkDeployment]: SIGNAL_IN_PROGRESS Signal: deployment 3f0dbe87-6b5e-4e53-9b5e-21e47c3e3ea4 succeeded
2016-10-07 19:50:18 [NetworkDeployment]: CREATE_COMPLETE state changed
2016-10-07 19:50:18 [overcloud-Compute-gci7wjialpks-77-2ofjvzs6pf6i]: CREATE_FAILED Resource CREATE failed: Operation cancelled
Stack overcloud UPDATE_FAILED
Heat Stack update failed.


Scaling to 80

Stack overcloud CREATE_COMPLETE
Overcloud Endpoint: http://172.21.0.18:5000/v2.0
Overcloud Deployed

real    366m46.211s
user    0m15.143s
sys     0m2.439s

Memory Usage