Without developer intervention, machine resources in a DC/OS environment are finite. Each containerized service in a DC/OS + Marathon cluster needs a certain amount of resources in order to run properly. As more and more services are deployed to a cluster, resource contention becomes a real concern. Eventually, resources will become so scarse that services will fail to deploy/scale at all! This is true for any microservice environment.
This Gist will show you how to identify Marathon services that are likely resource-starved, and provides some common steps for resolving this state.
- You'll need the private SSH key corresponding to the public key that is authorized with your DC/OS cluster.
- You must have the ability to SSH to your cluster.
- If your environment is Azure Container Services, see this article for more details.
- You must open an SSH tunnel to port 80 on the cluster. This tutorial will reference your locally-mapped port as
[port #]
.- If your environment is Azure Container Services, see the same article as above for details.
- Or, if you have the Azure CLI, you can instead try the
az acs dcos browse
command to open an SSH tunnel for you.
Luckily, spotting these failing deployments is a relatively straight forward process. Once you've connected an SSH tunnel to your cluster, simply navigate your browser to: http://localhost:[port #]/marathon
Here you can inspect the current services deployed to your Marathon instance. If things are working properly, you should see a Status 'Running' for all services. However if you see the following, then it's a safe bet that there aren't enough resources in the cluster to run the particular service:
Unfortunately, Marathon doesn't provide any specific reasons as to why a service is in the 'Waiting' state. (Exhausted CPU?, exhausted memory?, exhausted disk?, etc.) You can nagivate to http://localhost:[port #]/mesos
and look at the "Resources" section to try and identify for yourself, but regardless of the specific reason the resolution steps are the same. In general, the cluster simply needs more resources. This can be accomplished by:
- Double checking the specified resource requirements of your application
- Is there a typo in your service definition? For example, are you asking Marathon for 2560 MB of memory instead of 256 MB?
- Deleting any unneeded services
- You could scale an unneeded service to zero instances, or Destroy it entirely
- Adding more machines to your cluster
- If your cluster is running in Azure Container Services, see
https://aka.ms/acs-scale
for details
- If your cluster is running in Azure Container Services, see
- In some rare circumstances, something in your cluster may be corrupt. There is no silver bullet for this state as the reason could be multi-faceted. But, if you are running in Azure Container Services (or another cloud provider), try provisioning another cluster and attempting your deployment again.
- I don't have access to the private SSH key used to create my cluster!
- If your environment is Azure Container Services, you can reset SSH keys, create/delete users, etc, with the Azure VMAccess extension.