DrEsteban/marathon_resource_contention.md

## marathon_resource_contention.md

      
    Raw
  

              marathon_resource_contention.md
            
          
    DC/OS + Marathon Resource Contention

Introduction

Without developer intervention, machine resources in a DC/OS environment are finite. Each containerized service in a DC/OS + Marathon cluster needs a certain amount of resources in order to run properly. As more and more services are deployed to a cluster, resource contention becomes a real concern. Eventually, resources will become so scarse that services will fail to deploy/scale at all! This is true for any microservice environment.
This Gist will show you how to identify Marathon services that are likely resource-starved, and provides some common steps for resolving this state.
Prerequisites


You'll need the private SSH key corresponding to the public key that is authorized with your DC/OS cluster.
You must have the ability to SSH to your cluster.

If your environment is Azure Container Services, see this article for more details.


You must open an SSH tunnel to port 80 on the cluster. This tutorial will reference your locally-mapped port as [port #].

If your environment is Azure Container Services, see the same article as above for details.
Or, if you have the Azure CLI, you can instead try the az acs dcos browse command to open an SSH tunnel for you.


Identifying a resource-starved service

Luckily, spotting these failing deployments is a relatively straight forward process. Once you've connected an SSH tunnel to your cluster, simply navigate your browser to: http://localhost:[port #]/marathon
Here you can inspect the current services deployed to your Marathon instance. If things are working properly, you should see a Status 'Running' for all services.  However if you see the following, then it's a safe bet that there aren't enough resources in the cluster to run the particular service:

Ways to resolve the issue

Unfortunately, Marathon doesn't provide any specific reasons as to why a service is in the 'Waiting' state. (Exhausted CPU?, exhausted memory?, exhausted disk?, etc.) You can nagivate to http://localhost:[port #]/mesos and look at the "Resources" section to try and identify for yourself, but regardless of the specific reason the resolution steps are the same. In general, the cluster simply needs more resources. This can be accomplished by:

Double checking the specified resource requirements of your application

Is there a typo in your service definition? For example, are you asking Marathon for 2560 MB of memory instead of 256 MB?


Deleting any unneeded services

You could scale an unneeded service to zero instances, or Destroy it entirely


Adding more machines to your cluster

If your cluster is running in Azure Container Services, see https://aka.ms/acs-scale for details


In some rare circumstances, something in your cluster may be corrupt. There is no silver bullet for this state as the reason could be multi-faceted. But, if you are running in Azure Container Services (or another cloud provider), try provisioning another cluster and attempting your deployment again.

FAQ


I don't have access to the private SSH key used to create my cluster!

If your environment is Azure Container Services, you can reset SSH keys, create/delete users, etc, with the Azure VMAccess extension.