Skip to content

Instantly share code, notes, and snippets.

@DrEsteban
Last active February 27, 2017 23:54
Show Gist options
  • Save DrEsteban/9b38e5285d4bc21e0598164e7451cef5 to your computer and use it in GitHub Desktop.
Save DrEsteban/9b38e5285d4bc21e0598164e7451cef5 to your computer and use it in GitHub Desktop.
DC/OS + Marathon Deployment Issues

DC/OS + Marathon Resource Contention

Introduction

Without developer intervention, machine resources in a DC/OS environment are finite. Each containerized service in a DC/OS + Marathon cluster needs a certain amount of resources in order to run properly. As more and more services are deployed to a cluster, resource contention becomes a real concern. Eventually, resources will become so scarse that services will fail to deploy/scale at all! This is true for any microservice environment.

This Gist will show you how to identify Marathon services that are likely resource-starved, and provides some common steps for resolving this state.

Prerequisites

  • You'll need the private SSH key corresponding to the public key that is authorized with your DC/OS cluster.
  • You must have the ability to SSH to your cluster.
    • If your environment is Azure Container Services, see this article for more details.
  • You must open an SSH tunnel to port 80 on the cluster. This tutorial will reference your locally-mapped port as [port #].

Identifying a resource-starved service

Luckily, spotting these failing deployments is a relatively straight forward process. Once you've connected an SSH tunnel to your cluster, simply navigate your browser to: http://localhost:[port #]/marathon

Here you can inspect the current services deployed to your Marathon instance. If things are working properly, you should see a Status 'Running' for all services. However if you see the following, then it's a safe bet that there aren't enough resources in the cluster to run the particular service:

waiting.png

Ways to resolve the issue

Unfortunately, Marathon doesn't provide any specific reasons as to why a service is in the 'Waiting' state. (Exhausted CPU?, exhausted memory?, exhausted disk?, etc.) You can nagivate to http://localhost:[port #]/mesos and look at the "Resources" section to try and identify for yourself, but regardless of the specific reason the resolution steps are the same. In general, the cluster simply needs more resources. This can be accomplished by:

  • Double checking the specified resource requirements of your application
    • Is there a typo in your service definition? For example, are you asking Marathon for 2560 MB of memory instead of 256 MB?
  • Deleting any unneeded services
    • You could scale an unneeded service to zero instances, or Destroy it entirely
  • Adding more machines to your cluster
    • If your cluster is running in Azure Container Services, see https://aka.ms/acs-scale for details
  • In some rare circumstances, something in your cluster may be corrupt. There is no silver bullet for this state as the reason could be multi-faceted. But, if you are running in Azure Container Services (or another cloud provider), try provisioning another cluster and attempting your deployment again.

FAQ

  • I don't have access to the private SSH key used to create my cluster!
    • If your environment is Azure Container Services, you can reset SSH keys, create/delete users, etc, with the Azure VMAccess extension.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment