Skip to content

Instantly share code, notes, and snippets.

@SEJeff SEJeff/AuroraStuff.md forked from zircote/AuroraStuff.md
Last active Aug 29, 2015

Embed
What would you like to do?

The mustache gotcha

When using “bound” objects in an .aurora file it is an absolute that you do not have spaces in the “mustaches”.

Examples:

  • Bad: {{ profile.my_var }}
  • Good: {{profile.my_var}}

Mesos Slave Constraints

When scheduling a task on aurora with Production=True, the 0.7-incubating scheduler will set a default constraint preventing more than one instance of the task on the same rack.

"rack":{"limit":1}

When the task has Production=False or lacks the Production directive, the scheduler sets a default constraint to only allow one instance of the task per host.

"host":{"limit":1}

If rack and host attributes aren't set on the mesos slaves, the tasks won't meet their constraints, and won't run. This behavior can be disabled by starting the aurora scheduler with -enable_legacy_constraints=false. You can also start mesos-slave with something like:

--attributes="host:slave-hostname-fqdn;rack:rack-1"

NOTE: When you change slave attributes, no running tasks can be resumed. As a result, the current slave state is totally invalid.

The mesos-slave will simply refuse to start with an invalid state. Clearing it requires this where $MESOS_SLAVE_WORKDIR is the -work_dir argument to mesos-slave:

 rm -rf ${MESOS_SLAVE_WORKDIR}/slave/meta/slaves/latest

Docker Container Snafus

When running a docker container, you must ensure that all of the dependent library for the thermos_executor.pex are present in the docker container itself. The thermos_executor runs in the container not the mesos slaves environment.

If you run Debian docker containers on Centos hosts, you will likely need to build a thermos_executor.pex for the in the Debian docker container in addition to the Centos host. You will also need a wrapper that can decide if it is a docker container and run the correct thermos_executor.pex. I wrote a simple bash script that decides if the thermos_executor.pex is a docker container or the host system and ran the correct one. Nothing complicated but effective. The good news is I saw plans are coming to move away from the libmesos.so and to pesos, this will eliminate the pain associated with this.

AWS Linux and the sasl hassle

It seems libmesos.so expects the file libsasl.so.3 to be present, the package that provides this is not available and therefore the thermos_executor.pex just dies. I managed to work around this but I rather not say how…

When the thermos_executor.pex fails to run

When it dies due to the MesosDriver it wont give you a whole lot to go on. It is my opinion when the code handles the ImportError exception it should pay that message forward to the logger. The message it gives now is far to vague to truly figure out why it failed. I managed to figure out the pain by opening a python repl and attempt to import mesos.native this should give you a good idea of what is really broke.

What are all these .pex files

When you build all the parts of Apache Aurora, you will see all sorts of .pex files that need to be put in various places. It can be a challenge to discern the purpose of them all to the uninitiated. I will give a list and a general idea of what they each do, keep in mind I am still learning a lot of this myself.

  • thermos_executor.pex: This is the executor, it is the governor of all mesos tasks that are scheduled. It resides on the mesos slaves, at the path you configured in your aurora-scheduler arguments.
  • gc_executor.pex: This is the janitor, it is scheduled by the aurora-scheduler to seek and clean all cruft and clean up space, reap dead jobs that are lingering around the mesos slaves. It to resides on the mesos slaves in a path that you define in the aurora-scheduler arguments.
  • thermos_observer.pex: This tool is a service that runs on each mesos slave it allows the consumers of aurora to look at the details of each process and status of the jobs.
  • aurora.pex: This is the tools to schedule jobs, it may run on a machine that can speak to the mesos hosts using LIBPROCESS_HOST:LIBPROCESS_PORT It has a -h flag.
  • aurora_admin.pex: This is used for administrator level tasks in the aurora cluster, it is where you set quotas for roles as well as other features. It has a -h flag

Diagnosing trouble

A good place to start is the logs in the mesos slave for such things as LOST tasks etc. It is generally a failure with the thermos_executor.pex I have found; beyond that it’s probably a broke .aurora file that you generally can discern by observing the log entries for each process in the thermos_observer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.