Skip to content

Instantly share code, notes, and snippets.

@kerinin
Last active October 20, 2015 08:19
Show Gist options
  • Save kerinin/f1f2012b9b2ea1e2e8ad to your computer and use it in GitHub Desktop.
Save kerinin/f1f2012b9b2ea1e2e8ad to your computer and use it in GitHub Desktop.
Cluster Management

Cluster Management

Problem Statement

Distributed computing environments face two significant problems

  • Computing resources need to be assigned to computing tasks in a coordinated way (ie, if I have two computers and two tasks, I don't want both of the computers trying to do the same task)
  • Process failure needs to be detected and responded to (ie if I run out of disk space, another process should take over my task)

An ideal solution would satisfy a couple objectives

  1. 'Failure' would be defined in-process (as opposed to simply asking if an instance is running)
  2. Task assignment would be able to tolerate instances entering & leaving the cluster
  3. There would be no single point of failure
  4. It would be language-agnostic (could be used with Ruby, Java, Node, etc)
  5. Task assignment could (optionally) be static unless the cluster topology changed

We've tried a couple approaches to solving this:

  • Tell each computer what it's supposed to do when we provision it, and alert a human when something fails (OIB scheduler box). This fails #'s 2 & 3 (at least for singular resources like the scheduler)
  • Each instance claims a task for a duration of time and coordinates through a central data store. Cycle assignment regularly so failures can be healed (AMP instances). This fails #5
  • Use a process management framework to do it for us (Storm topology). This depends on the framework, but I don't know of any frameworks that satisfy all objectives (particularly #4)

Proposed Solution

Preliminaries

Consul provides a mechanism for service discovery and consistent, distributed KV store. It provides an HTTP api which accepts blocking requests which can be used to generate event handlers when the KV store (or the cluster topology) changes. I propose we build a daemon which handles cluster management on top of these capabilities.

Basic Architecture

We'll use the same basic terminology as Helix: a cluster is made up of a number of instances and a number of partitions. Each instance processes as many partitions as are required to fully consume the partitions. The task is to ensure that each partition as assigned to at-most-one instance and eventually exactly-one instance. (One difference between this proposal and Helix is that I'm not building in any type of restrictions on how instances can transition between partitions or what partitions can be co-located on an instance, ie Master/Slave separation)

Partitions exist in one of 2 states, assumed partitions have been acknowledged by the instance which owns them, while assigned partitions have not. Partition allocation is deterministic given a set of partitions & instances (for example using the RUSH algorithm), so there is always a mapping from a partition to an instance, however instances must explicitly 'assume' responsiblity for a partition. This ensures that partition assignment remains consistent during topology changes, and allows instances to signal that they have released ownership of a partition and the partition is ready to be transferred to its new owner.

Events

The nodes need to handle a few events:

  • Rebalance: When the cluster topology changes, a new partition mapping is computed. Processing on assumed partitions which are no longer assigned is terminated and the partitions are released. Assigned partitions which are not assumed are acquired as they become available and processing is started. (Executed by all instances)
  • Update Partitions: When a daemon first start, it updates the partition set to match it's local definition and triggers rebalance. (Update executed by new instance, rebalance executed by all instances)
  • Enter: When an instance enters the cluster it registers itself with the cluster and executes a rebalance. (Executed by the instance entering the cluster)
  • Leave: When an instance leaves the cluster, it terminates processing on assumed partitions, releases its assumed partitions and deregisters itself from the cluster. (Executed by the instance leaving the cluster)
  • Local failure: When an instance loses its connection to the cluster, it terminates processing on assumed partitions. (Executed by the instance which failed)
  • Remote failure: When an instance become unresponsive, its partitions are released and it is deregistered from the cluser by another instance. (Executed by all instances)
  • Health check failure: When an instance's health check fails, it optionally executes user-defined handlers - we make no assumptions about the implications of a health check fail. (Executed by the instance whose health-check failed)

Actions

To accomplish these events, the following actions need to be implemented:

  • Compute partition map based on the cluster topology and the instance's identity
  • Update the partition set
  • Transition from one partition map to another
  • Enter the cluster
  • Leave the cluster
  • Force another node to leave the cluster
  • Listen for topology changes
  • Listen for partition asignment changes
  • Listen for partition set changes
  • Begin processing
  • Terminate processing
  • Acquire a partition
  • Release a partition
  • Respond to a failing health check

Cluster Description

Finally, we can describe information required to describe a cluster:

  • A mapping from partitions to the instances which have assumed them (or null if not assumed)
  • A mapping from instances known to the cluster to their current state (alive or failed)
  • The action required to begin processing
  • The action required to terminate processing
  • The action required to respond to a health check failure
  • The address of at least one Consul server

Failure Modes

  • Listeners fail to trigger transitions. This should result in partitions not being released. Because partitions must be acquired before they're processed, the consistent KV store should prevent multiple machines from processing a partition, even if portions of the cluster have a different opinion about the current partition mapping.
  • Transient instance failures. This can result in partitions 'thrashing' between instances. It's probably worth establishing a minimum failure time before an instance is forced out of the cluster and its partitions re-assigned.
  • Local failure + listener failure. In this case the failed instance could be unaware that it has lost its connection to the cluster, and could result in partitions being processed by multiple nodes. Hopefully the types of failures that would cause this situation would also cause the instance to be unable to process its partition.
  • Performance discrepancy between instances. Because partition assignments are static, it's possible for some partitions to be processed more efficiently than others. It may be worth allowing the partition mappings to be shuffled occasionally to handle this situation (but it's probably better to just increase the cluster capacity).
  • Daemon failure. This could likewise result in partitions being processed by multiple instances. It will be important to ensure that the daemon is monitored. It might be worth doing some bash-fu to ensure that if the daemon process dies it also kills the processes doing the work.
@bpot
Copy link

bpot commented May 5, 2014

This is cool, I have a couple initial questions:

It sounds like there is no leader or controller process for the cluster. Is this made unnecessary because assignments are deterministic given a certain cluster state? I guess I should learn more about the RUSH algorithm.

Each instance processes as many partitions as are required to fully consume the partitions.

In some cases we may want to set a limit on how many partitions a single processes can handle. There may be a point after which a adding more partitions to a process causes the process to fail or work very slowly.

@kerinin
Copy link
Author

kerinin commented May 5, 2014

Yes, making the assignment deterministic means that the peers will all compute the same mapping, so as long as the partition set and cluster membership is consistent (which is what Consul's RAFT implementation gives us).

Good point on the partition limits. I'll have to think about how to accomplish that deterministically. Probably just assign all the partitions but only actually work on the first N. The problem that raises is how to deal with the partitions that aren't being allocated - if they be rotated on the instance or just left offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment