Skip to content

Instantly share code, notes, and snippets.

@kerinin
Created March 22, 2016 20:28
Show Gist options
  • Save kerinin/29de23aba51c16d114af to your computer and use it in GitHub Desktop.
Save kerinin/29de23aba51c16d114af to your computer and use it in GitHub Desktop.
Accounts Kinesis

Accounts Kinesis

We currently need to solve a couple of problems in the mailservice:

  1. We need to react to account deletions (ie terminating ongoing processes, and preventing retries of failed processes)
  2. We need to increase processing capacity as the number of accounts being handled increases
  3. To reduce the probability of k8s scheduling conflicts, we need to minimize the resources requested by mailservice pods

The current approach to event notification is to write SQS events to signal state changes, for instance when a new account is created, the process creating the account writes an event to one of 64 "created" queues, which is consumed by the sync process. We could create a partitioned set of "deleted" queues and write deletion events into them. This approach feels brittle, as it relies on SQS events being created anywhere accounts are modified, and we're already doing this in at least two places (the API and the CIO kafka topic watcher).

The current approach to processing capacity is to split accounts across N (currently 64) partitions, and for processes to "lock" a subset of these partitions. Capacity is increased by reducing the number of partitions locked by each process and increasing process replication in k8s. We're currently targeting 2 partitions per process, which means that we can only reduce the load on a process by half, and we can effectively only increase capacity by 100%. This can be mitigated by increasing the partition count (ie with 256 partitions we can target 8 partitions per process), however each time this happens we get closer to the current situation.

Currently there's a conflict between the need to scale efficiently (which is accomplished by increasing the number of partitions locked by a process) and the need to schedule processes reliably (which is accomplished by minimizing the resources requested by each process). One solution would be to substantially increase the partition count (say 8,192 partitions), but this incurs a decent amount of unecessary overhead (ie heartbeats) and may lead to non-linear increases in "thrashing" when resources are near their limits and communication starts to degrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment