Skip to content

Instantly share code, notes, and snippets.

@ryandotsmith
Last active July 21, 2023 14:13
Show Gist options
  • Star 82 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save ryandotsmith/c95fd21fab91b0823328 to your computer and use it in GitHub Desktop.
Save ryandotsmith/c95fd21fab91b0823328 to your computer and use it in GitHub Desktop.
Distributed Locking With DynamoDB

Distributed Locking With DynamoDB

This article unpacks information supporting DynamoDB as an excellent choice for a distributed locking service while briefly exploring the what, why & how of locking.

Why DynamoDB is a good choice for locking

To build a lock, all you need is access to a file system. To build a distributed lock, all you need is network access to the file system. To build a highly available, distributed lock, all you need is a highly available, network reslient, durable filesystem. It is for this reason that your file system choice becomes difficult. There are several popular services that will satisfy one or all of the aforementioned scenarios:

  • Doozerd
  • DynamoDB
  • MySQL
  • PostgreSQL
  • Redis
  • Zookeeper

The data stores in the list are far from equal. Their feature sets are as varied as the integers. However, what makes DynamoDB stand out is the fact that it is highly available and easily accessible. You can setup a highly available, distributed lock service in minutes. DynamoDB offers a simple set of primitives that make distributed locking easy.

  • Create item unless exists
  • Delete item

Since DynamoDB offers conditional writes, we can attempt to lock an item by creating the item unless an item already exists. Once we are done with the lock, we delete it.

Gotchas

When using DynamoDB, we trade lock expiration robustness for ease of use. This tradeoff is manifested when we consider what would happen if a processor acquires a lock, then is partitioned from DynamoDB. What is the processor to do? The process of detecting and deleting stale locks is a difficult problem. Solutions like Doozerd & Zookeeper use complicated consensus algorithms to agree if a lock has been abounded before deleting. However, there are plenty of use cases which do not require this level of sophistication.

One work around is to set a TTL on the lock item in DynamoDB. Subsequent attempts to acquire the lock will check the TTL and preempt an abandoned lock. This implies that the process that acquired the lock needs to terminate execution past the TTL. This can be achieved by wrapping your critical section in a timeout.

How you can start using DynamoDB as a lock

The recipe for using DynamoDB as a locking service is quite simple, you do not need anything but and HTTP client to take full advantage of locking on top of DynamoDB. The basic algorithm is:

Prune expired lock. (see Gotchas)
Create Item unless exists
Yield to critical section inside of timeout
Delete Item

If you are looking for a codified recepie, the following libraries exist:

What's a lock

Locking is a another word for incorporating mutual exclusion into a set of parallel executing processes. Mutual exclusion was discovered by E. W. Dijkstra and summarized as: The problem of ensuring that no two processes or threads can be in their critical section at the same time. The critical section being any section of code that assume no other code is in the same section at the same instant.

For instance:

x = get_global_count
x = x + 1

Imagine if 2 processes were accessing the previous code at the same instant, the CPU might allow the 1st process to execute the first line, then pausing the 1st process to allow the 2nd process to execute the first line. It is now impossible for each process to accurately compute the global count. Thus we are motivated to build a system in which the 1st process can lock the code so it can execute both lines with the confidence that no other process is manipulating the global count.

lock
x = get_global_count
x = x + 1
unlock

Why you would use a lock

The previous example demonstrated a classic scenario which requires local locking. DynamoDB is not the best choice for that type of locking; it is likely that your progamming language offers locking APIs that are much more suitable. However, there are a plethora of good examples for which DynamoDB is an excellent choice for locking. Automated DNS configuration is one such example:

lock("herokuapp.com.")
dns_records = get_records
if !includes(dns_records, "foo.herokuapp.com.") {
  create_record("foo.herokuapp.com.")
}
unlock("herokuapp.com.")

In this example, we have a process that is updating DNS records. Presumably, some other service is providing "foo.herokuapp.com." as an input to the process. Thus we have at least 3 network-indepoendent processes at play. In addition to this example, other scenarios include process partitioning, failing over a redis master, aggregating data accross shards, etc...

Conclusion

DynamoDB is not the most advanced locking service. However, DynamoDB offers a lot of features at a great price. Most importantly, DynamoDB is offered as an Amazon service. This allows the engineer not to have to maintain a cluster of Zookeepers or Doozerds. Whenever we can trade a feature for a dramatic reduction in operational burden, we should.

Updates

  • 2012-10-21 - Inception

Author

Ryan Smith

@ryandotsmith, ace hacker, builds distributed systems at heroku.

This article was motivated by many success and failures experienced with production systems at Heroku.

Related Articles

@noobiemcfoob
Copy link

noobiemcfoob commented Nov 22, 2016

Using a Time-To-Live lock with Lambda seems like a natural fit. Set the the TTL to 6 minutes (longer than AWS Lambda would allow a process to run) and it seems to resolve the orphaned lock problem, if your system can suffer a orphaned locked lasting for 6 minutes :P

@bsamuel-ui
Copy link

I apologize for my naiveness, but I'm having trouble thinking of concrete scenarios where a locking service would be required.

Semaphores are fundamentally designed to restrict access to limited resources.

One example I've seen was an API that ran administration commands on third-party network devices. This basically had to shell into the devices and run commands like "reboot" through terminal emulation. The workers needed to be highly available, so they tried to obtain a lock in dynamo.

In a case I'm working on now, we've got a large job that's being kicked off, and the user wants to poll us to find out when it's done. That means we have two calls, "start work" and "check work". We want workers to grab a lock on the work payload to allow "start work" to be idempotent.

I wanted to give that background to offer what is often a better solution to expiring locks. I'm very skeptical of timeouts, because they can work just fine, but those are most liable to fail exactly when you're trying to recover from an outage. One exception would be if you're using a service like AWS lambda. You can use a timeout quite safely because you can call get_time_remaining on your context object to determine when your process is going to time out and when it's safe to retry.

Generally, if you're working in the clerd, your lock may be related to some managed service. In my network devices example, these were being updated by AWS SWF workflows. So the workers stored their workflow ID in the lock, and could clear a dead lock by querying SWF for the status of the workflow.

And in the heavy computation case, the workers are tasks in an EC2 container service cluster. Determining the task from within it is a bit tricky, but once you have it, you can call describe_task to determine if it has died and break a dead lock that way. Since I have a frontend API, its job is to clear that dead lock on a 'start work' request, and the 'check work' call will update the work record to note 'hard failure' and report the error back. The workers can then assume that the frontend has cleared any dead locks for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment