Skip to content

Instantly share code, notes, and snippets.

@ryandotsmith
Last active July 21, 2023 14:13
Show Gist options
  • Star 82 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save ryandotsmith/c95fd21fab91b0823328 to your computer and use it in GitHub Desktop.
Save ryandotsmith/c95fd21fab91b0823328 to your computer and use it in GitHub Desktop.
Distributed Locking With DynamoDB

Distributed Locking With DynamoDB

This article unpacks information supporting DynamoDB as an excellent choice for a distributed locking service while briefly exploring the what, why & how of locking.

Why DynamoDB is a good choice for locking

To build a lock, all you need is access to a file system. To build a distributed lock, all you need is network access to the file system. To build a highly available, distributed lock, all you need is a highly available, network reslient, durable filesystem. It is for this reason that your file system choice becomes difficult. There are several popular services that will satisfy one or all of the aforementioned scenarios:

  • Doozerd
  • DynamoDB
  • MySQL
  • PostgreSQL
  • Redis
  • Zookeeper

The data stores in the list are far from equal. Their feature sets are as varied as the integers. However, what makes DynamoDB stand out is the fact that it is highly available and easily accessible. You can setup a highly available, distributed lock service in minutes. DynamoDB offers a simple set of primitives that make distributed locking easy.

  • Create item unless exists
  • Delete item

Since DynamoDB offers conditional writes, we can attempt to lock an item by creating the item unless an item already exists. Once we are done with the lock, we delete it.

Gotchas

When using DynamoDB, we trade lock expiration robustness for ease of use. This tradeoff is manifested when we consider what would happen if a processor acquires a lock, then is partitioned from DynamoDB. What is the processor to do? The process of detecting and deleting stale locks is a difficult problem. Solutions like Doozerd & Zookeeper use complicated consensus algorithms to agree if a lock has been abounded before deleting. However, there are plenty of use cases which do not require this level of sophistication.

One work around is to set a TTL on the lock item in DynamoDB. Subsequent attempts to acquire the lock will check the TTL and preempt an abandoned lock. This implies that the process that acquired the lock needs to terminate execution past the TTL. This can be achieved by wrapping your critical section in a timeout.

How you can start using DynamoDB as a lock

The recipe for using DynamoDB as a locking service is quite simple, you do not need anything but and HTTP client to take full advantage of locking on top of DynamoDB. The basic algorithm is:

Prune expired lock. (see Gotchas)
Create Item unless exists
Yield to critical section inside of timeout
Delete Item

If you are looking for a codified recepie, the following libraries exist:

What's a lock

Locking is a another word for incorporating mutual exclusion into a set of parallel executing processes. Mutual exclusion was discovered by E. W. Dijkstra and summarized as: The problem of ensuring that no two processes or threads can be in their critical section at the same time. The critical section being any section of code that assume no other code is in the same section at the same instant.

For instance:

x = get_global_count
x = x + 1

Imagine if 2 processes were accessing the previous code at the same instant, the CPU might allow the 1st process to execute the first line, then pausing the 1st process to allow the 2nd process to execute the first line. It is now impossible for each process to accurately compute the global count. Thus we are motivated to build a system in which the 1st process can lock the code so it can execute both lines with the confidence that no other process is manipulating the global count.

lock
x = get_global_count
x = x + 1
unlock

Why you would use a lock

The previous example demonstrated a classic scenario which requires local locking. DynamoDB is not the best choice for that type of locking; it is likely that your progamming language offers locking APIs that are much more suitable. However, there are a plethora of good examples for which DynamoDB is an excellent choice for locking. Automated DNS configuration is one such example:

lock("herokuapp.com.")
dns_records = get_records
if !includes(dns_records, "foo.herokuapp.com.") {
  create_record("foo.herokuapp.com.")
}
unlock("herokuapp.com.")

In this example, we have a process that is updating DNS records. Presumably, some other service is providing "foo.herokuapp.com." as an input to the process. Thus we have at least 3 network-indepoendent processes at play. In addition to this example, other scenarios include process partitioning, failing over a redis master, aggregating data accross shards, etc...

Conclusion

DynamoDB is not the most advanced locking service. However, DynamoDB offers a lot of features at a great price. Most importantly, DynamoDB is offered as an Amazon service. This allows the engineer not to have to maintain a cluster of Zookeepers or Doozerds. Whenever we can trade a feature for a dramatic reduction in operational burden, we should.

Updates

  • 2012-10-21 - Inception

Author

Ryan Smith

@ryandotsmith, ace hacker, builds distributed systems at heroku.

This article was motivated by many success and failures experienced with production systems at Heroku.

Related Articles

@mikehale
Copy link

I apologize for my naiveness, but I'm having trouble thinking of concrete scenarios where a locking service would be required. The DNS example given above seems like it could be solved in other ways. For example, a distinct queue of changes to the domain could be applied by multiple workers at any concurrency. Am I missing something there?

However, assuming that I do need a locking service you make a very strong case for DynamoDB.

@nickyleach
Copy link

@mikehale In our application we store some auxiliary data in a serialized format. When we update that data, we use a similar locking mechanism to ensure that simultaneous writes don't overwrite each other's changes

@drevell
Copy link

drevell commented Oct 29, 2012

This implies that the process that acquired the lock needs to terminate execution past the TTL. This can be achieved by wrapping your critical section in a timeout.

Google "juliet pause," where a worker is seen to be dead by the rest of the system but is in fact still alive. It can result in two threads running inside the critical section at the same time, and can be caused by various factors including high system load, faulty hardware, and garbage collection pauses. The solution (which is hard to implement) is to deny access to the underlying resource by non-lock holders. In ZooKeeper, for example, the "lock" (an ephemeral znode) is expired at the same time the session is broken (so further writes will not be accepted).

@ryandotsmith
Copy link
Author

@drevell I agree with your concern. Any implementation that is not based on paxos will by incomplete. However, if you can relax the constraints a bit, using a TTL on the lock and wrapping workers in a hard timeout for the duration of the TTL can get you a long ways.

@noobiemcfoob
Copy link

noobiemcfoob commented Nov 22, 2016

Using a Time-To-Live lock with Lambda seems like a natural fit. Set the the TTL to 6 minutes (longer than AWS Lambda would allow a process to run) and it seems to resolve the orphaned lock problem, if your system can suffer a orphaned locked lasting for 6 minutes :P

@bsamuel-ui
Copy link

I apologize for my naiveness, but I'm having trouble thinking of concrete scenarios where a locking service would be required.

Semaphores are fundamentally designed to restrict access to limited resources.

One example I've seen was an API that ran administration commands on third-party network devices. This basically had to shell into the devices and run commands like "reboot" through terminal emulation. The workers needed to be highly available, so they tried to obtain a lock in dynamo.

In a case I'm working on now, we've got a large job that's being kicked off, and the user wants to poll us to find out when it's done. That means we have two calls, "start work" and "check work". We want workers to grab a lock on the work payload to allow "start work" to be idempotent.

I wanted to give that background to offer what is often a better solution to expiring locks. I'm very skeptical of timeouts, because they can work just fine, but those are most liable to fail exactly when you're trying to recover from an outage. One exception would be if you're using a service like AWS lambda. You can use a timeout quite safely because you can call get_time_remaining on your context object to determine when your process is going to time out and when it's safe to retry.

Generally, if you're working in the clerd, your lock may be related to some managed service. In my network devices example, these were being updated by AWS SWF workflows. So the workers stored their workflow ID in the lock, and could clear a dead lock by querying SWF for the status of the workflow.

And in the heavy computation case, the workers are tasks in an EC2 container service cluster. Determining the task from within it is a bit tricky, but once you have it, you can call describe_task to determine if it has died and break a dead lock that way. Since I have a frontend API, its job is to clear that dead lock on a 'start work' request, and the 'check work' call will update the work record to note 'hard failure' and report the error back. The workers can then assume that the frontend has cleared any dead locks for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment