secret
Last active

Distributed Locking With DynamoDB

  • Download Gist
locking-dynamodb.md
Markdown

Distributed Locking With DynamoDB

This article unpacks information supporting DynamoDB as an excellent choice for a distributed locking service while briefly exploring the what, why & how of locking.

Why DynamoDB is a good choice for locking

To build a lock, all you need is access to a file system. To build a distributed lock, all you need is network access to the file system. To build a highly available, distributed lock, all you need is a highly available, network reslient, durable filesystem. It is for this reason that your file system choice becomes difficult. There are several popular services that will satisfy one or all of the aforementioned scenarios:

  • Doozerd
  • DynamoDB
  • MySQL
  • PostgreSQL
  • Redis
  • Zookeeper

The data stores in the list are far from equal. Their feature sets are as varied as the integers. However, what makes DynamoDB stand out is the fact that it is highly available and easily accessible. You can setup a highly available, distributed lock service in minutes. DynamoDB offers a simple set of primitives that make distributed locking easy.

  • Create item unless exists
  • Delete item

Since DynamoDB offers conditional writes, we can attempt to lock an item by creating the item unless an item already exists. Once we are done with the lock, we delete it.

Gotchas

When using DynamoDB, we trade lock expiration robustness for ease of use. This tradeoff is manifested when we consider what would happen if a processor acquires a lock, then is partitioned from DynamoDB. What is the processor to do? The process of detecting and deleting stale locks is a difficult problem. Solutions like Doozerd & Zookeeper use complicated consensus algorithms to agree if a lock has been abounded before deleting. However, there are plenty of use cases which do not require this level of sophistication.

One work around is to set a TTL on the lock item in DynamoDB. Subsequent attempts to acquire the lock will check the TTL and preempt an abandoned lock. This implies that the process that acquired the lock needs to terminate execution past the TTL. This can be achieved by wrapping your critical section in a timeout.

How you can start using DynamoDB as a lock

The recipe for using DynamoDB as a locking service is quite simple, you do not need anything but and HTTP client to take full advantage of locking on top of DynamoDB. The basic algorithm is:

Prune expired lock. (see Gotchas)
Create Item unless exists
Yield to critical section inside of timeout
Delete Item

If you are looking for a codified recepie, the following libraries exist:

What's a lock

Locking is a another word for incorporating mutual exclusion into a set of parallel executing processes. Mutual exclusion was discovered by E. W. Dijkstra and summarized as: The problem of ensuring that no two processes or threads can be in their critical section at the same time. The critical section being any section of code that assume no other code is in the same section at the same instant.

For instance:

x = get_global_count
x = x + 1

Imagine if 2 processes were accessing the previous code at the same instant, the CPU might allow the 1st process to execute the first line, then pausing the 1st process to allow the 2nd process to execute the first line. It is now impossible for each process to accurately compute the global count. Thus we are motivated to build a system in which the 1st process can lock the code so it can execute both lines with the confidence that no other process is manipulating the global count.

lock
x = get_global_count
x = x + 1
unlock

Why you would use a lock

The previous example demonstrated a classic scenario which requires local locking. DynamoDB is not the best choice for that type of locking; it is likely that your progamming language offers locking APIs that are much more suitable. However, there are a plethora of good examples for which DynamoDB is an excellent choice for locking. Automated DNS configuration is one such example:

lock("herokuapp.com.")
dns_records = get_records
if !includes(dns_records, "foo.herokuapp.com.") {
  create_record("foo.herokuapp.com.")
}
unlock("herokuapp.com.")

In this example, we have a process that is updating DNS records. Presumably, some other service is providing "foo.herokuapp.com." as an input to the process. Thus we have at least 3 network-indepoendent processes at play. In addition to this example, other scenarios include process partitioning, failing over a redis master, aggregating data accross shards, etc...

Conclusion

DynamoDB is not the most advanced locking service. However, DynamoDB offers a lot of features at a great price. Most importantly, DynamoDB is offered as an Amazon service. This allows the engineer not to have to maintain a cluster of Zookeepers or Doozerds. Whenever we can trade a feature for a dramatic reduction in operational burden, we should.

Updates

  • 2012-10-21 - Inception

Author

Ryan Smith

@ryandotsmith, ace hacker, builds distributed systems at heroku.

This article was motivated by many success and failures experienced with production systems at Heroku.

Related Articles

I apologize for my naiveness, but I'm having trouble thinking of concrete scenarios where a locking service would be required. The DNS example given above seems like it could be solved in other ways. For example, a distinct queue of changes to the domain could be applied by multiple workers at any concurrency. Am I missing something there?

However, assuming that I do need a locking service you make a very strong case for DynamoDB.

@mikehale In our application we store some auxiliary data in a serialized format. When we update that data, we use a similar locking mechanism to ensure that simultaneous writes don't overwrite each other's changes

This implies that the process that acquired the lock needs to terminate execution past the TTL. This can be achieved by wrapping your critical section in a timeout.

Google "juliet pause," where a worker is seen to be dead by the rest of the system but is in fact still alive. It can result in two threads running inside the critical section at the same time, and can be caused by various factors including high system load, faulty hardware, and garbage collection pauses. The solution (which is hard to implement) is to deny access to the underlying resource by non-lock holders. In ZooKeeper, for example, the "lock" (an ephemeral znode) is expired at the same time the session is broken (so further writes will not be accepted).

@drevell I agree with your concern. Any implementation that is not based on paxos will by incomplete. However, if you can relax the constraints a bit, using a TTL on the lock and wrapping workers in a hard timeout for the duration of the TTL can get you a long ways.

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.