Turning PostgreSQL into a queue serving 10,000 jobs per second
RDBMS-based job queues have been criticized recently for being unable to handle heavy loads. And they deserve it, to some extent, because the queries used to safely lock a job have been pretty hairy. SELECT FOR UPDATE followed by an UPDATE works fine at first, but then you add more workers, and each is trying to SELECT FOR UPDATE the same row (and maybe throwing NOWAIT in there, then catching the errors and retrying), and things slow down.
On top of that, they have to actually update the row to mark it as locked, so the rest of your workers are sitting there waiting while one of them propagates its lock to disk (and the disks of however many servers you're replicating to). QueueClassic got some mileage out of the novel idea of randomly picking a row near the front of the queue to lock, but I can't still seem to get more than an an extra few hundred jobs per second out of it under heavy load.
So, many developers have started going straight to Redis-backed queues (Resque, Sidekiq) or dedicated queues (beanstalkd, ZeroMQ...), but I see these as suboptimal solutions - they're each another moving part that can fail, and the jobs that you queue with them aren't protected by the same transactions and atomic backups that are keeping your precious relational data safe and consistent. Inevitably, a machine is going to fail, and you or someone else is going to be manually picking through your data, trying to figure out what it should look like.
Earlier this week I released Que, a Ruby gem that uses PostgreSQL's advisory locks to manage jobs. There are many advantages to this design, but the most striking is that the queries that workers have to run in order to find a job to work no longer block one another. There's no locked_at column to update anymore, so they're all just simple SELECT statements (well, recursive CTE select statements, but still), and readers don't block readers. And, since Postgres 9.2 brought optimizations that let read throughput scale linearly with the number of cores, I was hoping that benchmarking things on a set of AWS' new compute-optimized instances would show a linear increase in queue throughput as the number of cores increased.
Que scalability by number of cores:
|machine||processes (4 worker threads each)||jobs/sec (synchronous commit)||jobs/sec (asynchronous commit)|
|c3.large (2 vCPUs)||3||1564||2407|
|c3.xlarge (4 vCPUs)||5||2792||3970|
|c3.2xlarge (8 vCPUs)||8||5092||6752|
|c3.4xlarge (16 vCPUs)||11||7270||9806|
(The optimal number of processes was determined for each core count through trial and error - these seemed to be the peaks of performance. All tests done with vanilla Postgres 9.3, with shared_buffers bumped to 512MB and random_page_cost lowered to 1.1 to reflect that these machines use SSDs. Unfortunately, I wasn't able to reserve a 32-core instance.)
As you can see, it's not quite linear behavior. My wild guess is that the bottleneck is the advisory lock system - it's not a terribly popular feature, so it makes sense that nobody would invest the time to optimize it for heavy use - but I could be wrong.
Additionally, I did a second round of benchmarks comparing Que's locking mechanism alone (not any of the Ruby overhead) to those of DelayedJob and QueueClassic. Unfortunately, this benchmark is currently only able to run in a single process, so Ruby's GIL became the bottleneck for Que (and I believe for QueueClassic at the higher core counts and with synchronous commit turned off). I was also limited to 10 workers because DelayedJob didn't seem able to finish with any more than that. So these numbers don't represent Que's maximum throughput the way the above benchmark does, but it nicely shows the large difference between it and the other solutions:
DelayedJob, QueueClassic and Que scalability by number of cores
|machine||DJ jobs/sec sync, async||QC jobs/sec sync, async||Que jobs/sec sync, async|
|c3.large (2 vCPUs)||201, 138||127, 544||1459, 2035|
|c3.xlarge (4 vCPUs)||343, 345||267, 875||1892, 2763|
|c3.2xlarge (8 vCPUs)||202, 1185||323, 2535||1700, 2987|
|c3.4xlarge (16 vCPUs)||348, 1211||514, 2641||2051, 3089|
(The code for both of these benchmarks is in the Que repository.)
- EC2 performance is inconsistent, even with their newest instance types. I can't think of another explanation for the wild variations in DelayedJob's performance.
- If you don't want to move off of DelayedJob or QueueClassic just yet, SSDs help. These numbers are much better than what I got for them on a machine with a rotational disk (since Que doesn't write to disk when locking a job, it doesn't have the same issue). It'd be a good idea to put at least the write-ahead log on SSD, so that you don't have a bunch of workers sitting around waiting for another worker, which is in turn waiting for your disk to spin around.
- Que is fast! With the right hardware, you can process thousands of jobs per second with the same transactional guarantees as your data. If I had a big machine to run on that wasn't on EC2, I feel confident that Que could pass 10,000 jobs per second with the added safety of synchronous_commit.
I don't recommend that you try to run Que in production yet - the multi-threaded worker pool in particular is still a bit twitchy - but I'm very interested in comments and suggestions on it. My goal is to make it as reliable as possible while keeping its current speed.