rponte/taskq.md Secret

## taskq.md

      
    Raw
  

              taskq.md
            
          
    Implementing, rewriting, and rebuilding a task queue.

In every modern desktop application, there is a notion of foreground and background work. Be it running a thread in the background, another process, or even something like a webworker. In web services there is a similar notion, which i'm going to call interactive vs batch processing. I want to talk about the latter, but it's easier to describe as being the opposite of interactive.
When I say interactive, I mean in response to a users request, but also the user is expecting a timely response. Webservers are the easiest example. Responses to the user should happen as quicly as possible, and requests are rarely worth retrying too. People love the refresh button.
With batch processes, I mean the work that happens in the background, often overnight, which can be collected together and processed in larger chunks for throughput. You can think of an interactive process as one that is low latency responses, and a batch process as one that requires throughput, but it isn't always this straight forward. Some batch processess happen to a schedule, some run on events, some can't be bundled together.
A batch process might be best defined as the things that the users can't see running, but they can see failing, but it's easier to start with a more concrete example.
I have a website that lets users upload a picture, and I need to resize it for use later on. In the beginning everything ran inside the webserver, but as it grew in size requests started to fail. I could buy a larger webserver, maybe run several more with cpu room free, but I do what almost everyone else does, I hack up a task queue.
The first task queue is usually over a database

A task queue roughly means putting a list of jobs to do, somewhere, and then running workers to process this list, one by one. In this case, we cheat a little. The webserver handles uploading the images, but doesn't bother to resize them. A worker scans the databse, looking for entries that have an image but no resizes images, and process them.
For some workloads, this works reasonably well. Using a single worker, taking a large set of images to resize, rather than one by one, won't cause much trouble, but it might not be fast enough. Adding more workers, means more contention and eventually you hit a saturation point—adding more workers now slows down the system, rather than speeding it up.
This rudimentary solution is good enough for a demo, but it rarely survives in production. It's the reason people say "Don't put a queue in a database". Several workers hammering over the most recent item creates a massive load spike on the system. High volume updates, writes, or ephemeral data aren't easily served best by a database.
To scale it up, I could run more databases, I could make the workers use larger batch sizes to reduce contention, but instead I decide to add a queue.
The second task queue is usually a queue

Like before the webservice handles the image upload, but now it sends a message into a queue, somewhere within the system. Instead of scanning the database, workers poll the queue, racing each other to get the next item on the list. This fixes the scale problem, but now you have a lot more errors to handle.
Before, when an image didn't resize, or a worker crashed, the problem was small enough to fix by hand. Now, the slower the fix, the bigger the problem. When a worker crashes out, the queue starts to fill up. Eventually using all available memory, and goes offline. The error finally surfaces as requests to the website start breaking.
The first thing you do when you add a queue, is to add a dashboard, or a graph somewhere, and check it to see how healthy it is. The second thing is usually some error handling. A timeout on workers to stop them crashing out, something to restart them automatically
When it comes to failure handling, I have several more choices ahead. I could run multiple queues in parallel, and let the workers poll one, or the other. This works when neither queue is overloaded, or one queue can handle the load of the other, but fails catastrophically otherwise. Although now less likely, workers can still crash out, and a backlog big enough to crash one queue is big enough to crash every queue the load is redirected to.
In the end, the queue has to be able to drop old, or new work, to be able to get some of the work done. Load shedding, or dropping old work is more useful when new work is more relevant. Backpressure, or dropping new work is far easier, and the software trying to enqueue items can make an informed decision about how to handle the failure.
Rate limiting becomes a vital part of a healthy system, trying to prevent clots from impeding progress. The problem is that the things we do to fix an unhealthy system often make it worse. If backpressure isn't implemented all the way through a system, it will explode in the gaps. Someone will turn up the number of workers when a queue gets big, accidentally floodng the rest of the system downstream.
Making the queue bigger only means things spend more time in queues. The system is now slower, and when it does crash again, it's even more spectacular. Week long backlogs of data need to be massaged through the system to rebuild things, or even just abandoning the mess and moving on. Sometimes, It's easier to say sorry and shrug.
Without rate limiting, back pressure, or load shedding, the only solution is overprovisioning. Making sure you always have the excess capacity to handle spikes in load. There isn't really an alternative, if you don't implement backpressure, you still feel the effects—backpressure is how slowly you DDoS yourself. This is one of the few things that modern web services and stream engines share in common: it has taken decades to build one that doesn't explode at random.
Once we've tempered the queues to handle excess load, and made our choices about the behavours we accept under load: loss of new or old data, we can take steps to handle them elsewhere. The other thing we need to decide is how to handle a different sort of error: a failed task that could be retried later.
The task queue becomes a pipeline

Without abandoning the image resizing example, let's move on to uploading images to a third party service. The service is up most of the time, but occasionally it fails and we'd like to retry sending these requests later on.
With the queue, we have two options. Put it back in the same queue, put it another queue to handle later, perhaps by hand. These aren't mutually exclusive, but they create other problems. Retries can still get dropped under load, or worse, retries create the excess load in the queue, slowing it to a halt. Either one begins to use the queue to store which state the task is in.
Instead, we go back to the first queue worker, and change the code a little. The worker does large, slow scans across the database like before, but now puts work onto a queue. It's less of a worker, and more of a message pump. We don't need to run as many of these workers to keep things running, and we shouldn't need to scale them as dramatically.
Without the pump, we'd never be able to empty a queue, and when recovering from failure, we'd have to replay each of the items stuck in the queue before being ready to handle new load. Making the queues persistent means that each and every queue has become a tiny database, and sometimes that state is divided across the network.
When things go wrong, there's duplicates everywhere. A task in a queue might have multiple different copies of itself ahead or behind it. A queue makes a good way to buffer work for distribution, but a terrible lock and a nightmarish database.
As our long running tasks grow more states, from 'waiting, done' to 'enqueued, active, error, complete', or 'creating, starting, online, offline, restarting, stopped, archived', the queues struggle to keep up. The third task queue is built around these difficult, long running processes.
Eventually, the task queue has a scheduler

A scheduler is when a message pump begins to handle a little more responsibility, and tracks the state of the processes it is responsible for running. It might use a queue to distribute work, but it must use a database to handle persisting state in some way.
Once things tasks have mutliple states, you can't just work off a list of tasks, you need an index across them to look up this data quickly, and to track which version is the most recent as so not to create race conditions when updating the state. When I say database, I don't necessarily mean MySQL or PostgreSQL, but I do mean some key-value table with atomic updates.
The scheduler takes responsibility for the long lived background processess, dispatching work to worker processess run elsewhere on the system, but this doesn't mean being much more than a message pump,  A scheduler can take on other forms too.
Kubenetes' controllers are the same idea. Break a long running process into different states, store this state somewhere, and make a long running process by polling the database, taking any actions necessary, and updating the state too, if needed. Orleans has a similar idea, each long running process is represented as a persistent object, and distributed across a pool of workers with a distributed hash table.
In both cases, a long running process is stored somewhere, and something else makes sure work is scheduled and dispatched to a worker. They might use some queues underneath, but they aren't being used to handle errors, but to distribute load.