Skip to content

Instantly share code, notes, and snippets.

@pnispel
Last active July 18, 2018 19:10
Show Gist options
  • Save pnispel/473e1d27c49ca2a3fc0e338e751d0d89 to your computer and use it in GitHub Desktop.
Save pnispel/473e1d27c49ca2a3fc0e338e751d0d89 to your computer and use it in GitHub Desktop.

Backpressure Backfills

When backfilling millions of records, it becomes dangerous to try and update them all at once by hand. This technique tries to mitigate some of the dangers by creating a testible script that's sensitive to the current queue depth. It's most useful for things like data backfills, but the technique can be used for any type of sidekiq work that will require millions of jobs.

The goal of this technique is to allow running backfills at any time including peak hours without stressing the procore infrastructure.

Goals

  • Maximize dispersion of work - we want to disperse the work over time in a way that correlates with the amount of strain currently put on Procore
  • Minimize heavy database use - an obvious issue because if the database is getting hammered it can cause issues with locks and queuing
  • Decrease variability - we want the results of the backfill to be as predictable as possible
Maximize dispersion of work

There are two different solutions to this:

  1. Run your migration during off hours only, watch the infrastructure using DataDog and do your migrations by hand. This speeds up the migration, but you are risking bad things happening and have to constantly monitor both the infrastructure and your migrations.

The better solution is to:

  1. Use Sidekiq and the tooling provided to roughly monitor the current load on procore and fill in jobs when that load is low.
Minimize heavy database use

Assuming you are using sidekiq you will access the db both when you create your jobs and when you update the record you are backfilling. For reads you can use the follower to get the data you'll need to create your jobs.

Decrease variability

By using sidekiq and meeting the other two goals we put ourselves in a very good position in terms of the risk of the backfill blowing up in any way. As a general guideline for this goal, a backfill should be able to be stopped at any time and should be well tested. Using a technique to only fill in jobs when the queue is low allows us to be relatively sure that the backfill wont overwhelm the infrastructure. If jobs are hanging or take a long time to complete more jobs won't be added.

Putting it all together

THE SCRIPT
THE WORKER
Creating a script to queue our workers allows us to sample backpressure while we fill the queue and also to stop the queue fill at any time. The main things to look at are Checking the queue depth:

def ready_to_enqueue?
  queue_depth = Sidekiq::Queue.all.map(&:size).sum
  queue_depth < SOME_REASONABLE_QUEUE_DEPTH
end

where SOME_REASONABLE_QUEUE_DEPTH is somewhere around 100-1000. Then we sleep until the queue is drained:

def wait_for_backpressure
  loop do
    break if ready_to_enqueue?
    sleep SECONDS_TO_WAIT_FOR_BACK_PRESSURE
  end
end

In this worker we also use Sidekiq batches. Using batches we can specify a callback for when all jobs complete and monitor or cancel individual jobs.

The important parts of the batch are getting the jobs:

def get_job_args
    [
        ["job 1 arg 1", "job 1 arg 2"],
        ["job 2 arg 1", "job 2 arg 2"],
    ]
end

And then using Sidekiq::Client.push_bulk to fill the queue while waiting for backpressure:

def perform
    batch.jobs do
        get_job_args.each_slice(BATCH_SIZE) do |args|
            wait_for_backpressure
            Sidekiq::Client.push_bulk(
                'class' => WorkerClass,
                'args' => args
            )
        end
    end
end

And then in the callback we define a function to run after all the jobs have finished:

class BackfillCallback
    def on_complete(status, options)
        # do something
    end
end

Summary

With all these pieces in place you can be relatively sure that your backfill wont overwhelm the infrastructure even if you have to run at peak hours. Also, because the worker is well tested and runs using a script, you can leave it running all night and be sure it won't break things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment