Skip to content

Instantly share code, notes, and snippets.

@andrewle
Forked from ryandotsmith/process-partitioning.md
Created April 16, 2012 02:06
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save andrewle/2395994 to your computer and use it in GitHub Desktop.
Save andrewle/2395994 to your computer and use it in GitHub Desktop.
Leader Election With Ruby & PostgreSQL

Leader Election With Ruby & PostgreSQL

The Problem

I have a table with millions of rows that needs processing. The type of processing is not really important for this article. We simply need to process some data with Ruby --or any other user-space programming language.

A Solution

My environment already is dependent on Ruby & PostgreSQL so I want a solution that leverages my existing technologies. Also, I don't want to create a table other than the one which I need to process. Subsequently solutions like queue_classic are off of the table. Thus I have devised the following algorithm:

  • Ensure the table has a serial integer column.
  • Each Ruby process will take a unique integer.
  • The process's integer will start at 0 and be less than or equal to the max number of running processes.
  • The process will only work rows such that the id of the row mod max number of running processes equals the processes' unique integer.

The following algorithm ensures that no two workers will attempt to process the same rows. This will reduce contention on our table and allow greater throughput than if we were only running a single process.

The Code

MAX_WORKERS = 10
worker = nil

MAX_WORKERS.times do |i|
  if lock = DB["SELECT pg_try_advisory_lock(?)", i].get
    worker = i
    break
  end
end

if worker
  while(true)
    r = DB["SELECT * FROM t WHERE MOD(id, ?) = ?", MAX_WORKERS, worker].get
    process(r)
  end
else
  puts("unable to work. increase MAX_WORKERS")
end

Make this file executable and you will be able to execute this file in 10 seperate processes all of which are working in parallel.

The elegant feature of this code is the use of PostgreSQL's advisory lock (a lightweight locking utility). Each PostgreSQL session is elgable to lock an integer key. When the PostgreSQL session is disconnected the key is returned.

Note that the key is shared with anyone connecting to the PostgreSQL backend. This can cause confusion if you are using the lock for more than one function.

Links

Author

Ryan Smith

@ryandotsmith, ace hacker, builds distributed systems at heroku.

This article was motivated by many success and failures experienced with production systems at Heroku.

Last updated: 2012-04-12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment