igal/gist:983749

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Resque durability

Resque (https://github.com/defunkt/resque) is nice, but doesn't provide durability. When a worker "reserves" a job, it actually just pops it from the data store, which deletes the job. If anything happens to the worker after it pops the job, the job is lost forever. However, the author of Resque, defunkt, doesn't want reservation or retries added, and has rejected such patches in the past. Therefore, this may not be worth doing with Resque unless someone wants to maintain a fork of it forever.
Background


https://github.com/defunkt/resque/issues/16 -- where defunkt writes
"Resque is explicitly designed to never re-try jobs. Ever, under any
circumstance." and "If you need jobs to never fail and never slip through
the cracks due to failure you may want [something else]".
https://github.com/defunkt/resque/issues/93 -- where defunkt write "Resque
does not advertise itself as a system that will never lose your jobs. Part
of the design is we don't care if jobs are lost."
https://github.com/defunkt/resque/pull/165 -- where defunkt rejects a patch
which does basic reserve/retries by saying "Resque doesn't do retries on
purpose".
https://github.com/tobowers/resque/commits/rpoplpush -- rejected patch which
contains different solution. My issue with this is each worker must have its
own uniquely-named queue to store in-progress work into, workers are
responsible for re-enqueueing their own incomplete jobs when they're restarted,
and workers must only process jobs from one queue.

Possible solution

A UUID is added to jobs to make them unique and trackable. Workers report jobs they accepted and completed in a way that can be tracked. A new Nanny daemon watches the lists of accepted and completed jobs, retries expired jobs when appropriate, and fails jobs that retry too many times.
New data structures


"accepted:#{queue_name}" list -- jobs workers have accepted
"completed" list -- jobs workers have completed
"expirations:#{queue_name}" sorted set -- jobs with expiration times as scores
"retries" hash -- jobs and how many times they've been retried

Specification


Job

should instantiate with a UUID
should create with a UUID
should destroy job including its UUID
should record completion when

it succeeded
it failed


Resque

should push new job into queue (with 'lpush')
should pop accepted job (from 'queue' to 'accepted' list with 'rpoplpush')
should record 'completed' job


Nanny

when started as daemon

should process accepted jobs
should process completed jobs
should process expired jobs


when specifying timeouts

should use specific timeout assigned for specific queue
should use default timeout for a queue without its own timeout


when processing 'accepted' jobs

should create 'retries' entry for new job
should increment 'retries' entry for retried job
when job is retried too many times

should create 'failure' entry
should remove 'retries' entry
should remove 'accepted' entry


when job hasn't exceeded retries limit

should create 'expirations' entry
should remove 'accepted' entry


when processing 'completed' jobs [1]

should remove 'expirations' entry
should remove 'retries' entry
should remove 'completed' tracking


when processing expired jobs

when job has completed (was found in 'completed' list)

should treat it just like a normal completed job [see 1]


when job hasn't completed (wasn't found in 'completed' list)

when job has retries left

should increment 'tries' hash entry
should readd job to appropriate 'queue' list
should remove from 'expirations' sorted set


when job has no retries left

should create 'failure' list entry
should treat it just like a normal completed job [see 1]