Skip to content

Instantly share code, notes, and snippets.

@igal
Created May 20, 2011
Embed
What would you like to do?
Resque: possible way to add durability

Resque durability

Resque (https://github.com/defunkt/resque) is nice, but doesn't provide durability. When a worker "reserves" a job, it actually just pops it from the data store, which deletes the job. If anything happens to the worker after it pops the job, the job is lost forever. However, the author of Resque, defunkt, doesn't want reservation or retries added, and has rejected such patches in the past. Therefore, this may not be worth doing with Resque unless someone wants to maintain a fork of it forever.

Background

  1. https://github.com/defunkt/resque/issues/16 -- where defunkt writes "Resque is explicitly designed to never re-try jobs. Ever, under any circumstance." and "If you need jobs to never fail and never slip through the cracks due to failure you may want [something else]".
  2. https://github.com/defunkt/resque/issues/93 -- where defunkt write "Resque does not advertise itself as a system that will never lose your jobs. Part of the design is we don't care if jobs are lost."
  3. https://github.com/defunkt/resque/pull/165 -- where defunkt rejects a patch which does basic reserve/retries by saying "Resque doesn't do retries on purpose".
  4. https://github.com/tobowers/resque/commits/rpoplpush -- rejected patch which contains different solution. My issue with this is each worker must have its own uniquely-named queue to store in-progress work into, workers are responsible for re-enqueueing their own incomplete jobs when they're restarted, and workers must only process jobs from one queue.

Possible solution

A UUID is added to jobs to make them unique and trackable. Workers report jobs they accepted and completed in a way that can be tracked. A new Nanny daemon watches the lists of accepted and completed jobs, retries expired jobs when appropriate, and fails jobs that retry too many times.

New data structures

  • "accepted:#{queue_name}" list -- jobs workers have accepted
  • "completed" list -- jobs workers have completed
  • "expirations:#{queue_name}" sorted set -- jobs with expiration times as scores
  • "retries" hash -- jobs and how many times they've been retried

Specification

  • Job
    • should instantiate with a UUID
    • should create with a UUID
    • should destroy job including its UUID
    • should record completion when
      • it succeeded
      • it failed
  • Resque
    • should push new job into queue (with 'lpush')
    • should pop accepted job (from 'queue' to 'accepted' list with 'rpoplpush')
    • should record 'completed' job
  • Nanny
    • when started as daemon
      • should process accepted jobs
      • should process completed jobs
      • should process expired jobs
    • when specifying timeouts
      • should use specific timeout assigned for specific queue
      • should use default timeout for a queue without its own timeout
    • when processing 'accepted' jobs
      • should create 'retries' entry for new job
      • should increment 'retries' entry for retried job
      • when job is retried too many times
        • should create 'failure' entry
        • should remove 'retries' entry
        • should remove 'accepted' entry
      • when job hasn't exceeded retries limit
        • should create 'expirations' entry
        • should remove 'accepted' entry
    • when processing 'completed' jobs [1]
      • should remove 'expirations' entry
      • should remove 'retries' entry
      • should remove 'completed' tracking
    • when processing expired jobs
      • when job has completed (was found in 'completed' list)
        • should treat it just like a normal completed job [see 1]
      • when job hasn't completed (wasn't found in 'completed' list)
        • when job has retries left
          • should increment 'tries' hash entry
          • should readd job to appropriate 'queue' list
          • should remove from 'expirations' sorted set
        • when job has no retries left
          • should create 'failure' list entry
          • should treat it just like a normal completed job [see 1]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment