Dev-Dipesh/kue_issues_debugging.md

## kue_issues_debugging.md

      
    Raw
  

              kue_issues_debugging.md
            
          
    Project: Kue (https://github.com/Automattic/kue)

Issue: Automattic/kue#130
Here's the rough set of steps we typically follow to repair various kue issues we see regularly:
(note that some of these apply only to our fork - that has the ability to do "locks" so jobs for - in our case - a single user are serialized (users are differentiated based on their email address).
Failed Jobs Not Showing
When there are failed jobs (the sidebar says non-zero), but none show in the list, use the follow this procedure to repair them:
redis-cli
zrange q:jobs:failed 0 -1

For each do hget q:job:NUM type until you find one that has 'type' null (or no 'type' field shows up)
Then hgetall q:job:NUM to see the data values for it.
If there is no 'data' json blob, you can't recover - just delete the job as follows:
     hset q:job:NUM type bad
     zrem q:jobs:QUEUE:failed NUM
          (where QUEUE is specific queue the job was in - if you don't know which do this for each one)

That should make the jobs now appear.
Then go into the Kue UI and delete the 'bad' job.
If that doesn't work (e.g. it corrupts the failed queue again), here's how to manually delete a job:
     zrem q:jobs NUM
     zrem q:jobs:failed NUM
     del q:job:NUM
     del q:job:NUM:state
     del q:job:NUM:log

Even if there is a 'data' json blob, other fields might be messed up.  It's best to find out what type of job it is and who it applies to (via looking in the log files), do the above procedure and then kick off a new job (via the admin ui) to replace the corrupt one.
Jobs Staying in Queued
Sometimes, jobs will stay in queued and not be allocated to a worker even if one is available.  But, as soon as another job is queued, one will go out of queued and get processed (but one or more will still be "stuck" in queued).
First, find the queue that's misbehaving.
The example below assumes QUEUE is it's name.
Find out how many jobs are queued:
     llen q:QUEUE:jobs
     zrange q:jobs:QUEUE:inactive 0 -1

There are two possible problems here:

The number doesn't match between these two commands.
The number matches and it's 0 for both, but a job is still showing in the UI

To solve these problems:

Execute the following command as many times as is needed to make the numbers the same (e.g. if llen returns 0 and zrange returns 2 items, run it 2 times):

     lpush q:mail:jobs 1


In this case (they show up in the UI and when you do zrange q:jobs:inactive 0 -1), for each job showing up in the UI but not showing up in the above commands, it could be that the job is actually in a different state in reality, or two entries are invalid.  Here's how to check:

     hget q:job:NUM state

     if the state is inactive do the following, do the following in this order:
          zadd q:jobs:mail:inactive 0 NUM
          lpush q:mail:jobs 1

     If the sate is not inactive, then you should remove it from the inactive list:
          zrem q:jobs:inactive NUM

Jobs Staying in Staged
If jobs for a user stay in staged, and there are no other jobs for that user in inactive, active, or failed, this likely means that a previous job never released the lock correctly.  Check if this is the case as follows (given the specific user's email):
     get q:lockowners:EMAIL

Assuming this shows a job number, get that job's current state:
     hget q:job:NUM state

If it's current state is complete, you just need to delete the job and that should get the queue flowing.  You may also need to repair the staged queue if it's corrupt after deleting the job:
     zrem q:jobs:staged NUM

If you can't get to the specific job, try clearing the completed queue.
If the current state of the job that has the lock is 'staged', then you should move that job directly to 'inactive' manually in the UI (since it already has the lock it can go ahead and be moved to execute).