Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs stuck in inactive state #130

Closed
mikemoser opened this issue Sep 7, 2012 · 159 comments
Closed

Jobs stuck in inactive state #130

mikemoser opened this issue Sep 7, 2012 · 159 comments

Comments

@mikemoser
Copy link

Jobs get stuck in the inactive state fairly often for us. We noticed that the length of q:[type]:jobs is zero, even when there are inactive jobs of that type, so when getJob calls blpop, there is nothing to process.

It looks like this gets set when a job is saved and the state is set to inactive using lpush q:[type]:jobs 1. We're wondering if this is failing in some cases and once the count is off, jobs remain unprocessed.

Has anyone else seen this issue?

@sebicas
Copy link

sebicas commented Sep 8, 2012

Yes, I am having the same problem... do you have a patch?

@mikemoser
Copy link
Author

We do not have a patch yet - as we're not sure of the root cause. Right now we are wondering if there is an intermittent problem with the lpush call below.

Job.prototype.state = function(state){
  var client = this.client;
  this.removeState();
  this._state = state;
  this.set('state', state);
  client.zadd('q:jobs', this._priority, this.id);
  client.zadd('q:jobs:' + state, this._priority, this.id);
  client.zadd('q:jobs:' + this.type + ':' + state, this._priority, this.id);
  //increase available jobs, used by Worker#getJob()
  if ('inactive' == state) client.lpush('q:' + this.type + ':jobs', 1);
  return this;
};

We have added some diagnostics in job.js ctor to log errors for the client and waiting to repro:

this.client.on('error', function (err) {
  console.log('redis job client error ' + err);
});

This may not be the cause, if anyone else has ideas or a patch we would love to know. They get stuck often so I was surprised that more people have not run into this.

@sebicas
Copy link

sebicas commented Sep 8, 2012

Is very extrange... I am trying to find the cause...

When I call: jobs.inactive() I got the Jobs IDs# [ '147', '149', '144', '164', '168', '172', '176' ]

But for some reason jobs.process() don't see them not process them.

@sebicas
Copy link

sebicas commented Sep 8, 2012

I am able to reproduce the problem when stoping a worker... tasks tacked by the worker will be stuck in inactive state.

if I run:

jobs.process 'email', 4, (job, done) ->

1 task remains inconclusive in active state and 3 more remains stuck on inactive state forever.

If I reactive the worker all the other pending task are been processed but the ones I mention are getting stuck forever.

@spollack
Copy link

We have just seen this issue as well (job stuck in the inactive state).

@mikemoser
Copy link
Author

@sebicas do you have more details on your repro? Not sure what it means to be "tacked" by a worker while still inactive?

@sebicas
Copy link

sebicas commented Sep 11, 2012

@mikemoser not sure if "tacked" was the right word... what I tried to said if that for some reason the amount of stuck tasks is some way related to the number of simultaneous tasks indicated on the job.

For example if I do:

jobs.process 'email', 4, (job, done) ->

4 tasks will be stuck

jobs.process 'email', 6, (job, done) ->

6 tasks will be stuck

and so on...

@dfoody
Copy link

dfoody commented Sep 19, 2012

We use kue to manage somewhere between 1k-20k jobs per day and see the same problems. For us sometimes it's once a week. Other times multiple per day.

Unfortunately, the root cause of these issues are likely fundamental to the way kue is written - since changes are applied serially in kue, not as an atomic transaction, any little glitch / crash can cause the items in a job to get partially applied, leading to the need to manually repair "broken" jobs.

We're at the stage where we're deciding whether to rewrite the innards of kue to be more reliable, or whether to move to something else. Any thoughts would be appreciated.

@sebicas
Copy link

sebicas commented Sep 20, 2012

Unfortunately we are in the same situation as @dfoody :(

@tj
Copy link
Contributor

tj commented Sep 20, 2012

should be pretty trivial to make things atomic, I dont have time to look at it right now but even at worst we could use a little lua script. Though even if this portion is fully atomic there's always the chance of something being stuck, if the process is killed etc.. really I think the bigger problem is that we need to recover from half-processed jobs etc

@sebicas
Copy link

sebicas commented Sep 20, 2012

I agree, besides making things atomic... is process is killed in the middle of jobs execution, that causes the job to get stuck... @visionmedia any suggestions in how to solve that?

@tj
Copy link
Contributor

tj commented Sep 20, 2012

off the top of my head I can't think of any way really to differentiate between an active job and an active job whose's proc died. We could possibly tie PIDs into the whole thing, or alternatively just "timeout" those jobs, if it's been active for N minutes and it's not complete kill it and retry

@spollack
Copy link

FYI, there is a pull request here #105 (thanks @dfoody) that incorporates a watchdog to check/fail jobs that have run more than a configurable period of time. We have been using this code successfully in our project.

@tj
Copy link
Contributor

tj commented Sep 21, 2012

I do think they could be separated a bit, it's a pretty huge patch, I dont think some of that belongs in core and it takes more time to review really big pull-requests that have a larger scope

@sebicas
Copy link

sebicas commented Sep 21, 2012

@dfoody just mention he still have stuck jobs, so I guess his patch didn't solve the problem completely.

@dfoody
Copy link

dfoody commented Sep 21, 2012

There are really two separate issues here:

(1) What do you do with jobs that legitimately fail (this is where the watchdog enhancement I put in does work well - as long as you're sure that, when the watchdog fires, it's really failed and not just slow - so set your timeouts appropriately). The only alternative to really know if jobs have died or not is to use something like zookeeper under the covers (which has a nice feature that locks can automatically be released when a process dies).

(2) What happens when kue's data structures get corrupted. This is happening to us a lot right now, due to a combination of factors we believe: We're now doing calls across Amazon availability zones (Redis in a different AZ from Kue servers - increasing the latency between a series of redis requests) and we're now running significantly more kue servers than we were before. We think it's this combination of factors causing issues us to see the corruptions much more often. This is where moving to atomic redis operations (with appropriate use of 'watch') will hopefully help.

@mikemoser
Copy link
Author

@dfoody - thanks for qualifying. To be clear, this issue represents (2) we have a very basic setup and we see the redis indexes as describe above get out of sync before a job is ever processes, so they just stay in the inactive state. It happens a lot, however we can not get a consistant repro. Does anyone have a repro?

@dfoody
Copy link

dfoody commented Sep 22, 2012

@mikemoser given what you describe - does your process that queues the job quit soon after queuing it?
If so, that's almost certainly your issue. Not all redis operations are completed by the time the callback is called from kue typically, so you need to make sure the process doesn't die for at least a few seconds to give kue/redis enough time to finish doing all the operations that are part of queuing a job.

@behrad
Copy link
Collaborator

behrad commented Oct 13, 2012

We've also seen the same issue, our queuing processes are live ones. Any workarounds? is this issue finally clarified?

@mikemoser
Copy link
Author

@dfoody our worker process is always running, so kue should have all the time it needs to finish the operation of adding a job (e.g. call line 447 of kue/lib/jobs.js to increment the index by one for the new job). We are not able to get a consistant repro, so it's proving hard to fix, however we see it happen all the time. I want to reiterate this issue is about "new" jobs never getting out of the inactive sate, not jobs in process that get stuck. Those of you that said you've seen the same behavior is it the "new jobs stuck in inactive state and never getting processed?"

@dfoody
Copy link

dfoody commented Nov 2, 2012

I've not seen the case on our side where jobs get stuck in inactive without something else happening around the same time (e.g. a crash, etc. that corrupts the data, AWS having "minor" issues like all of EBS going down. etc). But, when a job does get stuck, we have to manually correct things before it starts moving again.

That said, we're running off my fork, not the original (which has lots of QoS and reliability changes).

One other thing to try: Have you restarted Redis recently? We have seen that sometimes redis does need a restart and that fixes some things.

@edwardmsmith
Copy link

We're seeing similar behavior as well.

What we see is that new jobs are stuck in an inactive state.

We have concurrency set to 1, but have a cluster of 4 processes.

Looking at Redis, we currently have two 'inactive' jobs.

When a new job is created, the oldest of the two inactive jobs suddenly gets processed.

So, we have, essentially, the two newest jobs always stuck - until they're displaced by new jobs.

@mikemoser
Copy link
Author

There seem to be two causes for new jobs to never get processed and stuck in the inactive state.

  1. Indexes are out of sync (e.g. my original post)
  2. BLPOP not responding when a new items are added

@edwardmsmith it sounds like your symptoms are related to #1. You can verify this by checking the if llen q:[type]:jobs is less than zcard q:jobs:[type]:inactive. We added a watchdog for each job type that checks to see if they are out of sync and corrects it by calling lpush q:[type]:jobs 1 for however many jobs are inactive and not in the key used by process job.

After correcting #1, we still noticed jobs stuck in the inactive state. It seems that BLPOP becomes unresponsive for certain job types and those jobs never process, even though the redis indexes look good. We don't have a high-volume of jobs for these types and our theory is that something goes wrong with with the redis connection, but it fails silently and BLPOP just remains blocking and doesn't process any more jobs of that type. We have to restart our worker process and it starts processing all the jobs properly. Has anyone seen BLPOP exhibit this behavior?

We're considering switching to lpop and adding a setTimeout to throttle the loop, however we'd prefer to keep BLOP and not add what is essentially a high-frequency polling solution.

@dfoody
Copy link

dfoody commented Dec 7, 2012

This might help you.

Here's the rough set of steps we typically follow to repair various kue issues we see regularly:
(note that some of these apply only to our fork - that has the ability to do "locks" so jobs for - in our case - a single user are serialized (users are differentiated based on their email address).

Failed Jobs Not Showing
When there are failed jobs (the sidebar says non-zero), but none show in the list, use the follow this procedure to repair them:

redis-cli

zrange q:jobs:failed 0 -1

For each do hget q:job:NUM type until you find one that has 'type' null (or no 'type' field shows up)
Then hgetall q:job:NUM to see the data values for it.

If there is no 'data' json blob, you can't recover - just delete the job as follows:
hset q:job:NUM type bad
zrem q:jobs:QUEUE:failed NUM
(where QUEUE is specific queue the job was in - if you don't know which do this for each one)

That should make the jobs now appear.
Then go into the Kue UI and delete the 'bad' job.

If that doesn't work (e.g. it corrupts the failed queue again), here's how to manually delete a job:
zrem q:jobs NUM
zrem q:jobs:failed NUM
del q:job:NUM
del q:job:NUM:state
del q:job:NUM:log

Even if there is a 'data' json blob, other fields might be messed up. It's best to find out what type of job it is and who it applies to (via looking in the log files), do the above procedure and then kick off a new job (via the admin ui) to replace the corrupt one.

Jobs Staying in Queued
Sometimes, jobs will stay in queued and not be allocated to a worker even if one is available. But, as soon as another job is queued, one will go out of queued and get processed (but one or more will still be "stuck" in queued).

First, find the queue that's misbehaving.
The example below assumes QUEUE is it's name.

Find out how many jobs are queued:
llen q:QUEUE:jobs
zrange q:jobs:QUEUE:inactive 0 -1

There are two possible problems here:
The number doesn't match between these two commands.
The number matches and it's 0 for both, but a job is still showing in the UI
To solve these problems:

  1. Execute the following command as many times as is needed to make the numbers the same (e.g. if llen returns 0 and zrange returns 2 items, run it 2 times):
    lpush q:mail:jobs 1

  2. In this case (they show up in the UI and when you do zrange q:jobs:inactive 0 -1), for each job showing up in the UI but not showing up in the above commands, it could be that the job is actually in a different state in reality, or two entries are invalid. Here's how to check:
    hget q:job:NUM state

 if the state is inactive do the following, do the following in this order:
      zadd q:jobs:mail:inactive 0 NUM
      lpush q:mail:jobs 1

 If the sate is not inactive, then you should remove it from the inactive list:
      zrem q:jobs:inactive NUM

Jobs Staying in Staged
If jobs for a user stay in staged, and there are no other jobs for that user in inactive, active, or failed, this likely means that a previous job never released the lock correctly. Check if this is the case as follows (given the specific user's email):
get q:lockowners:EMAIL

Assuming this shows a job number, get that job's current state:
hget q:job:NUM state

If it's current state is complete, you just need to delete the job and that should get the queue flowing. You may also need to repair the staged queue if it's corrupt after deleting the job:
zrem q:jobs:staged NUM

If you can't get to the specific job, try clearing the completed queue.

If the current state of the job that has the lock is 'staged', then you should move that job directly to 'inactive' manually in the UI (since it already has the lock it can go ahead and be moved to execute).

  • Dan

On Dec 7, 2012, at 2:24 PM, Michael Moser notifications@github.com wrote:

There seem to be two causes for new jobs to never get processed and stuck in the inactive state.

Indexes are out of sync (e.g. my original post)
BLPOP not responding when a new items are added
@edwardmsmith it sounds like your symptoms are related to #1. You can verify this by checking the if llen q:[type]:jobs is less than zcard q:jobs:[type]:inactive. We added a watchdog for each job type that checks to see if they are out of sync and corrects it by calling lpush q:[type]:jobs 1 for however many jobs are inactive and not in the key used by process job.

After correcting #1, we still noticed jobs stuck in the inactive state. It seems that BLPOP becomes unresponsive for certain job types and those jobs never process, even though the redis indexes look good. We don't have a high-volume of jobs for these types and our theory is that something goes wrong with with the redis connection, but it fails silently and BLPOP just remains blocking and doesn't process any more jobs of that type. We have to restart our worker process and it starts processing all the jobs properly. Has anyone seen BLPOP exhibit this behavior?

We're considering switching to lpop and adding a setTimeout to throttle the loop, however we'd prefer to keep BLOP and not add what is essentially a high-frequency polling solution.


Reply to this email directly or view it on GitHub.

@edwardmsmith
Copy link

@mikemoser - Thanks for the reply - interestingly, I don't have a key (my job type is 'email') q:email:jobs at all:

> keys *email*
1) "q:jobs:email:active"
2) "q:jobs:email:failed"
3) "q:jobs:email:complete"
4) "q:jobs:email:inactive"

So I had two stuck jobs:

> zrange q:jobs:email:inactive 0 -1
1) "6063"
2) "6064"
> lpush q:email:jobs 1
(integer) 1
> lpush q:email:jobs 1
(integer) 1
> zrange q:jobs:email:inactive 0 -1
(empty list or set)
> 

So, that seems to have cleared out the stuck items for now.

@dfoody - Wow, thanks for that!

@mikemoser
Copy link
Author

@edwardmsmith looks like your key was empty and it does seem that the indexes were our of sync. You can add a watchdog for each type to check this and correct it like we have.

@dfoody thanks for sharing - looks like y'all are having a lot of issues. We hope this is not a sign to come for us as we get more volume through kue. You state only 2 reasons for "Jobs Staying in Queued" however we have seen a third and where the numbers match on the indexes and they are greater than zero. This is where we just see the worker for that type sitting on the BLPOP command even through we are pushing new jobs to the key it's blocking on (e.g. lpush q:[type]:jobs 1). It really seems like BLPOP is just not responding when it should and never throwing and error. I'm not very experienced with redis in a high-volume senario, is BLPOP prone to this type of behavior? We are using OpenRedis to host, not sure if the extra network layers would effect this.

@dfoody
Copy link

dfoody commented Dec 7, 2012

We've not seen issues with BLPOP.
But, we're on redis 2.4.x and it looks like OpenRedis is on 2.6.x (and, coincidentally, the latest 2.6 release has a fix for a BLPOP issue…)

We host redis ourselves, and it's entirely possible - if you're not local to your redis server - that that could be the cause of issues (though I've not looked at the underlying redis protocol layer to see how they implement it to know more concretely if that type of thing could be an issue - e.g. does it heartbeat the connection to detect failures, etc.).

On Dec 7, 2012, at 3:16 PM, Michael Moser notifications@github.com wrote:

@edwardmsmith looks like your key was empty and it does seem that the indexes were our of sync. You can add a watchdog for each type to check this and correct it like we have.

@dfoody thanks for sharing - looks like y'all are having a lot of issues. We hope this is not a sign to come for us as we get more volume through kue. You state only 2 reasons for "Jobs Staying in Queued" however we have seen a third and where the numbers match on the indexes and they are greater than zero. This is where we just see the worker for that type sitting on the BLPOP command even through we are pushing new jobs to the key it's blocking on (e.g. lpush q:[type]:jobs 1). It really seems like BLOP is just not responding when it should and never throwing and error. I'm not very experienced with redis in a high-volume senario, is BLPOP prone to this type of behavior? We are using OpenRedis to host, not sure if the extra network layers would effect this.


Reply to this email directly or view it on GitHub.

@mikemoser
Copy link
Author

We're thinking about changing the kue/lib/queue/worker.js getJob() function to no longer use BLPOP and just use LPOP with a setTimeout. Here is a change that we've been testing locally. Any thoughts?

/**
 * Attempt to fetch the next job. 
 *
 * @param {Function} fn
 * @api private
 */

Worker.prototype.getJob = function(fn){
  var self = this;

  // alloc a client for this job type
  var client = clients[self.type]
    || (clients[self.type] = redis.createClient());

  // BLPOP indicates we have a new inactive job to process
  // client.blpop('q:' + self.type + ':jobs', 0, function(err, result) {
  //   self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
  //     if (err) return fn(err);
  //     if (!id) return fn();
  //     Job.get(id, fn);
  //   });
  // });

  client.lpop('q:' + self.type + ':jobs', function(err, result) {
    setTimeout(function () {
      self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
        if (err) return fn(err);
        if (!id) return fn();
        Job.get(id, fn);
      });
    }, result ?  0 : self.interval);
  });

@Jellyfrog
Copy link
Contributor

Any news on this? Looking to run > 200k jobs/day and need sth. stable since it will be kinda impossible to handle errors/stuck jobs manually.

@mikemoser
Copy link
Author

We have determined and fixed the BLPOP not responding. There were a few factors in play:

  1. We use OpenRedis and they had some logic that would kill idle connections (not considering blocking operations).
  2. We use a database index greater that 0 (e.g. the default on a connection) and the reconnect logic in kue did not consider the currently selected database.

So, the reason the BLPOP appeared unresponsive was b\c it had connected to the wrong database instance (e.g. back to index 0). We fixed this by:

  1. OpenRedis made a change to stop killing idle connection for blocking commands.
  2. We added some logic in Kue to ensure the correct database was selected before calling BLPOP again.

kue/lib/queue/worker.js getJob()

/**
 * Attempt to fetch the next job. 
 *
 * @param {Function} fn
 * @api private
 */

Worker.prototype.getJob = function(fn){
  var self = this;

  // alloc a client for this job type
  var client = clients[self.type]
    || (clients[self.type] = redis.createClient());

  // FIX: Ensure the correct database is selected
  // Note: We added selected_db when originally connecting 
  client.select(client.selected_db, function () {
    // BLPOP indicates we have a new inactive job to process
    client.blpop('q:' + self.type + ':jobs', 0, function(err) {
      // FIX: BLPOP may return an error and further commands should not be attempted.
      if (err) 
        return fn(err);

      self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
        if (err) return fn(err);
        if (!id) return fn();
        Job.get(id, fn);
      });
    });
  });

This is not the best place for this logic, I'm assuming we'd want to make the change in the core reconnect logic and ensure it does not execute the BLPOP until we're sure the database has been selected, however we have had this fix in place for several weeks and things are looking much better for us.

We continue to have a watchdog to fix the indexes, however we're observing to see if that issue is related to the selected db issue on reconnect as well.

@behrad
Copy link
Collaborator

behrad commented Dec 30, 2014

I've not posted any related issues yet, it should be investigated more first to catch a better resolution of the problem @tobalsgithub

@cburatto
Copy link

cburatto commented Jan 7, 2016

I am having this issue and it behaves as describer by @knation -- when the queue is inactive for too long (ex.: test environment is up over the weekend) -- any new jobs to that worker will get stuck. When I restart the worker, it processes the job. Is there any more information about this? I'll try @knation solution, but maybe a keep alive should standard.

@behrad
Copy link
Collaborator

behrad commented Jan 8, 2016

when the queue is inactive for too long

if your case is when Kue workers are idle for a long time, that is a different issue, and could be related to node_redis and it's connection properties. It may be the connection which is dropped for inactivity.
Can you investigate more and open another issue if so !?

@cburatto
Copy link

cburatto commented Jan 8, 2016

Thanks for pointing to a direction. I suspect using a keep_alive : true option when creating the queue might help. I'll test asap, but if you have any information or know of any drawbacks, that would help.

 var queue = kue.createQueue({
              prefix: 'q',
              redis: {
                "host" : REDIS_KUE_HOST,
                "port" : REDIS_KUE_PORT,
                "socket_keepalive" : true
              }
            });

@mathieugerard
Copy link

I recently moved my Kue workers from Heroku to a Ubuntu machine on Azure. On heroku everything works fine but on Azure I have the same issue of workers stopping taking jobs as described above.

As soon as the Azure workers get idle for about 5 minutes, they stop taking jobs forever. If I restart them, they start processing jobs again.
In the meanwhile, the workers on Heroku running exactly the same code keep taking jobs anytime, even after days being idle.

I do not see anything on the Azure or the Redis log saying anything is wrong, but i suspect it is indeed the BLPOP which is never responding.

Since both workers are running the exact same code, I would believe this could be linked to a configuration on the Ubuntu Server 14.04.4 machine or the environment. Any idea where to search?

@behrad
Copy link
Collaborator

behrad commented Apr 11, 2016

As soon as the Azure workers get idle for about 5 minutes

This may mean that your redis client connections are being closed or dropped after some idle time, Some saw this on cloud deployments. You can monitor your redis instace connections, or increase idle connection timeout, I don't remember the option name in node_redis, you can search Kue issues for that.

@victusfate
Copy link

victusfate commented Sep 8, 2016

noticed this mentioned in features "Graceful workers shutdown", I don't see how this is handled only by a queue process kill signal listener (workers can be in dozens of different processes connected to redis)

I ran into something with gearman workers/jobs a few years back (@enobrev wrestled this issue). I believe it could be related, as I can consistently recreate stuck jobs in kue.js by queueing them up and killing the worker.

Whenever our workers (which aren't running in the same process as the queue) restart or are killed (any deployment), they are not cleanly deregistering:

  • deleting worker (don't grab any new jobs)
  • resetting all active jobs in that process to inactive
  • calling done (with error) for all active jobs?, this may be needed to unblock the queue

something like this snipped from kue.js but just for a given process' workers and active jobs

    self.workers.forEach(function( worker ) {
      if( self.shuttingDown || worker.type == type ) {
        worker.shutdown(timeout, fn);
      } else {
        fn && fn();
      }
    });

or even better this!

queue.process('email', function(job, ctx, done){
  ctx.pause( 5000, function(err){
    console.log("Worker is paused... ");
    setTimeout( function(){ ctx.resume(); }, 10000 );
  });
});

This happens on every code push when I'm deploying (to heroku, locally, ec2, etc) and this is highly correlated with where I run into the stuck jobs/queue. I noticed the reference to gracefully shutting down the queue or restarting, but nothing about the workers that exec the jobs themselves.

What I believe is needed is something like the below code in each and every worker process (I'm testing today locally since I recreate the stuck queue - version -> master git+https://git@github.com/Automattic/kue.git )

process.on( 'SIGINT', function() {
  // delete any active workers in this process
  // reset any active jobs in this process to inactive, and possibly ensure their done w/ error is called
  process.exit( );
})

If memory serves, there may be other signals we'll need to listen to as well. Please let me know if there's a simple wrapper I can add somewhere, I'm hoping I can add it safely to my Job class.

@victusfate
Copy link

victusfate commented Sep 8, 2016

Ok put together a gist with graceful queue and worker shutdown. I'm still seeing a stuck active job, so I think worker pause is not triggering active jobs into an inactive state. I'm working on that last bit now.

Here's the gist:
https://gist.github.com/victusfate/1e2ce9eb73de32b78d2690d660f0f9c8

@victusfate
Copy link

victusfate commented Sep 8, 2016

Updated the gist above to check if a job is active, and set its state to inactive so other workers or this worker can pick it up when they resume.

Trying to ensure now there are no race conditions, and I don't have too many process SIGINT,SIGTERM listeners (default is 10, can bump it higher within reason). The timing is a little weird, I want to deregister all the workers so they don't grab any jobs, and then I want to make all current jobs inactive. But after the workers are shutdown, the ctx for the job makes setting its state raise an error. After ctx.pause() you can't make a kue.Job.get call and use the returned job to set it inactive, if its still active.

{ action: 'setJobInactive.err',
  err: 
   { [AbortError: EXEC can't be processed. The connection is already closed.]
     command: 'EXEC',
     code: 'NR_CLOSED',
     errors: [ [Object], [Object], [Object], [Object], [Object], [Object] ] } }

Ideally a single SIGINT, SIGTERM listener pair would cover all workers and active jobs per process, so those particular workers can gracefully shutdown, and any active jobs with those workerIds are made inactive after the workers are made inactive (so they don't try to grab the freshly made inactive jobs).

update
Ok, believe my latest version of that gist works as expected, pauses worker and makes any incomplete jobs inactive so other workers or future workers can pick them up.

@manast
Copy link

manast commented Mar 1, 2017

If you really care about this issue, latest bull is really at par feature wise with kue but with a non polling, and mostly atomic design, why wait to kue 1.0 when you can use bull? :) https://www.npmjs.com/package/bull

DISCLAIMER, I am the author of the package, I started it out of the frustration of some of the long standing issues with kue, which still today are not completely fixed, and can tell by experience that it is not completely trivial to rewrite everything using lua scripts and blocking redis calls...

@Caspain
Copy link

Caspain commented Mar 31, 2017

To remedy this issue, just create another job that periodically wakes the queue roughly every 1 minute then when complete , just remove. This way all jobs will get awoken and never get truly stuck.

@victusfate
Copy link

victusfate commented Apr 4, 2017

Who watches the Watchmen? @Caspain

@victusfate
Copy link

victusfate commented Jun 15, 2017

@knation do you have a complete snippet for the keep alive interval. We're still seeing stuck inactive jobs only in our dev environment likely related to long periods of inactivity.

Is this sufficient? I noticed your ... above

queue.process('stuck_queue', 10, function(job, done) {
  if (job.data.keepAlive) return done();
});

setInterval(function() { queue.create('stuck_queue', { keepAlive: true }).save(); }, 300000);

it looks like it just queues a job, every 300000 ms/ 5min

any word back from redis @behrad ?

@knation
Copy link

knation commented Jun 15, 2017

@victusfate Honestly, we abandoned this quite some time ago and migrated to a pubsub message system. That said, I believe it's close to what we had.

@victusfate
Copy link

Thanks @knation I ended up not needing it (just had a worker issue).

@christianguevara
Copy link

Hey @victusfate maybe can you comment what kind of issues ? i some times get stucked jobs could not find the reason yet.

@victusfate
Copy link

I do a graceful job to inactive shift on process kill or term and that normally handles any stuck jobs. My issue was just a faulty worker

@theoutlander
Copy link

Still facing this issue! I have all the error handlers, etc. as mentioned in the documentation. There was no failed job event raised either. I wasted the last three weeks implementing a solution with Kue/Redis that's completely unreliable! Going to switch to something else... will try Bull and if that doesn't work, I will move to RabbitMQ.

@theoutlander
Copy link

The number of jobs that are hung every time are equal to the number of parallel threads processing jobs.

@manast
Copy link

manast commented Sep 11, 2017

@theoutlander try bull which has a similar API to Kue, and if you need help ask in the gitter channel: https://gitter.im/OptimalBits/bull
The reason I wrote bull in the first place was due to the stuck jobs issue.

@theoutlander
Copy link

@manast Thanks for creating this. I'm loving it so far. Have ran into a weird issue today (https://github.com/OptimalBits/bull/issues/170)....not sure why. It went away after a while / restarting the IDE (Webstorm).

I haven't faced any issues with stuck jobs so far! Good work! And great job keeping a similar API...the transition was seamless!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests