Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider the introduction of a default maximum_retry_count #251

Closed
hlascelles opened this issue Jul 11, 2019 · 4 comments
Closed

Reconsider the introduction of a default maximum_retry_count #251

hlascelles opened this issue Jul 11, 2019 · 4 comments

Comments

@hlascelles
Copy link
Contributor

hlascelles commented Jul 11, 2019

Que 1.x introduces maximum_retry_count, after which the job stops.

From https://github.com/chanks/que/blob/master/docs/error_handling.md

There is a maximum_retry_count option for jobs. It defaults to 15 retries, which with the default retry interval means that a job will stop retrying after a little more than two days.

This is a great addition, but the fact that it is has a default that stops the job is concerning to me. Major versions can of course include breaking changes, but we only noticed it by chance.

One of the best things about Que is its resilience and the fact that jobs aren't lost (which we experienced constantly with Resque). I expect the change is related to the presence of the history table, but I'd say that is a bonus, not the main job flow.

Can I request the default be changed to "retry forever" in Que 1.x?

@airhorns
Copy link
Contributor

@hlascelles do you think that might be kind of surprising for developers? To me the new default makes sense because if a job has failed 15 times in a row over two days, it seems reasonable to assume thats because of some systemic failure, and that it's unlikely to succeed without something external changing. It think it'd be prudent for que to protect itself (and the system) by not retrying indefinitely, as if there is ever a systematically failing job and enough of them enqueued, they will pile up and consume all worker resources. I think the default of this "self-healing" (and also I guess less correct-ness oriented) approach makes sense so that developers don't accidentally shoot themselves in the foot with the above problem. Jobs like the que-scheduler jobs I think are the exceptions to the rule?

Both sidekiq and delayed_job both have a fixed number of retries with exponential backoff as well.

@siegy22
Copy link
Member

siegy22 commented Sep 23, 2019

I agree that "retrying forever" is kind of a dangerous default. I can't imagine having a job run for more than 2 days and suddenly succeed again.

In general I'm not a fan of (blindly) retrying jobs.

In the applications I built, I let certain (really few) jobs automatically retry on certain errors. E.g. if there was a network error when trying to connect to a 3rd party API. But in general I want my job to fail if there's an exception. 😄
That's just my opinion, but the current default feels pretty good to me.

@chanks What are your thoughts on this?

@chanks
Copy link
Collaborator

chanks commented Sep 23, 2019 via email

@hlascelles
Copy link
Contributor Author

hlascelles commented Sep 25, 2019

Yes, good points all.

I was surprised to learn the default for the gem delayed_job is complete job deletion - not something we could tolerate. EDIT: so does Sidekiq, albeit after some months. I'm glad que doesn't do that.

OK, I'll close this knowing that the 1.x branch by default will keep failed (expired) jobs in a dead letter queue which will aid morning manual retries 👍.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants