Reconsider the introduction of a default maximum_retry_count #251

hlascelles · 2019-07-11T09:55:11Z

Que 1.x introduces maximum_retry_count, after which the job stops.

From https://github.com/chanks/que/blob/master/docs/error_handling.md

There is a maximum_retry_count option for jobs. It defaults to 15 retries, which with the default retry interval means that a job will stop retrying after a little more than two days.

This is a great addition, but the fact that it is has a default that stops the job is concerning to me. Major versions can of course include breaking changes, but we only noticed it by chance.

One of the best things about Que is its resilience and the fact that jobs aren't lost (which we experienced constantly with Resque). I expect the change is related to the presence of the history table, but I'd say that is a bonus, not the main job flow.

Can I request the default be changed to "retry forever" in Que 1.x?

The text was updated successfully, but these errors were encountered:

airhorns · 2019-09-14T14:44:41Z

@hlascelles do you think that might be kind of surprising for developers? To me the new default makes sense because if a job has failed 15 times in a row over two days, it seems reasonable to assume thats because of some systemic failure, and that it's unlikely to succeed without something external changing. It think it'd be prudent for que to protect itself (and the system) by not retrying indefinitely, as if there is ever a systematically failing job and enough of them enqueued, they will pile up and consume all worker resources. I think the default of this "self-healing" (and also I guess less correct-ness oriented) approach makes sense so that developers don't accidentally shoot themselves in the foot with the above problem. Jobs like the que-scheduler jobs I think are the exceptions to the rule?

Both sidekiq and delayed_job both have a fixed number of retries with exponential backoff as well.

siegy22 · 2019-09-23T16:32:53Z

I agree that "retrying forever" is kind of a dangerous default. I can't imagine having a job run for more than 2 days and suddenly succeed again.

In general I'm not a fan of (blindly) retrying jobs.

In the applications I built, I let certain (really few) jobs automatically retry on certain errors. E.g. if there was a network error when trying to connect to a 3rd party API. But in general I want my job to fail if there's an exception. 😄
That's just my opinion, but the current default feels pretty good to me.

@chanks What are your thoughts on this?

chanks · 2019-09-23T20:32:23Z

Yeah, I'm not a big fan of retrying forever, particularly as a default. I do think it makes sense to support that behavior as a configuration option, though I don't remember offhand if 1.0 already supports that.

…

On Mon, Sep 23, 2019 at 12:32 PM Yves Siegrist ***@***.***> wrote: I agree that "retrying forever" is kind of a dangerous default. I can't imagine having a job run for more than 2 days and suddenly succeed again. In general I'm not a fan of (blindly) retrying jobs. In the applications I built, I let certain (really few) jobs automatically retry on certain errors. E.g. if there was a network error when trying to connect to a 3rd party API. But in general I want my job to fail if there's an exception. 😄 That's just my opinion, but the current default feels pretty good to me. @chanks <https://github.com/chanks> What are your thoughts on this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#251?email_source=notifications&email_token=AACJIGSIGYL3HR35AQDDSFDQLDVTRA5CNFSM4IA5V3VKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7LOZ5I#issuecomment-534179061>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACJIGSXCJUE64O2TJEZCT3QLDVTRANCNFSM4IA5V3VA> .

hlascelles · 2019-09-25T21:18:36Z

Yes, good points all.

I was surprised to learn the default for the gem delayed_job is complete job deletion - not something we could tolerate. EDIT: so does Sidekiq, albeit after some months. I'm glad que doesn't do that.

OK, I'll close this knowing that the 1.x branch by default will keep failed (expired) jobs in a dead letter queue which will aid morning manual retries 👍.

hlascelles mentioned this issue Jul 11, 2019

When using Que 1.0 ensure maximum_retry_count is set to forever hlascelles/que-scheduler#96

Closed

hlascelles closed this as completed Sep 25, 2019

hlascelles mentioned this issue Nov 10, 2019

Ensure reliable backoff semantics for Que 1.0 hlascelles/que-scheduler#135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider the introduction of a default maximum_retry_count #251

Reconsider the introduction of a default maximum_retry_count #251

hlascelles commented Jul 11, 2019 •

edited

Loading

airhorns commented Sep 14, 2019

siegy22 commented Sep 23, 2019

chanks commented Sep 23, 2019 via email

hlascelles commented Sep 25, 2019 •

edited

Loading

Reconsider the introduction of a default maximum_retry_count #251

Reconsider the introduction of a default maximum_retry_count #251

Comments

hlascelles commented Jul 11, 2019 • edited Loading

airhorns commented Sep 14, 2019

siegy22 commented Sep 23, 2019

chanks commented Sep 23, 2019 via email

hlascelles commented Sep 25, 2019 • edited Loading

hlascelles commented Jul 11, 2019 •

edited

Loading

hlascelles commented Sep 25, 2019 •

edited

Loading