Rethinking concurrency_limits#duration #336

hms · 2024-09-08T19:34:58Z

Between my reading of the documentation and observationally, concurrency_limits#duration: functions as a soft limit. While this might be correct for some/many use-cases, it can be problematic for others. I have jobs that should never allow for more than one to ever be inflight at a time. While it's easy enough to code for this situation (a combination of locking and skip on lock), I would prefer the option of being explicit via concurrency_limits.

I propose to breaking notion of duration into two parts:

a soft duration.
a hard duration

I would propose enhancing concurrency_limits with two optional parameters and behaviors:

notify_at:
- This would invoke a SolidQueue.log_subscriber to message the user they have a job that is exceeding runtime expectations
at_duration
- This would take following as parameters:
  - :release (default). This is the current behavior and would make this implementation transparent to existing deployments.
  - :fail_job. Terminate the job and set it to a failed state
The existing duration parameter would not be impacted.

If this makes sense and is acceptable, I'm happy to take a run at implementing.

rosa · 2024-09-09T08:37:24Z

Hmm... I think there's a mismatch of what duration means and the intention of this proposal 🤔 duration here refers to the time we keep jobs blocked before they're candidates for release, but it's not about runtime or the current job running. You could have a job that takes a second to run but blocks other jobs for 10 minutes because it stays enqueued for much longer (for example, if you have a big backlog).

The concurrency limits work as follows: when a job is enqueued, we check if it specifies concurrency controls. If it does, we try to see if we have permission to put it as "ready" (that's the semaphore check). Ready means it can be picked up by workers for execution. If we have permission, we do that, and we don't release the semaphore and try to unblock the next job until it finishes (be it successfully or unsuccessfully). Unblocking the next job doesn't mean running that job right away, but moving it from blocked to ready. Since something can happen that prevents the first job from releasing the semaphore and unblocking the next job, we have the duration as a failsafe. Jobs that have been blocked for more than duration can be released, but only one of them, following the same rules. So, duration is not really about the job that's enqueued or being run, it's about the jobs that are blocked waiting.

hms · 2024-09-09T23:27:39Z

@rosa Thank you for this explanation.

I completely misunderstood the description in the readme and the purpose of the implementation. For what it's worth, this description was extremely helpful and I think it would add real value as an enhancement to the current documentation.

rosa · 2024-09-11T17:34:30Z

That's a great point! I'm going to add this explanation to the README. Thank you! 🙏

@hms

As brought up by @hms in #336.

hms closed this as completed Sep 9, 2024

rosa added a commit that referenced this issue Sep 11, 2024

Add further clarification about how concurrency controls work

d768187

As brought up by @hms in #336.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking concurrency_limits#duration #336

Rethinking concurrency_limits#duration #336

hms commented Sep 8, 2024

rosa commented Sep 9, 2024 •

edited

Loading

hms commented Sep 9, 2024

rosa commented Sep 11, 2024

Rethinking concurrency_limits#duration #336

Rethinking concurrency_limits#duration #336

Comments

hms commented Sep 8, 2024

rosa commented Sep 9, 2024 • edited Loading

hms commented Sep 9, 2024

rosa commented Sep 11, 2024

rosa commented Sep 9, 2024 •

edited

Loading