Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethinking concurrency_limits#duration #336

Closed
hms opened this issue Sep 8, 2024 · 3 comments
Closed

Rethinking concurrency_limits#duration #336

hms opened this issue Sep 8, 2024 · 3 comments

Comments

@hms
Copy link
Contributor

hms commented Sep 8, 2024

Between my reading of the documentation and observationally, concurrency_limits#duration: functions as a soft limit. While this might be correct for some/many use-cases, it can be problematic for others. I have jobs that should never allow for more than one to ever be inflight at a time. While it's easy enough to code for this situation (a combination of locking and skip on lock), I would prefer the option of being explicit via concurrency_limits.

I propose to breaking notion of duration into two parts:

  • a soft duration.
  • a hard duration

I would propose enhancing concurrency_limits with two optional parameters and behaviors:

  • notify_at:
    • This would invoke a SolidQueue.log_subscriber to message the user they have a job that is exceeding runtime expectations
  • at_duration
    • This would take following as parameters:
      • :release (default). This is the current behavior and would make this implementation transparent to existing deployments.
      • :fail_job. Terminate the job and set it to a failed state
  • The existing duration parameter would not be impacted.

If this makes sense and is acceptable, I'm happy to take a run at implementing.

@rosa
Copy link
Member

rosa commented Sep 9, 2024

Hmm... I think there's a mismatch of what duration means and the intention of this proposal 🤔 duration here refers to the time we keep jobs blocked before they're candidates for release, but it's not about runtime or the current job running. You could have a job that takes a second to run but blocks other jobs for 10 minutes because it stays enqueued for much longer (for example, if you have a big backlog).

The concurrency limits work as follows: when a job is enqueued, we check if it specifies concurrency controls. If it does, we try to see if we have permission to put it as "ready" (that's the semaphore check). Ready means it can be picked up by workers for execution. If we have permission, we do that, and we don't release the semaphore and try to unblock the next job until it finishes (be it successfully or unsuccessfully). Unblocking the next job doesn't mean running that job right away, but moving it from blocked to ready. Since something can happen that prevents the first job from releasing the semaphore and unblocking the next job, we have the duration as a failsafe. Jobs that have been blocked for more than duration can be released, but only one of them, following the same rules. So, duration is not really about the job that's enqueued or being run, it's about the jobs that are blocked waiting.

@hms
Copy link
Contributor Author

hms commented Sep 9, 2024

@rosa Thank you for this explanation.

I completely misunderstood the description in the readme and the purpose of the implementation. For what it's worth, this description was extremely helpful and I think it would add real value as an enhancement to the current documentation.

@hms hms closed this as completed Sep 9, 2024
@rosa
Copy link
Member

rosa commented Sep 11, 2024

That's a great point! I'm going to add this explanation to the README. Thank you! 🙏

rosa added a commit that referenced this issue Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants