-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Conversation
cc: @james-deee |
if (index == 0) { | ||
return Optional.of(0L); | ||
} | ||
return Optional.of(Math.min((long) Math.pow(2, index), defaultOffset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if someone has their systemTaskWorkerCallbackDuration to one second though? It will always still be 1 second, so this won't be an exponential backoff in that scenario. It feels like defaultOffset
should be a configurable maximum offset or something like that. Maybe the default 30 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @james-deee , I think if we are setting systemTaskWorkerCallbackDuration to 1 second then ideally we don't want the exponential backoff thing. Here we are trying to exponential backoff from 1 to 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manan164 One hand, I see what you're saying, but in practice what we are seeing is that having 1 second doesn't alleviate the problem. The issue is that we have a lot of long running workflows (thus JOINs) in the queue. If we do NOT do an exponential backoff from our setting of 1 second, then our JOIN tasks don't get processed fast enough because the async task worker is trying to process say 10ks JOINs.......... which then makes JOINs that you want to process immediately be backed up and slowed down.
With my proposal, and with what we changed locally, fixes all of our issues with processing the async JOINS.
In our situation we use a duration of 1 second, but we have a cap of 30 seconds in place of that defaultOffset
and it completely solves our issue. The result of it is that JOINs on fast running workflows get processed almost instantaneously, and JOINs in the long running workflows get pushed back to the max of 30 seconds..... which is what frees up the processing of the immediate ones.
I really think these are 2 different levers/values that need to be present. Just using 1 second always won't work in heavy workfload Conductor instances where there might be a LOT of joins to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe what you're saying is that we should be using a setting of 30 seconds for the systemTaskWorkerCallbackDuration
always, and then it sounds like no one should ever use 1 second for the systemTaskWorkerCallbackDuration
value.
Then, I think that this could get confusing, because it would feel like someone would want to set the value to 1 for fast processing, but they wouldnt know that actually using 30 seconds will get the fastest/best processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manan164 maybe a compromise is to use a hardcoded 30 seconds for that defaultOffset
. I really think it will be confusing and could cause issues for people to use 1 second for that setting then, because that doesn't actually give them the best performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @james-deee , The expectation here is when we use systemTaskWorkerCallbackDuration to 30 seconds ideally we are fine with tasks getting checked every 30 seconds and getting completed but if tasks are getting checked at 1,2,4,8,16 seconds and get completed that should also be fine. But when we use systemTaskWorkerCallbackDuration to 1 seconds which means I want my task to get checked every second. Hope this answers the question.
is it merged to the main? |
Any estimate on when this can be expected to be merged? It is crippling the performance of our workflows 🙏🏼 |
Exactly, our system is throttled. Database is also at high utilization.
…On Thu, Jun 8, 2023, 8:00 AM Fran González Rull ***@***.***> wrote:
Any estimate on when this can be expected to be merged? It is crippling
the performance of our workflows 🙏🏼
—
Reply to this email directly, view it on GitHub
<#3594 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXQ2C7SRMKFI5OV3GL2IU33XKG5EJANCNFSM6AAAAAAXMW2RXE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@manan164 @apanicker-nflx Can we get this in? |
Pull Request type
NOTE: Please remember to run
./gradlew spotlessApply
to fix any format violations.Changes in this PR
The implemented approach of exponential backoff in join task calbackDuration.
Describe the new behavior from this PR, and why it's needed
Issue #
#3436