-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the default for runner.poll.interval
config option to 60
#4150
Increase the default for runner.poll.interval
config option to 60
#4150
Conversation
I have the feeling we should maybe test this in the wild before merging. Maybe we have silently been relying a lot on the polling backup mechanism and by changing this now, we would maybe see an unexpected drop in throughput, albeit at a significantly lower CPU cost 😅 I will try to test myself soon, but if anyone else could give it a spin as well, that'd be great. Until then I will put the PR as on hold. |
567c972
to
7442aee
Compare
@sphuber Thanks. I guess the testing would be more applicable for a high-throughput case right (actively doing lots of processes)? I have added logging in my production instance anyway and will see if I can find any. It isn't that active though. There are many |
Codecov Report
@@ Coverage Diff @@
## develop #4150 +/- ##
===========================================
+ Coverage 78.89% 78.89% +0.01%
===========================================
Files 467 467
Lines 34500 34500
===========================================
+ Hits 27214 27215 +1
+ Misses 7286 7285 -1
Continue to review full report at Codecov.
|
def call_on_process_finish(self, pk, callback):
"""
Callback to be called when the process of the given pk is terminated
:param pk: the pk of the process
:param callback: the function to be called upon process termination
"""
process = load_node(pk=pk)
self._poll_process(process, callback) I am not very sure but I think poll calculation is the only mechanism here rather than the backup for the rmq broadcast. Since the Or did I get it wrong? |
I think you are right @unkcpz . I have also checked the code and I cannot see where the parent is subscribing to state transition messages from the child, so it continuing relies solely on the polling mechanism. I wonder now if the subscribing was broken by accident at some point or if it actually never went in the code. Anyway, I will open a new issue for this and add a fix. This might actually give us an unexpected performance boost since we were actually not relying on event-based processing at all. |
Ha, yes. I also suggest to add a logging message to |
7442aee
to
187c0c4
Compare
This option is used to define the `poll_interval` argument of all `Runner` instances that are created to run all processes. It determines the interval with which the state of a subprocess, that a `WorkChain` is waiting on, is checked whether it is terminated. If that is the case, a callback is called which will signal to the `WorkChain` that it can continue. This polling is a backup mechanism in case the broadcast by the process when it terminates is missed by the caller, which would cause it to wait indefinitely. The original default of 1 was causing unnecessary load on the CPUs as well as the database that each time had to query for the process state. When running many calculations this would spin the CPUs noticeably. Since this is supposed to be a fail-safe mechanism and it should only be required rarely, it is fine to increase the time significantly to reduce the load.
187c0c4
to
e605430
Compare
@muhrin care to give this daunting PR a go? ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me. I have been running with 60 seconds polling interval for a few days so far, no issue found.
By the way, is there a plan to re-enable the broadcast mechanism for informing process completion, so the polling is for fail-safe as originally intended? |
Already merged in #4154 😄 |
Fixes #4149
This option is used to define the
poll_interval
argument of allRunner
instances that are created to run all processes. It determinesthe interval with which the state of a subprocess, that a
WorkChain
iswaiting on, is checked whether it is terminated. If that is the case, a
callback is called which will signal to the
WorkChain
that it cancontinue.
This polling is a backup mechanism in case the broadcast by the process
when it terminates is missed by the caller, which would cause it to wait
indefinitely. The original default of 1 was causing unnecessary load on
the CPUs as well as the database that each time had to query for the
process state. When running many calculations this would spin the CPUs
noticeably. Since this is supposed to be a fail-safe mechanism and it
should only be required rarely, it is fine to increase the time
significantly to reduce the load.