-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1913 Added minimum scheduler poll interval #1929
Fix #1913 Added minimum scheduler poll interval #1929
Conversation
aiida/work/job_calcs.py
Outdated
@@ -0,0 +1,235 @@ | |||
import contextlib | |||
from future.utils import iteritems, itervalues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: In the Python 3 PR I dropped future
completely in favor of six
, which has a iteritems
as well. But I dropped it in most cases in favor of just using d.items()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tiziano! I'll convert over to six.
4563fda
to
183c0a7
Compare
Codecov Report
@@ Coverage Diff @@
## develop #1929 +/- ##
===========================================
- Coverage 67.61% 67.14% -0.48%
===========================================
Files 324 321 -3
Lines 33305 33291 -14
===========================================
- Hits 22520 22354 -166
- Misses 10785 10937 +152
Continue to review full report at Codecov.
|
@muhrin GitHub does not allow me to add inline comments for some reason. Why do you exclude |
aiida/work/utils.py
Outdated
yield sleep(interval) | ||
|
||
import traceback | ||
traceback.print_exc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we always want the print here? It will be dumped once it hits the exception handling of the process no?
@@ -513,6 +516,27 @@ def set_default_mpiprocs_per_machine(self, def_cpus_per_machine): | |||
raise TypeError("def_cpus_per_machine must be an integer (or None)") | |||
self._set_property("default_mpiprocs_per_machine", def_cpus_per_machine) | |||
|
|||
def get_minimum_job_poll_interval(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I saw you used metadata
as the container for this info, I was worried that this was useless, because as that is a column of the DbComputer
table it should be immutable once stored, which would mean that it could only be set once during setup. I then checked the code and saw that the is stored
check was commented out. So apparently, this is now possible, however, I wonder if it is the right thing. I don't think there is another viable option but I am just pointing it out here, so we can discuss and make sure this is what we want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I suspect this metadata is basically the extras of Computer in the sense that they can be updated (intentionally) even after storing.
scheduler.set_transport(transport) | ||
|
||
kwargs = {'as_dict': True} | ||
if scheduler.get_feature('can_query_by_user'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the scheduler cannot query by user? It used to default to just query for a specific job id. I don't see anymore how this is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Querying by job id is no longer necessary from this point of the code because it queries all jobs (and caches the results). In a way that's the point in this class. It's still useful to query by user in case it's possible just to reduce the number of results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That I understood, but I assumed that given there exists this scheduler.get_feature('can_query_by_user')
check, certain schedulers cannot query by user. If that is in fact the case, this user wide polling will break. So either certain scheduler cannot query on a per user basis and this will break, or all of them can and then this check is superfluous and should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I made an assumption here that if you don't specify a user then it will query jobs from all users - if that's not true then you're right that this could cause a problem. Let me check...
9f542ee
to
f142eba
Compare
f142eba
to
9e6e4aa
Compare
Ok, we are now looking by job id when we can't query by user. This is something we need to test in the field on each of the schedulers.
The others (i.e. direct and sge) can query by user. Does anyone have the ability to test any of these? |
1bc56a8
to
ee51ec1
Compare
According to the chat with Giovanni as the last meeting the schedulers are indeed capable (and designed to) take a list of job ids so this puppy should be set to go. |
Now the computer has a property that can be set to regulate the interval with which the scheduler command to get jobs can be called. This acts as a minimum and may in practice be longer if the transport minimum open interval is longer. With these changes all jobs in a single runner on the same computer also have their requests to get the jobs list batched and therefore there should be a significant decrease in the number of calls asking the scheduler to list the currently running jobs.
By default the `JobsList` will get the status of running jobs by requesting the status for the user, associated with the auth info, however, some schedulers do not support this functionality. Instead, one has to query for a list of job ids. In this case the `JobsList` will simply get the job id list from the internal mapping it keeps.
ee51ec1
to
597430f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A-mazing
Fix #1913
Now the computer has a property that can be set to regulate the interval
with which the scheduler command to get jobs can be called. This acts
as a minimum and may in practice be longer if the transport minimum open
interval is longer.
With these changes all jobs in a single runner on the same computer also
have their requests to get the jobs list batched and therefore there
should be a significant decrease in the number of calls asking the
scheduler to list the currently running jobs.