-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-36981: retry jobs by increasing memory with user defined task para… #58
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #58 +/- ##
==========================================
- Coverage 37.87% 37.78% -0.10%
==========================================
Files 9 9
Lines 528 532 +4
Branches 91 91
==========================================
+ Hits 200 201 +1
- Misses 319 322 +3
Partials 9 9 ☔ View full report in Codecov by Sentry. |
python/lsst/ctrl/bps/panda/utils.py
Outdated
if gwjob.request_memory and gwjob.memory_multiplier | ||
else 0 | ||
) | ||
task_rss_retry_offset = 0 if task_rss_retry_step else gwjob.request_memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 323 suggests that even if gwjob.request_memory
evaluates to False
(e.g. is not set), the task will request for some default amount of memory instead of raising an error. So to cover the case when gwjob.request_memory
evaluates to False
, but gwjob.memory_multiplier
is set I would rewrite this fragment as follow
task_rss = gwjob.request_memory if gwjob.request_memory else PANDA_DEFAULT_RSS
task_rss_retry_step = task_rss * gwjob.memory_multiplier if gwjob.memory_mulitplier else 0
task_rss_retry_offset = 0 if task_rss_retry_step else task_rss
and change line 323 to
task_rss=task_rss
Also, can we add a comment explaining a bit how PanDA will use these values to scale the memory requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestions.
In fact, panda boost memory function will check the current task memory usage. If the memory after boosting is lower than the current task memory, it will not do anything.
python/lsst/ctrl/bps/panda/utils.py
Outdated
@@ -315,6 +321,9 @@ def _make_doma_work(config, generic_workflow, gwjob, task_count, task_chunk): | |||
}, | |||
encode_command_line=True, | |||
task_rss=gwjob.request_memory if gwjob.request_memory else PANDA_DEFAULT_RSS, | |||
task_rss_retry_offset=task_rss_retry_offset, | |||
task_rss_retry_step=task_rss_retry_step, | |||
task_rss_max=gwjob.request_memory_max if gwjob.request_memory_max else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if task_rss_max = None
(i.e. gwjob.request_memory_max
evaluates to False
)? Will PanDA impose any limits on the amount of the memory a job can request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a variable PANDA_DEFAULT_RSS_MAX=512000. However Currently panda doesn't support task_rss_max, it will be ignored (will fix it later in panda). (for ATLAS, the default per core is 4GB. The max boost memory per core can be 16G, it's hardcoded. here we need to transfer PANDA_DEFAULT_RSS_MAX to panda server.)
810e8fe
to
a31f31f
Compare
…meters
Checklist
doc/changes