Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-36981: retry jobs by increasing memory with user defined task para… #58

Merged
merged 5 commits into from
Dec 6, 2023

Conversation

zhaoyuyoung
Copy link
Contributor

…meters

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link

codecov bot commented Oct 31, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (c52380c) 37.87% compared to head (8fad834) 37.78%.

Files Patch % Lines
python/lsst/ctrl/bps/panda/utils.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
- Coverage   37.87%   37.78%   -0.10%     
==========================================
  Files           9        9              
  Lines         528      532       +4     
  Branches       91       91              
==========================================
+ Hits          200      201       +1     
- Misses        319      322       +3     
  Partials        9        9              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if gwjob.request_memory and gwjob.memory_multiplier
else 0
)
task_rss_retry_offset = 0 if task_rss_retry_step else gwjob.request_memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 323 suggests that even if gwjob.request_memoryevaluates to False (e.g. is not set), the task will request for some default amount of memory instead of raising an error. So to cover the case when gwjob.request_memory evaluates to False, but gwjob.memory_multiplier is set I would rewrite this fragment as follow

task_rss = gwjob.request_memory if gwjob.request_memory else PANDA_DEFAULT_RSS
task_rss_retry_step = task_rss * gwjob.memory_multiplier if gwjob.memory_mulitplier else 0
task_rss_retry_offset = 0 if task_rss_retry_step else task_rss

and change line 323 to

task_rss=task_rss

Also, can we add a comment explaining a bit how PanDA will use these values to scale the memory requirements?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestions.
In fact, panda boost memory function will check the current task memory usage. If the memory after boosting is lower than the current task memory, it will not do anything.

@@ -315,6 +321,9 @@ def _make_doma_work(config, generic_workflow, gwjob, task_count, task_chunk):
},
encode_command_line=True,
task_rss=gwjob.request_memory if gwjob.request_memory else PANDA_DEFAULT_RSS,
task_rss_retry_offset=task_rss_retry_offset,
task_rss_retry_step=task_rss_retry_step,
task_rss_max=gwjob.request_memory_max if gwjob.request_memory_max else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if task_rss_max = None (i.e. gwjob.request_memory_max evaluates to False)? Will PanDA impose any limits on the amount of the memory a job can request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a variable PANDA_DEFAULT_RSS_MAX=512000. However Currently panda doesn't support task_rss_max, it will be ignored (will fix it later in panda). (for ATLAS, the default per core is 4GB. The max boost memory per core can be 16G, it's hardcoded. here we need to transfer PANDA_DEFAULT_RSS_MAX to panda server.)

@wguanicedew wguanicedew merged commit ad75819 into main Dec 6, 2023
11 of 13 checks passed
@wguanicedew wguanicedew deleted the tickets/DM-36981 branch December 6, 2023 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants