-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reducing resource requests #42
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,8 +8,8 @@ | |
walltimes = { | ||
"compute_bigmem": "01:00:00", | ||
"large_mem": "04:00:00", | ||
"sharded_reproject": "04:00:00", | ||
"gpu_max": "08:00:00", | ||
"sharded_reproject": "01:00:00", | ||
"gpu_max": "01:00:00", | ||
} | ||
|
||
|
||
|
@@ -21,7 +21,7 @@ def klone_resource_config(): | |
os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat()) | ||
), | ||
run_dir=os.path.join("/gscratch/dirac/kbmod/workflow/run_logs", datetime.date.today().isoformat()), | ||
retries=1, | ||
retries=100, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Until we have a good way to catch and ignore pre-emption "failures" that would increment the retry counter, we can naively set the max retry number of something large. |
||
executors=[ | ||
HighThroughputExecutor( | ||
label="small_cpu", | ||
|
@@ -30,19 +30,20 @@ def klone_resource_config(): | |
partition="ckpt-g2", | ||
account="astro", | ||
min_blocks=0, | ||
max_blocks=4, | ||
max_blocks=16, | ||
init_blocks=0, | ||
parallelism=1, | ||
nodes_per_block=1, | ||
cores_per_node=1, # perhaps should be 8??? | ||
mem_per_node=256, # In GB | ||
mem_per_node=32, # In GB | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This executor is only used by the pre-TNO workflow to convert the URI file into an ImageCollection. So we probably never needed anywhere near the memory that was requested. |
||
exclusive=False, | ||
walltime=walltimes["compute_bigmem"], | ||
# Command to run before starting worker - i.e. conda activate <special_env> | ||
worker_init="", | ||
), | ||
), | ||
HighThroughputExecutor( | ||
# This executor was used for the pre-TNO reprojection task | ||
label="large_mem", | ||
max_workers=1, | ||
provider=SlurmProvider( | ||
|
@@ -62,18 +63,19 @@ def klone_resource_config(): | |
), | ||
), | ||
HighThroughputExecutor( | ||
# This executor is used for reprojecting sharded WorkUnits | ||
label="sharded_reproject", | ||
max_workers=1, | ||
provider=SlurmProvider( | ||
partition="ckpt-g2", | ||
account="astro", | ||
min_blocks=0, | ||
max_blocks=2, | ||
max_blocks=16, | ||
init_blocks=0, | ||
parallelism=1, | ||
nodes_per_block=1, | ||
cores_per_node=32, | ||
mem_per_node=128, # ~2-4 GB per core | ||
cores_per_node=8, | ||
mem_per_node=32, # ~2-4 GB per core | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this executor we're cranking up the maximum number of concurrent jobs running and decreasing the cores per node and memory. |
||
exclusive=False, | ||
walltime=walltimes["sharded_reproject"], | ||
# Command to run before starting worker - i.e. conda activate <special_env> | ||
|
@@ -87,12 +89,12 @@ def klone_resource_config(): | |
partition="ckpt-g2", | ||
account="escience", | ||
min_blocks=0, | ||
max_blocks=2, | ||
max_blocks=10, | ||
init_blocks=0, | ||
parallelism=1, | ||
nodes_per_block=1, | ||
cores_per_node=2, # perhaps should be 8??? | ||
mem_per_node=512, # In GB | ||
cores_per_node=1, | ||
mem_per_node=64, # In GB | ||
exclusive=False, | ||
walltime=walltimes["gpu_max"], | ||
# Command to run before starting worker - i.e. conda activate <special_env> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced the time requested for each of these. @DinoBektesevic I think that 1hr should generally be enough to finish a search, but let me know if this should be pushed back up.