-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* basic launcher/config structure from official hydra examples in place * remote_launcher_plugin progress * basic launcher/config structure from official hydra examples in place * remote_launcher_plugin progress Signed-off-by: César Miguel Valdez Córdova <cesar.valdez@mila.quebec> * removed repo_dir param, config interpolation * add uv lock * remote launcher params, configs * unrolled config to try params to circumvent pre-emption * more configs * Tweaks Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add a `cluster` config group, tweak resources Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Tweak the configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove hydra_plugins folder, tweak hydra `Plugins` Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix weird PL .log bug in callback Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * add first mock test * removed direct mock call, overrides for argv * add trainer params, assertion * added assertion to mock calls * avoid parameter overwriting on executor * hotpatched uv not found error, path instead of string * Debug issues with test for cluster group Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Try to make things more explicit Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add "cpu" and "gpu" resources configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add Beluga and Cedar configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Update the remote executor Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Update the remote slurm executor to latest commit Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Update executor plugin, add todos Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * debug config revamp * Add back the one_gpu.yaml config (local only) Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * keep basic assertion within test * Fix pre-commit version issue Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * empty line at the end of debug.yaml * Fix pre-commit issues Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Rename test module Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add missing marks on test Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove debug config Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Show git diff when a pre-commit hook fails Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add back weirdly formatted import in main.py?! Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove duplicated, outdated test in main_test.py Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix the one_gpu.yaml config not having a target Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Don't include DRAC slurm account in configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * WIP: Use the same resource group local and remote Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add a patched config for submitit_slurm Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Use patched config in `cluster=current` Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix structure of log dirs with submitit launchers Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove the duplicate 'one_gpu' config Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Undo change to Trainer default config Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Silence tiny typing error in PatchedSlurmQueueConf Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add test to load the configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix test for remote launcher plugin Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix bug in overwriting of `setup` from executor Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add marks to tests for the remote launcher plugin Signed-off-by: Fabrice Normandin <normandf@mila.quebec> --------- Signed-off-by: César Miguel Valdez Córdova <cesar.valdez@mila.quebec> Signed-off-by: Fabrice Normandin <normandf@mila.quebec> Co-authored-by: Fabrice Normandin <normandf@mila.quebec>
- Loading branch information
1 parent
f753a20
commit 0031536
Showing
19 changed files
with
1,151 additions
and
285 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# @package _global_ | ||
defaults: | ||
- narval | ||
|
||
# Use this to specify which remote slurm cluster the job should run on. | ||
# Remember to also use the resources group to select the resources allocated to the job! | ||
|
||
hydra: | ||
launcher: | ||
executor: | ||
cluster_hostname: beluga | ||
internet_access_on_compute_nodes: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# @package _global_ | ||
defaults: | ||
- narval | ||
|
||
# Use this to specify which remote slurm cluster the job should run on. | ||
# Remember to also use the resources group to select the resources allocated to the job! | ||
|
||
hydra: | ||
launcher: | ||
executor: | ||
cluster_hostname: cedar | ||
internet_access_on_compute_nodes: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: patched_submitit_slurm | ||
hydra: | ||
mode: MULTIRUN | ||
run: | ||
# output directory, generated dynamically on each run | ||
dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S} | ||
sweep: | ||
dir: logs/${name}/multiruns | ||
subdir: ${hydra.job.id} | ||
launcher: | ||
stderr_to_stdout: true | ||
submitit_folder: ${hydra.sweep.dir}/%j |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: remote_submitit_slurm | ||
|
||
# Use this to specify which remote slurm cluster the job should run on. | ||
# Remember to also use the resources group to select the resources allocated to the job! | ||
hydra: | ||
mode: MULTIRUN | ||
run: | ||
# output directory, generated dynamically on each run | ||
dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S} | ||
sweep: | ||
dir: logs/${name}/multiruns | ||
subdir: ${hydra.job.id} | ||
|
||
launcher: | ||
executor: | ||
_target_: remote_slurm_executor.RemoteSlurmExecutor | ||
_partial_: true | ||
folder: "${hydra.sweep.dir}/%j" | ||
cluster_hostname: mila | ||
internet_access_on_compute_nodes: true | ||
|
||
stderr_to_stdout: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# @package _global_ | ||
defaults: | ||
- mila.yaml | ||
|
||
hydra: | ||
launcher: | ||
executor: | ||
cluster_hostname: narval | ||
internet_access_on_compute_nodes: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: patched_submitit_slurm | ||
trainer: | ||
accelerator: cpu | ||
devices: auto | ||
|
||
hydra: | ||
mode: MULTIRUN | ||
launcher: | ||
nodes: 1 | ||
tasks_per_node: 1 | ||
cpus_per_task: 8 | ||
mem_gb: 16 | ||
array_parallelism: 16 # max num of jobs to run in parallel | ||
# Other things to pass to `sbatch`: | ||
additional_parameters: | ||
time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS) | ||
|
||
|
||
## A list of commands to add to the generated sbatch script before running srun: | ||
# setup: | ||
# - export LD_PRELOAD=/some/folder/with/libraries/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# @package _global_ | ||
defaults: | ||
- override /hydra/launcher: patched_submitit_slurm | ||
|
||
hydra: | ||
mode: MULTIRUN | ||
launcher: | ||
cpus_per_task: 4 | ||
gpus_per_task: 1 | ||
array_parallelism: 16 # max num of jobs to run in parallel | ||
# Other things to pass to `sbatch`: | ||
additional_parameters: | ||
time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS) | ||
# TODO: It would be better to have those be arguments to the launcher (as is the case in the | ||
# RemoteLauncherPlugin), that way we could use only SLURM argument names.. | ||
nodes: 1 | ||
mem: 16G | ||
ntasks_per_node: 1 | ||
|
||
## A list of commands to add to the generated sbatch script before running srun: | ||
# setup: | ||
# - export LD_PRELOAD=/some/folder/with/libraries/ |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.