Remote launcher plugin (#73)

* basic launcher/config structure from official hydra examples in place * remote_launcher_plugin progress * basic launcher/config structure from official hydra examples in place * remote_launcher_plugin progress Signed-off-by: César Miguel Valdez Córdova <cesar.valdez@mila.quebec> * removed repo_dir param, config interpolation * add uv lock * remote launcher params, configs * unrolled config to try params to circumvent pre-emption * more configs * Tweaks Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add a `cluster` config group, tweak resources Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Tweak the configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove hydra_plugins folder, tweak hydra `Plugins` Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix weird PL .log bug in callback Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * add first mock test * removed direct mock call, overrides for argv * add trainer params, assertion * added assertion to mock calls * avoid parameter overwriting on executor * hotpatched uv not found error, path instead of string * Debug issues with test for cluster group Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Try to make things more explicit Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add "cpu" and "gpu" resources configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add Beluga and Cedar configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Update the remote executor Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Update the remote slurm executor to latest commit Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Update executor plugin, add todos Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * debug config revamp * Add back the one_gpu.yaml config (local only) Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * keep basic assertion within test * Fix pre-commit version issue Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * empty line at the end of debug.yaml * Fix pre-commit issues Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Rename test module Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add missing marks on test Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove debug config Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Show git diff when a pre-commit hook fails Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add back weirdly formatted import in main.py?! Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove duplicated, outdated test in main_test.py Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix the one_gpu.yaml config not having a target Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Don't include DRAC slurm account in configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * WIP: Use the same resource group local and remote Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add a patched config for submitit_slurm Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Use patched config in `cluster=current` Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix structure of log dirs with submitit launchers Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Remove the duplicate 'one_gpu' config Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Undo change to Trainer default config Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Silence tiny typing error in PatchedSlurmQueueConf Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add test to load the configs Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix test for remote launcher plugin Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Fix bug in overwriting of `setup` from executor Signed-off-by: Fabrice Normandin <normandf@mila.quebec> * Add marks to tests for the remote launcher plugin Signed-off-by: Fabrice Normandin <normandf@mila.quebec> --------- Signed-off-by: César Miguel Valdez Córdova <cesar.valdez@mila.quebec> Signed-off-by: Fabrice Normandin <normandf@mila.quebec> Co-authored-by: Fabrice Normandin <normandf@mila.quebec>
mila-iqia · Oct 24, 2024 · 0031536 · 0031536
1 parent f753a20
commit 0031536
Show file tree

Hide file tree

Showing 19 changed files with 1,151 additions and 285 deletions.
diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
@@ -31,7 +31,7 @@ jobs:
       - run: pip install 'pre-commit<4.0.0'
       - run: pre-commit --version
       - run: pre-commit install
-      - run: pre-commit run --all-files
+      - run: pre-commit run --all-files --show-diff-on-failure
 
   check_docs:
     needs: [linting]

diff --git a/docs/profiling_test.py b/docs/profiling_test.py
@@ -83,7 +83,7 @@
         """
         experiment=profiling \
         algorithm=example \
-        resources=one_gpu \
+        resources=gpu \
         hydra.launcher.gres='gpu:a100:1' \
         hydra.launcher.cpus_per_task=4 \
         datamodule.num_workers=8 \

diff --git a/project/algorithms/callbacks/samples_per_second.py b/project/algorithms/callbacks/samples_per_second.py
@@ -57,11 +57,11 @@ def on_shared_batch_end(
         if phase in self.last_step_times:
             elapsed = now - self.last_step_times[phase]
             batch_size = self.get_num_samples(batch)
-            self.log(
+            pl_module.log(
                 f"{phase}/samples_per_second",
                 batch_size / elapsed,
-                module=pl_module,
-                trainer=trainer,
+                # module=pl_module,
+                # trainer=trainer,
                 prog_bar=True,
                 on_step=True,
                 on_epoch=True,
@@ -114,11 +114,11 @@ def on_before_optimizer_step(
             key = "ups"
         else:
             key = f"optimizer_{opt_idx}/ups"
-        self.log(
+        pl_module.log(
             key,
             updates_per_second,
-            module=pl_module,
-            trainer=trainer,
+            # module=pl_module,
+            # trainer=trainer,
             prog_bar=False,
             on_step=True,
         )
diff --git a/project/algorithms/jax_rl_example_test.py b/project/algorithms/jax_rl_example_test.py
@@ -645,6 +645,7 @@ def get_num_samples(self, batch: TrajectoryWithLastObs) -> int:
     def on_fit_end(self, trainer: lightning.Trainer, pl_module: lightning.LightningModule) -> None:
         super().on_fit_end(trainer, pl_module)
 
+    @override
     def log(
         self,
         name: str,

diff --git a/project/configs/__init__.py b/project/configs/__init__.py
@@ -3,24 +3,33 @@
 from __future__ import annotations
 
 from hydra.core.config_store import ConfigStore
+from omegaconf import OmegaConf
 
 from project.configs.algorithm.network import network_store
 from project.configs.algorithm.optimizer import optimizers_store
 from project.configs.config import Config
 from project.configs.datamodule import datamodule_store
-
-# from project.utils.env_vars import REPO_ROOTDIR, SLURM_JOB_ID, SLURM_TMPDIR
+from project.utils.remote_launcher_plugin import RemoteSlurmQueueConf
 
 cs = ConfigStore.instance()
 cs.store(name="base_config", node=Config)
 
+OmegaConf.register_new_resolver("eval", eval)
+
 
 def add_configs_to_hydra_store():
     """Adds all configs to the Hydra Config store."""
     datamodule_store.add_to_hydra_store()
     network_store.add_to_hydra_store()
     optimizers_store.add_to_hydra_store()
 
+    ConfigStore.instance().store(
+        group="hydra/launcher",
+        name="remote_submitit_slurm",
+        node=RemoteSlurmQueueConf,
+        provider="Mila",
+    )
+
 
 # todo: move the algorithm_store.add_to_hydra_store() here?
 

diff --git a/project/configs/cluster/beluga.yaml b/project/configs/cluster/beluga.yaml
@@ -0,0 +1,12 @@
+# @package _global_
+defaults:
+  - narval
+
+# Use this to specify which remote slurm cluster the job should run on.
+# Remember to also use the resources group to select the resources allocated to the job!
+
+hydra:
+  launcher:
+    executor:
+      cluster_hostname: beluga
+      internet_access_on_compute_nodes: false
diff --git a/project/configs/cluster/cedar.yaml b/project/configs/cluster/cedar.yaml
@@ -0,0 +1,12 @@
+# @package _global_
+defaults:
+  - narval
+
+# Use this to specify which remote slurm cluster the job should run on.
+# Remember to also use the resources group to select the resources allocated to the job!
+
+hydra:
+  launcher:
+    executor:
+      cluster_hostname: cedar
+      internet_access_on_compute_nodes: true
diff --git a/project/configs/cluster/current.yaml b/project/configs/cluster/current.yaml
@@ -0,0 +1,14 @@
+# @package _global_
+defaults:
+  - override /hydra/launcher: patched_submitit_slurm
+hydra:
+  mode: MULTIRUN
+  run:
+    # output directory, generated dynamically on each run
+    dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S}
+  sweep:
+    dir: logs/${name}/multiruns
+    subdir: ${hydra.job.id}
+  launcher:
+    stderr_to_stdout: true
+    submitit_folder: ${hydra.sweep.dir}/%j
diff --git a/project/configs/cluster/mila.yaml b/project/configs/cluster/mila.yaml
@@ -0,0 +1,24 @@
+# @package _global_
+defaults:
+  - override /hydra/launcher: remote_submitit_slurm
+
+# Use this to specify which remote slurm cluster the job should run on.
+# Remember to also use the resources group to select the resources allocated to the job!
+hydra:
+  mode: MULTIRUN
+  run:
+    # output directory, generated dynamically on each run
+    dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S}
+  sweep:
+    dir: logs/${name}/multiruns
+    subdir: ${hydra.job.id}
+
+  launcher:
+    executor:
+      _target_: remote_slurm_executor.RemoteSlurmExecutor
+      _partial_: true
+      folder: "${hydra.sweep.dir}/%j"
+      cluster_hostname: mila
+      internet_access_on_compute_nodes: true
+
+    stderr_to_stdout: true
diff --git a/project/configs/cluster/narval.yaml b/project/configs/cluster/narval.yaml
@@ -0,0 +1,9 @@
+# @package _global_
+defaults:
+  - mila.yaml
+
+hydra:
+  launcher:
+    executor:
+      cluster_hostname: narval
+      internet_access_on_compute_nodes: false
diff --git a/project/configs/config.yaml b/project/configs/config.yaml
@@ -5,9 +5,9 @@ defaults:
   - optional datamodule: null
   - trainer: default.yaml
   - hydra: default.yaml
-
-  # Allows launching LOTS of runs in parallel on a cluster thanks to the submitit launcher.
   - resources: null
+  # Allows launching LOTS of runs in parallel on a cluster thanks to the submitit launcher.
+  - cluster: null
 
   # experiment configs allow for version control of specific hyperparameters
   # e.g. best hyperparameters for given model and datamodule

diff --git a/project/configs/resources/cpu.yaml b/project/configs/resources/cpu.yaml
@@ -0,0 +1,23 @@
+# @package _global_
+defaults:
+  - override /hydra/launcher: patched_submitit_slurm
+trainer:
+  accelerator: cpu
+  devices: auto
+
+hydra:
+  mode: MULTIRUN
+  launcher:
+    nodes: 1
+    tasks_per_node: 1
+    cpus_per_task: 8
+    mem_gb: 16
+    array_parallelism: 16 # max num of jobs to run in parallel
+    # Other things to pass to `sbatch`:
+    additional_parameters:
+      time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS)
+
+
+    ## A list of commands to add to the generated sbatch script before running srun:
+    # setup:
+    # - export LD_PRELOAD=/some/folder/with/libraries/
diff --git a/project/configs/resources/gpu.yaml b/project/configs/resources/gpu.yaml
@@ -0,0 +1,22 @@
+# @package _global_
+defaults:
+  - override /hydra/launcher: patched_submitit_slurm
+
+hydra:
+  mode: MULTIRUN
+  launcher:
+    cpus_per_task: 4
+    gpus_per_task: 1
+    array_parallelism: 16 # max num of jobs to run in parallel
+    # Other things to pass to `sbatch`:
+    additional_parameters:
+      time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS)
+      # TODO: It would be better to have those be arguments to the launcher (as is the case in the
+      # RemoteLauncherPlugin), that way we could use only SLURM argument names..
+      nodes: 1
+      mem: 16G
+      ntasks_per_node: 1
+
+    ## A list of commands to add to the generated sbatch script before running srun:
+    # setup:
+    # - export LD_PRELOAD=/some/folder/with/libraries/
diff --git a/project/configs/resources/one_gpu.yaml b/project/configs/resources/one_gpu.yaml
diff --git a/project/configs/resources/two_gpus.yaml b/project/configs/resources/two_gpus.yaml