Skip to content

Commit

Permalink
Remote launcher plugin (#73)
Browse files Browse the repository at this point in the history
* basic launcher/config structure from official hydra examples in place

* remote_launcher_plugin progress

* basic launcher/config structure from official hydra examples in place

* remote_launcher_plugin progress

Signed-off-by: César Miguel Valdez Córdova <cesar.valdez@mila.quebec>

* removed repo_dir param, config interpolation

* add uv lock

* remote launcher params, configs

* unrolled config to try params to circumvent pre-emption

* more configs

* Tweaks

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add a `cluster` config group, tweak resources

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Tweak the configs

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Remove hydra_plugins folder, tweak hydra `Plugins`

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Fix weird PL .log bug in callback

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* add first mock test

* removed direct mock call, overrides for argv

* add trainer params, assertion

* added assertion to mock calls

* avoid parameter overwriting on executor

* hotpatched uv not found error, path instead of string

* Debug issues with test for cluster group

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Try to make things more explicit

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add "cpu" and "gpu" resources configs

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add Beluga and Cedar configs

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Update the remote executor

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Update the remote slurm executor to latest commit

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Update executor plugin, add todos

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* debug config revamp

* Add back the one_gpu.yaml config (local only)

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* keep basic assertion within test

* Fix pre-commit version issue

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* empty line at the end of debug.yaml

* Fix pre-commit issues

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Rename test module

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add missing marks on test

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Remove debug config

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Show git diff when a pre-commit hook fails

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add back weirdly formatted import in main.py?!

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Remove duplicated, outdated test in main_test.py

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Fix the one_gpu.yaml config not having a target

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Don't include DRAC slurm account in configs

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* WIP: Use the same resource group local and remote

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add a patched config for submitit_slurm

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Use patched config in `cluster=current`

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Fix structure of log dirs with submitit launchers

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Remove the duplicate 'one_gpu' config

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Undo change to Trainer default config

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Silence tiny typing error in PatchedSlurmQueueConf

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add test to load the configs

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Fix test for remote launcher plugin

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Fix bug in overwriting of `setup` from executor

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

* Add marks to tests for the remote launcher plugin

Signed-off-by: Fabrice Normandin <normandf@mila.quebec>

---------

Signed-off-by: César Miguel Valdez Córdova <cesar.valdez@mila.quebec>
Signed-off-by: Fabrice Normandin <normandf@mila.quebec>
Co-authored-by: Fabrice Normandin <normandf@mila.quebec>
  • Loading branch information
cmvcordova and lebrice authored Oct 24, 2024
1 parent f753a20 commit 0031536
Show file tree
Hide file tree
Showing 19 changed files with 1,151 additions and 285 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
- run: pip install 'pre-commit<4.0.0'
- run: pre-commit --version
- run: pre-commit install
- run: pre-commit run --all-files
- run: pre-commit run --all-files --show-diff-on-failure

check_docs:
needs: [linting]
Expand Down
2 changes: 1 addition & 1 deletion docs/profiling_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@
"""
experiment=profiling \
algorithm=example \
resources=one_gpu \
resources=gpu \
hydra.launcher.gres='gpu:a100:1' \
hydra.launcher.cpus_per_task=4 \
datamodule.num_workers=8 \
Expand Down
12 changes: 6 additions & 6 deletions project/algorithms/callbacks/samples_per_second.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,11 @@ def on_shared_batch_end(
if phase in self.last_step_times:
elapsed = now - self.last_step_times[phase]
batch_size = self.get_num_samples(batch)
self.log(
pl_module.log(
f"{phase}/samples_per_second",
batch_size / elapsed,
module=pl_module,
trainer=trainer,
# module=pl_module,
# trainer=trainer,
prog_bar=True,
on_step=True,
on_epoch=True,
Expand Down Expand Up @@ -114,11 +114,11 @@ def on_before_optimizer_step(
key = "ups"
else:
key = f"optimizer_{opt_idx}/ups"
self.log(
pl_module.log(
key,
updates_per_second,
module=pl_module,
trainer=trainer,
# module=pl_module,
# trainer=trainer,
prog_bar=False,
on_step=True,
)
1 change: 1 addition & 0 deletions project/algorithms/jax_rl_example_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -645,6 +645,7 @@ def get_num_samples(self, batch: TrajectoryWithLastObs) -> int:
def on_fit_end(self, trainer: lightning.Trainer, pl_module: lightning.LightningModule) -> None:
super().on_fit_end(trainer, pl_module)

@override
def log(
self,
name: str,
Expand Down
13 changes: 11 additions & 2 deletions project/configs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,24 +3,33 @@
from __future__ import annotations

from hydra.core.config_store import ConfigStore
from omegaconf import OmegaConf

from project.configs.algorithm.network import network_store
from project.configs.algorithm.optimizer import optimizers_store
from project.configs.config import Config
from project.configs.datamodule import datamodule_store

# from project.utils.env_vars import REPO_ROOTDIR, SLURM_JOB_ID, SLURM_TMPDIR
from project.utils.remote_launcher_plugin import RemoteSlurmQueueConf

cs = ConfigStore.instance()
cs.store(name="base_config", node=Config)

OmegaConf.register_new_resolver("eval", eval)


def add_configs_to_hydra_store():
"""Adds all configs to the Hydra Config store."""
datamodule_store.add_to_hydra_store()
network_store.add_to_hydra_store()
optimizers_store.add_to_hydra_store()

ConfigStore.instance().store(
group="hydra/launcher",
name="remote_submitit_slurm",
node=RemoteSlurmQueueConf,
provider="Mila",
)


# todo: move the algorithm_store.add_to_hydra_store() here?

Expand Down
12 changes: 12 additions & 0 deletions project/configs/cluster/beluga.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# @package _global_
defaults:
- narval

# Use this to specify which remote slurm cluster the job should run on.
# Remember to also use the resources group to select the resources allocated to the job!

hydra:
launcher:
executor:
cluster_hostname: beluga
internet_access_on_compute_nodes: false
12 changes: 12 additions & 0 deletions project/configs/cluster/cedar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# @package _global_
defaults:
- narval

# Use this to specify which remote slurm cluster the job should run on.
# Remember to also use the resources group to select the resources allocated to the job!

hydra:
launcher:
executor:
cluster_hostname: cedar
internet_access_on_compute_nodes: true
14 changes: 14 additions & 0 deletions project/configs/cluster/current.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# @package _global_
defaults:
- override /hydra/launcher: patched_submitit_slurm
hydra:
mode: MULTIRUN
run:
# output directory, generated dynamically on each run
dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S}
sweep:
dir: logs/${name}/multiruns
subdir: ${hydra.job.id}
launcher:
stderr_to_stdout: true
submitit_folder: ${hydra.sweep.dir}/%j
24 changes: 24 additions & 0 deletions project/configs/cluster/mila.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# @package _global_
defaults:
- override /hydra/launcher: remote_submitit_slurm

# Use this to specify which remote slurm cluster the job should run on.
# Remember to also use the resources group to select the resources allocated to the job!
hydra:
mode: MULTIRUN
run:
# output directory, generated dynamically on each run
dir: logs/${name}/runs/${now:%Y-%m-%d}/${now:%H-%M-%S}
sweep:
dir: logs/${name}/multiruns
subdir: ${hydra.job.id}

launcher:
executor:
_target_: remote_slurm_executor.RemoteSlurmExecutor
_partial_: true
folder: "${hydra.sweep.dir}/%j"
cluster_hostname: mila
internet_access_on_compute_nodes: true

stderr_to_stdout: true
9 changes: 9 additions & 0 deletions project/configs/cluster/narval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# @package _global_
defaults:
- mila.yaml

hydra:
launcher:
executor:
cluster_hostname: narval
internet_access_on_compute_nodes: false
4 changes: 2 additions & 2 deletions project/configs/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ defaults:
- optional datamodule: null
- trainer: default.yaml
- hydra: default.yaml

# Allows launching LOTS of runs in parallel on a cluster thanks to the submitit launcher.
- resources: null
# Allows launching LOTS of runs in parallel on a cluster thanks to the submitit launcher.
- cluster: null

# experiment configs allow for version control of specific hyperparameters
# e.g. best hyperparameters for given model and datamodule
Expand Down
23 changes: 23 additions & 0 deletions project/configs/resources/cpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# @package _global_
defaults:
- override /hydra/launcher: patched_submitit_slurm
trainer:
accelerator: cpu
devices: auto

hydra:
mode: MULTIRUN
launcher:
nodes: 1
tasks_per_node: 1
cpus_per_task: 8
mem_gb: 16
array_parallelism: 16 # max num of jobs to run in parallel
# Other things to pass to `sbatch`:
additional_parameters:
time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS)


## A list of commands to add to the generated sbatch script before running srun:
# setup:
# - export LD_PRELOAD=/some/folder/with/libraries/
22 changes: 22 additions & 0 deletions project/configs/resources/gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# @package _global_
defaults:
- override /hydra/launcher: patched_submitit_slurm

hydra:
mode: MULTIRUN
launcher:
cpus_per_task: 4
gpus_per_task: 1
array_parallelism: 16 # max num of jobs to run in parallel
# Other things to pass to `sbatch`:
additional_parameters:
time: 1-00:00:00 # maximum wall time allocated for the job (D-HH:MM:SS)
# TODO: It would be better to have those be arguments to the launcher (as is the case in the
# RemoteLauncherPlugin), that way we could use only SLURM argument names..
nodes: 1
mem: 16G
ntasks_per_node: 1

## A list of commands to add to the generated sbatch script before running srun:
# setup:
# - export LD_PRELOAD=/some/folder/with/libraries/
26 changes: 0 additions & 26 deletions project/configs/resources/one_gpu.yaml

This file was deleted.

6 changes: 0 additions & 6 deletions project/configs/resources/two_gpus.yaml

This file was deleted.

Loading

0 comments on commit 0031536

Please sign in to comment.