Refactored PTH DDP env vars creation in SLURM #2206

vfdev-5 · 2021-09-19T14:44:26Z

Fixes #2202

Related to #2048

Description:

Refactored PTH DDP env vars creation in SLURM

Check list:

New tests are added (if a new feature is added)
New doc strings: description and/or example code are in RST format
Documentation is updated (if required)

Fixes #2202

…m-env-vars

ignite/distributed/comp_models/native.py

sdesrozis · 2021-09-20T06:06:32Z

ignite/distributed/comp_models/native.py

+        # To cover case 1), let's ensure that defined RANK == SLURM_PROCID, LOCAL_RANK == SLURM_LOCALID,
+        #   WORLD_SIZE == SLURM_NTASKS. We will use defined MASTER_ADDR and MASTER_PORT instead of defining
+        #   them by our means
+        # To cover case 2), let's check that defined RANK >= SLURM_PROCID, LOCAL_RANK >= SLURM_LOCALID,


If I understand correctly what is done, the idea is to ensure that in such a case, the user didn't use srun as a mistake

srun python -m torch.distributed.launch ...

Therefore, every process should have a rank, local rank and world size greater or equal to what is defined by slurm.

Is it correct ?

The case 2 is to cover use-case like :

srun -N1 -n1 -G8 python -m torch.distributed.launch\ --nproc_per_node=8 --nnodes=1 --node_rank=0 \ --master_addr="localhost" --master_port=1234 \ main.py

case 2)

RANK >= SLURM_PROCID, LOCAL_RANK >= SLURM_LOCALID means that one process is spawn on the node by srun
but
WORLD_SIZE >= SLURM_NTASKS sounds weird. SLURM_NTASKS is the max number of tasks checked by slurm. If WORLD_SIZE is greater to that value, the scheduler should kill the process, because more than allocated ressources for the job are used. I think it could be an issue using gloo.

What do you think ?

Otherwise, it works on my side.

The example above srun -N1 -n1 allocates SLURM_NTASKS=1, but launcher creates ws=8. That's why WORLD_SIZE >= SLURM_NTASKS. Am I missing something ?

See here as well : https://www.hpcworkshops.com/08-ml-on-parallelcluster/03-distributed-data-parallel.html

Yes but the slurm scheduler can kill the job if more ressources than allocated are used. I suppose it depends on how the scheduler is configured. Imagine you schedule a job defining 4 tasks and in fact you use 8, it could be a big issue in production. Anyway, I think that does not really matter. This is a user constraint, we can't handle.

ignite/distributed/comp_models/native.py

sdesrozis

LGTM !

Refactored PTH DDP env vars creation in SLURM

a7c7b32

Fixes #2202

vfdev-5 requested a review from sdesrozis September 19, 2021 14:44

github-actions bot added the module: distributed Distributed module label Sep 19, 2021

vfdev-5 added 5 commits September 19, 2021 17:00

Merge branch 'master' of github.com:pytorch/ignite into fix-2202-slur…

905e707

…m-env-vars

Removed useless comment and fixed flake8

307aa08

Fixed failing tests and mypy

4db6ebe

Fixed failed test

f0b2499

Fixed failing test with slurm

418187a

sdesrozis reviewed Sep 20, 2021

View reviewed changes

ignite/distributed/comp_models/native.py Outdated Show resolved Hide resolved

sdesrozis reviewed Sep 20, 2021

View reviewed changes

ignite/distributed/comp_models/native.py Outdated Show resolved Hide resolved

Updates according to the review

3daf559

sdesrozis approved these changes Sep 20, 2021

View reviewed changes

sdesrozis merged commit 09599e0 into master Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored PTH DDP env vars creation in SLURM #2206

Refactored PTH DDP env vars creation in SLURM #2206

vfdev-5 commented Sep 19, 2021 •

edited

Loading

sdesrozis Sep 20, 2021 •

edited

Loading

vfdev-5 Sep 20, 2021

sdesrozis Sep 20, 2021 •

edited

Loading

sdesrozis Sep 20, 2021

vfdev-5 Sep 20, 2021 •

edited

Loading

sdesrozis Sep 20, 2021

sdesrozis left a comment

Refactored PTH DDP env vars creation in SLURM #2206

Refactored PTH DDP env vars creation in SLURM #2206

Conversation

vfdev-5 commented Sep 19, 2021 • edited Loading

sdesrozis Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

vfdev-5 Sep 20, 2021

Choose a reason for hiding this comment

sdesrozis Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

sdesrozis Sep 20, 2021

Choose a reason for hiding this comment

vfdev-5 Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

sdesrozis Sep 20, 2021

Choose a reason for hiding this comment

sdesrozis left a comment

Choose a reason for hiding this comment

vfdev-5 commented Sep 19, 2021 •

edited

Loading

sdesrozis Sep 20, 2021 •

edited

Loading

sdesrozis Sep 20, 2021 •

edited

Loading

vfdev-5 Sep 20, 2021 •

edited

Loading