-
-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored PTH DDP env vars creation in SLURM #2206
Conversation
# To cover case 1), let's ensure that defined RANK == SLURM_PROCID, LOCAL_RANK == SLURM_LOCALID, | ||
# WORLD_SIZE == SLURM_NTASKS. We will use defined MASTER_ADDR and MASTER_PORT instead of defining | ||
# them by our means | ||
# To cover case 2), let's check that defined RANK >= SLURM_PROCID, LOCAL_RANK >= SLURM_LOCALID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly what is done, the idea is to ensure that in such a case, the user didn't use srun
as a mistake
srun python -m torch.distributed.launch ...
Therefore, every process should have a rank, local rank and world size greater or equal to what is defined by slurm.
Is it correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case 2 is to cover use-case like :
srun -N1 -n1 -G8 python -m torch.distributed.launch\
--nproc_per_node=8 --nnodes=1 --node_rank=0 \
--master_addr="localhost" --master_port=1234 \
main.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case 2)
RANK >= SLURM_PROCID
, LOCAL_RANK >= SLURM_LOCALID
means that one process is spawn on the node by srun
but
WORLD_SIZE >= SLURM_NTASKS
sounds weird. SLURM_NTASKS
is the max number of tasks checked by slurm. If WORLD_SIZE
is greater to that value, the scheduler should kill the process, because more than allocated ressources for the job are used. I think it could be an issue using gloo
.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, it works on my side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example above srun -N1 -n1
allocates SLURM_NTASKS=1
, but launcher creates ws=8. That's why WORLD_SIZE >= SLURM_NTASKS
. Am I missing something ?
See here as well : https://www.hpcworkshops.com/08-ml-on-parallelcluster/03-distributed-data-parallel.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but the slurm scheduler can kill the job if more ressources than allocated are used. I suppose it depends on how the scheduler is configured. Imagine you schedule a job defining 4 tasks and in fact you use 8, it could be a big issue in production. Anyway, I think that does not really matter. This is a user constraint, we can't handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Fixes #2202
Related to #2048
Description:
Check list: