Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

Open
alenawang opened this issue Oct 8, 2024 · 1 comment

Comments

@alenawang
Copy link
Contributor

What happened?

When elasticPolicy is set on the manifest but the user does not pass in minReplicas or maxReplicas explicitly, the PET_NNODES env var is set to x:x where x is the number of worker replicas only - it does not seem to be including the master replica in this count. When elasticPolicy is not set, PET_NNODES is set to a single number that is the master + number of worker replicas, which seems correct.

What did you expect to happen?

We expected PET_NNODES to be set to x:x where x is the total number of replicas (master + workers). Does this make sense? If so we would be interested in contributing this fix.

Environment

Kubernetes version:
v1.29.8

Training Operator version:
v1-855e096, also tested a local build using the latest on master

Training Operator Python SDK version:
N/A

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@kuizhiqing
Copy link
Member

kuizhiqing commented Oct 10, 2024

Thanks for this feedback.

Actually, for design purpose, we no need to set master at all, https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml and https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/echo/echo.yaml.

This design make sense, since in the elastic scenario, nodes are treat equally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants