Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorchjob didn't create worker pod ,seems hang #2327

Open
Twilighter9527 opened this issue Nov 15, 2024 · 10 comments
Open

pytorchjob didn't create worker pod ,seems hang #2327

Twilighter9527 opened this issue Nov 15, 2024 · 10 comments

Comments

@Twilighter9527
Copy link

What happened?

I follow this create a pytorchjob https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml

pytorchjob yaml
image-20241115112121748

pod status
image-20241115112542445

kubeclt describe pytorchjob -n kubeflow
image-20241115112149697

training operator log
image-20241115112300634

What did you expect to happen?

this should create pytorch-simple-master-0 and pytorch-simple-worker-0,but the pytorch-simple-worker-0 seems hang,the log show that some wrong with the yaml to json, i didn't think that is the reason. first, kubeclt create -f simple.yaml, nothing wrong happened. second, i use python to read the yaml to json, it is ok. I had a similar issue a few weeks ago, and I set sidecar.istio=false to solve it。 but now is not in default namespaces and I label the node sidecar.istio=false

Environment

Kubernetes version:

$ kubectl version

image

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

image

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@andreyvelich
Copy link
Member

Thanks for creating this @Twilighter9527!
How did you add the istio annotations to your PyTorch Job ? Did you use quotes ?

sidecar.istio.io/inject: "false"

@Twilighter9527
Copy link
Author

Thanks for creating this @Twilighter9527! How did you add the istio annotations to your PyTorch Job ? Did you use quotes ?

sidecar.istio.io/inject: "false"

yes,just like the demo,event set the level of node to sidecar.istio.io/inject: "false",nothing happen
无标题

@Twilighter9527
Copy link
Author

image

@kuizhiqing
Copy link
Member

The error log in training operator log is quite sure there is a mistake in the configure yaml.

@Twilighter9527
Copy link
Author

The error log in training operator log is quite sure there is a mistake in the configure yaml.

but i can create the pod success(py-master-0), and i only user the master part yaml or woker part yaml can create success.
image

@kuizhiqing
Copy link
Member

Have you ever try with the origin yaml file with touching nothing ?

@Twilighter9527
Copy link
Author

Twilighter9527 commented Nov 18, 2024

Have you ever try with the origin yaml file with touching nothing ?
i user this simple.yaml, just change the image field

@kuizhiqing
Copy link
Member

kuizhiqing commented Nov 18, 2024

Try with CHANGE NOTHING, even image not found error will happen after pod has been created.

@Twilighter9527
Copy link
Author

Twilighter9527 commented Nov 18, 2024

it will failed on create master pod,the log just as before, so weird.
image
image

@andreyvelich
Copy link
Member

Hi @Twilighter9527, it looks like your cluster doesn't have access to the public DockerHub registry.
Do you need to configure proxies to pull the public images ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants