Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vizier-core stuck in CrashLoopBackoff due to failed pod checks #322

Closed
pdmack opened this issue Jan 13, 2019 · 5 comments
Closed

vizier-core stuck in CrashLoopBackoff due to failed pod checks #322

pdmack opened this issue Jan 13, 2019 · 5 comments

Comments

@pdmack
Copy link
Member

pdmack commented Jan 13, 2019

#270 implemented gRPC health checking but I'm trying to make sense of why vizier-core (0.4.0) falls into CLB due to the failed readiness/liveness checks in my master+3 compute node deployment.

Events:
  Type     Reason     Age                  From                                 Message
  ----     ------     ----                 ----                                 -------
  Normal   Pulled     1h (x261 over 15h)   kubelet, node2.REDACTED.internal  Container image "gcr.io/kubeflow-images-public/katib/vizier-core:v0.4.0" already present on machine
  Warning  Unhealthy  46m (x800 over 15h)  kubelet, node2.REDACTED.internal  Readiness probe failed: timeout: failed to connect service ":6789" within 1s
  Warning  BackOff    6m (x3388 over 15h)  kubelet, node2.REDACTED.internal  Back-off restarting failed container
  Warning  Unhealthy  1m (x866 over 15h)   kubelet, node2.REDACTED.internal  Liveness probe failed: timeout: failed to connect service ":6789" within 1s
vizier-core                              NodePort    172.30.193.128   <none>        6789:30681/TCP      1d
vizier-core-rest                         ClusterIP   172.30.154.220   <none>        80/TCP              1d
vizier-db                                ClusterIP   172.30.8.147     <none>        3306/TCP            1d
vizier-suggestion-bayesianoptimization   ClusterIP   172.30.68.61     <none>        6789/TCP            1d
vizier-suggestion-grid                   ClusterIP   172.30.241.37    <none>        6789/TCP            1d
vizier-suggestion-hyperband              ClusterIP   172.30.251.108   <none>        6789/TCP            1d
vizier-suggestion-random                 ClusterIP   172.30.26.32     <none>        6789/TCP            1d
@johnugeorge
Copy link
Member

@pdmack My deployment looks ok with 0.4.0. Did you change anything in your environment? Btw, are other Katib pods up?

@pdmack
Copy link
Member Author

pdmack commented Jan 13, 2019

Yes the other vizier pods are running. Is this a local check within the pod that the vizier-manager is up and running?

@johnugeorge
Copy link
Member

@pdmack
Yes. This is to check if the katib manager is ready to accept requests. https://github.com/kubeflow/katib/blob/master/cmd/manager/main.go#L346

  1. Can you verify if vizier-db pod is up and active? I have seen a similar situation when the db pod was pending due to pvc issues
  2. Can you check the logs of vizier-core pod? Can you see this print - https://github.com/kubeflow/katib/blob/master/cmd/manager/main.go#L343

@pdmack
Copy link
Member Author

pdmack commented Jan 14, 2019

This is OpenShift 3.11 with generous permissions, but I'm thinking there's something subtle I'm missing. A standalone 3.11 env (AIO) doesn't exhibit this problem.

@pdmack
Copy link
Member Author

pdmack commented Jan 15, 2019

@johnugeorge yeah turned out to be vizier-db was the culprit although it reported as Running. I got around this with permission modifications for the storage provisioner backing store.

I'll try to remember to file something for the health/ready checks on vizier-db.

@pdmack pdmack closed this as completed Jan 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants