vizier-core stuck in CrashLoopBackoff due to failed pod checks #322

pdmack · 2019-01-13T16:54:47Z

#270 implemented gRPC health checking but I'm trying to make sense of why vizier-core (0.4.0) falls into CLB due to the failed readiness/liveness checks in my master+3 compute node deployment.

Events:
  Type     Reason     Age                  From                                 Message
  ----     ------     ----                 ----                                 -------
  Normal   Pulled     1h (x261 over 15h)   kubelet, node2.REDACTED.internal  Container image "gcr.io/kubeflow-images-public/katib/vizier-core:v0.4.0" already present on machine
  Warning  Unhealthy  46m (x800 over 15h)  kubelet, node2.REDACTED.internal  Readiness probe failed: timeout: failed to connect service ":6789" within 1s
  Warning  BackOff    6m (x3388 over 15h)  kubelet, node2.REDACTED.internal  Back-off restarting failed container
  Warning  Unhealthy  1m (x866 over 15h)   kubelet, node2.REDACTED.internal  Liveness probe failed: timeout: failed to connect service ":6789" within 1s

vizier-core                              NodePort    172.30.193.128   <none>        6789:30681/TCP      1d
vizier-core-rest                         ClusterIP   172.30.154.220   <none>        80/TCP              1d
vizier-db                                ClusterIP   172.30.8.147     <none>        3306/TCP            1d
vizier-suggestion-bayesianoptimization   ClusterIP   172.30.68.61     <none>        6789/TCP            1d
vizier-suggestion-grid                   ClusterIP   172.30.241.37    <none>        6789/TCP            1d
vizier-suggestion-hyperband              ClusterIP   172.30.251.108   <none>        6789/TCP            1d
vizier-suggestion-random                 ClusterIP   172.30.26.32     <none>        6789/TCP            1d

The text was updated successfully, but these errors were encountered:

johnugeorge · 2019-01-13T17:32:54Z

@pdmack My deployment looks ok with 0.4.0. Did you change anything in your environment? Btw, are other Katib pods up?

pdmack · 2019-01-13T21:08:21Z

Yes the other vizier pods are running. Is this a local check within the pod that the vizier-manager is up and running?

johnugeorge · 2019-01-14T11:04:41Z

@pdmack
Yes. This is to check if the katib manager is ready to accept requests. https://github.com/kubeflow/katib/blob/master/cmd/manager/main.go#L346

Can you verify if vizier-db pod is up and active? I have seen a similar situation when the db pod was pending due to pvc issues
Can you check the logs of vizier-core pod? Can you see this print - https://github.com/kubeflow/katib/blob/master/cmd/manager/main.go#L343

pdmack · 2019-01-14T15:01:02Z

This is OpenShift 3.11 with generous permissions, but I'm thinking there's something subtle I'm missing. A standalone 3.11 env (AIO) doesn't exhibit this problem.

pdmack · 2019-01-15T03:29:06Z

@johnugeorge yeah turned out to be vizier-db was the culprit although it reported as Running. I got around this with permission modifications for the storage provisioner backing store.

I'll try to remember to file something for the health/ready checks on vizier-db.

pdmack closed this as completed Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vizier-core stuck in CrashLoopBackoff due to failed pod checks #322

vizier-core stuck in CrashLoopBackoff due to failed pod checks #322

pdmack commented Jan 13, 2019

johnugeorge commented Jan 13, 2019

pdmack commented Jan 13, 2019

johnugeorge commented Jan 14, 2019

pdmack commented Jan 14, 2019

pdmack commented Jan 15, 2019

vizier-core stuck in CrashLoopBackoff due to failed pod checks #322

vizier-core stuck in CrashLoopBackoff due to failed pod checks #322

Comments

pdmack commented Jan 13, 2019

johnugeorge commented Jan 13, 2019

pdmack commented Jan 13, 2019

johnugeorge commented Jan 14, 2019

pdmack commented Jan 14, 2019

pdmack commented Jan 15, 2019