Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoadBalancerNegNotReady #931

Closed
RafiGreenberg opened this issue Nov 6, 2019 · 8 comments
Closed

LoadBalancerNegNotReady #931

RafiGreenberg opened this issue Nov 6, 2019 · 8 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@RafiGreenberg
Copy link

RafiGreenberg commented Nov 6, 2019

I upgraded our GKE cluster to 1.13.11-gke.11

Since then 1 newly created service using NEG is failing to become healthy even though the pods are reporting the container is healthy.

kubectl describe pod

<snip>

    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 06 Nov 2019 09:08:10 -0800
    Ready:          True
    Restart Count:  0

<snip>

Readiness Gates:
  Type                                       Status
  cloud.google.com/load-balancer-neg-ready
Conditions:
  Type                                       Status
  cloud.google.com/load-balancer-neg-ready
  Initialized                                True
  Ready                                      False
  ContainersReady                            True
  PodScheduled                               True

<snip>

Events:
  Type     Reason                   Age    From                                                 Message
  ----     ------                   ----   ----                                                 -------
  Normal   LoadBalancerNegNotReady  3m55s  neg-readiness-reflector                              Waiting for pod to become healthy in at least one of the NEG(s): [k8s1-74b928a2-default-ww-api-8080-a3c1e454]
  Normal   Scheduled                3m55s  default-scheduler                                    Successfully assigned default/ww-api-575475649c-q7dz4 to gke-dev-cluster-1-dev-pool-3-9b3e5c8e-7ms5
  Normal   Pulled                   3m54s  kubelet, gke-dev-cluster-1-dev-pool-3-9b3e5c8e-7ms5  Container image "gcr.io/ranker-infra/ww-api:0c9bae40cd44b8da075d79b7005e5ed0119f95d2" already present on machine
  Normal   Created                  3m54s  kubelet, gke-dev-cluster-1-dev-pool-3-9b3e5c8e-7ms5  Created container
  Normal   Started                  3m54s  kubelet, gke-dev-cluster-1-dev-pool-3-9b3e5c8e-7ms5  Started container
@tornado67
Copy link

Same issue on 1.13.11-gke.14 .

@tobiasbrodersen
Copy link

tobiasbrodersen commented Jan 14, 2020

We're experiencing issue on version v1.13.11-gke.14 aswell, it seems to be correclated to NEGs and the new readiness gates introcuded in 1.13+
Removing our annotations on the service:

beta.cloud.google.com/backend-config: '{"default": "istio"}'
cloud.google.com/app-protocols: '{"https":"HTTP2"}'
cloud.google.com/neg: '{"ingress": true}'

And reapplying them makes all healthcheck pass and traffic is forwarded again.
I'm trying to dig further into the documentation and will report back if I get any findings.

@freehan
Copy link
Contributor

freehan commented Jan 14, 2020

Some backgrounds:
OSS K8s Pod readiness gate feature
usage of pod readiness gate for Container Native LB

Based on the troubleshooting guide:

  1. Look for neg-status annotation on the service with neg annotation.
    It should contain the NEG name and the locations. More info here

  2. Look for backend-service. More info here.

  3. Check if the corresponding endpoints showed up in the backend service and they healthy.

If not then check a few things:
0. validate if the cluster satisfies the requirements.

  1. validate the health check configuration on backend service is correct and it is health checking backends as expected
  2. validate firewall is opened for health check requests to pass and arrive at the destination.
  3. validate the backends are receiving health check requests and responding correctly.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 13, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ajaysourcedigital
Copy link

If you define LivenessProbe and readinessProbe inside the YAML definition file, it should go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants