Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE: ingress only shows the first backend's healthiness in backends annotation #35

Closed
bowei opened this issue Oct 11, 2017 · 21 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@bowei
Copy link
Member

bowei commented Oct 11, 2017

From @MrHohn on September 20, 2017 1:43

From kubernetes/enhancements#27 (comment).

We attach backends annotation to ingress object after LB creation:

  ...
  backends:		{"k8s-be-30910--7b4223ab4c1af15d":"UNHEALTHY"}

And from the implementation:
https://github.com/kubernetes/ingress/blob/937cde666e533e4f70087207910d6135c672340a/controllers/gce/backends/backends.go#L437-L452

Using only the first backend's healthiness to represent the healthiness for all backends seems incorrect.

cc @freehan

Copied from original issue: kubernetes/ingress-nginx#1395

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @yastij on September 20, 2017 14:50

@MrHohn - anyone working on this one ? if not I can send a PR

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @MrHohn on September 20, 2017 20:57

@yastij Nope, though I'm not quite sure how should we present the backends healthiness in annotation --- for huge cluster, we might have too many backends (nodes). It also seems unwise to append all of them via annotation...

cc @nicksardo

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @yastij on September 20, 2017 21:13

Maybe state healthy when all the backends are (checking all backends) , and unhealthy when some are (specifying which ones aren't healthy)

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @MrHohn on September 20, 2017 21:18

Maybe state healthy when all the backends are (checking all backends) , and unhealthy when some are (specifying which ones aren't healthy)

Yeah that sort of makes sense, though for the externlTrafficPolicy=Local case, some of the backends (nodes) may intentionally fail LB healthcheck so that traffic will only go to nodes that contains backend pods. Showing this as unhealthy may scare users, and it isn't actually unhealthy :(

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @yastij on September 20, 2017 21:59

@MrHohn - we can detect this case nope ? if it is set to Local we can ignore the unhealthy status ?

@bowei
Copy link
Member Author

bowei commented Oct 11, 2017

From @nicksardo on September 25, 2017 23:29

I do not know if people use externalTrafficPolicy=Local with ingress (I've never tried it) and it's not something we document with ingress. It may technically work, but I don't know how well it works in production with rolling updates and other edge cases. Although if we wanted to support that case, another option is to correlate the instance status with the pods location(node).

I agree that this annotation is not accurate. Even if it shows a correct status, the annotation is only refreshed on every sync (which may be 10 minutes or longer if there are a lot of ingress objects). My question is whether this annotation is worth keeping. Wouldn't users be better off looking at the GCP Console for backend status? Do users have daemons which poll this annotation and perform alerts? If the only case we're concerned about is bad healthcheck configuration breaking all backends, couldn't we create an alert saying "All backends are unhealthy - please investigate"?

cc @csbell @nikhiljindal

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2018
@yastij
Copy link
Member

yastij commented Jan 9, 2018

@bowei @MrHohn @nicksardo - is this still open ?

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 10, 2018
@yastij
Copy link
Member

yastij commented Feb 11, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 11, 2018
@bowei
Copy link
Member Author

bowei commented Feb 12, 2018

thanks -- let's keep this one open

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 13, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 12, 2018
@yastij
Copy link
Member

yastij commented Jun 12, 2018

/remove-lifecycle

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@nicksardo nicksardo reopened this Jul 12, 2018
@nicksardo nicksardo added kind/bug Categorizes issue or PR as related to a bug. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 16, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2018
@bowei
Copy link
Member Author

bowei commented Nov 6, 2018

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 6, 2018
@ashi009
Copy link

ashi009 commented Aug 15, 2019

Any update on this?

In the kube-proxy world, showing first backends health status is probably OK, as the traffic can hop between nodes when access with node port. Each backend group will report health regardless if there is a backend in that zone at all.

However, with NEG backends, it's possible to have zones with no corresponding backends at all. In that case, the health status for those backend groups will be constantly unknown, and the console looks broken though the ingress works fine:

image

For this reason, I think this should be fixed.

@bowei
Copy link
Member Author

bowei commented Aug 15, 2019

@freehan -- can we put this in the backlog? It looks like a self contained item

@ashi009
Copy link

ashi009 commented Aug 15, 2019

FTR: the GKE console issue is tracked at https://issuetracker.google.com/issues/130748827.

@swetharepakula
Copy link
Member

This has been fixed with #936.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

8 participants