Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delete upstream healthcheck annotation #3207

Merged
merged 1 commit into from
Oct 9, 2018
Merged

delete upstream healthcheck annotation #3207

merged 1 commit into from
Oct 9, 2018

Conversation

ElvinEfendi
Copy link
Member

What this PR does / why we need it:
These annotation has little value in ingress-nginx. ingress-nginx relies on k8s for healthchecking.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2018
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 9, 2018
@ElvinEfendi ElvinEfendi changed the title [WIP] delete upstream healthcheck annotation delete upstream healthcheck annotation Oct 9, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 9, 2018
@ElvinEfendi
Copy link
Member Author

/assign @aledbf

@aledbf
Copy link
Member

aledbf commented Oct 9, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, ElvinEfendi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AndrewFarley
Copy link

AndrewFarley commented Jul 11, 2019

So I'm running into issues where we need to set max_fails and because of this "little value" removal we can not now, is there a workaround to still set max_fails?

See: socketio/socket.io#1739

@antoineco
Copy link
Contributor

@AndrewFarley max_fails translates to the failureThreshold field of Kubernetes' readiness probes. The recommendation is to use that on your target application container.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes

@AndrewFarley
Copy link

This situation per the link is one in which I am not using this as so much for a health check mechanism but as a retry and anti-flapping mechanism between upstreams. The only solution I can find is to use max_fails, and since you've removed this, I am left with the option of reverse diff-ing this patch and maintaining our own fork for this purpose. Thoughts @antoineco ?

@AndrewFarley
Copy link

I'm sure I don't have all the context or reasoning here in this PR, but I generally don't understand why you would remove a feature even one of "little value" that didn't have any effect if it wasn't going to be used by anyone in favor of using k8s health checks.

@aledbf
Copy link
Member

aledbf commented Jul 11, 2019

The only solution I can find is to use max_fails, and since you've removed this, I am left with the option of reverse diff-ing this patch and maintaining our own fork for this purpose.

This was removed due to the Lua load balancer change in the ingress controller to avoid reloads when a pod is created/destroyed. Right now there is no easy way to implement an equivalent to the max_fails feature.
That said, the advice from @antoineco to use the probe feature from Kubernetes is the way to go. Additionally, you can use the https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/configmap/#proxy-next-upstream setting to configure in which cases you want to try a different pod from the service being exposed behind the ingress

@AndrewFarley
Copy link

@aledbf Proxy next upstream might work for this situation, I'll try to fiddle with it, thanks for the info and history on the thought behind this change. Thanks

@panpan0000
Copy link
Contributor

panpan0000 commented Sep 17, 2019

Hi, @AndrewFarley @aledbf
In short,
(a) the problem without timely failover in ingress is kind of big problem in some cases, other than little value.
(b) the proxy-next-upstream is not working, unfortunately.

details below:

(a) We cannot always rely on the liveness-probe. There's a situation which is the problem: a node crash (we can simulate it by reboot a node). Liveness-probe was taken care by kubelet, which is not working when node down. And kubernetes will not mark that node from ready to NotReady until 40s timeout. Until then, the pods on that node will be changed from Running to Unknown state.
That say, there will be 40+ seconds which ingress-nginx thought pods are alive but actually not, the requests which send to those dead pods will suffer from 504 Gateway Timeout.

This kind of 40+s duration and problem is unacceptable for many critical business.

(b) the proxy_next_upstream is not working , neither for ingress-nginx version 0.21.0 nor 0.25.1.
So without it, we don't have a timely failover mechanism in ingress-nginx !

(1)setup two pods for a deployment, running on two nodes.
(2) confirm the setting in ingress-nginx's nginx.conf has been properly setup via configmap.

			proxy_next_upstream                      error timeout http_504 http_503 http_500 http_502 http_404 invalid_header;
			proxy_next_upstream_timeout             2;

(3) looping curl -H "Host:xxx" ingressIp:port (yes, it's GET , not the same issue like #4174)
(4) reboot one node
(5) there will be 40+s , 50% request failed with 504 Gateway timeout

@panpan0000
Copy link
Contributor

panpan0000 commented Sep 17, 2019

I found a post mentioning proxy_next_upstream is not effective for upstreams by balancer_by_lua_block inside openrestry , the founder of openrestry @agentzh proposed to use set_more_tries in lua instead.
But I found ingress code does have this code ngx_balancer.set_more_tries(1)
https://github.com/kubernetes/ingress-nginx/blob/master/rootfs/etc/nginx/lua/balancer.lua#L244

I don't have idea why it's not working...

the original post:
openresty/lua-resty-core#51

@agentzh
Copy link

agentzh commented Sep 17, 2019

@panpan0000 I'm afraid you misinterpret my comment there. The proxy_next_upstream is definitely effective at the same time, see the official documentation for set_more_triesmentioned in that comment:

https://github.com/openresty/lua-resty-core/blob/master/lib/ngx/balancer.md#set_more_tries

@panpan0000
Copy link
Contributor

@agentzh thank you for your clarification !
So in theory, ingress-nginx code should work.
I filed a new issue to track this problem #4567

@ElvinEfendi ElvinEfendi deleted the delete-upstream-healthcheck branch September 17, 2019 21:07
@annProg
Copy link

annProg commented Dec 5, 2019

when use ingress for custom endpoint which not belong to any Pod, upstream healthcheck is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants