Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add validDns option and headless distributor to Nginx #366

Merged
merged 9 commits into from
Jun 10, 2022

Conversation

locmai
Copy link
Contributor

@locmai locmai commented Jun 8, 2022

Signed-off-by: Loc Mai lmai@axon.com

What this PR does:

Address the NGINX issue on Kubernetes when rolling restart the distributor pods would cause an amount of ingestion failures on the write path.

This is more of a Kubernetes issue but we could have this as a work around. See kubernetes/kubernetes#24092

Which issue(s) this PR fixes:
No issue created.

Template results

# values.yaml
nginx:
  config:
    dnsValid: 15s
    distributorHeadless: true

nginx-config - configmap:

...
        resolver kube-dns.kube-system.svc.cluster.local 15s;

        # Distributor Config
        location = /ring {
          proxy_pass      http://RELEASE-NAME-cortex-distributor-headless.default.svc.cluster.local:8080$request_uri;
        }

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

cc @juan-ramirez-sp @gburton1

axonlmai added 2 commits June 8, 2022 12:54
Signed-off-by: Loc Mai <lmai@axon.com>
Signed-off-by: Loc Mai <lmai@axon.com>
Copy link
Collaborator

@nschad nschad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is very similiar to #339 and probably also to #330. Quiet frankly I not sure which road we should go. However I do see the necessity

@locmai
Copy link
Contributor Author

locmai commented Jun 8, 2022

It is indeed. I think @juan-ramirez-sp tried the #330 but didn't work, see this Slack discussion https://cloud-native.slack.com/archives/CCYDASBLP/p1654268570724589 .

We tested this change on our deployment and it works nicely. Me and @gburton1 could take a look at the #339 though, I think it's pretty similar.

@nschad
Copy link
Collaborator

nschad commented Jun 8, 2022

It is indeed. I think @juan-ramirez-sp tried the #330 but didn't work, see this Slack discussion https://cloud-native.slack.com/archives/CCYDASBLP/p1654268570724589 .

We tested this change on our deployment and it works nicely. Me and @gburton1 could take a look at the #339 though, I think it's pretty similar.

Ah okay upon further investigation: https://serverfault.com/questions/240476/how-to-force-nginx-to-resolve-dns-of-a-dynamic-hostname-everytime-when-doing-p this seems to be a ongoing issue for a lot of people.

The best solution might be this or #339 which is effectively the same

values.yaml Outdated
@@ -1276,7 +1276,10 @@ nginx:
# setHeaders:
# X-Scope-OrgID: $remote_user
basicAuthSecretName: ""

# -- (optional) Including the valid parameter to the `resolver` directive to re-resolve names every `dnsValid` seconds/minutes
dnsValid: ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is broken now, I would be happy to see a default that behaves nicely.

Suggested change
dnsValid: ""
dnsValid: "15s"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah great idea. Anyone wanna change the option name? I see in the other PR they have dnsTTL.

By the way, after I re-tested the whole case here, it looks like the valid=15s is all we need to solve the problem (no headless service change required).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I like dnsTTL more

@@ -71,7 +71,7 @@ data:

# Distributor Config
location = /ring {
proxy_pass http://{{ template "cortex.fullname" . }}-distributor.{{ $rootDomain }}$request_uri;
proxy_pass http://{{ template "cortex.fullname" . }}-distributor{{- if .Values.nginx.config.distributorHeadless }}-headless{{- end }}.{{ $rootDomain }}$request_uri;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think headless should be applied to all distributor endpoints below also.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried this? with headless you get multiple dns results with the pod individual ip address. Does "load-balancing" still work?

values.yaml Outdated
# -- (optional) Including the valid parameter to the `resolver` directive to re-resolve names every `dnsValid` seconds/minutes
dnsValid: ""
# -- (optional) If true Nginx will use the distributor headless service for `/ring` path. See https://github.com/kubernetes/kubernetes/issues/24092
distributorHeadless: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @gburton1 suggested if we replace everything with the headless service we don't need that config option here

@nschad
Copy link
Collaborator

nschad commented Jun 10, 2022

but @locmai I don't quite understand how ClusterIP is a problem. The virtual IP of the distributor service doesn't change when pods are scaling up/down. kube-proxy then takes over routing that traffic to any distributor pod (fake round-robin or actual round-robin with ipvs backend)

@locmai
Copy link
Contributor Author

locmai commented Jun 10, 2022

but @locmai I don't quite understand how ClusterIP is a problem. The virtual IP of the distributor service doesn't change when pods are scaling up/down. kube-proxy then takes over routing that traffic to any distributor pod (fake round-robin or actual round-robin with ipvs backend)

Yeah I think it's not. As I mentioned in the comment above, I tested again and the valid=15s is all we need. My guess is that after nginx resolve the DNS into the service, then retrieve the list of endpoints (Pods' IPs) then it would cache it right there, so by having the small TTL, it would just invalidate the old record and do the resolving process all again.

This may put some more load on the nginx though.

@nschad
Copy link
Collaborator

nschad commented Jun 10, 2022

but @locmai I don't quite understand how ClusterIP is a problem. The virtual IP of the distributor service doesn't change when pods are scaling up/down. kube-proxy then takes over routing that traffic to any distributor pod (fake round-robin or actual round-robin with ipvs backend)

Yeah I think it's not. As I mentioned in the comment above, I tested again and the valid=15s is all we need. My guess is that after nginx resolve the DNS into the service, then retrieve the list of endpoints (Pods' IPs) then it would cache it right there, so by having the small TTL, it would just invalidate the old record and do the resolving process all again.

This may put some more load on the nginx though.

Do you just wanna go ahead with only the dnsTTL change and keep service as is?

axonlmai added 4 commits June 10, 2022 15:18
Signed-off-by: Loc Mai <lmai@axon.com>
…/cortex-helm-chart into nginx-to-use-headless-distributor

Signed-off-by: Loc Mai <lmai@axon.com>
Signed-off-by: Loc Mai <lmai@axon.com>
@locmai
Copy link
Contributor Author

locmai commented Jun 10, 2022

Yeah I just updated the PR @nschad @kd7lxl

@nschad nschad enabled auto-merge (squash) June 10, 2022 08:31
@locmai
Copy link
Contributor Author

locmai commented Jun 10, 2022

Error: timed out waiting for the condition
41

Sounds like a hiccup/flaky test @@

@nschad
Copy link
Collaborator

nschad commented Jun 10, 2022

Error: timed out waiting for the condition
41

Sounds like a hiccup/flaky test @@

2022/06/10 08:38:29 [emerg] 1#1: host not found in resolver "15s" in /etc/nginx/nginx.conf:21

Please render the chart locally and check if there are no mistakes

@nschad nschad disabled auto-merge June 10, 2022 08:49
Signed-off-by: Loc Mai <lmai@axon.com>
@locmai
Copy link
Contributor Author

locmai commented Jun 10, 2022

oh shoot, got it now :D

@nschad
Copy link
Collaborator

nschad commented Jun 10, 2022

@locmai I think the nginx config is wrong. How does that work for you o.O

Should be

resolver 127.0.0.1 [::1]:5353 valid=30s;

http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver

README.md Outdated Show resolved Hide resolved
Signed-off-by: Loc Mai <lmai@axon.com>
@locmai
Copy link
Contributor Author

locmai commented Jun 10, 2022

@nschad sorry my bad. Templated out here:

resolver kube-dns.kube-system.svc.cluster.local valid=15s;

missed the valid=. I tested my local with simple editing the configmap.

@nschad nschad merged commit 34b0650 into cortexproject:master Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants