feat: add validDns option and headless distributor to Nginx #366

locmai · 2022-06-08T06:08:29Z

Signed-off-by: Loc Mai lmai@axon.com

What this PR does:

Address the NGINX issue on Kubernetes when rolling restart the distributor pods would cause an amount of ingestion failures on the write path.

This is more of a Kubernetes issue but we could have this as a work around. See kubernetes/kubernetes#24092

Which issue(s) this PR fixes:
No issue created.

Template results

# values.yaml
nginx:
  config:
    dnsValid: 15s
    distributorHeadless: true

nginx-config - configmap:

...
        resolver kube-dns.kube-system.svc.cluster.local 15s;

        # Distributor Config
        location = /ring {
          proxy_pass      http://RELEASE-NAME-cortex-distributor-headless.default.svc.cluster.local:8080$request_uri;
        }

Checklist

CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

cc @juan-ramirez-sp @gburton1

Signed-off-by: Loc Mai <lmai@axon.com>

nschad

Ah this is very similiar to #339 and probably also to #330. Quiet frankly I not sure which road we should go. However I do see the necessity

locmai · 2022-06-08T06:48:16Z

It is indeed. I think @juan-ramirez-sp tried the #330 but didn't work, see this Slack discussion https://cloud-native.slack.com/archives/CCYDASBLP/p1654268570724589 .

We tested this change on our deployment and it works nicely. Me and @gburton1 could take a look at the #339 though, I think it's pretty similar.

nschad · 2022-06-08T07:10:17Z

It is indeed. I think @juan-ramirez-sp tried the #330 but didn't work, see this Slack discussion https://cloud-native.slack.com/archives/CCYDASBLP/p1654268570724589 .

We tested this change on our deployment and it works nicely. Me and @gburton1 could take a look at the #339 though, I think it's pretty similar.

Ah okay upon further investigation: https://serverfault.com/questions/240476/how-to-force-nginx-to-resolve-dns-of-a-dynamic-hostname-everytime-when-doing-p this seems to be a ongoing issue for a lot of people.

The best solution might be this or #339 which is effectively the same

kd7lxl · 2022-06-08T20:44:05Z

values.yaml

@@ -1276,7 +1276,10 @@ nginx:
    #   setHeaders:
    #     X-Scope-OrgID: $remote_user
    basicAuthSecretName: ""
-
+    # -- (optional) Including the valid parameter to the `resolver` directive to re-resolve names every `dnsValid` seconds/minutes
+    dnsValid: ""


Since this is broken now, I would be happy to see a default that behaves nicely.

Suggested change

dnsValid: ""

dnsValid: "15s"

Yeah great idea. Anyone wanna change the option name? I see in the other PR they have dnsTTL.

By the way, after I re-tested the whole case here, it looks like the valid=15s is all we need to solve the problem (no headless service change required).

yeah I like dnsTTL more

gburton1 · 2022-06-09T04:50:38Z

templates/nginx/nginx-config.yaml

@@ -71,7 +71,7 @@ data:

        # Distributor Config
        location = /ring {
-          proxy_pass      http://{{ template "cortex.fullname" . }}-distributor.{{ $rootDomain }}$request_uri;
+          proxy_pass      http://{{ template "cortex.fullname" . }}-distributor{{- if .Values.nginx.config.distributorHeadless }}-headless{{- end }}.{{ $rootDomain }}$request_uri;


I think headless should be applied to all distributor endpoints below also.

have you tried this? with headless you get multiple dns results with the pod individual ip address. Does "load-balancing" still work?

nschad · 2022-06-09T07:12:10Z

values.yaml

+    # -- (optional) Including the valid parameter to the `resolver` directive to re-resolve names every `dnsValid` seconds/minutes
+    dnsValid: ""
+    # -- (optional) If true Nginx will use the distributor headless service for `/ring` path. See https://github.com/kubernetes/kubernetes/issues/24092
+    distributorHeadless: false


Like @gburton1 suggested if we replace everything with the headless service we don't need that config option here

nschad · 2022-06-10T06:40:25Z

but @locmai I don't quite understand how ClusterIP is a problem. The virtual IP of the distributor service doesn't change when pods are scaling up/down. kube-proxy then takes over routing that traffic to any distributor pod (fake round-robin or actual round-robin with ipvs backend)

locmai · 2022-06-10T06:49:19Z

but @locmai I don't quite understand how ClusterIP is a problem. The virtual IP of the distributor service doesn't change when pods are scaling up/down. kube-proxy then takes over routing that traffic to any distributor pod (fake round-robin or actual round-robin with ipvs backend)

Yeah I think it's not. As I mentioned in the comment above, I tested again and the valid=15s is all we need. My guess is that after nginx resolve the DNS into the service, then retrieve the list of endpoints (Pods' IPs) then it would cache it right there, so by having the small TTL, it would just invalidate the old record and do the resolving process all again.

This may put some more load on the nginx though.

nschad · 2022-06-10T08:08:51Z

but @locmai I don't quite understand how ClusterIP is a problem. The virtual IP of the distributor service doesn't change when pods are scaling up/down. kube-proxy then takes over routing that traffic to any distributor pod (fake round-robin or actual round-robin with ipvs backend)

Yeah I think it's not. As I mentioned in the comment above, I tested again and the valid=15s is all we need. My guess is that after nginx resolve the DNS into the service, then retrieve the list of endpoints (Pods' IPs) then it would cache it right there, so by having the small TTL, it would just invalidate the old record and do the resolving process all again.

This may put some more load on the nginx though.

Do you just wanna go ahead with only the dnsTTL change and keep service as is?

Signed-off-by: Loc Mai <lmai@axon.com>

…/cortex-helm-chart into nginx-to-use-headless-distributor Signed-off-by: Loc Mai <lmai@axon.com>

Signed-off-by: Loc Mai <lmai@axon.com>

locmai · 2022-06-10T08:30:46Z

Yeah I just updated the PR @nschad @kd7lxl

locmai · 2022-06-10T08:46:19Z

Error: timed out waiting for the condition
41

Sounds like a hiccup/flaky test @@

nschad · 2022-06-10T08:49:48Z

Error: timed out waiting for the condition
41

Sounds like a hiccup/flaky test @@

2022/06/10 08:38:29 [emerg] 1#1: host not found in resolver "15s" in /etc/nginx/nginx.conf:21

Please render the chart locally and check if there are no mistakes

Signed-off-by: Loc Mai <lmai@axon.com>

locmai · 2022-06-10T08:53:35Z

oh shoot, got it now :D

nschad · 2022-06-10T08:53:37Z

@locmai I think the nginx config is wrong. How does that work for you o.O

Should be

resolver 127.0.0.1 [::1]:5353 valid=30s;

http://nginx.org/en/docs/http/ngx_http_core_module.html#resolver

README.md

Signed-off-by: Loc Mai <lmai@axon.com>

locmai · 2022-06-10T09:04:31Z

@nschad sorry my bad. Templated out here:

resolver kube-dns.kube-system.svc.cluster.local valid=15s;

missed the valid=. I tested my local with simple editing the configmap.

axonlmai added 2 commits June 8, 2022 12:54

feat: add validDns option and headless distributor to Nginx

8d0fa94

Signed-off-by: Loc Mai <lmai@axon.com>

chore: update changelog, fix trailing space and typo

7ee4ab6

Signed-off-by: Loc Mai <lmai@axon.com>

nschad reviewed Jun 8, 2022

View reviewed changes

Merge branch 'master' into nginx-to-use-headless-distributor

9396c92

kd7lxl approved these changes Jun 8, 2022

View reviewed changes

gburton1 reviewed Jun 9, 2022

View reviewed changes

nschad reviewed Jun 9, 2022

View reviewed changes

axonlmai added 4 commits June 10, 2022 15:18

Merge branch 'master' into nginx-to-use-headless-distributor

7edac7d

Update naming

d4efb68

Signed-off-by: Loc Mai <lmai@axon.com>

Merge branch 'nginx-to-use-headless-distributor' of github.com:locmai…

d861fd6

…/cortex-helm-chart into nginx-to-use-headless-distributor Signed-off-by: Loc Mai <lmai@axon.com>

remove extra line

d7824b0

Signed-off-by: Loc Mai <lmai@axon.com>

nschad approved these changes Jun 10, 2022

View reviewed changes

nschad enabled auto-merge (squash) June 10, 2022 08:31

nschad disabled auto-merge June 10, 2022 08:49

fix nginx conf

748f07e

Signed-off-by: Loc Mai <lmai@axon.com>

locmai commented Jun 10, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

fix nginx conf

ea7a7ad

Signed-off-by: Loc Mai <lmai@axon.com>

nschad merged commit 34b0650 into cortexproject:master Jun 10, 2022

nschad mentioned this pull request Jun 10, 2022

unable nginx keepalive to cortex components #330

Closed

1 task

This was referenced Jun 10, 2022

feat(nginx): ensure dns resolver respects a ttl #339

Closed

cut 1.6.0 #369

Merged

nschad mentioned this pull request Nov 8, 2022

fix: nginx fails to forward requests to new pods #412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add validDns option and headless distributor to Nginx #366

feat: add validDns option and headless distributor to Nginx #366

locmai commented Jun 8, 2022 •

edited

Loading

nschad left a comment

locmai commented Jun 8, 2022

nschad commented Jun 8, 2022

kd7lxl Jun 8, 2022

nschad Jun 9, 2022

locmai Jun 10, 2022

nschad Jun 10, 2022

gburton1 Jun 9, 2022

nschad Jun 10, 2022

nschad Jun 9, 2022

nschad commented Jun 10, 2022

locmai commented Jun 10, 2022

nschad commented Jun 10, 2022

locmai commented Jun 10, 2022

locmai commented Jun 10, 2022

nschad commented Jun 10, 2022 •

edited

Loading

locmai commented Jun 10, 2022

nschad commented Jun 10, 2022 •

edited

Loading

locmai commented Jun 10, 2022

feat: add validDns option and headless distributor to Nginx #366

feat: add validDns option and headless distributor to Nginx #366

Conversation

locmai commented Jun 8, 2022 • edited Loading

Template results

nschad left a comment

Choose a reason for hiding this comment

locmai commented Jun 8, 2022

nschad commented Jun 8, 2022

kd7lxl Jun 8, 2022

Choose a reason for hiding this comment

nschad Jun 9, 2022

Choose a reason for hiding this comment

locmai Jun 10, 2022

Choose a reason for hiding this comment

nschad Jun 10, 2022

Choose a reason for hiding this comment

gburton1 Jun 9, 2022

Choose a reason for hiding this comment

nschad Jun 10, 2022

Choose a reason for hiding this comment

nschad Jun 9, 2022

Choose a reason for hiding this comment

nschad commented Jun 10, 2022

locmai commented Jun 10, 2022

nschad commented Jun 10, 2022

locmai commented Jun 10, 2022

locmai commented Jun 10, 2022

nschad commented Jun 10, 2022 • edited Loading

locmai commented Jun 10, 2022

nschad commented Jun 10, 2022 • edited Loading

locmai commented Jun 10, 2022

locmai commented Jun 8, 2022 •

edited

Loading

nschad commented Jun 10, 2022 •

edited

Loading

nschad commented Jun 10, 2022 •

edited

Loading