Intermittent DNS Failure with Kong Gateway and multi-zone kuma service mesh #2829

arjunsalyan · 2021-09-23T13:12:35Z

Summary

I have a three cluster setup on GKE:

Kuma global
Zona A
Zone B

Zone A has kong gateway and ingress controller installed along with some other services. To expose the services on Zone B I create an external name on zone A pointing to the service on Zone B and then create an ingress for it.

All works fine, except that intermittently (4-5 times in a day) kong throws this error:

[warn] 1102#0: *494 [lua] base.lua:659: queryDns(): [upstream:services-externalname.services.80.svc 20] querying dns for 
fe.dev.svc.80.mesh failed: dns server error: 4 not implemented. Tried ["(short)fe.dev.svc.80.mesh:(na) - cache-
hit/stale","fe.dev.svc.80.mesh.kong-system.svc.cluster.local:1 - cache-hit/dns server error: 3 name 
error","fe.dev.svc.80.mesh.svc.cluster.local:
....

Here services-externalname is the name of the external name service on Zone A and fe.dev.svc.80.mesh is the Kuma DNS address for the service running on zone B.
And then this appears:

*6377490 [lua] balancer.lua:258: callback(): [healthchecks] balancer
 0dc6f45b-8f8d-40d2-a504-473544ee190b:services-externalname.services.80.svc
 reported health status changed to UNHEALTHY, context: ngx.timer, client: 127.0.0.1, server: 127.0.0.1:8444

It stays unhealthy for a minute or so, and then automatically returns back to healthy. During this period kong is not able to serve this and throws the error when trying to access through gateway: Failure to get a peer from the ring balancer

Steps To Reproduce

I am just listing down things I did:

Setup three GKE clusters running k8s 1.19.12-gke.2101
Intstalled kuma on all three clusters (one global, two remotes) using helm chart v 0.7.0 (kuma - v1.3.0)
Installed kong gateway on Zone A using helm chart v 2.3.0 (kong - v2.5, ingress controller - v1.3)
Setup external name as explained in kuma docs

Additional Details & Logs

Version - Kuma 1.3.0 and Kong 2.4
Error logs - Mentioned above
Configuration - multi-zone with kong as gateway
Platform and Operating System
Installation Method - helm

I have tried to follow all steps from the documentation. Did I do a mistake, or is something we need to fix?

The text was updated successfully, but these errors were encountered:

jpeach · 2021-09-28T02:08:24Z

dns server error: 4 not implemented seems a bit odd. Is there a corresponding log message anywhere in coredns or kuma logs?

arjunsalyan · 2021-09-28T18:32:25Z

No, I do not see anything strange anywhere else in the logs. I also tried some monitoring with kumactl install metrics, but all seems good there also.

kuma-control-plane of the zone A has only three types of logs:

either it is annotating the services:

INFO	controllers.Service	annotating service which is part of the mesh	
{"service": "services/demo-service", "annotation": "ingress.kubernetes.io/service-upstream=true"}

or this

INFO kds-zone updating a resource 
{"type": "ZoneIngress", "name": "gke-services.kuma-ingress-857896c4f9-gckjl.kuma-system.default.default", "mesh": ""}

or this:

INFO kds.reconcile detected changes in the resources. Sending changes to the client. 
{"resourceType": "DataplaneInsight", "client": "global"}

Also, I have tried to do a complete reinstall of the entire mesh and the services- but same result. I am stuck with this since 1.2.0, and have moved onto 1.3.0 but the issue still exists.

arjunsalyan · 2021-10-05T10:45:24Z

I recently tried to change the Gateway to nginx, but the results are same:

[error] 305#305: *1096422 stream [lua] dns.lua:152: dns_lookup():
failed to query the DNS server for services-externalname.services.80.svc:
server returned error code: 3: name error
server returned error code: 3: name error, context: ngx.timer

Happens intermittently, like once in an hour, and fixes automatically in around a minute.

github-actions · 2021-11-22T14:33:17Z

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

github-actions · 2021-12-24T08:01:23Z

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

lahabana · 2022-01-06T08:34:29Z

This looks like this issue that was fixed in 1.3.1: #2756

lahabana · 2022-01-06T08:36:06Z

I'm going to mark this as duplicated but please reopen if you disagree!

sravanakinapally · 2022-10-10T01:54:01Z

I recently tried to change the Gateway to nginx, but the results are same:
[error] 305#305: *1096422 stream [lua] dns.lua:152: dns_lookup():
failed to query the DNS server for services-externalname.services.80.svc:
server returned error code: 3: name error
server returned error code: 3: name error, context: ngx.timer
Happens intermittently, like once in an hour, and fixes automatically in around a minute.

I have similar issue and not sure what is the fix. Can you help me what was the fix

Kuma Service mesh 1-Global 6-Zone all these are Azure Kubernetes cluster AKS 1.22.6 with Kuma version 1.3.1
nginx-ingress : docker.io/bitnami/nginx-ingress-controller:1.1.0-debian-10-r13

We have 6-zone clusters but we notice these error only in one cluster which may have about 600+ pods..

ingress-nginx-controller-848dfd5464-6zz58:controller 2022/10/10 01:36:18 [error] 34#34: *2505451 [lua] dns.lua:152: dns_lookup(): failed to query the DNS server for reinstate-ui-nonprod.reubstate-nonprod.svc.80.mesh:
ingress-nginx-controller-848dfd5464-6zz58:controller server returned error code: 3: name error
ingress-nginx-controller-848dfd5464-6zz58:controller server returned error code: 3: name error, context: ngx.timer

lahabana · 2022-10-17T10:08:39Z

@sravanakinapally can you confirm this is intermittent? Could you show the log of the dataplane during this time? I'm wondering if we can maybe track this down to something on the Envoy config.

kleinfreund · 2022-11-02T16:24:00Z

@sravanakinapally Hey. Do you have any updates for us on this? Is this still affecting you?

slonka · 2022-12-12T14:24:58Z

@sravanakinapally - any updates here?

sravanakinapally · 2022-12-16T21:20:56Z

@slonka let me know if you need more logs

This was intermittent but it's happening often now we upgraded the cluster to 1.24.6 AZure AKS with ingress nginx docker.io/bitnami/nginx-ingress-controller:1.6.0-debian-11-r1

uma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.614Z	ERROR	Reconciler error	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "service": {"name":"dataprivacy-portal-ng-k8s-poc-nonprod-externalname","namespace":"dpp-ns-nonprod"}, "namespace": "dpp-ns-nonprod", "name": "dataprivacy-portal-ng-k8s-poc-nonprod-externalname", "reconcileID": "a178d82e-cbb9-478b-9519-e750e1e16fb7", "error": "unable to update ingress service upstream annotation on service dataprivacy-portal-ng-k8s-poc-nonprod-externalname: Operation cannot be fulfilled on services \"dataprivacy-portal-ng-k8s-poc-nonprod-externalname\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on services \"dataprivacy-portal-ng-k8s-poc-nonprod-externalname\": the object has been modified; please apply your changes to the latest version and try again\nunable to update ingress service upstream annotation on service dataprivacy-portal-ng-k8s-poc-nonprod-externalname\ngh.neting.cc/kumahq/kuma/pkg/plugins/runtime/k8s/controllers.(*ServiceReconciler).Reconcile\n\t/home/circleci/project/pkg/plugins/runtime/k8s/controllers/service_controller.go:85\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/home/circleci/go/src/runtime/asm_amd64.s:1571"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.614Z	INFO	controllers.Service	annotating service which is part of the mesh	{"service": "dpp-ns-nonprod/dataprivacy-portal-ng-k8s-poc-nonprod-externalname", "annotation": "ingress.kubernetes.io/service-upstream=true"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.665Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-stage-7d44c9b984-jhh8b", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}

uma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.665Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-stage-7d44c9b984-jhh8b", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.726Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-test-965df7556-bzq28", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.744Z	INFO	controllers.Service	annotating service which is part of the mesh	{"service": "dpp-ns-nonprod/dataprivacy-portal-ng-k8s-poc-nonprod-externalname", "annotation": "ingress.kubernetes.io/service-upstream=true"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.787Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-springboot-k8s-poc-nonprod-5fb6db7699-765k5", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.857Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-ng-k8s-poc-nonprod-7dd66b7d94-vxph4", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.909Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-api-proxy-test-796f6cb685-n8t6g", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.973Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-router-service-dev-55985fd584-qq9vc", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-9rtml:control-plane 2022-12-16T21:19:58.001Z	INFO	leader	Not the leader. Waiting.
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:58.023Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-dev-55c99479d9-r97rd", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}

jakubdyszkiewicz · 2022-12-19T15:29:42Z

Triage: Hey, which version of Kuma are you running?
On Oct 10 you posted Kuma Service mesh 1-Global 6-Zone all these are Azure Kubernetes cluster AKS 1.22.6 with Kuma version 1.3.1. Did you upgrade? In 1.5.0 we released a fix to a DNS issue that looks like what you are describing https://github.com//kumahq/kuma/pull/3459

github-actions · 2023-04-24T07:14:58Z

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

bartsmykla · 2023-05-16T05:56:23Z

@sravanakinapally could you update to the newer version of the kuma, and get back with information if it still happens? I would like to push this forward, and resolve the issue if necessary.

sravanakinapally · 2023-05-16T15:25:17Z

@bartsmykla yes, this is still happening. I am working with John H on this.

github-actions · 2023-08-15T07:12:45Z

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions · 2023-11-17T07:13:33Z

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

slonka · 2023-11-27T13:10:31Z

@johnharris85 did we figure out what was wrong here?

nowNick · 2023-11-27T16:03:49Z

I'm not sure if these issues are 100% related but they do look similar.

A short summary is that we've had some problems with Kong's DNS client. There's been a quick-fix in 3.5.0 and we've merged also a broader fix but it's still to be released. More info here: Kong/kong#9959 (comment)

github-actions · 2024-02-26T07:15:33Z

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

jakubdyszkiewicz · 2024-04-05T15:01:09Z

that honestly looks like #8301 that was fixed in 2.5.0 and backported to patch versions of earlier minors.
Especially when we are talking about multizone like the original post

lahabana · 2024-04-08T09:18:14Z

@arjunsalyan can you check if this still happens on a recent version?

slonka · 2024-05-06T07:37:52Z

pinging @arjunsalyan again

arjunsalyan · 2024-05-06T08:08:00Z

Sorry guys, we no longer have the setup on which we had the issue earlier and so there is no way for me to test or reproduce this. We can close this ticket if similar issues have been addressed.

jpeach added the multizone label Sep 28, 2021

github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Nov 22, 2021

lahabana added area/multizone and removed multizone labels Nov 22, 2021

jpeach removed the triage/stale Inactive for some time. It will be triaged again label Nov 22, 2021

github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Dec 24, 2021

lahabana closed this as completed Jan 6, 2022

lahabana added triage/duplicated already exists and removed triage/stale Inactive for some time. It will be triaged again labels Jan 6, 2022

lahabana reopened this Oct 17, 2022

lahabana added triage/needs-information Reviewed and some extra information was asked to the reporter and removed triage/duplicated already exists labels Oct 17, 2022

github-actions bot added the triage/pending This issue will be looked at on the next triage meeting label Oct 17, 2022

kleinfreund removed the triage/pending This issue will be looked at on the next triage meeting label Nov 2, 2022

Automaat added triage/pending This issue will be looked at on the next triage meeting and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels Dec 19, 2022

jakubdyszkiewicz removed the triage/pending This issue will be looked at on the next triage meeting label Dec 19, 2022

slonka added triage/pending This issue will be looked at on the next triage meeting and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels Jan 23, 2023

jakubdyszkiewicz added triage/needs-reproducing Someone else should try to reproduce this and removed triage/pending This issue will be looked at on the next triage meeting labels Jan 23, 2023

github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Apr 24, 2023

lahabana removed the triage/stale Inactive for some time. It will be triaged again label Apr 24, 2023

github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Aug 15, 2023

jakubdyszkiewicz removed the triage/stale Inactive for some time. It will be triaged again label Aug 18, 2023

github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Nov 17, 2023

michaelbeaumont removed the triage/stale Inactive for some time. It will be triaged again label Nov 17, 2023

github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Feb 26, 2024

jakubdyszkiewicz removed the triage/stale Inactive for some time. It will be triaged again label Feb 26, 2024

jakubdyszkiewicz assigned bartsmykla Apr 2, 2024

lahabana added triage/needs-information Reviewed and some extra information was asked to the reporter and removed triage/needs-reproducing Someone else should try to reproduce this labels Apr 8, 2024

jakubdyszkiewicz closed this as not planned Won't fix, can't repro, duplicate, stale May 6, 2024

jakubdyszkiewicz added triage/rotten closed due to lack of information for too long, rejected feature... and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent DNS Failure with Kong Gateway and multi-zone kuma service mesh #2829

Intermittent DNS Failure with Kong Gateway and multi-zone kuma service mesh #2829

arjunsalyan commented Sep 23, 2021

jpeach commented Sep 28, 2021

arjunsalyan commented Sep 28, 2021

arjunsalyan commented Oct 5, 2021

github-actions bot commented Nov 22, 2021

github-actions bot commented Dec 24, 2021

lahabana commented Jan 6, 2022

lahabana commented Jan 6, 2022

sravanakinapally commented Oct 10, 2022

lahabana commented Oct 17, 2022

kleinfreund commented Nov 2, 2022

slonka commented Dec 12, 2022

sravanakinapally commented Dec 16, 2022 •

edited

Loading

jakubdyszkiewicz commented Dec 19, 2022

github-actions bot commented Apr 24, 2023

bartsmykla commented May 16, 2023

sravanakinapally commented May 16, 2023

github-actions bot commented Aug 15, 2023

github-actions bot commented Nov 17, 2023

slonka commented Nov 27, 2023

nowNick commented Nov 27, 2023

github-actions bot commented Feb 26, 2024

jakubdyszkiewicz commented Apr 5, 2024

lahabana commented Apr 8, 2024

slonka commented May 6, 2024

arjunsalyan commented May 6, 2024

Intermittent DNS Failure with Kong Gateway and multi-zone kuma service mesh #2829

Intermittent DNS Failure with Kong Gateway and multi-zone kuma service mesh #2829

Comments

arjunsalyan commented Sep 23, 2021

Summary

Steps To Reproduce

Additional Details & Logs

jpeach commented Sep 28, 2021

arjunsalyan commented Sep 28, 2021

arjunsalyan commented Oct 5, 2021

github-actions bot commented Nov 22, 2021

github-actions bot commented Dec 24, 2021

lahabana commented Jan 6, 2022

lahabana commented Jan 6, 2022

sravanakinapally commented Oct 10, 2022

lahabana commented Oct 17, 2022

kleinfreund commented Nov 2, 2022

slonka commented Dec 12, 2022

sravanakinapally commented Dec 16, 2022 • edited Loading

jakubdyszkiewicz commented Dec 19, 2022

github-actions bot commented Apr 24, 2023

bartsmykla commented May 16, 2023

sravanakinapally commented May 16, 2023

github-actions bot commented Aug 15, 2023

github-actions bot commented Nov 17, 2023

slonka commented Nov 27, 2023

nowNick commented Nov 27, 2023

github-actions bot commented Feb 26, 2024

jakubdyszkiewicz commented Apr 5, 2024

lahabana commented Apr 8, 2024

slonka commented May 6, 2024

arjunsalyan commented May 6, 2024

sravanakinapally commented Dec 16, 2022 •

edited

Loading