Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent DNS Failure with Kong Gateway and multi-zone kuma service mesh #2829

Closed
arjunsalyan opened this issue Sep 23, 2021 · 26 comments
Closed
Assignees
Labels
area/multizone triage/rotten closed due to lack of information for too long, rejected feature...

Comments

@arjunsalyan
Copy link

Summary

I have a three cluster setup on GKE:

  • Kuma global
  • Zona A
  • Zone B

Zone A has kong gateway and ingress controller installed along with some other services. To expose the services on Zone B I create an external name on zone A pointing to the service on Zone B and then create an ingress for it.

All works fine, except that intermittently (4-5 times in a day) kong throws this error:

[warn] 1102#0: *494 [lua] base.lua:659: queryDns(): [upstream:services-externalname.services.80.svc 20] querying dns for 
fe.dev.svc.80.mesh failed: dns server error: 4 not implemented. Tried ["(short)fe.dev.svc.80.mesh:(na) - cache-
hit/stale","fe.dev.svc.80.mesh.kong-system.svc.cluster.local:1 - cache-hit/dns server error: 3 name 
error","fe.dev.svc.80.mesh.svc.cluster.local:
....

Here services-externalname is the name of the external name service on Zone A and fe.dev.svc.80.mesh is the Kuma DNS address for the service running on zone B.
And then this appears:

*6377490 [lua] balancer.lua:258: callback(): [healthchecks] balancer
 0dc6f45b-8f8d-40d2-a504-473544ee190b:services-externalname.services.80.svc
 reported health status changed to UNHEALTHY, context: ngx.timer, client: 127.0.0.1, server: 127.0.0.1:8444

It stays unhealthy for a minute or so, and then automatically returns back to healthy. During this period kong is not able to serve this and throws the error when trying to access through gateway: Failure to get a peer from the ring balancer

Steps To Reproduce

I am just listing down things I did:

  1. Setup three GKE clusters running k8s 1.19.12-gke.2101
  2. Intstalled kuma on all three clusters (one global, two remotes) using helm chart v 0.7.0 (kuma - v1.3.0)
  3. Installed kong gateway on Zone A using helm chart v 2.3.0 (kong - v2.5, ingress controller - v1.3)
  4. Setup external name as explained in kuma docs

Additional Details & Logs

  • Version - Kuma 1.3.0 and Kong 2.4
  • Error logs - Mentioned above
  • Configuration - multi-zone with kong as gateway
  • Platform and Operating System
  • Installation Method - helm

I have tried to follow all steps from the documentation. Did I do a mistake, or is something we need to fix?

@jpeach
Copy link
Contributor

jpeach commented Sep 28, 2021

dns server error: 4 not implemented seems a bit odd. Is there a corresponding log message anywhere in coredns or kuma logs?

@arjunsalyan
Copy link
Author

No, I do not see anything strange anywhere else in the logs. I also tried some monitoring with kumactl install metrics, but all seems good there also.

kuma-control-plane of the zone A has only three types of logs:

either it is annotating the services:

INFO	controllers.Service	annotating service which is part of the mesh	
{"service": "services/demo-service", "annotation": "ingress.kubernetes.io/service-upstream=true"}

or this

INFO kds-zone updating a resource 
{"type": "ZoneIngress", "name": "gke-services.kuma-ingress-857896c4f9-gckjl.kuma-system.default.default", "mesh": ""}

or this:

INFO kds.reconcile detected changes in the resources. Sending changes to the client. 
{"resourceType": "DataplaneInsight", "client": "global"}

Also, I have tried to do a complete reinstall of the entire mesh and the services- but same result. I am stuck with this since 1.2.0, and have moved onto 1.3.0 but the issue still exists.

@arjunsalyan
Copy link
Author

I recently tried to change the Gateway to nginx, but the results are same:

[error] 305#305: *1096422 stream [lua] dns.lua:152: dns_lookup():
failed to query the DNS server for services-externalname.services.80.svc:
server returned error code: 3: name error
server returned error code: 3: name error, context: ngx.timer

Happens intermittently, like once in an hour, and fixes automatically in around a minute.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Nov 22, 2021
@github-actions
Copy link
Contributor

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

@jpeach jpeach removed the triage/stale Inactive for some time. It will be triaged again label Nov 22, 2021
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Dec 24, 2021
@github-actions
Copy link
Contributor

This issue was inactive for 30 days it will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant please comment on it promptly or attend the next triage meeting.

@lahabana
Copy link
Contributor

lahabana commented Jan 6, 2022

This looks like this issue that was fixed in 1.3.1: #2756

@lahabana
Copy link
Contributor

lahabana commented Jan 6, 2022

I'm going to mark this as duplicated but please reopen if you disagree!

@lahabana lahabana closed this as completed Jan 6, 2022
@lahabana lahabana added triage/duplicated already exists and removed triage/stale Inactive for some time. It will be triaged again labels Jan 6, 2022
@sravanakinapally
Copy link

I recently tried to change the Gateway to nginx, but the results are same:

[error] 305#305: *1096422 stream [lua] dns.lua:152: dns_lookup():
failed to query the DNS server for services-externalname.services.80.svc:
server returned error code: 3: name error
server returned error code: 3: name error, context: ngx.timer

Happens intermittently, like once in an hour, and fixes automatically in around a minute.

I have similar issue and not sure what is the fix. Can you help me what was the fix

Kuma Service mesh 1-Global 6-Zone all these are Azure Kubernetes cluster AKS 1.22.6 with Kuma version 1.3.1
nginx-ingress : docker.io/bitnami/nginx-ingress-controller:1.1.0-debian-10-r13

We have 6-zone clusters but we notice these error only in one cluster which may have about 600+ pods..

ingress-nginx-controller-848dfd5464-6zz58:controller 2022/10/10 01:36:18 [error] 34#34: *2505451 [lua] dns.lua:152: dns_lookup(): failed to query the DNS server for reinstate-ui-nonprod.reubstate-nonprod.svc.80.mesh:
ingress-nginx-controller-848dfd5464-6zz58:controller server returned error code: 3: name error
ingress-nginx-controller-848dfd5464-6zz58:controller server returned error code: 3: name error, context: ngx.timer

@lahabana
Copy link
Contributor

@sravanakinapally can you confirm this is intermittent? Could you show the log of the dataplane during this time? I'm wondering if we can maybe track this down to something on the Envoy config.

@lahabana lahabana reopened this Oct 17, 2022
@lahabana lahabana added triage/needs-information Reviewed and some extra information was asked to the reporter and removed triage/duplicated already exists labels Oct 17, 2022
@github-actions github-actions bot added the triage/pending This issue will be looked at on the next triage meeting label Oct 17, 2022
@kleinfreund kleinfreund removed the triage/pending This issue will be looked at on the next triage meeting label Nov 2, 2022
@kleinfreund
Copy link
Contributor

@sravanakinapally Hey. Do you have any updates for us on this? Is this still affecting you?

@slonka
Copy link
Contributor

slonka commented Dec 12, 2022

@sravanakinapally - any updates here?

@sravanakinapally
Copy link

sravanakinapally commented Dec 16, 2022

@slonka let me know if you need more logs

This was intermittent but it's happening often now we upgraded the cluster to 1.24.6 AZure AKS with ingress nginx docker.io/bitnami/nginx-ingress-controller:1.6.0-debian-11-r1

uma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.614Z	ERROR	Reconciler error	{"controller": "configmap", "controllerGroup": "", "controllerKind": "ConfigMap", "service": {"name":"dataprivacy-portal-ng-k8s-poc-nonprod-externalname","namespace":"dpp-ns-nonprod"}, "namespace": "dpp-ns-nonprod", "name": "dataprivacy-portal-ng-k8s-poc-nonprod-externalname", "reconcileID": "a178d82e-cbb9-478b-9519-e750e1e16fb7", "error": "unable to update ingress service upstream annotation on service dataprivacy-portal-ng-k8s-poc-nonprod-externalname: Operation cannot be fulfilled on services \"dataprivacy-portal-ng-k8s-poc-nonprod-externalname\": the object has been modified; please apply your changes to the latest version and try again", "errorVerbose": "Operation cannot be fulfilled on services \"dataprivacy-portal-ng-k8s-poc-nonprod-externalname\": the object has been modified; please apply your changes to the latest version and try again\nunable to update ingress service upstream annotation on service dataprivacy-portal-ng-k8s-poc-nonprod-externalname\ngh.neting.cc/kumahq/kuma/pkg/plugins/runtime/k8s/controllers.(*ServiceReconciler).Reconcile\n\t/home/circleci/project/pkg/plugins/runtime/k8s/controllers/service_controller.go:85\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:320\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/circleci/.go-kuma-go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.1/pkg/internal/controller/controller.go:234\nruntime.goexit\n\t/home/circleci/go/src/runtime/asm_amd64.s:1571"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.614Z	INFO	controllers.Service	annotating service which is part of the mesh	{"service": "dpp-ns-nonprod/dataprivacy-portal-ng-k8s-poc-nonprod-externalname", "annotation": "ingress.kubernetes.io/service-upstream=true"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.665Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-stage-7d44c9b984-jhh8b", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}

uma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.665Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-stage-7d44c9b984-jhh8b", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.726Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-test-965df7556-bzq28", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.744Z	INFO	controllers.Service	annotating service which is part of the mesh	{"service": "dpp-ns-nonprod/dataprivacy-portal-ng-k8s-poc-nonprod-externalname", "annotation": "ingress.kubernetes.io/service-upstream=true"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.787Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-springboot-k8s-poc-nonprod-5fb6db7699-765k5", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.857Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-ng-k8s-poc-nonprod-7dd66b7d94-vxph4", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.909Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-api-proxy-test-796f6cb685-n8t6g", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:57.973Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "dataprivacy-portal-router-service-dev-55985fd584-qq9vc", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}
kuma-control-plane-856b976ffc-9rtml:control-plane 2022-12-16T21:19:58.001Z	INFO	leader	Not the leader. Waiting.
kuma-control-plane-856b976ffc-2pl9k:control-plane 2022-12-16T21:19:58.023Z	INFO	discovery.k8s.pod-to-dataplane-converter	ignoring label when converting labels to tags, because it uses reserved Kuma prefix	{"pod": "privacy-portal-dpo-dev-55c99479d9-r97rd", "namespace": "dpp-ns-nonprod", "label": "kuma.io/region"}

@Automaat Automaat added triage/pending This issue will be looked at on the next triage meeting and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels Dec 19, 2022
@jakubdyszkiewicz
Copy link
Contributor

Triage: Hey, which version of Kuma are you running?
On Oct 10 you posted Kuma Service mesh 1-Global 6-Zone all these are Azure Kubernetes cluster AKS 1.22.6 with Kuma version 1.3.1. Did you upgrade? In 1.5.0 we released a fix to a DNS issue that looks like what you are describing https://github.com//kumahq/kuma/pull/3459

@jakubdyszkiewicz jakubdyszkiewicz removed the triage/pending This issue will be looked at on the next triage meeting label Dec 19, 2022
@slonka slonka added triage/pending This issue will be looked at on the next triage meeting and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels Jan 23, 2023
@jakubdyszkiewicz jakubdyszkiewicz added triage/needs-reproducing Someone else should try to reproduce this and removed triage/pending This issue will be looked at on the next triage meeting labels Jan 23, 2023
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Apr 24, 2023
@github-actions
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@lahabana lahabana removed the triage/stale Inactive for some time. It will be triaged again label Apr 24, 2023
@bartsmykla
Copy link
Contributor

@sravanakinapally could you update to the newer version of the kuma, and get back with information if it still happens? I would like to push this forward, and resolve the issue if necessary.

@sravanakinapally
Copy link

@bartsmykla yes, this is still happening. I am working with John H on this.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Aug 15, 2023
@github-actions
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@jakubdyszkiewicz jakubdyszkiewicz removed the triage/stale Inactive for some time. It will be triaged again label Aug 18, 2023
@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Nov 17, 2023
Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@michaelbeaumont michaelbeaumont removed the triage/stale Inactive for some time. It will be triaged again label Nov 17, 2023
@slonka
Copy link
Contributor

slonka commented Nov 27, 2023

@johnharris85 did we figure out what was wrong here?

@nowNick
Copy link

nowNick commented Nov 27, 2023

I'm not sure if these issues are 100% related but they do look similar.

A short summary is that we've had some problems with Kong's DNS client. There's been a quick-fix in 3.5.0 and we've merged also a broader fix but it's still to be released. More info here: Kong/kong#9959 (comment)

Copy link
Contributor

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed.
If you think this issue is still relevant, please comment on it or attend the next triage meeting.

@github-actions github-actions bot added the triage/stale Inactive for some time. It will be triaged again label Feb 26, 2024
@jakubdyszkiewicz jakubdyszkiewicz removed the triage/stale Inactive for some time. It will be triaged again label Feb 26, 2024
@jakubdyszkiewicz
Copy link
Contributor

that honestly looks like #8301 that was fixed in 2.5.0 and backported to patch versions of earlier minors.
Especially when we are talking about multizone like the original post

@lahabana lahabana added triage/needs-information Reviewed and some extra information was asked to the reporter and removed triage/needs-reproducing Someone else should try to reproduce this labels Apr 8, 2024
@lahabana
Copy link
Contributor

lahabana commented Apr 8, 2024

@arjunsalyan can you check if this still happens on a recent version?

@slonka
Copy link
Contributor

slonka commented May 6, 2024

pinging @arjunsalyan again

@arjunsalyan
Copy link
Author

Sorry guys, we no longer have the setup on which we had the issue earlier and so there is no way for me to test or reproduce this. We can close this ticket if similar issues have been addressed.

@jakubdyszkiewicz jakubdyszkiewicz closed this as not planned Won't fix, can't repro, duplicate, stale May 6, 2024
@jakubdyszkiewicz jakubdyszkiewicz added triage/rotten closed due to lack of information for too long, rejected feature... and removed triage/needs-information Reviewed and some extra information was asked to the reporter labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/multizone triage/rotten closed due to lack of information for too long, rejected feature...
Projects
None yet
Development

No branches or pull requests