Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset watch retry count on successful connection to API Server #267

Merged
merged 4 commits into from
Jan 22, 2021
Merged

Reset watch retry count on successful connection to API Server #267

merged 4 commits into from
Jan 22, 2021

Conversation

andrzej-stencel
Copy link
Contributor

@andrzej-stencel andrzej-stencel commented Oct 16, 2020

Fixes #249.

This pull request adds resetting of pod and namespace watch retry count after successfully re-establishing connection to Kubernetes API server. This is to prevent Fluentd restarts in the following scenario (describing pod watch, but namespace watch works in the same way):

  1. Pod watch connection to Kubernetes API server is successfully created by this plugin.
  2. The pod watch connection is dropped by Kubernetes API server after certain period of time (API server is known to drop these connection every now and then, e.g. after 45 minutes or so).
  3. The pod watch connection is successfully renewed by this plugin.
  4. The above points 2-3 repeat more than :watch_retry_max_times with no watch updates coming from API server in the meantime, which would reset the watch retry count.

Nothing incorrect is actually happening in the above scenario, so we don't want to raise a Fluent::UnrecoverableError in such case (which causes the whole Fluentd instance to restart). To prevent raising the Fluent::UnrecoverableError, I propose to reset the watch retry count not only on receiving an update from the watch (which might not happen given for example no changes in the namespaces of the k8s cluster for a long time), but also on successful re-connection to the API server.

@andrzej-stencel andrzej-stencel marked this pull request as ready for review October 26, 2020 12:36
@jcantrill
Copy link
Contributor

@jkohen @grosser might I get you to weigh in on these changes?

@jkohen
Copy link
Contributor

jkohen commented Oct 26, 2020

@qingling128 FYI.

test/plugin/test_watch_pods.rb Show resolved Hide resolved
@andrzej-stencel
Copy link
Contributor Author

Thanks a lot for the reviews folks. Can we have this merged and released as new version?

@andrzej-stencel
Copy link
Contributor Author

Hey @jcantrill is there anything stopping us from merging and releasing a new version? Is there anything I could help with?

@djj0809
Copy link

djj0809 commented Nov 13, 2020

@jcantrill our production fluentd instances are keeping restarting all the time, can we merge this and release a new version ASAP?

andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 26, 2020
andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 26, 2020
andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 27, 2020
andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 27, 2020
…1183)

This change bundles the `kubernetes_metadata` Fluentd plugin from https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/ into our repo, as the upstream is not active and our [pull request there](fabric8io/fluent-plugin-kubernetes_metadata_filter#267) is not being merged despite being accepted.

This code was copied from upstream at commit fabric8io/fluent-plugin-kubernetes_metadata_filter@84f66a8 on `master` branch from Oct 8th, 2020. On that commit, changes from the pull request fabric8io/fluent-plugin-kubernetes_metadata_filter#267 have been applied.
andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 30, 2020
…1183)

This change bundles the `kubernetes_metadata` Fluentd plugin from https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/ into our repo, as the upstream is not active and our [pull request there](fabric8io/fluent-plugin-kubernetes_metadata_filter#267) is not being merged despite being accepted.

This code was copied from upstream at commit fabric8io/fluent-plugin-kubernetes_metadata_filter@84f66a8 on `master` branch from Oct 8th, 2020. On that commit, changes from the pull request fabric8io/fluent-plugin-kubernetes_metadata_filter#267 have been applied.
@frankreno
Copy link
Contributor

@jcantrill @grosser @qingling128 - Is there any ETA on when this fix will be merged and a new release containing the fix available? We have numerous customers affected by this and waiting on this fix which is why @astencel-sumo helped address this.

andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Dec 14, 2020
…1183) (#1193)

This change bundles the `kubernetes_metadata` Fluentd plugin from https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/ into our repo, as the upstream is not active and our [pull request there](fabric8io/fluent-plugin-kubernetes_metadata_filter#267) is not being merged despite being accepted.

This code was copied from upstream at commit fabric8io/fluent-plugin-kubernetes_metadata_filter@84f66a8 on `master` branch from Oct 8th, 2020. On that commit, changes from the pull request fabric8io/fluent-plugin-kubernetes_metadata_filter#267 have been applied.
@elsesiy
Copy link

elsesiy commented Jan 11, 2021

Any progress on this?

@djj0809
Copy link

djj0809 commented Jan 20, 2021

@jcantrill @grosser @qingling128 please merge it and release ASAP, it's really a annoying issue.

@grosser
Copy link
Contributor

grosser commented Jan 20, 2021

I can't merge/am not a maintainer 🤷
also using fluent-plugin-kubelet_metadata now if that is any help

@qingling128
Copy link
Contributor

Same here. I think @jcantrill has merge access.

@jcantrill jcantrill merged commit d5fe9b7 into fabric8io:master Jan 22, 2021
@jcantrill
Copy link
Contributor

Sorry for the delay here. I will release something shortly

@jcantrill
Copy link
Contributor

published 2.5.3 https://rubygems.org/gems/fluent-plugin-kubernetes_metadata_filter/versions/2.5.3

@andrzej-stencel andrzej-stencel deleted the Reset-watch-retry-count-on-successful-connection-to-API-server branch July 21, 2023 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fail to parse watch call response and crashes pod.
8 participants