Reset watch retry count on successful connection to API Server #267

andrzej-stencel · 2020-10-16T11:57:51Z

Fixes #249.

This pull request adds resetting of pod and namespace watch retry count after successfully re-establishing connection to Kubernetes API server. This is to prevent Fluentd restarts in the following scenario (describing pod watch, but namespace watch works in the same way):

Pod watch connection to Kubernetes API server is successfully created by this plugin.
The pod watch connection is dropped by Kubernetes API server after certain period of time (API server is known to drop these connection every now and then, e.g. after 45 minutes or so).
The pod watch connection is successfully renewed by this plugin.
The above points 2-3 repeat more than :watch_retry_max_times with no watch updates coming from API server in the meantime, which would reset the watch retry count.

Nothing incorrect is actually happening in the above scenario, so we don't want to raise a Fluent::UnrecoverableError in such case (which causes the whole Fluentd instance to restart). To prevent raising the Fluent::UnrecoverableError, I propose to reset the watch retry count not only on receiving an update from the watch (which might not happen given for example no changes in the namespaces of the k8s cluster for a long time), but also on successful re-connection to the API server.

jcantrill · 2020-10-26T18:18:19Z

@jkohen @grosser might I get you to weigh in on these changes?

jkohen · 2020-10-26T20:32:43Z

@qingling128 FYI.

test/plugin/test_watch_pods.rb

andrzej-stencel · 2020-11-03T08:03:27Z

Thanks a lot for the reviews folks. Can we have this merged and released as new version?

andrzej-stencel · 2020-11-10T13:53:36Z

Hey @jcantrill is there anything stopping us from merging and releasing a new version? Is there anything I could help with?

djj0809 · 2020-11-13T02:30:36Z

@jcantrill our production fluentd instances are keeping restarting all the time, can we merge this and release a new version ASAP?

Replicates changes from this PR on upstream: fabric8io/fluent-plugin-kubernetes_metadata_filter#267

…1183) This change bundles the `kubernetes_metadata` Fluentd plugin from https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/ into our repo, as the upstream is not active and our [pull request there](fabric8io/fluent-plugin-kubernetes_metadata_filter#267) is not being merged despite being accepted. This code was copied from upstream at commit fabric8io/fluent-plugin-kubernetes_metadata_filter@84f66a8 on `master` branch from Oct 8th, 2020. On that commit, changes from the pull request fabric8io/fluent-plugin-kubernetes_metadata_filter#267 have been applied.

frankreno · 2020-11-30T14:21:36Z

@jcantrill @grosser @qingling128 - Is there any ETA on when this fix will be merged and a new release containing the fix available? We have numerous customers affected by this and waiting on this fix which is why @astencel-sumo helped address this.

…1183) (#1193) This change bundles the `kubernetes_metadata` Fluentd plugin from https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/ into our repo, as the upstream is not active and our [pull request there](fabric8io/fluent-plugin-kubernetes_metadata_filter#267) is not being merged despite being accepted. This code was copied from upstream at commit fabric8io/fluent-plugin-kubernetes_metadata_filter@84f66a8 on `master` branch from Oct 8th, 2020. On that commit, changes from the pull request fabric8io/fluent-plugin-kubernetes_metadata_filter#267 have been applied.

elsesiy · 2021-01-11T21:00:46Z

Any progress on this?

djj0809 · 2021-01-20T02:27:29Z

@jcantrill @grosser @qingling128 please merge it and release ASAP, it's really a annoying issue.

grosser · 2021-01-20T02:38:27Z

I can't merge/am not a maintainer 🤷
also using fluent-plugin-kubelet_metadata now if that is any help

qingling128 · 2021-01-20T06:39:32Z

Same here. I think @jcantrill has merge access.

jcantrill · 2021-01-22T20:07:26Z

Sorry for the delay here. I will release something shortly

jcantrill · 2021-01-22T20:29:50Z

published 2.5.3 https://rubygems.org/gems/fluent-plugin-kubernetes_metadata_filter/versions/2.5.3

Reset watch retry count on successful connection to API Server

9c6d64a

andrzej-stencel mentioned this pull request Oct 16, 2020

fail to parse watch call response and crashes pod. #249

Closed

andrzej-stencel added 2 commits October 26, 2020 12:18

Reset pod watch retry count on successful connection to API Server

6001b22

Fix tests

3b31fe7

andrzej-stencel marked this pull request as ready for review October 26, 2020 12:36

grosser approved these changes Oct 26, 2020

View reviewed changes

andrzej-stencel mentioned this pull request Oct 27, 2020

Never trigger Fluentd restart if watch_retry_max_times == -1 #266

Closed

qingling128 reviewed Oct 30, 2020

View reviewed changes

test/plugin/test_watch_pods.rb Show resolved Hide resolved

Re-add tests for Fluent::UnrecoverableError

8a449ff

qingling128 approved these changes Nov 2, 2020

View reviewed changes

test/plugin/test_watch_pods.rb Show resolved Hide resolved

andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 26, 2020

Reset watch retry count on successful connection to API Server

636ce1b

Replicates changes from this PR on upstream: fabric8io/fluent-plugin-kubernetes_metadata_filter#267

andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 26, 2020

Reset watch retry count on successful connection to API Server

a8b1569

Replicates changes from this PR on upstream: fabric8io/fluent-plugin-kubernetes_metadata_filter#267

andrzej-stencel mentioned this pull request Nov 26, 2020

Bundle kubernetes_metadata plugin with fix for Fluentd pod restarts SumoLogic/sumologic-kubernetes-collection#1183

Merged

3 tasks

andrzej-stencel added a commit to SumoLogic/sumologic-kubernetes-collection that referenced this pull request Nov 27, 2020

Reset watch retry count on successful connection to API Server

c6a6f93

Replicates changes from this PR on upstream: fabric8io/fluent-plugin-kubernetes_metadata_filter#267

jcantrill merged commit d5fe9b7 into fabric8io:master Jan 22, 2021

SarthakSahu mentioned this pull request Jun 22, 2021

Fluentd Restart with " Exception encountered parsing pod watch event". #293

Closed

andrzej-stencel deleted the Reset-watch-retry-count-on-successful-connection-to-API-server branch July 21, 2023 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset watch retry count on successful connection to API Server #267

Reset watch retry count on successful connection to API Server #267

andrzej-stencel commented Oct 16, 2020 •

edited

Loading

jcantrill commented Oct 26, 2020

jkohen commented Oct 26, 2020

andrzej-stencel commented Nov 3, 2020

andrzej-stencel commented Nov 10, 2020

djj0809 commented Nov 13, 2020

frankreno commented Nov 30, 2020

elsesiy commented Jan 11, 2021

djj0809 commented Jan 20, 2021

grosser commented Jan 20, 2021

qingling128 commented Jan 20, 2021

jcantrill commented Jan 22, 2021

jcantrill commented Jan 22, 2021

Reset watch retry count on successful connection to API Server #267

Reset watch retry count on successful connection to API Server #267

Conversation

andrzej-stencel commented Oct 16, 2020 • edited Loading

jcantrill commented Oct 26, 2020

jkohen commented Oct 26, 2020

andrzej-stencel commented Nov 3, 2020

andrzej-stencel commented Nov 10, 2020

djj0809 commented Nov 13, 2020

frankreno commented Nov 30, 2020

elsesiy commented Jan 11, 2021

djj0809 commented Jan 20, 2021

grosser commented Jan 20, 2021

qingling128 commented Jan 20, 2021

jcantrill commented Jan 22, 2021

jcantrill commented Jan 22, 2021

andrzej-stencel commented Oct 16, 2020 •

edited

Loading