Retry creating kubeclients in FluentD when error #855

samjsong · 2020-08-26T15:51:02Z

Edit: New proposal based on Dominik's comments is to retry for ~4 minutes (with exponential backoff) in the configure step. This should hopefully give enough leeway for transient issue, and throw an exception otherwise, causing the Fluentd worker process to die and restart (after having logged the exception). This will make it way easier to understand the issue than the previous Got exception undefined method.

Based on the issues that @frankreno mentioned some customers were facing with

Got exception undefined method _____ for #<String:0x00007ff19eaf4e60> ______

I found that this is because Ruby does a weird thing where if a method returns an exception, Ruby treats that exception as a String type. In this case, inside connect_kubernetes we call create_client which can return an exception. If for some reason we fail to create this client, we store the exception instead of the client - which means that when we try to call client.getEntity, we get the above undefined method for String error.

By creating the client in the configure step, this means that when the fluentd worker exits and restarts, we never re-initialize the client. ~~By moving connect_kubernetes to start, we will at least try to re-initialize the client.~~

Unfortunately I spent some time trying to figure out why the initial error happens but haven't been able to figure that out, but hopefully this change at least makes it so that fluentd can fix itself if a transient issue ends.

sumo-drosiek · 2020-08-26T15:58:04Z

Maybe we could put querying into begin rescue block and reconnect in case of connection is invalid

samjsong · 2020-08-26T16:01:07Z

I think we could also add new begin rescue blocks where we call kubeclient methods to re-init the clients in that same case

frankreno

LGTM - as discussed let's get this lined up for a 1.2.1 release.

samjsong · 2020-09-02T07:11:19Z

I've made a few changes that I hope will help further

return nil in create_clients so that @clients[api_version] will return nil rather than a String of the exception thrown
in the events plugin and enriched metadata plugin service monitor, add a nil check on relevant clients. If nil, attempt to recreate the client before proceeding. There's no guarantee this attempt will succeed (especially if there's a legitimate issue), but as these are infinite loops we'll at least try again soon.
in fetch_resource, fix the check for client not created. Again, attempt to recreate if nil.

I thought about doing some sort of retry mechanism on client creation, but I wasn't sure how long we'd want to retry - infinitely? If the team thinks adding some sort of retry into create_clients itself, I'd be happy to do that as well

fluent-plugin-enhance-k8s-metadata/lib/sumologic/kubernetes/connector.rb

samjsong · 2020-09-02T07:59:25Z

Hmm, Travis failed even though build passed locally, I’ll check tomorrow

perk · 2020-09-03T06:41:47Z

Travis vs kubeclient error has been fixed upstream by #870 - please rebase :)

sumo-drosiek

🎉

perk-sumo

LGTM 👍

samjsong requested review from frankreno, a team, perk-sumo and pmalek-sumo and removed request for a team August 26, 2020 15:51

frankreno approved these changes Aug 27, 2020

View reviewed changes

frankreno added this to the v1.2 milestone Sep 1, 2020

samjsong changed the title ~~initialize kube api clients at start, not config~~ initialize kubeclients at start; check and reinit when there is an error Sep 2, 2020

sumo-drosiek reviewed Sep 2, 2020

View reviewed changes

fluent-plugin-enhance-k8s-metadata/lib/sumologic/kubernetes/connector.rb Outdated Show resolved Hide resolved

samjsong force-pushed the ssong-init-client branch 2 times, most recently from fbfc20c to e0c3bf4 Compare September 3, 2020 06:51

samjsong added 4 commits September 2, 2020 23:52

initialize kube api clients at start, not config

9cec068

check and attempt to reinitialize client

95b6a91

revert previous changes, just retry in configure

fa4e56e

fix relative URL error in tests

11c463a

samjsong force-pushed the ssong-init-client branch from e0c3bf4 to 11c463a Compare September 3, 2020 06:52

sumo-drosiek approved these changes Sep 3, 2020

View reviewed changes

samjsong changed the title ~~initialize kubeclients at start; check and reinit when there is an error~~ retry creating kubeclients when error Sep 3, 2020

perk-sumo approved these changes Sep 3, 2020

View reviewed changes

perk-sumo merged commit 6648b63 into master Sep 3, 2020

perk-sumo deleted the ssong-init-client branch September 3, 2020 11:21

perk-sumo changed the title ~~retry creating kubeclients when error~~ Retry creating kubeclients in FluentD when error Sep 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry creating kubeclients in FluentD when error #855

Retry creating kubeclients in FluentD when error #855

samjsong commented Aug 26, 2020 •

edited

Loading

sumo-drosiek commented Aug 26, 2020

samjsong commented Aug 26, 2020

frankreno left a comment

samjsong commented Sep 2, 2020

samjsong commented Sep 2, 2020

perk commented Sep 3, 2020

sumo-drosiek left a comment

perk-sumo left a comment

Retry creating kubeclients in FluentD when error #855

Retry creating kubeclients in FluentD when error #855

Conversation

samjsong commented Aug 26, 2020 • edited Loading

sumo-drosiek commented Aug 26, 2020

samjsong commented Aug 26, 2020

frankreno left a comment

Choose a reason for hiding this comment

samjsong commented Sep 2, 2020

samjsong commented Sep 2, 2020

perk commented Sep 3, 2020

sumo-drosiek left a comment

Choose a reason for hiding this comment

perk-sumo left a comment

Choose a reason for hiding this comment

samjsong commented Aug 26, 2020 •

edited

Loading