-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry creating kubeclients in FluentD when error #855
Conversation
Maybe we could put querying into |
I think we could also add new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - as discussed let's get this lined up for a 1.2.1 release.
I've made a few changes that I hope will help further
I thought about doing some sort of retry mechanism on client creation, but I wasn't sure how long we'd want to retry - infinitely? If the team thinks adding some sort of retry into |
fluent-plugin-enhance-k8s-metadata/lib/sumologic/kubernetes/connector.rb
Outdated
Show resolved
Hide resolved
Hmm, Travis failed even though build passed locally, I’ll check tomorrow |
Travis vs kubeclient error has been fixed upstream by #870 - please rebase :) |
fbfc20c
to
e0c3bf4
Compare
e0c3bf4
to
11c463a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Edit: New proposal based on Dominik's comments is to retry for ~4 minutes (with exponential backoff) in the
configure
step. This should hopefully give enough leeway for transient issue, and throw an exception otherwise, causing the Fluentd worker process to die and restart (after having logged the exception). This will make it way easier to understand the issue than the previousGot exception undefined method
.Based on the issues that @frankreno mentioned some customers were facing with
I found that this is because Ruby does a weird thing where if a method returns an exception, Ruby treats that exception as a String type. In this case, inside
connect_kubernetes
we callcreate_client
which can return an exception. If for some reason we fail to create this client, we store the exception instead of the client - which means that when we try to callclient.getEntity
, we get the aboveundefined method for String
error.By creating the client in the
configure
step, this means that when the fluentd worker exits and restarts, we never re-initialize the client.By movingconnect_kubernetes
tostart
, we will at least try to re-initialize the client.Unfortunately I spent some time trying to figure out why the initial error happens but haven't been able to figure that out, but hopefully this change at least makes it so that fluentd can fix itself if a transient issue ends.