-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(inputs.cisco_telemetry_mdt): Add GRPC Keepalive/timeout config options #11458
feat(inputs.cisco_telemetry_mdt): Add GRPC Keepalive/timeout config options #11458
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for digging into this! We get a number of cisco connection issues and it is always helpful when someone puts up a PR to make this plugin easier to use.
Couple of comments are below.
@@ -27,6 +27,12 @@ later. | |||
## Grpc Maximum Message Size, default is 4MB, increase the size. | |||
max_msg_size = 4000000 | |||
|
|||
## GRPC permit keepalives without calls, set to true if your clients are sending pings without calls in-flight. | |||
# permit_keepalive_without_calls = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does a user know if their clients are sending pings without calls in flight? Basically, if I am using Telegraf, what would I need to know to change this from the default? Similar to what you added for the keepalive_minimum_time below description, you specified an error message that helps guide the user.
@reimda thoughts on a different variable to avoid a boolean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more documentation to that setting - it's probably not going to be used as much as keepalive_minimum_time, but may be helpful for stablizing telemetry on older/unpatched Cisco code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I'd still like @reimda's thoughts on a way to avoid the boolean or to go with it given it is directly used to configure the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple thoughts:
I'd usually ask if there are other ways telegraf users might want to handle things related to this setting. For example, besides the default of not allowing keepalive pings without active streams, and the new option of allowing them, maybe someone would want to log when this happens. Then the setting should be a string instead of boolean and it should expect values "permit", "send_goaway", and "log".
In this case since the grpc module only has a boolean setting, it doesn't look like there are going to be other options besides permit and don't permit. Maybe a boolean setting is the right way to go.
The opcua and modbus plugins have "workarounds" sections to handle things like this. If this setting is only useful to interoperate with code that is known to be quirky (or clearly buggy), putting it in a workarounds section away from the ordinary plugin settings is a way to tell telegraf users it's only useful in rare occasions, but at the same time make it visible enough that they'll see it if they need it.
The grpc module does the same thing as those "workarounds" sections but calls it "enforcement policy". Maybe we should mirror the module settings exactly and have a section "enforcement_policy" or "grpc_enforcement_policy" and settings "min_time" and "permit_without_stream".
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me - after all, these settings are only useful to get around implementation issues, and don't have any effect on the TCP transport anyway. I'm looking at the opcua plugin to see how the workaround section is implemented, I could modify this to have a "grpc_enforcement_policy" section with these settings.
Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for your contribution @danialre!
Required for all PRs:
This PR adds configuration options for changing server-side GRPC keepalive/ping settings. We've found when some Cisco IOS-XE devices send telemetry data on GRPC-Dialout over a firewall, some timings on ping packets are affected so that pings arrive without data faster than the minimum ping interval time (5 minute default). This can also be exacerbated by using telemetry intervals longer than 5 minutes.
Each time this happens is a "strike", and if it happens more than 2 times the GRPC server in cisco_telemetry_mdt.go closes out the connection with a GRPC GOAWAY (ENHANCE_YOUR_CALM) error. Sometimes the telemetry device sets up a new connection to continue sending data; other times it fails and we lose data that should otherwise make it to Telegraf.
More information on the keepalive/strike behavior is documented here: https://github.com/grpc/grpc/blob/master/doc/keepalive.md
The two config options added are:
These settings can allow Cisco MDT users to adjust the sensitivity of these keepalive timeouts, so GRPC sessions won't be closed as often for misbehaving devices.
Default/unfilled options won't change the current plugin behavior.