Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.cisco_telemetry_mdt): Add GRPC Keepalive/timeout config options #11458

Merged
merged 6 commits into from
Jul 18, 2022

Conversation

danialre
Copy link
Contributor

@danialre danialre commented Jul 5, 2022

Required for all PRs:

This PR adds configuration options for changing server-side GRPC keepalive/ping settings. We've found when some Cisco IOS-XE devices send telemetry data on GRPC-Dialout over a firewall, some timings on ping packets are affected so that pings arrive without data faster than the minimum ping interval time (5 minute default). This can also be exacerbated by using telemetry intervals longer than 5 minutes.
Each time this happens is a "strike", and if it happens more than 2 times the GRPC server in cisco_telemetry_mdt.go closes out the connection with a GRPC GOAWAY (ENHANCE_YOUR_CALM) error. Sometimes the telemetry device sets up a new connection to continue sending data; other times it fails and we lose data that should otherwise make it to Telegraf.

More information on the keepalive/strike behavior is documented here: https://github.com/grpc/grpc/blob/master/doc/keepalive.md

The two config options added are:

  • permit_keepalive_without_calls (bool) - enable/disable enforcement of requiring in-flight calls for pings
  • keepalive_minimum_time (duration) - set a custom minimum time in between successive pings.

These settings can allow Cisco MDT users to adjust the sensitivity of these keepalive timeouts, so GRPC sessions won't be closed as often for misbehaving devices.
Default/unfilled options won't change the current plugin behavior.

@telegraf-tiger telegraf-tiger bot added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Jul 5, 2022
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for digging into this! We get a number of cisco connection issues and it is always helpful when someone puts up a PR to make this plugin easier to use.

Couple of comments are below.

@@ -27,6 +27,12 @@ later.
## Grpc Maximum Message Size, default is 4MB, increase the size.
max_msg_size = 4000000

## GRPC permit keepalives without calls, set to true if your clients are sending pings without calls in-flight.
# permit_keepalive_without_calls = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does a user know if their clients are sending pings without calls in flight? Basically, if I am using Telegraf, what would I need to know to change this from the default? Similar to what you added for the keepalive_minimum_time below description, you specified an error message that helps guide the user.

@reimda thoughts on a different variable to avoid a boolean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more documentation to that setting - it's probably not going to be used as much as keepalive_minimum_time, but may be helpful for stablizing telemetry on older/unpatched Cisco code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I'd still like @reimda's thoughts on a way to avoid the boolean or to go with it given it is directly used to configure the connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple thoughts:

I'd usually ask if there are other ways telegraf users might want to handle things related to this setting. For example, besides the default of not allowing keepalive pings without active streams, and the new option of allowing them, maybe someone would want to log when this happens. Then the setting should be a string instead of boolean and it should expect values "permit", "send_goaway", and "log".

In this case since the grpc module only has a boolean setting, it doesn't look like there are going to be other options besides permit and don't permit. Maybe a boolean setting is the right way to go.

The opcua and modbus plugins have "workarounds" sections to handle things like this. If this setting is only useful to interoperate with code that is known to be quirky (or clearly buggy), putting it in a workarounds section away from the ordinary plugin settings is a way to tell telegraf users it's only useful in rare occasions, but at the same time make it visible enough that they'll see it if they need it.

The grpc module does the same thing as those "workarounds" sections but calls it "enforcement policy". Maybe we should mirror the module settings exactly and have a section "enforcement_policy" or "grpc_enforcement_policy" and settings "min_time" and "permit_without_stream".

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me - after all, these settings are only useful to get around implementation issues, and don't have any effect on the TCP transport anyway. I'm looking at the opcua plugin to see how the workaround section is implemented, I could modify this to have a "grpc_enforcement_policy" section with these settings.

plugins/inputs/cisco_telemetry_mdt/README.md Outdated Show resolved Hide resolved
@telegraf-tiger
Copy link
Contributor

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jul 18, 2022
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for your contribution @danialre!

@powersj powersj merged commit 1fa47c8 into influxdata:master Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants