Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added an option to skip failed outputs in agent startup #10111

Closed
wants to merge 3 commits into from

Conversation

eran-gil2
Copy link

@eran-gil2 eran-gil2 commented Nov 16, 2021

Required for all PRs:

resolves #3723, resolves #9778

@telegraf-tiger
Copy link
Contributor

Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla

@telegraf-tiger telegraf-tiger bot added the fix pr to fix corresponding bug label Nov 16, 2021
@eran-gil2
Copy link
Author

!signed-cla

@eran-gil2 eran-gil2 force-pushed the master branch 2 times, most recently from ff6dc62 to c68533e Compare November 16, 2021 14:05
@eran-gil2
Copy link
Author

hey could you please approve other workflows so this feature could be merged? :)

@eran-gil2
Copy link
Author

@srebhan who is allowed to review my feature? anyone? or are there certain specifications for this?

Copy link
Contributor

@Hipska Hipska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General idea is good, but I would add a command line option for that as well.

It seems to me that it might be better if outputs that are not able to connect at startup are retried after some time instead of just being completely ignored for the duration of telegraf run.

@@ -199,6 +199,9 @@ type AgentConfig struct {

Hostname string
OmitHostname bool

// Allows running the agents even if some outputs fail
SkipFailedOutputs bool `toml:"skip_failed_outputs"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update related agentConfig with example configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as the etc/telegraf.conf and etc/telegraf_windows.conf

@Hipska
Copy link
Contributor

Hipska commented Nov 23, 2021

This PR also doesn't solve #3723 since it is requested to have x times or retry. This PR doesn't add this behaviour.

Comments in a related PR (#9051) mentioned that plugins should be implementing the Start or Connect method in order to let telegraf startup without halting if one of the plugins cannot be connected to. It seems that in order to have it like this, a change in agent.go needs to be done.

@srebhan
Copy link
Member

srebhan commented Nov 23, 2021

@eran-gil2 I'd like to discuss this with the core-devs as this is a very mighty feature IMO. To me it seems too easy to loose data when using this feature without caution as you will get an "everything alright" impression but you easily have no working output. I think this might be some fundamental thing to retry to connect to the outputs until we reach a buffer overflow or anything. We even might want to persist the metrics locally in this case...

@srebhan
Copy link
Member

srebhan commented Jun 22, 2022

@eran-gil2 after discussing this matter with the team, we came to the conclusion that a global option to ignore all errors is not the way we want to go. There have been similar proposals (#11313) and requests (#10694, #11289) for the input side.

However, we think it's better to fix the individual plugins to avoid connecting to services in Init but use Connect instead. Furthermore, we do see the demand for a more dynamic configuration of telegraf and other features such as life-cycle and connection management. We will start a discussion on how to integrate such features in future versions of telegraf.

For now, I'm closing this PR as a correct implementation requires a rework of Telegraf internals which is out-of-scope for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix pr to fix corresponding bug needs design review
Projects
None yet
3 participants