-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telemetry messages loss #141
Comments
No error message in telegraf consumer logs. |
unit tests added to telegraf: ed65d07806db4cbb0623c9b6cb296313990d3623, send 25 messages, and expect 25 messages. |
|
I did the test with the nats server instead of the nats-streaming server, same issue. |
unit test on telegraf nats consumer (outside of docker image) does not experience the same bug |
instead of the telegraf nats output I tried to send data directly to nats. The telegraf nats input was then able to read all the packets. |
@ndegory Helpful analysis so far with your last two comments, thanks. @bertrand-quenin @freignat91 I don't think we should have reverted (#143) just yet until we had tested at-least-once delivery messaging semantics. We can live with messaging loss for a bit while analyzing and also checking on the NATS slack channel and/or opening an issue on their GitHub repo. See the rest of my overall thoughts here: #113 (comment) |
I don't think there's any message loss from nats itself. I think the telegraf plugins are not configured to work the way we want. |
@bertrand-quenin I agree with that assessment. I do have overarching concerns about NATS Streaming in general, which I mention in #113 (comment), but I also don't want to foreclose on getting things working and doing a fair assessment prematurely either. Not sure we needed to revert (#143) so quickly -- we barely gave it a chance. But as we've been discussing on Slack, I'm also fine with the idea that we don't deliver any telemetry over a message queue for now. Not sure if it will make a difference, but I also mentioned on Slack that influxdata/telegraf#1697 was merged a few hours ago, so @ndegory should make sure to update our Telegraf image so we're using the latest with official telegraf -> NATS output plugin support. |
patch https://patch-diff.githubusercontent.com/raw/influxdata/telegraf/pull/1697.patch (as of Sept 6th) applied on telegraf 1.0.0-rc1, same test, same result:
|
To check if the issue is related to the input or the output plugin, I did 2 separated tests as follow: |
We can probably close this issue now. |
I'd rather keep it open but low priority, as we'll have to eventually fix it. |
metrics gathered by telegraf agents are sent to the nats streaming server, and the telegraf worker consumes and pushes them to influxdb.
We notice a loss of messages.
The UI of the nats server show an equal number of inputs and outputs for the metrics messages, but the telegraf consumer acknowledges much less messages than expected.
The text was updated successfully, but these errors were encountered: