Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry messages loss #141

Closed
ndegory opened this issue Sep 5, 2016 · 13 comments
Closed

telemetry messages loss #141

ndegory opened this issue Sep 5, 2016 · 13 comments
Assignees

Comments

@ndegory
Copy link
Contributor

ndegory commented Sep 5, 2016

metrics gathered by telegraf agents are sent to the nats streaming server, and the telegraf worker consumes and pushes them to influxdb.
We notice a loss of messages.
The UI of the nats server show an equal number of inputs and outputs for the metrics messages, but the telegraf consumer acknowledges much less messages than expected.

@ndegory ndegory changed the title telemetry message loss telemetry messages loss Sep 5, 2016
@ndegory ndegory self-assigned this Sep 5, 2016
@ndegory
Copy link
Contributor Author

ndegory commented Sep 5, 2016

No error message in telegraf consumer logs.
Changes on the data collection interval doesn't have any effect on the issue.

@ndegory
Copy link
Contributor Author

ndegory commented Sep 5, 2016

unit tests added to telegraf: ed65d07806db4cbb0623c9b6cb296313990d3623, send 25 messages, and expect 25 messages.
the unit test works for kafka, but fails for nats (from 1 to 7 messages only).

@ndegory
Copy link
Contributor Author

ndegory commented Sep 5, 2016

$ sudo docker-compose -f ./docker-compose.nats.yml run sut
WARNING: The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use the bundle feature of the Docker experimental build.

More info:
https://docs.docker.com/compose/bundles

Creating network "dockertelegraf_default" with the default driver
Creating dockertelegraf_nats_1
Creating dockertelegraf_telegraf-agent_1
Creating dockertelegraf_telegraf-consumer_1
INFO - resetting the test file
wait for telegraf to be available... [OK]
wait 2 sec for queuer... [OK]
test send to tcp listener... [OK](25 msg)
test output file... [OK](0 sec)
test tcp listener measurement data...
failed (5/25 msg)
2016/09/05 12:20:00 Starting Telegraf (version 1.0.0-rc1)
2016/09/05 12:20:00 Loaded outputs: file
2016/09/05 12:20:00 Loaded inputs: nats_consumer
2016/09/05 12:20:00 Tags enabled: host=af287c90d5da
2016/09/05 12:20:00 Agent Config: Interval:2s, Debug:false, Quiet:false, Hostname:"af287c90d5da", Flush Interval:2s
2016/09/05 12:20:00 Started the NATS consumer service, nats: nats://nats:4222, subjects: [telegraf], queue: telegraf_consumers
2016/09/05 12:20:04 Output [file] buffer fullness: 1 / 1000 metrics. Total gathered metrics: 1. Total dropped metrics: 0.
2016/09/05 12:20:04 Output [file] wrote batch of 1 metrics in 208.81µs
2016/09/05 12:20:06 Output [file] buffer fullness: 4 / 1000 metrics. Total gathered metrics: 5. Total dropped metrics: 0.
2016/09/05 12:20:06 Output [file] wrote batch of 4 metrics in 232.129µs
2016/09/05 12:20:08 Output [file] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 5. Total dropped metrics: 0.
2016/09/05 12:19:59 Starting Telegraf (version 1.0.0-rc1)
2016/09/05 12:19:59 Loaded outputs: nats
2016/09/05 12:19:59 Loaded inputs: tcp_listener
2016/09/05 12:19:59 Tags enabled: host=22fb199e93de
2016/09/05 12:19:59 Agent Config: Interval:2s, Debug:false, Quiet:false, Hostname:"22fb199e93de", Flush Interval:2s
2016/09/05 12:19:59 TCP server listening on: [::]:8094
2016/09/05 12:19:59 Started TCP listener service on :8094
2016/09/05 12:20:02 Output [nats] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2016/09/05 12:20:04 Output [nats] buffer fullness: 25 / 1000 metrics. Total gathered metrics: 25. Total dropped metrics: 0.
2016/09/05 12:20:04 Output [nats] wrote batch of 25 metrics in 120.058µs
2016/09/05 12:20:06 Output [nats] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 25. Total dropped metrics: 0.
2016/09/05 12:20:08 Output [nats] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 25. Total dropped metrics: 0.

@ndegory
Copy link
Contributor Author

ndegory commented Sep 5, 2016

I did the test with the nats server instead of the nats-streaming server, same issue.

@ndegory
Copy link
Contributor Author

ndegory commented Sep 5, 2016

unit test on telegraf nats consumer (outside of docker image) does not experience the same bug

@ndegory
Copy link
Contributor Author

ndegory commented Sep 5, 2016

instead of the telegraf nats output I tried to send data directly to nats. The telegraf nats input was then able to read all the packets.

@subfuzion
Copy link
Contributor

@ndegory Helpful analysis so far with your last two comments, thanks.

@bertrand-quenin @freignat91 I don't think we should have reverted (#143) just yet until we had tested at-least-once delivery messaging semantics. We can live with messaging loss for a bit while analyzing and also checking on the NATS slack channel and/or opening an issue on their GitHub repo.

See the rest of my overall thoughts here: #113 (comment)

@bquenin
Copy link
Contributor

bquenin commented Sep 6, 2016

I don't think there's any message loss from nats itself. I think the telegraf plugins are not configured to work the way we want.
For log messages, there are no issues.

@subfuzion
Copy link
Contributor

subfuzion commented Sep 6, 2016

@bertrand-quenin I agree with that assessment. I do have overarching concerns about NATS Streaming in general, which I mention in #113 (comment), but I also don't want to foreclose on getting things working and doing a fair assessment prematurely either. Not sure we needed to revert (#143) so quickly -- we barely gave it a chance.

But as we've been discussing on Slack, I'm also fine with the idea that we don't deliver any telemetry over a message queue for now.

Not sure if it will make a difference, but I also mentioned on Slack that influxdata/telegraf#1697 was merged a few hours ago, so @ndegory should make sure to update our Telegraf image so we're using the latest with official telegraf -> NATS output plugin support.

@ndegory
Copy link
Contributor Author

ndegory commented Sep 6, 2016

patch https://patch-diff.githubusercontent.com/raw/influxdata/telegraf/pull/1697.patch (as of Sept 6th) applied on telegraf 1.0.0-rc1, same test, same result:

INFO - resetting the test file
wait for telegraf to be available... [OK]
wait 2 sec for queuer... [OK]
test send to tcp listener... [OK](25 msg)
test output file... [OK](0 sec)
test tcp listener measurement data...
failed (10/25 msg)

@ndegory
Copy link
Contributor Author

ndegory commented Sep 6, 2016

To check if the issue is related to the input or the output plugin, I did 2 separated tests as follow:
50 msg in TCP listener input -> telegraf -> nats output plugin -> NATS -> nats-sub sample client
50 msg in nats-pub sample client -> NATS -> nats input plugin -> telegraf -> file output plugin
Both are working fine, the messages are received on the other side.

@bquenin
Copy link
Contributor

bquenin commented Sep 7, 2016

We can probably close this issue now.

@ndegory
Copy link
Contributor Author

ndegory commented Sep 7, 2016

I'd rather keep it open but low priority, as we'll have to eventually fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants