telemetry messages loss #141

ndegory · 2016-09-05T10:21:16Z

metrics gathered by telegraf agents are sent to the nats streaming server, and the telegraf worker consumes and pushes them to influxdb.
We notice a loss of messages.
The UI of the nats server show an equal number of inputs and outputs for the metrics messages, but the telegraf consumer acknowledges much less messages than expected.

ndegory · 2016-09-05T10:27:32Z

No error message in telegraf consumer logs.
Changes on the data collection interval doesn't have any effect on the issue.

ndegory · 2016-09-05T12:13:18Z

unit tests added to telegraf: ed65d07806db4cbb0623c9b6cb296313990d3623, send 25 messages, and expect 25 messages.
the unit test works for kafka, but fails for nats (from 1 to 7 messages only).

ndegory · 2016-09-05T12:21:03Z

$ sudo docker-compose -f ./docker-compose.nats.yml run sut
WARNING: The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use the bundle feature of the Docker experimental build.

More info:
https://docs.docker.com/compose/bundles

Creating network "dockertelegraf_default" with the default driver
Creating dockertelegraf_nats_1
Creating dockertelegraf_telegraf-agent_1
Creating dockertelegraf_telegraf-consumer_1
INFO - resetting the test file
wait for telegraf to be available... [OK]
wait 2 sec for queuer... [OK]
test send to tcp listener... [OK](25 msg)
test output file... [OK](0 sec)
test tcp listener measurement data...
failed (5/25 msg)
2016/09/05 12:20:00 Starting Telegraf (version 1.0.0-rc1)
2016/09/05 12:20:00 Loaded outputs: file
2016/09/05 12:20:00 Loaded inputs: nats_consumer
2016/09/05 12:20:00 Tags enabled: host=af287c90d5da
2016/09/05 12:20:00 Agent Config: Interval:2s, Debug:false, Quiet:false, Hostname:"af287c90d5da", Flush Interval:2s
2016/09/05 12:20:00 Started the NATS consumer service, nats: nats://nats:4222, subjects: [telegraf], queue: telegraf_consumers
2016/09/05 12:20:04 Output [file] buffer fullness: 1 / 1000 metrics. Total gathered metrics: 1. Total dropped metrics: 0.
2016/09/05 12:20:04 Output [file] wrote batch of 1 metrics in 208.81µs
2016/09/05 12:20:06 Output [file] buffer fullness: 4 / 1000 metrics. Total gathered metrics: 5. Total dropped metrics: 0.
2016/09/05 12:20:06 Output [file] wrote batch of 4 metrics in 232.129µs
2016/09/05 12:20:08 Output [file] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 5. Total dropped metrics: 0.
2016/09/05 12:19:59 Starting Telegraf (version 1.0.0-rc1)
2016/09/05 12:19:59 Loaded outputs: nats
2016/09/05 12:19:59 Loaded inputs: tcp_listener
2016/09/05 12:19:59 Tags enabled: host=22fb199e93de
2016/09/05 12:19:59 Agent Config: Interval:2s, Debug:false, Quiet:false, Hostname:"22fb199e93de", Flush Interval:2s
2016/09/05 12:19:59 TCP server listening on: [::]:8094
2016/09/05 12:19:59 Started TCP listener service on :8094
2016/09/05 12:20:02 Output [nats] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 0. Total dropped metrics: 0.
2016/09/05 12:20:04 Output [nats] buffer fullness: 25 / 1000 metrics. Total gathered metrics: 25. Total dropped metrics: 0.
2016/09/05 12:20:04 Output [nats] wrote batch of 25 metrics in 120.058µs
2016/09/05 12:20:06 Output [nats] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 25. Total dropped metrics: 0.
2016/09/05 12:20:08 Output [nats] buffer fullness: 0 / 1000 metrics. Total gathered metrics: 25. Total dropped metrics: 0.

ndegory · 2016-09-05T12:24:30Z

I did the test with the nats server instead of the nats-streaming server, same issue.

ndegory · 2016-09-05T14:44:52Z

unit test on telegraf nats consumer (outside of docker image) does not experience the same bug

ndegory · 2016-09-05T15:27:45Z

instead of the telegraf nats output I tried to send data directly to nats. The telegraf nats input was then able to read all the packets.

subfuzion · 2016-09-06T11:23:00Z

@ndegory Helpful analysis so far with your last two comments, thanks.

@bertrand-quenin @freignat91 I don't think we should have reverted (#143) just yet until we had tested at-least-once delivery messaging semantics. We can live with messaging loss for a bit while analyzing and also checking on the NATS slack channel and/or opening an issue on their GitHub repo.

See the rest of my overall thoughts here: #113 (comment)

bquenin · 2016-09-06T11:38:11Z

I don't think there's any message loss from nats itself. I think the telegraf plugins are not configured to work the way we want.
For log messages, there are no issues.

subfuzion · 2016-09-06T14:32:40Z

@bertrand-quenin I agree with that assessment. I do have overarching concerns about NATS Streaming in general, which I mention in #113 (comment), but I also don't want to foreclose on getting things working and doing a fair assessment prematurely either. Not sure we needed to revert (#143) so quickly -- we barely gave it a chance.

But as we've been discussing on Slack, I'm also fine with the idea that we don't deliver any telemetry over a message queue for now.

Not sure if it will make a difference, but I also mentioned on Slack that influxdata/telegraf#1697 was merged a few hours ago, so @ndegory should make sure to update our Telegraf image so we're using the latest with official telegraf -> NATS output plugin support.

ndegory · 2016-09-06T15:40:31Z

patch https://patch-diff.githubusercontent.com/raw/influxdata/telegraf/pull/1697.patch (as of Sept 6th) applied on telegraf 1.0.0-rc1, same test, same result:

INFO - resetting the test file
wait for telegraf to be available... [OK]
wait 2 sec for queuer... [OK]
test send to tcp listener... [OK](25 msg)
test output file... [OK](0 sec)
test tcp listener measurement data...
failed (10/25 msg)

ndegory · 2016-09-06T15:46:40Z

To check if the issue is related to the input or the output plugin, I did 2 separated tests as follow:
50 msg in TCP listener input -> telegraf -> nats output plugin -> NATS -> nats-sub sample client
50 msg in nats-pub sample client -> NATS -> nats input plugin -> telegraf -> file output plugin
Both are working fine, the messages are received on the other side.

bquenin · 2016-09-07T08:55:07Z

We can probably close this issue now.

ndegory · 2016-09-07T15:53:05Z

I'd rather keep it open but low priority, as we'll have to eventually fix it.

ndegory changed the title ~~telemetry message loss~~ telemetry messages loss Sep 5, 2016

ndegory self-assigned this Sep 5, 2016

ndegory mentioned this issue Sep 6, 2016

Revert nats on telegraf #143

Merged

subfuzion mentioned this issue Sep 6, 2016

Switch from Kafka to NATS Streaming #113

Closed

9 tasks

freignat91 closed this as completed Sep 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry messages loss #141

telemetry messages loss #141

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

subfuzion commented Sep 6, 2016

bquenin commented Sep 6, 2016 •

edited

Loading

subfuzion commented Sep 6, 2016 •

edited

Loading

ndegory commented Sep 6, 2016

ndegory commented Sep 6, 2016 •

edited

Loading

bquenin commented Sep 7, 2016

ndegory commented Sep 7, 2016

telemetry messages loss #141

telemetry messages loss #141

Comments

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

ndegory commented Sep 5, 2016

subfuzion commented Sep 6, 2016

bquenin commented Sep 6, 2016 • edited Loading

subfuzion commented Sep 6, 2016 • edited Loading

ndegory commented Sep 6, 2016

ndegory commented Sep 6, 2016 • edited Loading

bquenin commented Sep 7, 2016

ndegory commented Sep 7, 2016

bquenin commented Sep 6, 2016 •

edited

Loading

subfuzion commented Sep 6, 2016 •

edited

Loading

ndegory commented Sep 6, 2016 •

edited

Loading