Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two more Kafka input failures that break all other plugins #9778

Closed
daviesalex opened this issue Sep 18, 2021 · 1 comment · Fixed by #14884
Closed

Two more Kafka input failures that break all other plugins #9778

daviesalex opened this issue Sep 18, 2021 · 1 comment · Fixed by #14884
Assignees
Labels
area/kafka bug unexpected problem or unintended behavior

Comments

@daviesalex
Copy link

There are 2 failures that we see in Telegraf 1.20-rc0 just in the Kafka plugin, despite #9051 that was supposed to fix this plugin:

1. If the Kafka backends are just down

Use this config to test:

[agent]
  interval = "1s"
  flush_interval = "1s"
  omit_hostname = true
  collection_jitter = "0s"
  flush_jitter = "0s"

[[outputs.kafka]]
  brokers = ["server1:9092","server2:9092","server3:9092"]
  topic = "xx"
  client_id = "telegraf-metrics-foo"
  version = "2.4.0"
  routing_tag = "host"
  required_acks = 1
  max_retry = 100
  sasl_mechanism = "SCRAM-SHA-256"
  sasl_username = "foo"
  sasl_password = "bar"
  exclude_topic_tag = true
  compression_codec = 4
  data_format = "msgpack"

[[inputs.cpu]]

[[outputs.file]]
  files = ["stdout"]

Make sure the client cant talk to server[1-3]; we did ip route add x via 127.0.0.1 to null route it but you could use a firewall or just point it to IPs that are not running Kafka.

What we expect:

  • Kafka output fails and tries to reconnect 100 times
  • I can still see the CPU input plugin sending data to stdout
  • Once Kafka manages to connect then I see the data there as well

What actually happens:

  • Kafka tries to connect a couple of times
  • CPU input plugin data is never passed to the stdout
  • After kafka fails, Telegraf exits with an error

2. If the Kafka sasl_password is wrong and SASL auth enabled

This is trivial to reproduce - just change the sasl_password for a working config.

What we expect:

  • Kafka output fails and tries to reconnect X times
  • Everything else works fine

What actually happens:

  • Telegraf immediately fails to start with this error (process exits):
[root@x ~]# /usr/local/telegraf/bin/telegraf -config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/conf.d
2021-09-16T11:23:51Z I! Starting Telegraf build-50
...
2021-09-16T11:23:51Z E! [agent] Failed to connect to [outputs.kafka], retrying in 15s, error was 'kafka server: SASL Authentication failed.'
@reimda
Copy link
Contributor

reimda commented Sep 21, 2021

There are a few retry mechanisms built into sarama (the library that telegraf uses for kafka support). I did a quick test in #9786 to see if they affect connection retries like the ones described in this issue. I configured telegraf to connect to localhost on a port that isn't listening. In this case the config.Producer.Retry and config.Admin.Retry settings don't seem to affect retries.

We will need to spend some more time understanding how sarama intends to handle connection failures and retries. If there is no provision for retrying connection failures in the library, we may need the plugin to detect failures and retry them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kafka bug unexpected problem or unintended behavior
Projects
None yet
3 participants