Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance with http_listener input and influxdb output and large number of connections #4564

Closed
DanHoerst opened this issue Aug 16, 2018 · 1 comment

Comments

@DanHoerst
Copy link

DanHoerst commented Aug 16, 2018

Relevant telegraf.conf:

# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)


# Global tags can be specified here in key="value" format.
[global_tags]
  service = "telegraf-influx-listener"
  region = "region"


# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "60s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at
  ## most metric_batch_size metrics.
  metric_batch_size = 1000
  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  metric_buffer_limit = 500000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "60s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "5s"

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = "/var/log/telegraf/telegraf.log"
  ## Override default hostname, if empty use os.Hostname()
  #hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false


###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

# Configuration for us-east-1 influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http://influx1"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "database" # required
  ## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  ## note: using "s" precision greatly improves InfluxDB compression.
  precision = "s"

  ## HTTP Content-Encoding for write request body, can be set to "gzip" to
  ## compress body or "identity" to apply no encoding.
  content_encoding = "gzip"

  ## Retention policy to write to.
  retention_policy = "default"
  ## Write consistency (clusters only), can be: "any", "one", "quorom", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "300s"
  username = "user"
  password = "pass"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  user_agent = "telegraf-influx-listener"

# Configuration for local region influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://localhost:8089"] # UDP endpoint example
  urls = ["http:/influx2"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "database" # required
  ## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  ## note: using "s" precision greatly improves InfluxDB compression.
  precision = "s"

  ## HTTP Content-Encoding for write request body, can be set to "gzip" to
  ## compress body or "identity" to apply no encoding.
  content_encoding = "gzip"

  ## Retention policy to write to.
  retention_policy = "default"
  ## Write consistency (clusters only), can be: "any", "one", "quorom", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "300s"
  username = "user"
  password = "pass"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  user_agent = "telegraf-influx-listener"


###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Influx HTTP write listener
[[inputs.http_listener]]
  ## Address and port to host HTTP listener on
  service_address = ":8186"

  ## timeouts
  read_timeout = "300s"
  write_timeout = "300s"

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Read metrics about system load & uptime
[[inputs.system]]

# # Read metrics about network interface usage
[[inputs.net]]
#   ## By default, telegraf gathers stats from any up interface (excluding loopback)
#   ## Setting interfaces will tell it to gather these explicit interfaces,
#   ## regardless of status.
#   ##
  interfaces = ["eth0"]

# # Read TCP metrics such as established, time wait and sockets counts.
[[inputs.netstat]]

## Collect system resource usage by an individual process using their /proc data
# Telegraf must run as root in order for these metrics to be reported
[[inputs.procstat]]
  pattern = "telegraf"
  pid_tag = true

# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]

System info:

[Include Telegraf version, operating system name, and other relevant details]

Telegraf version: 1.7.3
OS: Ubuntu 14.04

Running on AWS on m5.large and r5.large instances

Steps to reproduce:

  1. Spin up telegraf with the above config
  2. Open a large number of connections to telegraf

Expected behavior:

I'd expect Telegraf to be able to handle more than 2000 connections per telegraf instance.

Actual behavior:

Telegraf becomes very latent and stops taking connections at 2000 connections per telegraf instance. As far as I can tell, request count does not seem to affect latency - only what seems like a max connection problem.

Additional info:

[Include gist of relevant config, logs, etc.]

The only errors we see in the logs are:

2018-08-16T14:19:16Z E! read tcp 10.4.13.159:8186->10.4.8.220:14434: i/o timeout
2018-08-16T14:19:16Z E! read tcp 10.4.13.159:8186->10.4.6.232:42200: i/o timeout
2018-08-16T14:19:16Z E! read tcp 10.4.13.159:8186->10.4.6.232:51620: i/o timeout

These messages spam once a node hits >= 2000 connections. However, there does not appear to be a CPU, Network I/O, or memory usage bottleneck. In fact, mem and CPU usage is almost non-existant.

Here are the limits for the running telegraf process:

ubuntu@telegraf:~$ cat /proc/1545/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             61432                61432                processes
Max open files            24000                24000                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       61432                61432                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Please let me know if there is any additional info I can provide.

@danielnelson
Copy link
Contributor

This is a result of #2919. Telegraf is essentially getting backed up because it can't output as fast as it is receiving data. I'm currently working on a solution, perhaps you can help with testing when I'm a little further along?

In the meantime, the best workaround is lowering the InfluxDB timeout (careful as this can cause more load on InfluxDB as timeouts are retried), adding more Telegraf instances to your batching tier, or improving the write speed of InfluxDB by allocating more resources to it. Depending on your durability requirements you could consider switching to UDP for sending to InfluxDB as well.

I'm going to close this issue and I'll post on #2919 when the changes are ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants