You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Telegraf Configuration
#
# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.
#
# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.
#
# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.
#
# Environment variables can be used anywhere in this config file, simply prepend
# them with $. For strings the variable must be within quotes (ie, "$STR_VAR"),
# for numbers and booleans they should be plain (ie, $INT_VAR, $BOOL_VAR)
# Global tags can be specified here in key="value" format.
[global_tags]
service = "telegraf-influx-listener"
region = "region"
# Configuration for telegraf agent
[agent]
## Default data collection interval for all inputs
interval = "60s"
## Rounds collection interval to 'interval'
## ie, if interval="10s" then always collect on :00, :10, :20, etc.
round_interval = true
## Telegraf will send metrics to outputs in batches of at
## most metric_batch_size metrics.
metric_batch_size = 1000
## For failed writes, telegraf will cache metric_buffer_limit metrics for each
## output, and will flush this buffer on a successful write. Oldest metrics
## are dropped first when this buffer fills.
metric_buffer_limit = 500000
## Collection jitter is used to jitter the collection by a random amount.
## Each plugin will sleep for a random time within jitter before collecting.
## This can be used to avoid many plugins querying things like sysfs at the
## same time, which can have a measurable effect on the system.
collection_jitter = "0s"
## Default flushing interval for all outputs. You shouldn't set this below
## interval. Maximum flush_interval will be flush_interval + flush_jitter
flush_interval = "60s"
## Jitter the flush interval by a random amount. This is primarily to avoid
## large write spikes for users running a large number of telegraf instances.
## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
flush_jitter = "5s"
## Logging configuration:
## Run telegraf with debug log messages.
debug = false
## Run telegraf in quiet mode (error log messages only).
quiet = false
## Specify the log file name. The empty string means to log to stderr.
logfile = "/var/log/telegraf/telegraf.log"
## Override default hostname, if empty use os.Hostname()
#hostname = ""
## If set to true, do no set the "host" tag in the telegraf agent.
omit_hostname = false
###############################################################################
# OUTPUT PLUGINS #
###############################################################################
# Configuration for us-east-1 influxdb server to send metrics to
[[outputs.influxdb]]
## The full HTTP or UDP endpoint URL for your InfluxDB instance.
## Multiple urls can be specified as part of the same cluster,
## this means that only ONE of the urls will be written to each interval.
# urls = ["udp://localhost:8089"] # UDP endpoint example
urls = ["http://influx1"] # required
## The target database for metrics (telegraf will create it if not exists).
database = "database" # required
## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
## note: using "s" precision greatly improves InfluxDB compression.
precision = "s"
## HTTP Content-Encoding for write request body, can be set to "gzip" to
## compress body or "identity" to apply no encoding.
content_encoding = "gzip"
## Retention policy to write to.
retention_policy = "default"
## Write consistency (clusters only), can be: "any", "one", "quorom", "all"
write_consistency = "any"
## Write timeout (for the InfluxDB client), formatted as a string.
## If not provided, will default to 5s. 0s means no timeout (not recommended).
timeout = "300s"
username = "user"
password = "pass"
## Set the user agent for HTTP POSTs (can be useful for log differentiation)
user_agent = "telegraf-influx-listener"
# Configuration for local region influxdb server to send metrics to
[[outputs.influxdb]]
## The full HTTP or UDP endpoint URL for your InfluxDB instance.
## Multiple urls can be specified as part of the same cluster,
## this means that only ONE of the urls will be written to each interval.
# urls = ["udp://localhost:8089"] # UDP endpoint example
urls = ["http:/influx2"] # required
## The target database for metrics (telegraf will create it if not exists).
database = "database" # required
## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
## note: using "s" precision greatly improves InfluxDB compression.
precision = "s"
## HTTP Content-Encoding for write request body, can be set to "gzip" to
## compress body or "identity" to apply no encoding.
content_encoding = "gzip"
## Retention policy to write to.
retention_policy = "default"
## Write consistency (clusters only), can be: "any", "one", "quorom", "all"
write_consistency = "any"
## Write timeout (for the InfluxDB client), formatted as a string.
## If not provided, will default to 5s. 0s means no timeout (not recommended).
timeout = "300s"
username = "user"
password = "pass"
## Set the user agent for HTTP POSTs (can be useful for log differentiation)
user_agent = "telegraf-influx-listener"
###############################################################################
# INPUT PLUGINS #
###############################################################################
# Influx HTTP write listener
[[inputs.http_listener]]
## Address and port to host HTTP listener on
service_address = ":8186"
## timeouts
read_timeout = "300s"
write_timeout = "300s"
# Read metrics about cpu usage
[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## If true, collect raw CPU time metrics.
collect_cpu_time = false
# Get kernel statistics from /proc/stat
[[inputs.kernel]]
# no configuration
# Read metrics about memory usage
[[inputs.mem]]
# no configuration
# Read metrics about system load & uptime
[[inputs.system]]
# # Read metrics about network interface usage
[[inputs.net]]
# ## By default, telegraf gathers stats from any up interface (excluding loopback)
# ## Setting interfaces will tell it to gather these explicit interfaces,
# ## regardless of status.
# ##
interfaces = ["eth0"]
# # Read TCP metrics such as established, time wait and sockets counts.
[[inputs.netstat]]
## Collect system resource usage by an individual process using their /proc data
# Telegraf must run as root in order for these metrics to be reported
[[inputs.procstat]]
pattern = "telegraf"
pid_tag = true
# Read metrics about disk IO by device
[[inputs.diskio]]
## By default, telegraf will gather stats for all devices including
## disk partitions.
## Setting devices will restrict the stats to the specified devices.
# devices = ["sda", "sdb"]
## Uncomment the following line if you need disk serial numbers.
# skip_serial_number = false
#
## On systems which support it, device metadata can be added in the form of
## tags.
## Currently only Linux is supported via udev properties. You can view
## available properties for a device by running:
## 'udevadm info -q property -n /dev/sda'
# device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
#
## Using the same metadata source as device_tags, you can also customize the
## name of the device via templates.
## The 'name_templates' parameter is a list of templates to try and apply to
## the device. The template may contain variables in the form of '$PROPERTY' or
## '${PROPERTY}'. The first template which does not contain any variables not
## present for the device is used as the device name tag.
## The typical use case is for LVM volumes, to get the VG/LV name instead of
## the near-meaningless DM-0 name.
# name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]
System info:
[Include Telegraf version, operating system name, and other relevant details]
Telegraf version: 1.7.3
OS: Ubuntu 14.04
Running on AWS on m5.large and r5.large instances
Steps to reproduce:
Spin up telegraf with the above config
Open a large number of connections to telegraf
Expected behavior:
I'd expect Telegraf to be able to handle more than 2000 connections per telegraf instance.
Actual behavior:
Telegraf becomes very latent and stops taking connections at 2000 connections per telegraf instance. As far as I can tell, request count does not seem to affect latency - only what seems like a max connection problem.
These messages spam once a node hits >= 2000 connections. However, there does not appear to be a CPU, Network I/O, or memory usage bottleneck. In fact, mem and CPU usage is almost non-existant.
Here are the limits for the running telegraf process:
ubuntu@telegraf:~$ cat /proc/1545/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 61432 61432 processes
Max open files 24000 24000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 61432 61432 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Please let me know if there is any additional info I can provide.
The text was updated successfully, but these errors were encountered:
This is a result of #2919. Telegraf is essentially getting backed up because it can't output as fast as it is receiving data. I'm currently working on a solution, perhaps you can help with testing when I'm a little further along?
In the meantime, the best workaround is lowering the InfluxDB timeout (careful as this can cause more load on InfluxDB as timeouts are retried), adding more Telegraf instances to your batching tier, or improving the write speed of InfluxDB by allocating more resources to it. Depending on your durability requirements you could consider switching to UDP for sending to InfluxDB as well.
I'm going to close this issue and I'll post on #2919 when the changes are ready.
Relevant telegraf.conf:
System info:
[Include Telegraf version, operating system name, and other relevant details]
Telegraf version: 1.7.3
OS: Ubuntu 14.04
Running on AWS on m5.large and r5.large instances
Steps to reproduce:
Expected behavior:
I'd expect Telegraf to be able to handle more than 2000 connections per telegraf instance.
Actual behavior:
Telegraf becomes very latent and stops taking connections at 2000 connections per telegraf instance. As far as I can tell, request count does not seem to affect latency - only what seems like a max connection problem.
Additional info:
[Include gist of relevant config, logs, etc.]
The only errors we see in the logs are:
These messages spam once a node hits >= 2000 connections. However, there does not appear to be a CPU, Network I/O, or memory usage bottleneck. In fact, mem and CPU usage is almost non-existant.
Here are the limits for the running telegraf process:
Please let me know if there is any additional info I can provide.
The text was updated successfully, but these errors were encountered: