Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal_write.buffer_size metric not reset on timed writes #5298

Closed
pberlowski opened this issue Jan 16, 2019 · 2 comments · Fixed by #5314
Closed

internal_write.buffer_size metric not reset on timed writes #5298

pberlowski opened this issue Jan 16, 2019 · 2 comments · Fixed by #5314
Assignees
Labels
area/agent bug unexpected problem or unintended behavior
Milestone

Comments

@pberlowski
Copy link
Contributor

Relevant telegraf.conf:

[global_tags]
 test = "test"

# Configuration for telegraf agent
[agent]
  interval = "30s"
  round_interval = true
  metric_batch_size = 500
  metric_buffer_limit = 500000
  collection_jitter = "0s"
  flush_interval = "30s"
  flush_jitter = "5s"

  ## By default, precision will be set to the same timestamp order as the
  ## collection interval, with the maximum being 1s.
  ## Precision will NOT be used for service inputs, such as logparser and statsd.
  ## Valid values are "ns", "us" (or "µs"), "ms", "s".
  precision = "1ns"

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = true
  ## Rg = true
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = "/var/log/telegraf/telegraf.log"

  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false


[[inputs.http_listener]]
    # Gateway listens globally
        service_address = "0.0.0.0:8186"
        read_timeout = "10s"
        write_timeout = "10s"

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default, telegraf gather stats for all mountpoints.
  ## Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points = ["/"]
  ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
  ## present on /run, /var/run, /dev/shm or /dev).
  ignore_fs = ["tmpfs", "devtmpfs"]

[[inputs.diskio]]

# Collect statistics about itself
[[inputs.internal]]
  ## If true, collect telegraf memory stats.
  collect_memstats = true

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Read metrics about network interface usage
[[inputs.net]]
  ## By default, telegraf gathers stats from any up interface (excluding loopback)
  ## Setting interfaces will tell it to gather these explicit interfaces,
  ## regardless of status.
  ##
  # interfaces = ["eth0"]

[[inputs.nstat]]
  fieldpass = ["Tcp*Opens","TcpCurrEstab"]

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

[[inputs.system]]
  fielddrop = [ "uptime_format" ]

[[inputs.netstat]]

[[inputs.processes]]

[[inputs.ntpq]]
  ## If false, set the -n ntpq flag. Can reduce metric gather times.
  dns_lookup = false

[[inputs.procstat]]
  systemd_unit = "telegraf"
  pid_tag = true
  fieldpass = ["*rss", "*rss_hard"]

System info:

Telegraf version: 1.9.2
OS: Centos 7

Steps to reproduce:

  1. Create a chart of internal_write.buffer_size metric
  2. Leave batch size sufficiently high to never flush due to batch size
  3. Overflow batch size once (e.g. send 1000 metrics while batch_size is 500)
  4. Do not overlow batch size again (agent will flush on a set flush period)
  5. Observe reported buffer_size

Expected behavior:

internal_agent.buffer_size drops to 0 as there's no metrics in the buffer

Actual behavior:

internal_agent.buffer_size metric reported as batch_size forever.

Additional info:

Buffer_size is set and emitted only in the AddMetric method of the running_output and only if the batch was written to buffer before the flush time.

Buffer_size is not set in the Write method so when the buffer is flushed, the metric is not reset.

The above means that buffer_size will be set only when we overflow the batch and thus will never reset to 0.

@pberlowski
Copy link
Contributor Author

We'll be testing this patch in our environment:
https://gist.github.com/pberlowski/6855a647f74b4d3c647e2d1ab344e525

@danielnelson danielnelson added this to the 1.9.3 milestone Jan 16, 2019
@danielnelson danielnelson self-assigned this Jan 16, 2019
@pberlowski
Copy link
Contributor Author

The patch is successfully zeroing the buffer count when relevant. One additional behavior that was noticed here is that the buffer_size is always reported as a factor of batch size, due to the metric being reported after adding a full batch to buffer. This is interesting but not necessarily a problem.

@danielnelson danielnelson added bug unexpected problem or unintended behavior area/agent labels Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants