Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basicstats counter not working as expected #5595

Closed
toni-moreno opened this issue Mar 17, 2019 · 14 comments · Fixed by #5644
Closed

Basicstats counter not working as expected #5595

toni-moreno opened this issue Mar 17, 2019 · 14 comments · Fixed by #5644
Assignees
Labels
area/agent bug unexpected problem or unintended behavior
Milestone

Comments

@toni-moreno
Copy link
Contributor

Description

I've trying to config an easy way to get apache statistics as described in #3991 , but while doing some testing with tail / logparser , I noticed something is not working to me , perhaps there is some missing config ? If not could be becouse of a bug in the agregator (perhaps there is some point filtering before send them to the aggregation process)?

Relevant telegraf.conf:

This is the config to put in /etc/telegraf/telegraf.d/testlog.conf to reproduce

[[inputs.tail]]
  name_override = "apachelog2"
  files = ["/var/log/apache2/test.log"]
  from_beginning = false
  pipe = false
  data_format = "grok"
  grok_patterns = ["%{NOTSPACE:time:drop} %{NOTSPACE:request:tag} %{NUMBER:resp_time:int}"]

[[aggregators.basicstats]]
  namepass = ["apachelog2"]
  period = "60s"
  drop_original = true
  fieldpass = ["resp_time"]
  stats = ["count","max","min","mean"]

System info:

root@ipasdev01:~# telegraf --version
Telegraf 1.10.0 (git: HEAD fe33ee89)
root@ipasdev01:~# cat /etc/debian_version 
buster/sid
root@ipasdev01:~# uname -a
Linux ipasdev01 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce:

1.- Create the previous config file (testlog.conf) , put in the telegraf.d directory and restart telegraf

2.- Create this script , to simulate a log file ( always with the same input )

#!/bin/bash
while true
do
	D=`date +"%D-%T"`
	echo "$D /200_1sec.php 1000000">>/var/log/apache2/test.log
	echo "$D /200_1sec.php 1000000">>/var/log/apache2/test.log
	echo "$D /200_2sec.php 2000000">>/var/log/apache2/test.log
	sleep 2
done

Execute in background

nohup ~/telegraf_log_test.sh &

Expected behavior:

I expect to have:

  • 60 times/minute the request 200_1sec.php
  • 30 times/minute the request 200_2sec.php

As counted with grep and wc -l

root@ipasdev01:$ grep "19:29:"  /var/log/apache2/test.log  | grep "200_1sec" | wc -l
60
root@ipasdev01:$ grep "19:29:"  /var/log/apache2/test.log  | grep "200_2sec" | wc -l
30

Actual behavior:

Basicstats agregator is counting only:

  • 8 times/min the request "200_1sec.php"
  • 4 times/min the request "200_2sec.php"
> select sum(resp_time_count) from apachelog2 where time > now()  - 2m group by time(1m),request;
name: apachelog2
tags: request=/200_1sec.php
time                 sum
----                 ---
2019-03-17T18:28:00Z 
2019-03-17T18:29:00Z 8
2019-03-17T18:30:00Z 8

name: apachelog2
tags: request=/200_2sec.php
time                 sum
----                 ---
2019-03-17T18:28:00Z 
2019-03-17T18:29:00Z 4
2019-03-17T18:30:00Z 4
@toni-moreno
Copy link
Contributor Author

I forgot include the complete config and log

Telegraf Full config

/etc/telegraf/telegraf.conf

[global_tags]

   l_tag1 = "a" 
   l_tag2 = "b" 
   l_tag3 = "c" 

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "10s"
  debug = true 
  quiet = false
  logfile = "/var/log/telegraf/telegraf.log"
  omit_hostname = false 
[[inputs.internal]]
  # collect_memstats = true
  [inputs.internal.tags]
  ifx_db = "internal_telegraf"
[[outputs.influxdb]]
  urls = ["http://influxdb.mydomain.org:8086"] 
  database = "_telegraf" 
   username = "xxxxxxxxx"
   password = "xxxxxxxxx"

  
  #Filter influx output
  tagexclude = ["ifx_db"]
  [outputs.influxdb.tagpass]
  ifx_db = ["internal_telegraf"]

/etc/telegraf/telegraf.d/input_apache.conf

[[inputs.tail]]
  name_override = "apachelog2"
  files = ["/var/log/apache2/test.log"]
  from_beginning = false 
  pipe = false
  data_format = "grok"
  grok_patterns = ["%{NOTSPACE:time:drop} %{NOTSPACE:request:tag} %{NUMBER:resp_time:int}"]
  [inputs.tail.tags]
    instance="instance1"
    ifx_db="ws_apache"
[[aggregators.basicstats]]
  namepass = ["apachelog2"]
  period = "60s"
  drop_original = true
  fieldpass = ["resp_time"]
  stats = ["count","max","min","mean"]

/etc/telegraf/telegraf.d/output_apache.conf

[[outputs.influxdb]]
  urls = ["http://influxdb.mydomain.org:8086"]
  database = "apache_metrics" 
  username = "xxxxxxx"
  password = "xxxxxxx"
  #Filter metrics
  tagexclude = ["ifx_db"]
  [outputs.influxdb.tagpass]
  ifx_db = ["ws_apache"]

Output Log

2019-03-18T05:18:09Z I! Loaded inputs: internal tail
2019-03-18T05:18:09Z I! Loaded aggregators: basicstats
2019-03-18T05:18:09Z I! Loaded processors: 
2019-03-18T05:18:09Z I! Loaded outputs: influxdb influxdb
2019-03-18T05:18:09Z I! Tags enabled: host=ipasdev01 l_tag1=a l_tag2=b l_tag3=c
2019-03-18T05:18:09Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ipasdev01", Flush Interval:10s
2019-03-18T05:18:09Z D! [agent] Connecting outputs
2019-03-18T05:18:09Z D! [agent] Attempting connection to output: influxdb
2019-03-18T05:18:09Z D! [agent] Successfully connected to output: influxdb
2019-03-18T05:18:09Z D! [agent] Attempting connection to output: influxdb
2019-03-18T05:18:09Z D! [agent] Successfully connected to output: influxdb
2019-03-18T05:18:09Z D! [agent] Starting service inputs
2019-03-18T05:18:09Z D! [inputs.tail] tail added for file: /var/log/apache2/test.log
2019-03-18T05:18:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:18:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:18:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:18:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:18:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:18:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:18:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:18:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:00Z D! [outputs.influxdb] wrote batch of 6 metrics in 17.384325ms
2019-03-18T05:19:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:19:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 11.926215ms
2019-03-18T05:20:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 17.559029ms
2019-03-18T05:20:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:20:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 24.70911ms
2019-03-18T05:21:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 28.525874ms
2019-03-18T05:21:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:21:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 15.273403ms
2019-03-18T05:22:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 15.221005ms
2019-03-18T05:22:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:22:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 11.496482ms
2019-03-18T05:23:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 17.704723ms
2019-03-18T05:23:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:23:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 12.81991ms
2019-03-18T05:24:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 52.024487ms
2019-03-18T05:24:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:24:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 14.451988ms
2019-03-18T05:25:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 14.894181ms
2019-03-18T05:25:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:25:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 15.351588ms
2019-03-18T05:26:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 20.242445ms
2019-03-18T05:26:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:26:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 12.704176ms
2019-03-18T05:27:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 17.389107ms
2019-03-18T05:27:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:27:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 10.827497ms
2019-03-18T05:28:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 15.290605ms
2019-03-18T05:28:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:30Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:40Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:28:50Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:29:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:29:00Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:29:10Z D! [outputs.influxdb] wrote batch of 6 metrics in 14.191643ms
2019-03-18T05:29:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:29:10Z D! [outputs.influxdb] wrote batch of 2 metrics in 18.257915ms
2019-03-18T05:29:10Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:29:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 
2019-03-18T05:29:20Z D! [outputs.influxdb] buffer fullness: 0 / 10000 metrics. 

@toni-moreno
Copy link
Contributor Author

More interesting data ...

I't seems like counting changes with time..

In the morning...

> select sum(resp_time_count) from apachelog2 where time > now()  - 2m group by time(1m),request;
name: apachelog2
tags: request=/200_1sec.php
time                 sum
----                 ---
2019-03-18T05:29:00Z 
2019-03-18T05:30:00Z 10
2019-03-18T05:31:00Z 10

name: apachelog2
tags: request=/200_2sec.php
time                 sum
----                 ---
2019-03-18T05:29:00Z 
2019-03-18T05:30:00Z 5
2019-03-18T05:31:00Z 5

Some hours after

> select sum(resp_time_count) from apachelog2 where time > now()  - 2m group by time(1m),request;
name: apachelog2
tags: request=/200_1sec.php
time                 sum
----                 ---
2019-03-18T09:35:00Z 
2019-03-18T09:36:00Z 36
2019-03-18T09:37:00Z 36

name: apachelog2
tags: request=/200_2sec.php
time                 sum
----                 ---
2019-03-18T09:35:00Z 
2019-03-18T09:36:00Z 18
2019-03-18T09:37:00Z 18

image

@danielnelson
Copy link
Contributor

@toni-moreno I'm not able to replicate, can you try to remove the tag routing to see if it helps?:

select sum(resp_time_count) from apachelog2 where time > now()  - 4m group by time(1m),request;
name: apachelog2
tags: request=/200_1sec.php
time                 sum
----                 ---
2019-03-19T00:00:00Z 
2019-03-19T00:01:00Z 52
2019-03-19T00:02:00Z 60
2019-03-19T00:03:00Z 60
2019-03-19T00:04:00Z 60

name: apachelog2
tags: request=/200_2sec.php
time                 sum
----                 ---
2019-03-19T00:00:00Z 
2019-03-19T00:01:00Z 26
2019-03-19T00:02:00Z 30
2019-03-19T00:03:00Z 30
2019-03-19T00:04:00Z 30

Output with a file output:

apachelog2,host=loaner,path=test.log,request=/200_2sec.php resp_time_count=30,resp_time_max=2000000,resp_time_mean=2000000,resp_time_min=2000000 1552954029000000000
apachelog2,host=loaner,path=test.log,request=/200_1sec.php resp_time_count=60,resp_time_max=1000000,resp_time_mean=1000000,resp_time_min=1000000 1552954029000000000
apachelog2,host=loaner,path=test.log,request=/200_2sec.php resp_time_count=30,resp_time_max=2000000,resp_time_mean=2000000,resp_time_min=2000000 1552954089000000000
apachelog2,host=loaner,path=test.log,request=/200_1sec.php resp_time_count=60,resp_time_max=1000000,resp_time_mean=1000000,resp_time_min=1000000 1552954089000000000
apachelog2,host=loaner,path=test.log,request=/200_2sec.php resp_time_count=30,resp_time_max=2000000,resp_time_mean=2000000,resp_time_min=2000000 1552954149000000000
apachelog2,host=loaner,path=test.log,request=/200_1sec.php resp_time_count=60,resp_time_max=1000000,resp_time_mean=1000000,resp_time_min=1000000 1552954149000000000

@sbengo
Copy link

sbengo commented Mar 19, 2019

Hi @danielnelson , I'm working with @toni-moreno and after some tests I have seen the same wrong results as he had described on this issue.

Version: Telegraf 1.10.0 (git: HEAD fe33ee8)
Config: Same
Data: Same script
Routing tag: disabled/enabled, no difference seen

Test:

  • I have done the tests on 2 different hosts with invalid results.
  • In some restarts the values ​​changed, but they were still invalid.

A: Debian VM on VirtualBox:

  • Linux snmpcoldev01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
  • debian_version: 8.4

Result:

> select sum(resp_time_count) from apachelog2 where time > now() - 4m group by time(1m), request;
name: apachelog2
tags: request=/200_1sec.php
time                sum
----                ---
1552992300000000000 
1552992360000000000 24
1552992420000000000 24
1552992480000000000 24
1552992540000000000 24

name: apachelog2
tags: request=/200_2sec.php
time                sum
----                ---
1552992300000000000 
1552992360000000000 12
1552992420000000000 12
1552992480000000000 12
1552992540000000000 12

B: Redhat 7.5 VM on VMWare vSphere:

  • Red Hat Enterprise Linux Server release 7.5 (Maipo)

Result:

> select sum(resp_time_count) from apachelog2 where time > now() - 4m group by time(1m), request;
name: apachelog2
tags: request=/200_1sec.php
time                sum
----                ---
1552992060000000000
1552992120000000000 26
1552992180000000000 26
1552992240000000000 26
1552992300000000000 26

name: apachelog2
tags: request=/200_2sec.php
time                sum
----                ---
1552992060000000000
1552992120000000000 13
1552992180000000000 13
1552992240000000000 13
1552992300000000000 13

@danielnelson
Copy link
Contributor

Can you try using the watch_method = "poll" option to tail and see if it has any effect.

If nothing changes, can you place the 3 log lines into a file and substitute the tail input with the file input:

[[inputs.file]]
  interval = "2s"
  name_override = "apachelog2"
  files = ["test.log"]
  data_format = "grok"
  grok_patterns = ["%{NOTSPACE:time:drop} %{NOTSPACE:request:tag} %{NUMBER:resp_time:int}"]

test.log:

03/18/19-17:00:11 /200_1sec.php 1000000
03/18/19-17:00:11 /200_1sec.php 1000000
03/18/19-17:00:11 /200_2sec.php 2000000

@sbengo
Copy link

sbengo commented Mar 20, 2019

Hi @danielnelson, thanks for the answer!

First of all, on each of the test done, we have check that setting up drop_original to false seems that the log parser retrieved all lines and they had been written OK to the DB.
The problem seems to be related with the aggregator.

Following with your recommendations:

Set up watch_method = poll

Same wrong result as described on the issue.

Test with input.file plugin

Result:

It seems that the results are wrong. We expect to have 30 and 60

> select sum(resp_time_count) from apachelog2 where time > now() - 4m group by time(1m),request
name: apachelog2
tags: request=/200_1sec.php
time                sum
----                ---
1553079840000000000
1553079900000000000 20
1553079960000000000 20
1553080020000000000 20
1553080080000000000 20

name: apachelog2
tags: request=/200_2sec.php
time                sum
----                ---
1553079840000000000
1553079900000000000 10
1553079960000000000 10
1553080020000000000 10
1553080080000000000 10

More info:

With drop_original = false and with the following query we can see that the points are being read OK from file.
Note: in order to have all points, we changed the precision = 10s to precision = 's'

> select count(resp_time), sum(resp_time_count) from apachelog2 where time > now() - 3m group by time(1m),request
name: apachelog2
tags: request=/200_1sec.php
time                count sum
----                ----- ---
1553079900000000000 6
1553079960000000000 30    20
1553080020000000000 30    20
1553080080000000000 20    20

name: apachelog2
tags: request=/200_2sec.php
time                count sum
----                ----- ---
1553079900000000000 6
1553079960000000000 30    10
1553080020000000000 30    10
1553080080000000000 20    10

Raw:

time                request       resp_time resp_time_count resp_time_max resp_time_mean resp_time_min
----                -------       --------- --------------- ------------- -------------- -------------
1553080082000000000 /200_1sec.php 1000000
1553080082000000000 /200_2sec.php 2000000
1553080084000000000 /200_1sec.php 1000000
1553080084000000000 /200_2sec.php 2000000
1553080086000000000 /200_1sec.php 1000000
1553080086000000000 /200_2sec.php 2000000
1553080088000000000 /200_1sec.php 1000000
1553080088000000000 /200_2sec.php 2000000
1553080090000000000 /200_1sec.php 1000000
1553080090000000000 /200_2sec.php 2000000
1553080092000000000 /200_1sec.php 1000000
1553080092000000000 /200_2sec.php 2000000
1553080094000000000 /200_1sec.php 1000000
1553080094000000000 /200_2sec.php 2000000
1553080096000000000 /200_1sec.php 1000000
1553080096000000000 /200_2sec.php 2000000
1553080098000000000 /200_1sec.php 1000000
1553080098000000000 /200_2sec.php 2000000
1553080100000000000 /200_1sec.php 1000000
1553080100000000000 /200_2sec.php 2000000
1553080102000000000 /200_1sec.php 1000000
1553080102000000000 /200_2sec.php 2000000
1553080104000000000 /200_1sec.php 1000000
1553080104000000000 /200_2sec.php 2000000
1553080106000000000 /200_1sec.php 1000000
1553080106000000000 /200_2sec.php 2000000
1553080108000000000 /200_1sec.php 1000000
1553080108000000000 /200_2sec.php 2000000
1553080110000000000 /200_1sec.php 1000000
1553080110000000000 /200_2sec.php 2000000
1553080112000000000 /200_1sec.php 1000000
1553080112000000000 /200_2sec.php 2000000
1553080114000000000 /200_1sec.php 1000000
1553080114000000000 /200_2sec.php 2000000
1553080116000000000 /200_1sec.php 1000000
1553080116000000000 /200_2sec.php 2000000
1553080118000000000 /200_1sec.php 1000000
1553080118000000000 /200_2sec.php 2000000
1553080120000000000 /200_1sec.php 1000000
1553080120000000000 /200_2sec.php 2000000
1553080122000000000 /200_1sec.php 1000000
1553080122000000000 /200_2sec.php 2000000
1553080124000000000 /200_1sec.php 1000000
1553080124000000000 /200_2sec.php 2000000
1553080126000000000 /200_1sec.php 1000000
1553080126000000000 /200_2sec.php 2000000
1553080128000000000 /200_1sec.php 1000000
1553080128000000000 /200_2sec.php 2000000
1553080130000000000 /200_1sec.php 1000000
1553080130000000000 /200_2sec.php 2000000
1553080132000000000 /200_1sec.php 1000000
1553080132000000000 /200_2sec.php 2000000
1553080134000000000 /200_1sec.php 1000000
1553080134000000000 /200_2sec.php 2000000
1553080136000000000 /200_1sec.php 1000000
1553080136000000000 /200_2sec.php 2000000
1553080138000000000 /200_1sec.php 1000000
1553080138000000000 /200_2sec.php 2000000
1553080140000000000 /200_1sec.php 1000000   20              1000000       1000000        1000000
1553080140000000000 /200_2sec.php 2000000   10              2000000       2000000        2000000

@danielnelson
Copy link
Contributor

If you enable the internal plugin what are the metrics_dropped on internal_aggregate?

internal_aggregate,aggregator=basicstats metrics_dropped=0i

@sbengo
Copy link

sbengo commented Mar 21, 2019

Hi @danielnelson ,

We enabled the internal plugin and, as you can see on the following results, seems that:

<metrics_dropped> = <expected_value> - <sum(resp_time_count)>

Test with internal

  • We did a restart to see if the metrics_dropped would change
Expected : 90
T1: 39 = 90 - (34+17)
T2: 78 = 90 - (8+4)
  • We did the test with the input_file, with the same config than the above comment

Result

select non_negative_derivative(last(metrics_dropped),1m) from "internal_aggregate" where time  > now() - 20m group by time(1m);
select sum(resp_time_count) from "apachelog2" where time > now() - 20m group by time(1m),request

name: internal_aggregate
time                non_negative_derivative
----                -----------------------
1553154840000000000 39
1553154900000000000 39
1553154960000000000 39
1553155020000000000 39
1553155080000000000 39
1553155140000000000 39
1553155200000000000 39
1553155260000000000 39
1553155320000000000 39
<telegraf_restart>
1553155500000000000 78
1553155560000000000 78
1553155620000000000 78
1553155680000000000 78
1553155740000000000 78
1553155800000000000 78
1553155860000000000 78

name: apachelog2
tags: request=/200_1sec.php
time                sum
----                ---
1553154840000000000 34
1553154900000000000 34
1553154960000000000 34
1553155020000000000 34
1553155080000000000 34
1553155140000000000 34
1553155200000000000 34
1553155260000000000 34
1553155320000000000 34
<telegraf_restart>
1553155500000000000 8
1553155560000000000 8
1553155620000000000 8
1553155680000000000 8
1553155740000000000 8
1553155800000000000 8
1553155860000000000 8

name: apachelog2
tags: request=/200_2sec.php
time                sum
----                ---
1553154840000000000 17
1553154900000000000 17
1553154960000000000 17
1553155020000000000 17
1553155080000000000 17
1553155140000000000 17
1553155200000000000 17
1553155260000000000 17
1553155320000000000 17
<telegraf_restart>
1553155500000000000 4
1553155560000000000 4
1553155620000000000 4
1553155680000000000 4
1553155740000000000 4
1553155800000000000 4
1553155860000000000 4

@toni-moreno
Copy link
Contributor Author

Hi @danielnelson , I'm still working with the tail plugin.

image

This is my count while having 21 dropped metrics

> select sum(resp_time_count) from apachelog2 where time > now()  - 5m group by time(1m),request;
name: apachelog2
tags: request=/200_1sec.php
time                 sum
----                 ---
2019-03-21T10:19:00Z
2019-03-21T10:20:00Z 46
2019-03-21T10:21:00Z 46
2019-03-21T10:22:00Z 46
2019-03-21T10:23:00Z 46
2019-03-21T10:24:00Z 48

name: apachelog2
tags: request=/200_2sec.php
time                 sum
----                 ---
2019-03-21T10:19:00Z
2019-03-21T10:20:00Z 23
2019-03-21T10:21:00Z 23
2019-03-21T10:22:00Z 23
2019-03-21T10:23:00Z 23
2019-03-21T10:24:00Z 24

As in the previous @sbengo example with the file input plugin , dropped added with counted are correct total number of lines in the file per minute.

23 + 46 + 21 = 90 metrics/min

Could you explain us why aggregator is dropping metrics, and how can I fix it?

@danielnelson
Copy link
Contributor

There are 2 possibilities:

  • The selected metric does not have any fields; fieldpass has removed all the fields. I think this is actually a bug and they should be marked as metrics_filtered, but right now they are being marked under metrics_dropped.
  • The metric does not have a timestamp in the current aggregation window running_aggregator.go#L125-L128.

I'm not sure which case we are running into, though the code setting the aggregation window has been problematic, and it could possible have something to do with the agent precision rounding. I can fix the bug where we report metrics as dropped when they should be filtered, and that will help narrow it down.

@danielnelson
Copy link
Contributor

I see what the issue is, working on a fix now.

@toni-moreno
Copy link
Contributor Author

Hi @danielnelson any news about the fix ? . do you need any other test we can do?.

When ready we would like to test the fix and validate it if you want.

Thank you again!

@ushuz
Copy link
Contributor

ushuz commented Mar 26, 2019

I'm also testing aggregators and find that drop_original doesn't work well with max_undelivered_messages. It feels like drop_original = true won't reset some internal counters about how many messages were processed. Maybe these problems share the same root cause?

@ushuz ushuz mentioned this issue Mar 26, 2019
3 tasks
@danielnelson danielnelson self-assigned this Mar 26, 2019
@danielnelson danielnelson added area/agent bug unexpected problem or unintended behavior labels Mar 26, 2019
@danielnelson danielnelson added this to the 1.10.2 milestone Mar 26, 2019
@danielnelson
Copy link
Contributor

@toni-moreno I will definitely take you up on that offer to test, but I need a bit more time. Should have the fix ready on Wednesday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/agent bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants