Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logparser plugin don't process new lines after telegraf configuration reload. #3573

Closed
Isonami opened this issue Dec 12, 2017 · 3 comments · Fixed by #6074 or #6089
Closed

Logparser plugin don't process new lines after telegraf configuration reload. #3573

Isonami opened this issue Dec 12, 2017 · 3 comments · Fixed by #6074 or #6089
Labels
area/tail bug unexpected problem or unintended behavior
Milestone

Comments

@Isonami
Copy link

Isonami commented Dec 12, 2017

Bug report

The newly started telegraf works well, but after configuration reload via SIGHUP, new lines in log aren't parsed or sent to output plugin. Other metrics aren't affected.

Relevant telegraf.conf:

/etc/telegraf/telegraf.conf

[global_tags]

[agent]
  hostname = "hostname"
  omit_hostname = false
  interval = "30s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "30s"
  flush_jitter = "0s"
  logfile = ""
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
  database = "telegraf"
  insecure_skip_verify = true
  urls = ["https://influx.hostname"]

#
# INPUTS:
#
[[inputs.cpu]]
  percpu = true
  totalcpu = true
[[inputs.disk]]
[[inputs.io]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.swap]]
[[inputs.system]]

/etc/telegraf/telegraf.d/logparser.conf

[[inputs.logparser]]
  files = ["/var/log/nginx/web-*-access.log"]
  from_beginning = false
[inputs.logparser.grok]
  measurement = "nginx_access_log"
  patterns = ["\\[[0-9a-z-]+\\] %{CLIENT:client_ip} - %{NOTSPACE} \\[%{HTTPDATE:timestamp:ts-httpd}\\] \"(?:%{WORD:verb:tag} %{NOTSPACE:request}(?: HTTP/%{NUMBER:http_version:float})?|%{DATA})\" %{NUMBER:resp_code:tag} (?:%{NUMBER:resp_bytes:int}|-) \"%{NOTSPACE:referer}\" \"[^\"]+\" \"(?:%{CLIENT:forwarded_for}|-)\" \"(?:%{CLIENT:real_ip}|-)\" %{NOTSPACE} rt=%{NUMBER:request_time:float} uht=(?:%{NUMBER:upstream_header_time:float}|-) urt=(?:%{NUMBER:upstream_response_time:float}|-)"]
  timezone = ""

System info:

Centos 7.2
Telegraf 1.4.3,1.4.5

Steps to reproduce:

  1. Start telegraf
  2. Write new line to log file: echo '[-] 192.168.1.1 - - [12/Dec/2017:14:00:47 +0300] "GET /app HTTP/1.1" 501 178 "-" "axios/0.16.2" "-" "127.0.0.1" (.) rt=1.001 uht=1.001 urt=1.001' >> /var/log/nginx/web-api-access.log
  3. Wait 30 seconds and select last value from influx select last("request_time") from "nginx_access_log" limit 1 must be 1.001
  4. Reload telegraf: kill -HUP $PID
  5. Write new line: echo '[-] 192.168.1.1 - - [12/Dec/2017:14:05:47 +0300] "GET /app HTTP/1.1" 502 178 "-" "axios/0.16.2" "-" "127.0.0.1" (.) rt=2.002 uht=2.002 urt=2.002' >> /var/log/nginx/web-api-access.log
  6. Wait 30 seconds and select again. New value must be 2.002, but 1.001 is still here.

Additional info:

Telegraf log output:

Dec 12 14:46:33 web2 [1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 12 14:46:33 web2 systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Starting Telegraf v1.4.5
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Loaded outputs: influxdb
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Loaded inputs: inputs.system inputs.cpu inputs.disk inputs.diskio inputs.mem inputs.net inputs.swap inputs.logparser
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Tags enabled: host=web2
Dec 12 14:46:33 web2 telegraf[24150]: 2017-12-12T11:46:33Z I! Agent Config: Interval:30s, Quiet:false, Hostname:"web2", Flush Interval:30s
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Reloading Telegraf config
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Hang on, flushing any cached metrics before shutdown
Dec 12 14:51:41 web2 systemd[1]: Reloaded The plugin-driven server agent for reporting metrics into InfluxDB.
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Starting Telegraf v1.4.5
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Loaded outputs: influxdb
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Loaded inputs: inputs.net inputs.swap inputs.system inputs.cpu inputs.disk inputs.diskio inputs.mem inputs.logparser
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Tags enabled: host=web2
Dec 12 14:51:41 web2 telegraf[24150]: 2017-12-12T11:51:41Z I! Agent Config: Interval:30s, Quiet:false, Hostname:"web2", Flush Interval:30s
@danielnelson danielnelson added area/tail bug unexpected problem or unintended behavior labels Dec 12, 2017
@bolek2000
Copy link

bolek2000 commented Apr 3, 2018

To work around bug #3906 I do a service telegraf reload in my preporcessing script (that works around bug #3613 ). This leads to the effect that my 2nd logparser stops working.
I have one logparser which uses from_beginning = true and reads a preprocessed file. This one still works after the SIGHUP.
But the second logparser that reads from the live /var/log/nginx/access.log in conjunction with the basicstats aggregator stops working after SIGHUP. Here I use from_beginning = false.
Bothh logparsers use inotify.

@danielnelson
Copy link
Contributor

Closing based on #5347 (comment)

@sgtsquiggs
Copy link
Contributor

This issue is still valid :(

I have found the issue though. In hpcloud/tail (influxdata/tail) there is a problem:

In this function, it is possible for shared.watchNums[fname] to go negative.

When debugging, I found that it would decrement once because of tomb.Tomb's Dying() triggering:

github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.remove at inotify_tracker.go:102
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.RemoveWatch at inotify_tracker.go:87
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.(*InotifyFileWatcher).ChangeEvents.func1 at inotify.go:94
runtime.goexit at asm_amd64.s:1337
 - Async stack trace
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.(*InotifyFileWatcher).ChangeEvents at inotify.go:77

and again due to tail.Cleanup() being called

github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.remove at inotify_tracker.go:102
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.RemoveWatch at inotify_tracker.go:87
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail/watch.Cleanup at inotify_tracker.go:129
github.com/influxdata/telegraf/vendor/github.com/influxdata/tail.(*Tail).Cleanup at tail.go:453
github.com/influxdata/telegraf/plugins/inputs/tail.(*Tail).Stop at tail.go:219
github.com/influxdata/telegraf/agent.(*Agent).stopServiceInputs at agent.go:651
github.com/influxdata/telegraf/agent.(*Agent).Run.func1 at agent.go:75
runtime.goexit at asm_amd64.s:1337
 - Async stack trace
github.com/influxdata/telegraf/agent.(*Agent).Run at agent.go:66

@danielnelson danielnelson reopened this Jul 3, 2019
sgtsquiggs added a commit to sgtsquiggs/telegraf that referenced this issue Jul 5, 2019
* Fixes influxdata#3522 - Not able to read rotated log file without missing lines
* Fixes influxdata#3573 - logparser/tail plugin stop after reload (removes unnecessary tailer.Cleanup() calls)
@danielnelson danielnelson added this to the 1.11.2 milestone Jul 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment