Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.11.1] Ordering of config exposed potential race in addOutput() #1090

Closed
wrigtim opened this issue Apr 25, 2016 · 3 comments
Closed

[0.11.1] Ordering of config exposed potential race in addOutput() #1090

wrigtim opened this issue Apr 25, 2016 · 3 comments
Labels
bug unexpected problem or unintended behavior

Comments

@wrigtim
Copy link

wrigtim commented Apr 25, 2016

Currently running 0.11.1 with a few internal patches. I noticed our max_buffer_limit config was not being respected on some systems, and with some more testing noticed there seems to be a race on startup where the config isn't honored every time?

I've since re-ordered our standard config to define [agent] before [outputs], which seems to reduce how easy this is to reproduce (no scientific analysis here as yet!).

[root@host1 ~]# grep metric_buffer /etc/telegraf/telegraf.conf 
  metric_buffer_limit = 32768

Starting the agent twice in quick succession - first starts with metric_buffer_limit of 32768, and the second starts with zero...

[root@host1 ~]# telegraf -config /etc/telegraf/telegraf.conf 
2016/04/25 07:16:41 Creating runningoutput with metric_buffer_limit 32768
2016/04/25 07:16:41 Creating runningoutput with metric_buffer_limit 32768
2016/04/25 07:16:41 Starting Telegraf (version 0.11.3-4)
2016/04/25 07:16:41 Loaded outputs: influxdb influxdb
2016/04/25 07:16:41 Loaded inputs: cpu disk diskio mem swap system netif
2016/04/25 07:16:41 Agent Config: Interval:9s, Debug:false, Quiet:false, Hostname:"host1", Flush Interval:17.308906277s 
2016/04/25 07:16:41 Gathered metrics, (separate 5m0s interval), from disk in 442.32µs
2016/04/25 07:16:42 Hang on, flushing any cached metrics before shutdown
2016/04/25 07:16:42 Wrote 9 metrics to output influxdb in 78.334µs
2016/04/25 07:16:42 Wrote 9 metrics to output influxdb in 89.99µs
^C

[root@host1 ~]# telegraf -config /etc/telegraf/telegraf.conf 
2016/04/25 07:16:43 Creating runningoutput with metric_buffer_limit 0
2016/04/25 07:16:43 Creating runningoutput with metric_buffer_limit 0
2016/04/25 07:16:43 Starting Telegraf (version 0.11.3-4)
2016/04/25 07:16:43 Loaded outputs: influxdb influxdb
2016/04/25 07:16:43 Loaded inputs: disk diskio mem swap system netif cpu
2016/04/25 07:16:43 Agent Config: Interval:9s, Debug:false, Quiet:false, Hostname:"host1", Flush Interval:16.260703541s 
2016/04/25 07:16:43 Gathered metrics, (separate 5m0s interval), from disk in 426.628µs
2016/04/25 07:16:44 Hang on, flushing any cached metrics before shutdown
2016/04/25 07:16:44 Wrote 9 metrics to output influxdb in 82.83µs
2016/04/25 07:16:44 Wrote 9 metrics to output influxdb in 177.271µs
2016/04/25 07:16:45 Gathered metrics, (9s interval), from 6 inputs in 1.971490474s

PS: the 'created runningoutput' line was added with:

--- a/telegraf/src/github.com/influxdata/telegraf/internal/config/config.go
+++ b/telegraf/src/github.com/influxdata/telegraf/internal/config/config.go
@@ -422,6 +422,7 @@ func (c *Config) addOutput(name string, table *ast.Table) error {
        }

        ro := internal_models.NewRunningOutput(name, output, outputConfig)
+       log.Printf("Creating runningoutput with metric_buffer_limit %d\n", c.Agent.MetricBufferLimit)
        if c.Agent.MetricBufferLimit > 0 {
                ro.MetricBufferLimit = c.Agent.MetricBufferLimit
        }
@sparrc
Copy link
Contributor

sparrc commented Apr 25, 2016

thanks for the report, I'll dig into this

@sparrc sparrc added the bug unexpected problem or unintended behavior label Apr 25, 2016
sparrc added a commit that referenced this issue Apr 29, 2016
@sparrc
Copy link
Contributor

sparrc commented Apr 29, 2016

Found the problem, I was relying on the map order to parse various portions of the config, which appears to usually be returned in-order, but sometimes (about 15% of the time) gets returned out-of-order.

When out-of-order, the output configs would get loaded before the agent config, resulting in outputs that didn't have the agent config applied!

thanks for the report, I'll have the fix merged soon (pr #1130)

sparrc added a commit that referenced this issue Apr 29, 2016
sparrc added a commit that referenced this issue Apr 29, 2016
@sparrc
Copy link
Contributor

sparrc commented Apr 29, 2016

BTW @wrigtim @sebito91 there is also a change coming to the metric_buffer_limit config option that simplifies configuration and buffering. See the 2nd release note here: https://github.com/influxdata/telegraf/blob/master/CHANGELOG.md

@sparrc sparrc closed this as completed in 4e9798d Apr 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants