[feature request] Recalculating flush jitter #1296

kostasb · 2016-05-30T14:50:17Z

Scope:

Change the behavior of Telegraf's "flush_jitter" to make it sleep for a random,newly assigned, amount of time before flushing each output.

original topic:
Add a flush_randomize_jitter configuration parameter for the agent, that will force recalculation of the flush interval (FlushInterval.Duration) for the next buffer flush cycle.
This will enable the use of flush interval jitter in the same Telegraf agent, while the current flush_jitter implementation calculates the flush interval only once at daemon startup using the flush_jitter parameter, but then maintains the calculated flush interval for all onward cycles.

sparrc · 2016-05-30T21:20:01Z

I would prefer to change the behavior of telegraf, sleeping for a random (newly assigned) amount before flushing each output. This is currently the behavior of collection_jitter.

because of the architecture there is no way for collection_jitter to behave the same way that flush_jitter currently does, so I'd rather have them be consistent and make flush_jitter behave the same as collection_jitter.

kostasb · 2016-05-31T06:47:21Z

My only consideration is about users in the field currently using flush_jitter for its original purpose to avoid having multiple agents send data at the exact same time, while flushing happens at fixed intervals.
A change of flush_jitter behavior after an upgrade for these existing installations may be confusing to users as they will see writes coming in at random times.

Adding the new behavior for jitter as a new feature & config parameter would enable us to support both solutions.

sparrc · 2016-05-31T07:52:52Z

@kostasb currently flush_jitter does jitter the interval by a random amount, but for each instance it is a fixed length that gets set at startup.

This would be just to change the behavior to reget the jitter each interval.

kostasb · 2016-05-31T08:12:35Z

@sparrc That is my consideration: users who currently use jitter to introduce some fixed delay that is consistent for each flush, will be getting a random delay for each flush once this behavior is changed.

For example, in case there are data points emitted to Telegraf without a timestamp this will result in random timestamps inside the jitter window, while with the current fixed jitter the interval would be consistent for every write. This might affect some users, which is why I suggested keeping both behaviors.

sparrc · 2016-05-31T08:37:06Z

Telegraf doesn't emit metrics without a timestamp, and I don't think users would be relying on a "fixed" flushing interval that changes any time the service is restarted or reloaded. It seems to me like it makes more sense to make the two jitter options consistent rather than adding more flags to get them to do the same thing.

daviesalex · 2016-05-31T08:38:24Z

@sparrc +1. Anybody who wants this will really want true random (random on each flush).

kostasb · 2016-05-31T08:50:17Z

@sparrc As long as we add a note in documents for the change in jitter behavior and the use cases affected for data points arriving to Telegraf without timestamps ( e.g. udp and tcp input plugins ) that sounds good to me.

sparrc · 2016-05-31T08:52:35Z

@kostasb yes, will do :),

Telegraf adds timestamps when it adds points, so any udp or tcp points would get timestamps as soon as they hit the telegraf listener.

kostasb · 2016-05-31T08:54:59Z

@sparrc Thanks for clarifying that, then the change shouldn't really affect any real world cases.

daviesalex · 2016-05-31T09:09:41Z

The current implementation isnt really achieving a terribly random effect. Here is real world data from a random sample of 1,000 servers running telegraf plotting the interval out of the log file on startup:

Looked at another way:

The hope is that a random item per node per 10 seconds will be far better distributed to achieve the goal (which is a overall uniform, or close to uniform, distribution). Equally important, it will also eliminate long term biases for some nodes.

Thanks guys!

kostasb · 2016-05-31T09:19:07Z

@sparrc Will it be possible for jitter to go down to ms precision or will we stick with seconds ?

use a common function between collection_jitter and flush_jitter. which creates the same behavior between the two options. going forward, both jitters will be random sleeps that get re-evaluated at runtime for every interval (previously only collection_jitter did this) also fixes behavior so that both jitters will exit in the event of a process exit. closes #1296

sparrc · 2016-05-31T09:21:25Z

it will be nanoseconds

use a common function between collection_jitter and flush_jitter. which creates the same behavior between the two options. going forward, both jitters will be random sleeps that get re-evaluated at runtime for every interval (previously only collection_jitter did this) also fixes behavior so that both jitters will exit in the event of a process exit. closes #1296

daviesalex · 2016-05-31T16:56:09Z

Thanks, @sparrc. We will get this into our environment ASAP and let you know!

kostasb mentioned this issue May 30, 2016

flush_interval / flush_jitter not doing what expected #1293

Closed

sparrc mentioned this issue May 31, 2016

Refactor collection_jitter and flush_jitter #1298

Merged

1 task

sparrc closed this as completed in #1298 May 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Recalculating flush jitter #1296

[feature request] Recalculating flush jitter #1296

kostasb commented May 30, 2016 •

edited

Loading

sparrc commented May 30, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

daviesalex commented May 31, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

kostasb commented May 31, 2016

daviesalex commented May 31, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

daviesalex commented May 31, 2016

[feature request] Recalculating flush jitter #1296

[feature request] Recalculating flush jitter #1296

Comments

kostasb commented May 30, 2016 • edited Loading

sparrc commented May 30, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

daviesalex commented May 31, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

kostasb commented May 31, 2016

daviesalex commented May 31, 2016

kostasb commented May 31, 2016

sparrc commented May 31, 2016

daviesalex commented May 31, 2016

kostasb commented May 30, 2016 •

edited

Loading