-
Notifications
You must be signed in to change notification settings - Fork 73
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on the bug fix, not too sure if I agree on the second change.
pkg/agent/config.go
Outdated
if c.ScrapeTimeout > c.ScrapeInterval { | ||
return fmt.Errorf("scrape timeout must be larger or equal to inverval for: %v", c.JobName) | ||
} | ||
if c.ScrapeTimeout == 0 { | ||
c.ScrapeTimeout = c.ScrapeInterval | ||
c.ScrapeTimeout = c.ScrapeInterval + model.Duration(3*time.Second) | ||
} | ||
if c.ScrapeTimeout <= c.ScrapeInterval { | ||
return fmt.Errorf("scrape timeout must be larger or equal to inverval for: %v", c.JobName) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only do those defaults and checks if delta=true
. Because otherwise computing the profile will be instantaneous and we should have shorter timeouts.
pkg/agent/profiles.go
Outdated
@@ -228,7 +228,7 @@ func (tg *TargetGroup) targetsFromGroup(group *targetgroup.Group) ([]*Target, [] | |||
} | |||
|
|||
if pcfg, found := tg.config.ProfilingConfig.PprofConfig[profType]; found && pcfg.Delta { | |||
params.Add("seconds", strconv.Itoa(int(time.Duration(tg.config.ScrapeTimeout)/time.Second)-1)) | |||
params.Add("seconds", strconv.Itoa(int(time.Duration(tg.config.ScrapeInterval)/time.Second))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too sure why this had been removed. I always thought that this will make sure we have a consistent scrape time, even when network latency is added (up to 1 second).
Let's take this example:
scrape_interval: 15s
scrape_timeout: 18s
profiling_time: 15s (was 14s before)
After this change, when we have network latency of 500ms, we would miss out on scrapes, because the real scrape interval would be 15s+network latency and this will create problems when we look at things over time e.g. this historgram screenshot (similar to the prometheus rate([twice_scrape_interval]) problem.
Here you can see a point missing at 11:12:47 because of that:
Co-authored-by: Christian Simon <simon@swine.de>
Co-authored-by: Christian Simon <simon@swine.de>
Fixes the scrape timeout validation.
Fixes #463