Return previous memstats if Go collector needs longer than 1s #568

beorn7 · 2019-05-03T23:14:10Z

If memstats couldn't be collected within 1s, they will simply be
skipped.

Also, document the issue comprehensively.

Sooooo, this is the simplest possible way to address #551. Or let's say it is the second simplest way.

The simplest way would be to not do anything about it (besides describing the problem in the doc comment) and wait/hope for Go1.13 to finally fix golang/go#19812 . Until that has happened, users affected by #551 will have occasional scrape timeouts. That's at least clean, but since the Go collector is there by default, most of the affected users will probably not have read the warning in the doc comment and will be as puzzled by the timeouts as @bboreham was.

The idea in this PR is to simply skip the memstats metrics in the (rare) cases where it hits a GC that also takes more than about a second to complete at the point the collection hits it. The obvious drawback is that in those cases, the time series will have gaps. (They will even marked as stale, if I understood staleness handling correctly. But I might be wrong. @brian-brazil will know for sure.)

Other ideas that I contemplated:

Add GoCollectorOpts, similar to ProcessCollectorOpts, with the option to switch off memstats metrics altogether and set a timeout to consciously opt into the "metric skip" behavior above. A sane default needs to be discussed (could be the status quo). It's a nicely explicit solution, but it will (mildly) break all code using NewGoCollector.
In case of a timeout, return the memstats from the previous collection cycle. Concern here is that the previous collection cycle might be very long ago. We could safeguard by only returning previous values that are at most 1m old or something. But that's still quite a contraption. Seems attractive at first, but I think that's the most problematic solution after all.

I'm really not sure what's the least evil here. @bboreham @brian-brazil I'd be grateful for your input.

brian-brazil · 2019-05-04T09:57:51Z

They will even marked as stale, if I understood staleness handling correctly.

That's correct.

In case of a timeout, return the memstats from the previous collection cycle. Concern here is that the previous collection cycle might be very long ago.

We'd have the right stats though at the next scrape, this doesn't seem like a major issue to me. We wouldn't have updated stats without waiting until the GC completes anyway.

bboreham · 2019-05-04T10:55:45Z

Couple of thoughts:

I was seeing >10s cycles several times an hour, so 1s doesn’t seem that unlikely.
Changes in the particular software have dropped the times, so I’m not as affected right now, personally.
Go is hard-coded to run GC every 2 minutes minimum.

I think I’m more with returning the last data we have, maybe collected on a background goroutine. But I haven’t thought how hard that is.

beorn7 · 2019-05-04T16:27:19Z

In case of a timeout, return the memstats from the previous collection cycle. Concern here is that the previous collection cycle might be very long ago.

We'd have the right stats though at the next scrape, this doesn't seem like a major issue to me. We wouldn't have updated stats without waiting until the GC completes anyway.

That's right, and I think in scenarios with regular scrapes, it will work just fine. I'm more concerned about less conventional scenarios, e.g. a developer runs a test binary that isn't regularly scraped by Prometheus and manually looks at the metrics now and then, like every few hours. This might yield results that are hours old without any warning.

But as you both seem to like the idea of returning the last data, let's think a bit more about it.

I wouldn't like to pro-actively poll memstats in a goroutine as I want an unscraped binary to not do any significant activity only because it is instrumented.

I do think we should have some sanity age check for the previous metrics, perhaps 5m as that's the lookback delta of Prometheus and if the previous collection is longer ago, it's probably not a setup with regular scrapes.

Thus, the proposed algorithm would be:

Start runtime.ReadMemStats in a goroutine. No matter what happens below, this goroutine will save the result with the current timestamp.
- If the goroutine finishes within 1s, collect metrics as usual.
- If not, check if there are saved results from a previous run that are less than 5m old.
  - If yes, return those.
  - If not, wait for the goroutine to complete, even if it takes very long.

This will result in the following properties:

Returned results will never be older than 5m.
If there are fresh enough results from previous runs, the collection will not last for longer than ~1s (which I still think is a fair requirement).
If there are no fresh enough previous results available, the collection will revert to the old behavior, which will in extreme cases run into the scrape timeout of the Prometheus server. But it will be a one-time thing because the next scrape will have fresh enough previous results available.
Under no circumstances it will return incomplete metrics.
With the expected fix in Go1.13, the code will automatically do the right thing as memstats collection will be faster than 1s.

Let me know what you think. If you like it, I'll implement it.

brian-brazil · 2019-05-04T17:19:43Z

That sounds reasonable to me, presuming you're okay with it.

beorn7 · 2019-05-04T19:55:33Z

I'll give it a spin.

beorn7 · 2019-05-05T23:09:32Z

Done. Please have another look, Bryan and Brian.

brian-brazil · 2019-05-06T07:11:11Z

👍

Might need a make format.

beorn7 · 2019-05-06T08:40:30Z

Might need a make format.

What makes you think so?

beorn7 · 2019-05-06T08:40:55Z

(And wouldn't the CI catch that anyway?)

brian-brazil · 2019-05-06T09:22:32Z

What makes you think so?

Some of the comments aren't lined up as I would expect.

(And wouldn't the CI catch that anyway?)

I had looked for that, but looks like I missed that it was checked.

Signed-off-by: beorn7 <bjoern@rabenste.in>

In this simple case, it's the fastest and easiest. Signed-off-by: beorn7 <bjoern@rabenste.in>

tl;dr: Return previous memstats if reading new ones takes longer than 1s. See the doc comment of NewGoCollector for details. Signed-off-by: beorn7 <bjoern@rabenste.in>

beorn7 · 2019-05-06T21:31:23Z

I squashed the commits to a meaningful sequence. Will merge on green.

beorn7 requested a review from brian-brazil May 3, 2019 23:14

beorn7 changed the title ~~Limit collection time of the Go collector to 1s~~ Return previous memstats if Go collector needs longer than 1s May 5, 2019

beorn7 added 3 commits May 6, 2019 23:28

Fix and tweak Go collector tests

016273b

Signed-off-by: beorn7 <bjoern@rabenste.in>

Replace fmt.Sprintf by simple concatenation

547c945

In this simple case, it's the fastest and easiest. Signed-off-by: beorn7 <bjoern@rabenste.in>

Handle long ReadMemStats duration in Go collector

7cf0955

tl;dr: Return previous memstats if reading new ones takes longer than 1s. See the doc comment of NewGoCollector for details. Signed-off-by: beorn7 <bjoern@rabenste.in>

beorn7 force-pushed the beorn7/go-collector branch from e1ed4bf to 7cf0955 Compare May 6, 2019 21:30

beorn7 merged commit 2d3b0fe into master May 6, 2019

beorn7 deleted the beorn7/go-collector branch May 6, 2019 22:10

beorn7 mentioned this pull request May 6, 2019

Document or fix more potential ways the goCollector can take very long due to ReadMemStats. #551

Closed

beorn7 mentioned this pull request May 16, 2019

runtime: cannot ReadMemStats during GC golang/go#19812

Closed

Dieterbe mentioned this pull request May 16, 2019

GC resulting in missed stats grafana/metrictank#1207

Closed

simonpasquier mentioned this pull request May 17, 2019

Bump prometheus/client_golang to v0.9.3 prometheus/prometheus#5575

Closed

bboreham mentioned this pull request May 17, 2019

/metrics calls on ingesters taking multiple seconds cortexproject/cortex#1281

Closed

beorn7 mentioned this pull request May 22, 2019

Cut v2.10.0-rc.0 prometheus/prometheus#5584

Merged

fkaleo mentioned this pull request Oct 16, 2019

MemoryReporter: make call to runtime.ReadMemStats time bound to avoid lost metrics grafana/metrictank#1494

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return previous memstats if Go collector needs longer than 1s #568

Return previous memstats if Go collector needs longer than 1s #568

beorn7 commented May 3, 2019

brian-brazil commented May 4, 2019

bboreham commented May 4, 2019

beorn7 commented May 4, 2019 •

edited

Loading

brian-brazil commented May 4, 2019

beorn7 commented May 4, 2019

beorn7 commented May 5, 2019

brian-brazil commented May 6, 2019

beorn7 commented May 6, 2019

beorn7 commented May 6, 2019

brian-brazil commented May 6, 2019

beorn7 commented May 6, 2019

Return previous memstats if Go collector needs longer than 1s #568

Return previous memstats if Go collector needs longer than 1s #568

Conversation

beorn7 commented May 3, 2019

brian-brazil commented May 4, 2019

bboreham commented May 4, 2019

beorn7 commented May 4, 2019 • edited Loading

brian-brazil commented May 4, 2019

beorn7 commented May 4, 2019

beorn7 commented May 5, 2019

brian-brazil commented May 6, 2019

beorn7 commented May 6, 2019

beorn7 commented May 6, 2019

brian-brazil commented May 6, 2019

beorn7 commented May 6, 2019

beorn7 commented May 4, 2019 •

edited

Loading