-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return previous memstats if Go collector needs longer than 1s #568
Conversation
That's correct.
We'd have the right stats though at the next scrape, this doesn't seem like a major issue to me. We wouldn't have updated stats without waiting until the GC completes anyway. |
Couple of thoughts:
I think I’m more with returning the last data we have, maybe collected on a background goroutine. But I haven’t thought how hard that is. |
That's right, and I think in scenarios with regular scrapes, it will work just fine. I'm more concerned about less conventional scenarios, e.g. a developer runs a test binary that isn't regularly scraped by Prometheus and manually looks at the metrics now and then, like every few hours. This might yield results that are hours old without any warning. But as you both seem to like the idea of returning the last data, let's think a bit more about it. I wouldn't like to pro-actively poll memstats in a goroutine as I want an unscraped binary to not do any significant activity only because it is instrumented. I do think we should have some sanity age check for the previous metrics, perhaps 5m as that's the lookback delta of Prometheus and if the previous collection is longer ago, it's probably not a setup with regular scrapes. Thus, the proposed algorithm would be:
This will result in the following properties:
Let me know what you think. If you like it, I'll implement it. |
That sounds reasonable to me, presuming you're okay with it. |
I'll give it a spin. |
Done. Please have another look, Bryan and Brian. |
👍 Might need a make format. |
What makes you think so? |
(And wouldn't the CI catch that anyway?) |
Some of the comments aren't lined up as I would expect.
I had looked for that, but looks like I missed that it was checked. |
Signed-off-by: beorn7 <bjoern@rabenste.in>
In this simple case, it's the fastest and easiest. Signed-off-by: beorn7 <bjoern@rabenste.in>
tl;dr: Return previous memstats if reading new ones takes longer than 1s. See the doc comment of NewGoCollector for details. Signed-off-by: beorn7 <bjoern@rabenste.in>
e1ed4bf
to
7cf0955
Compare
I squashed the commits to a meaningful sequence. Will merge on green. |
If memstats couldn't be collected within 1s, they will simply be
skipped.
Also, document the issue comprehensively.
Sooooo, this is the simplest possible way to address #551. Or let's say it is the second simplest way.
The simplest way would be to not do anything about it (besides describing the problem in the doc comment) and wait/hope for Go1.13 to finally fix golang/go#19812 . Until that has happened, users affected by #551 will have occasional scrape timeouts. That's at least clean, but since the Go collector is there by default, most of the affected users will probably not have read the warning in the doc comment and will be as puzzled by the timeouts as @bboreham was.
The idea in this PR is to simply skip the memstats metrics in the (rare) cases where it hits a GC that also takes more than about a second to complete at the point the collection hits it. The obvious drawback is that in those cases, the time series will have gaps. (They will even marked as stale, if I understood staleness handling correctly. But I might be wrong. @brian-brazil will know for sure.)
Other ideas that I contemplated:
Add
GoCollectorOpts
, similar toProcessCollectorOpts
, with the option to switch off memstats metrics altogether and set a timeout to consciously opt into the "metric skip" behavior above. A sane default needs to be discussed (could be the status quo). It's a nicely explicit solution, but it will (mildly) break all code usingNewGoCollector
.In case of a timeout, return the memstats from the previous collection cycle. Concern here is that the previous collection cycle might be very long ago. We could safeguard by only returning previous values that are at most 1m old or something. But that's still quite a contraption. Seems attractive at first, but I think that's the most problematic solution after all.
I'm really not sure what's the least evil here. @bboreham @brian-brazil I'd be grateful for your input.