Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: cannot ReadMemStats during GC #19812

Closed
aclements opened this issue Mar 31, 2017 · 16 comments · Fixed by prometheus/client_golang#568
Closed

runtime: cannot ReadMemStats during GC #19812

aclements opened this issue Mar 31, 2017 · 16 comments · Fixed by prometheus/client_golang#568
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@aclements
Copy link
Member

GC holds the worldsema across the entire garbage collection, which means nothing else can stop the world, even during concurrent marking. Anything that does try to stop the world will block until GC is done. Probably the most interesting such operation is ReadMemStats.

We should fix GC to only hold worldsema during sweep termination and mark termination. The trickiest part of this is probably dealing with GOMAXPROCS changing during GC, which will affect mark worker scheduling.

/cc @RLH

@aclements aclements added this to the Go1.9Maybe milestone Mar 31, 2017
@aclements aclements modified the milestones: Go1.10, Go1.9Maybe Jun 21, 2017
@aclements aclements added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Jun 21, 2017
@aclements aclements modified the milestones: Go1.10, Go1.11 Nov 8, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018
@ianlancetaylor ianlancetaylor added this to the Go1.12 milestone May 31, 2018
@ianlancetaylor ianlancetaylor added the NeedsFix The path to resolution is known, but the work has not been done. label May 31, 2018
@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
@andreimatei
Copy link
Contributor

This issue is kind of a pain for CockroachDB because we use ReadMemStats() to get periodic heap stats. Because the ReadMemStats() blocks for GC to finish (and we've seen it take many seconds), we've had to collect these heap stats separately from the loop that sample other stats in our server (which we care about collecting periodically, and in a timely manner), and so now our heap stats are out of sync with those other stats.

FWIW, we'd rather have more approximate heap stats that we can collect cheaply than more accurate heap stats that can only be collected at specific points.

@beorn7
Copy link

beorn7 commented May 16, 2019

Note that there is a work around in client_golang now, see prometheus/client_golang#568 (check the final state of the PR, not the whole discussion). It will be released any moment as v0.9.3.

@beorn7
Copy link

beorn7 commented May 16, 2019

Meaning: If you use prometheus/client_golang, you are good now. If you are doing it on your own, it might serve as inspiration.

@aclements
Copy link
Member Author

Thanks for the reports of this happening in real systems. I knew this was a theoretical problem, but this is the first data I've had on it happening in production.

Making the stats approximate doesn't really help in this case. We're just doing some heavy-handed synchronization that we need to figure out how to break up.

@zigo101
Copy link

zigo101 commented May 17, 2019

Is it acceptable to cache and return the stats before GC starts?

@mknyszek
Copy link
Contributor

I spoke with @aclements and there may be a simple way to do this. Most of the trickiness arises from dealing with GOMAXPROCS changing during the mark phase, but we can also ignore that case by just adding a new semaphore. Turns out we also have to retain the old behavior for trace start for other reasons.

I have a quick and dirty change which seems to work OK in all.bash locally. I'll upload it in a bit.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/182657 mentions this issue: runtime: don't hold worldsema across mark phase

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
t4n6a1ka pushed a commit to t4n6a1ka/go that referenced this issue Sep 5, 2019
This change makes it so that worldsema isn't held across the mark phase.
This means that various operations like ReadMemStats may now stop the
world during the mark phase, reducing latency on such operations.

Only three such operations are still no longer allowed to occur during
marking: GOMAXPROCS, StartTrace, and StopTrace.

For the former it's because any change to GOMAXPROCS impacts GC mark
background worker scheduling and the details there are tricky.

For the latter two it's because tracing needs to observe consistent GC
start and GC end events, and if StartTrace or StopTrace may stop the
world during marking, then it's possible for it to see a GC end event
without a start or GC start event without an end, respectively.

To ensure that GOMAXPROCS and StartTrace/StopTrace cannot proceed until
marking is complete, the runtime now holds a new semaphore, gcsema,
across the mark phase just like it used to with worldsema.

Fixes golang#19812.

Change-Id: I15d43ed184f711b3d104e8f267fb86e335f86bf9
Reviewed-on: https://go-review.googlesource.com/c/go/+/182657
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/215157 mentions this issue: runtime: preempt dedicated background mark workers for STW

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/216358 mentions this issue: Revert "runtime: don't hold worldsema across mark phase"

gopherbot pushed a commit that referenced this issue Jan 24, 2020
This reverts commit 7b294cd, CL 182657.

Reason for revert: This change may be causing latency problems
for applications which call ReadMemStats, because it may cause
all goroutines to stop until the GC completes.

https://golang.org/cl/215157 fixes this problem, but it's too
late in the cycle to land that.

Updates #19812.

Change-Id: Iaa26f4dec9b06b9db2a771a44e45f58d0aa8f26d
Reviewed-on: https://go-review.googlesource.com/c/go/+/216358
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@aclements
Copy link
Member Author

Reopening because of revert.

@aclements aclements reopened this Jan 25, 2020
@aclements aclements modified the milestones: Go1.14, Go1.15 Jan 25, 2020
@mknyszek
Copy link
Contributor

We're pretty sure we know how to fix this properly now. I'll have a more complete patch ready early in the Go 1.15 cycle.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/216557 mentions this issue: runtime: don't hold worldsema across mark phase

@dmitshur
Copy link
Contributor

This issue is currently labeled as early-in-cycle for Go 1.15. That time is now, so friendly ping. If it no longer needs to be done early in cycle, that label can be removed.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/220177 mentions this issue: runtime: add ReadMemStats latency benchmark

@mknyszek
Copy link
Contributor

mknyszek commented Feb 19, 2020

@dmitshur I'm pinging my reviewers. This should go in ASAP given how much trouble it gave us last release.

It should be a lot better this time though; the fix to this bug last time was a bit sloppy, and everything here has been re-done in a more principled manner (with a new benchmark).

@mknyszek mknyszek self-assigned this Mar 12, 2020
gopherbot pushed a commit that referenced this issue Mar 18, 2020
Currently, dedicated background mark workers are essentially always
non-preemptible.

This change makes it so that dedicated background mark workers park if
their preemption flag is set and someone is trying to STW, allowing them
to do so.

This change prepares us for allowing a STW to happen (and happen
promptly) during GC marking in a follow-up change.

Updates #19812.

Change-Id: I67fb6085bf0f0aebd18ca500172767818a1f15e3
Reviewed-on: https://go-review.googlesource.com/c/go/+/215157
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Mar 18, 2020
This change adds a benchmark to the runtime which measures ReadMemStats
latencies. It generates allocations with lots of pointers to keep the GC
busy while hitting ReadMemStats and measuring the time it takes to
complete.

Updates #19812.

Change-Id: I7a76aaf497ba5324d3c7a7b3df32461b3e6c3ac8
Reviewed-on: https://go-review.googlesource.com/c/go/+/220177
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/228337 mentions this issue: runtime: prevent preemption while releasing worldsema in gcStart

gopherbot pushed a commit that referenced this issue Apr 16, 2020
Currently, as a result of us releasing worldsema now to allow STW events
during a mark phase, we release worldsema between starting the world and
having the goroutine block in STW mode. This inserts preemption points
which, if followed through, could lead to a deadlock. Specifically,
because user goroutine scheduling is disabled in STW mode, the goroutine
will block before properly releasing worldsema.

The fix here is to prevent preemption while releasing the worldsema.

Fixes #38404.
Updates #19812.

Change-Id: I8ed5b3aa108ab2e4680c38e77b0584fb75690e3d
Reviewed-on: https://go-review.googlesource.com/c/go/+/228337
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Austin Clements <austin@google.com>
@golang golang locked and limited conversation to collaborators Apr 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants