Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: fix goroutine leak related to state snapshot generation #28974

Merged
merged 6 commits into from
Feb 14, 2024

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Feb 12, 2024

This PR fixes two things,

  • A goroutine leak caused by snapshot generation, which, after being completed, wants to deliver the generator stats over a chanel. In some cases, we fail to read the channel, which causes thousands of threads getting blocked (example all: update to go version 1.22.1 #28946 (comment) ).
  • A t.Parallel early in the state test execution causes a massive parallelization: basically all statetest files are read, and initiated. Once all state-tests are initiated, they are all halted, or, put into a wait, so that testing can proceed at the specified paralellism. I suspect that the early forking of testcases does not improve the testing-speed on appveyor/travis, so I removed it here. Will check which run is fastest, with or without it.

@holiman
Copy link
Contributor Author

holiman commented Feb 12, 2024

Exec speed of github.com/ethereum/go-ethereum/tests

Ubuntu AMD64 Windows AMD64 Windows 386
less parallel 1077.158s 781.898s 770.950s
more parallel 1102.093s 636.330s 653.223s

Ok, I guess we keep the paralellism!

@fjl fjl changed the title tests: fix goroutineleak tests: fix goroutine leak Feb 12, 2024
@@ -240,6 +240,9 @@ func (t *StateTest) RunNoVerify(subtest StateSubtest, vmconfig vm.Config, snapsh
vmconfig.ExtraEips = eips

block := t.genesis(config).ToBlock()
// MakePreState invokes snapshot.New, which will start a goroutine to
// generate.go:generate(). This goroutine cannot exit until the abort-stats
// can be delivered, either via a call to Journal or e.g Disable.
triedb, snaps, statedb := MakePreState(rawdb.NewMemoryDatabase(), t.json.Pre, snapshotter, scheme)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding this here?

if snaps != nil {
    defer snaps.Disable()
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And while here, we should probably also use defer triedb.Close().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only close these things unless we return them to caller. If we do, it's the callers job to clean up

Copy link
Contributor

@fjl fjl Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, but that's a design smell then. So there are quite a few cases where we need to take care of closing it. The other way of using defer here would be naming the returns, and then adding:

defer func() {
    if err != nil {
       triedb.Close()
       if snaps != nil {
           snaps.Disable()
       }
    }
}()

What I'm trying to get at here, is we should try to have worry-free return whenever possible, and having to add a condition before every return is not great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. Once the CI runs is done, I can change it to use that defer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@fjl fjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy about this weird leak of snapshot internals, but I guess there is no good solution otherwise.

@fjl fjl changed the title tests: fix goroutine leak tests: fix goroutine leak related to state snapshot generation Feb 12, 2024
@fjl
Copy link
Contributor

fjl commented Feb 12, 2024

It doesn't work because we sometimes want to return an error from RunNoVerify but also leave the databases open for some reason.

@holiman holiman requested a review from s1na as a code owner February 12, 2024 20:43
Comment on lines +490 to +491
st.Snapshots.Disable()
st.Snapshots.Release()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make Disable safe to call multiple times. Right now, it will halt, because nothing is listening to dl.genAbort. I think maybe we should nil the genAbort when the generator becomes inactive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff --git a/core/state/snapshot/snapshot.go b/core/state/snapshot/snapshot.go
index 6389842382..52e6fca56c 100644
--- a/core/state/snapshot/snapshot.go
+++ b/core/state/snapshot/snapshot.go
@@ -258,6 +258,13 @@ func (t *Tree) Disable() {
 	for _, layer := range t.layers {
 		switch layer := layer.(type) {
 		case *diskLayer:
+
+			layer.lock.RLock()
+			if layer.genMarker == nil {
+				// Generator is already aborted or finished
+				layer.lock.RUnlock()
+				break
+			}
 			// If the base layer is generating, abort it
 			if layer.genAbort != nil {
 				abort := make(chan *generatorStats)

I think this would fix it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a simpler fix could be to set st.Snapshots = nil here?


layer.lock.RLock()
generating := layer.genMarker != nil
layer.lock.RUnlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is safe. We are iterating the layers here, and locking them individually. What happens if genMarker is updated in-between?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. However, there's only one singleton disklayer. The genmarker is only ever set to nil after completion. I don't see how this can go wrong, other than false-negative: after we release this lock, the generation completes, and we again get stuck trying to send stuff over a channel where nobody is listening.

The mechanism re genabort is not wonderful to work with.

@fjl fjl merged commit 8321fe2 into ethereum:master Feb 14, 2024
3 checks passed
@fjl fjl added this to the 1.13.13 milestone Feb 14, 2024
jorgemmsilva pushed a commit to iotaledger/go-ethereum that referenced this pull request Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants