Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: use channel for promoted transactions to improve order during high volume #16673

Closed

Conversation

ryanschneider
Copy link
Contributor

Related: #16617 and #14669.

This works around what seems to be a source of contention when event.Feed.Send is called from multiple goroutines at high volume. For example, see the debug.stacks() output in #16617:

goroutine 43158053 [chan receive, 136 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc4b5ae3320, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

goroutine 43750113 [chan receive, 57 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc5403ca6c0, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

goroutine 43016550 [chan receive, 154 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc4935fddd0, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

goroutine 43928769 [chan receive, 35 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc5059bf710, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

As mentioned in #14669 I was unable to find the root cause of what appears to be the non-FIFO nature of the semaphore in event.Feed.Send, however by routing all the TxPool.promoteTx calls to a channel serviced in loop the contention is removed.

IMO the root cause of event.Feed.Send issues should still be investigated, but this change is orthogonal to any fix in that subsystem.

@karalabe
Copy link
Member

karalabe commented May 4, 2018

Neither the old code nor the new one guarantee ordering. If a lot of goroutines are started, it's up to the scheduler to decide on the order. If Feed.Send is stuck, it means some reader stopped fetching data and blocks everyone else. We'd need to find the blocker, that's the issue here.

core/tx_pool.go Outdated
@@ -653,7 +659,9 @@ func (pool *TxPool) add(tx *types.Transaction, local bool) (bool, error) {
log.Trace("Pooled new executable transaction", "hash", hash, "from", from, "to", tx.To())

// We've directly injected a replacement transaction, notify subsystems
go pool.txFeed.Send(TxPreEvent{tx})
go func(tx *types.Transaction) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to just send on the same goroutine.

@ryanschneider
Copy link
Contributor Author

I agree that this doesn't do anything to guarantee stricter ordering (especially with the send to channel still on a separate goroutine as Felix mentioned). I was really hesitant to send this as a PR, mainly wanted to do so because of the improved performance we are seeing our nodes running it, so it's great to get this sort of feedback.

FWIW, based on my experience w/ the issue, it's not that Feed.Send gets completely stuck, but rather that the behavior of acquiring f.sendLock goes from "mostly FIFO" to "almost completely LIFO", so you end up with some calls to Feed.Send sitting waiting for the lock for minutes or hours, while newer goroutines spawned later seem to have no trouble getting the lock. Always acquiring sendLock from the same goroutine running loop seems to work around this issue.

Anyways, I 100% agree that the real fix is to find what causes f.sendLock to get into that state, but so far I've been unable to do so.

@ryanschneider
Copy link
Contributor Author

Here's a good example, I just restarted a node and was almost immediately seeing the promoteTx channels w/ some long block times, for example:

goroutine 50730 [chan receive, 2 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc425e64078, 0xe80460, 0xc4c39e9560, 0xc4c39e9560)
	/ext-go/1/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xc9
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx

However, I could clearly see that the total number of goroutines in that state was fluctuating up and down:

# while true; do geth attach /mnt/local//ethereum/geth.ipc --exec "debug.stacks()" | grep promoteTx | wc -l && sleep 2; done
1188
1073
930
875
829
829
975
859
^C

And event though that goroutine was "stuck" for 2 minutes, it eventually went away:

while true; do geth attach /mnt/local//ethereum/geth.ipc --exec "debug.stacks()" | grep 50730 && sleep 2; done
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:


# these blank lines are because the goroutine was no longer found

So there's definitely movement in the system, its just that some goroutines get stuck not being serviced while others continue to operating fine. I'd invite you guys to try a similar test with a fully synced node on mainnet (I was using an archive node, but fast sync exhibits the same issue). I haven't looked to see if the same thing happens on testnets, but my guess is it won't just because of the difference in tx volume.

@karalabe
Copy link
Member

karalabe commented May 7, 2018

Just for the reference, here's a similar stack dump from the Rinkeby bootnode. I'm seeing it has some trouble keeping up with the chian. Posting it here as one more data point: https://gist.github.com/karalabe/83ab19a04f42d3d5a28505e3577dbc92

@holiman
Copy link
Contributor

holiman commented May 17, 2018

For reference, which may also affect tickets about memory leaks, I saw that the monitoring node was up at 30G memory. A pprof dump showed

goroutine profile: total 1343045
1342813 @ 0x44132a 0x4413de 0x418a92 0x41874b 0x6c06ac 0x46edc1
#	0x6c06ab	github.com/ethereum/go-ethereum/event.(*Feed).Send+0xcb	/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133

32 @ 0x44132a 0x450fe0 0x9704ae 0x97ee9b 0x46edc1
#	0x9704ad	github.com/ethereum/go-ethereum/p2p.(*Peer).run+0x2cd		/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/peer.go:202
#	0x97ee9a	github.com/ethereum/go-ethereum/p2p.(*Server).runPeer+0x12a	/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/server.go:894

For context, that means there are 1.3 MILLION goroutines on the event feed. The second-runner is 32 goroutines.

This is probably the cause behind #16728 aswell. cc @fjl

@holiman
Copy link
Contributor

holiman commented May 17, 2018

Also for future reference, the info above was found via browsing

http://localhost:6060/debug/pprof/goroutine?debug=1

@fjl
Copy link
Contributor

fjl commented May 17, 2018

I have pushed a change that makes tx event sending fully synchronous.

@karalabe
Copy link
Member

karalabe commented May 17, 2018 via email

@fjl
Copy link
Contributor

fjl commented May 17, 2018

I've said it before: events should be sent synchronously to create back-pressure if downstream cannot keep up. I have verified consumers of the txFeed. The miner is the one to worry about and it will process txs fast enough if we also include #16720 (it changes the miner to skip execution if we're over the gas limit).

@ryanschneider
Copy link
Contributor Author

If the subscribers can be expected to process/queue immediately then I agree with making the Send synchronous, however I don’t believe that’s currently the case. For example, when I traced the p2p code it appeared that it would block in response to a this event while it wrote the tx to every peer in the same goroutine.

IMO a compromise would be to keep the promotedTxCh but write to it synchronously. Even that would require a fair amount of testing during high tx churn to make sure the buffered channel could keep up.

I agree that the long term goal should be synchronous/quick processing of events in every subsystem that subscribes to them.

@ryanschneider
Copy link
Contributor Author

I just read #16720 and am happy to rebase off that, I think the perf changes there would help the issue a lot but still not fully resolve it. I’m traveling this week but can try vanilla 16720 and it in combination with my changes early next week on some production loads.

@fjl
Copy link
Contributor

fjl commented May 29, 2018

#16720 is merged now and the leak seems fixed on our infrastructure. I'm closing this, please reopen if you still need this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants