core: use channel for promoted transactions to improve order during high volume #16673

ryanschneider · 2018-05-03T20:34:45Z

This works around what seems to be a source of contention when event.Feed.Send is called from multiple goroutines at high volume. For example, see the debug.stacks() output in #16617:

goroutine 43158053 [chan receive, 136 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc4b5ae3320, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

goroutine 43750113 [chan receive, 57 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc5403ca6c0, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

goroutine 43016550 [chan receive, 154 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc4935fddd0, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

goroutine 43928769 [chan receive, 35 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc420313e38, 0xdef1a0, 0xc5059bf710, 0x60c977)
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xcc
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx
  /home/ubuntu/geth/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/core/tx_pool.go:750 +0x2ae

As mentioned in #14669 I was unable to find the root cause of what appears to be the non-FIFO nature of the semaphore in event.Feed.Send, however by routing all the TxPool.promoteTx calls to a channel serviced in loop the contention is removed.

IMO the root cause of event.Feed.Send issues should still be investigated, but this change is orthogonal to any fix in that subsystem.

…igh volume Also send via channel here too.

karalabe · 2018-05-04T07:15:05Z

Neither the old code nor the new one guarantee ordering. If a lot of goroutines are started, it's up to the scheduler to decide on the order. If Feed.Send is stuck, it means some reader stopped fetching data and blocks everyone else. We'd need to find the blocker, that's the issue here.

fjl · 2018-05-04T11:10:35Z

core/tx_pool.go

@@ -653,7 +659,9 @@ func (pool *TxPool) add(tx *types.Transaction, local bool) (bool, error) {
 		log.Trace("Pooled new executable transaction", "hash", hash, "from", from, "to", tx.To())

 		// We've directly injected a replacement transaction, notify subsystems
-		go pool.txFeed.Send(TxPreEvent{tx})
+		go func(tx *types.Transaction) {


I think it would be better to just send on the same goroutine.

ryanschneider · 2018-05-04T17:13:39Z

I agree that this doesn't do anything to guarantee stricter ordering (especially with the send to channel still on a separate goroutine as Felix mentioned). I was really hesitant to send this as a PR, mainly wanted to do so because of the improved performance we are seeing our nodes running it, so it's great to get this sort of feedback.

FWIW, based on my experience w/ the issue, it's not that Feed.Send gets completely stuck, but rather that the behavior of acquiring f.sendLock goes from "mostly FIFO" to "almost completely LIFO", so you end up with some calls to Feed.Send sitting waiting for the lock for minutes or hours, while newer goroutines spawned later seem to have no trouble getting the lock. Always acquiring sendLock from the same goroutine running loop seems to work around this issue.

Anyways, I 100% agree that the real fix is to find what causes f.sendLock to get into that state, but so far I've been unable to do so.

ryanschneider · 2018-05-04T19:25:56Z

Here's a good example, I just restarted a node and was almost immediately seeing the promoteTx channels w/ some long block times, for example:

goroutine 50730 [chan receive, 2 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc425e64078, 0xe80460, 0xc4c39e9560, 0xc4c39e9560)
	/ext-go/1/src/github.com/ethereum/go-ethereum/event/feed.go:133 +0xc9
created by github.com/ethereum/go-ethereum/core.(*TxPool).promoteTx

However, I could clearly see that the total number of goroutines in that state was fluctuating up and down:

# while true; do geth attach /mnt/local//ethereum/geth.ipc --exec "debug.stacks()" | grep promoteTx | wc -l && sleep 2; done
1188
1073
930
875
829
829
975
859
^C

And event though that goroutine was "stuck" for 2 minutes, it eventually went away:

while true; do geth attach /mnt/local//ethereum/geth.ipc --exec "debug.stacks()" | grep 50730 && sleep 2; done
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:
goroutine 50730 [chan receive, 2 minutes]:


# these blank lines are because the goroutine was no longer found

So there's definitely movement in the system, its just that some goroutines get stuck not being serviced while others continue to operating fine. I'd invite you guys to try a similar test with a fully synced node on mainnet (I was using an archive node, but fast sync exhibits the same issue). I haven't looked to see if the same thing happens on testnets, but my guess is it won't just because of the difference in tx volume.

karalabe · 2018-05-07T08:37:00Z

Just for the reference, here's a similar stack dump from the Rinkeby bootnode. I'm seeing it has some trouble keeping up with the chian. Posting it here as one more data point: https://gist.github.com/karalabe/83ab19a04f42d3d5a28505e3577dbc92

holiman · 2018-05-17T10:33:35Z

For reference, which may also affect tickets about memory leaks, I saw that the monitoring node was up at 30G memory. A pprof dump showed

goroutine profile: total 1343045
1342813 @ 0x44132a 0x4413de 0x418a92 0x41874b 0x6c06ac 0x46edc1
#	0x6c06ab	github.com/ethereum/go-ethereum/event.(*Feed).Send+0xcb	/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/event/feed.go:133

32 @ 0x44132a 0x450fe0 0x9704ae 0x97ee9b 0x46edc1
#	0x9704ad	github.com/ethereum/go-ethereum/p2p.(*Peer).run+0x2cd		/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/peer.go:202
#	0x97ee9a	github.com/ethereum/go-ethereum/p2p.(*Server).runPeer+0x12a	/go-ethereum/build/_workspace/src/github.com/ethereum/go-ethereum/p2p/server.go:894

For context, that means there are 1.3 MILLION goroutines on the event feed. The second-runner is 32 goroutines.

This is probably the cause behind #16728 aswell. cc @fjl

holiman · 2018-05-17T10:38:42Z

Also for future reference, the info above was found via browsing

http://localhost:6060/debug/pprof/goroutine?debug=1

fjl · 2018-05-17T11:10:51Z

I have pushed a change that makes tx event sending fully synchronous.

karalabe · 2018-05-17T11:47:58Z

That will break the pool hard, because shuffling transactions will depend on external readers consuming the data fast enough. I don't randomly add goroutines for the fun of it.

…

On Thu, May 17, 2018, 13:10 Felix Lange ***@***.***> wrote: I have pushed a change that makes tx event sending fully synchronous. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#16673 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAH6GeNf6pEW-YBp9Ixpk81sGgFd8Z85ks5tzVq-gaJpZM4TxvgM> .

fjl · 2018-05-17T11:57:32Z

I've said it before: events should be sent synchronously to create back-pressure if downstream cannot keep up. I have verified consumers of the txFeed. The miner is the one to worry about and it will process txs fast enough if we also include #16720 (it changes the miner to skip execution if we're over the gas limit).

ryanschneider · 2018-05-17T14:17:24Z

If the subscribers can be expected to process/queue immediately then I agree with making the Send synchronous, however I don’t believe that’s currently the case. For example, when I traced the p2p code it appeared that it would block in response to a this event while it wrote the tx to every peer in the same goroutine.

IMO a compromise would be to keep the promotedTxCh but write to it synchronously. Even that would require a fair amount of testing during high tx churn to make sure the buffered channel could keep up.

I agree that the long term goal should be synchronous/quick processing of events in every subsystem that subscribes to them.

ryanschneider · 2018-05-17T14:35:46Z

I just read #16720 and am happy to rebase off that, I think the perf changes there would help the issue a lot but still not fully resolve it. I’m traveling this week but can try vanilla 16720 and it in combination with my changes early next week on some production loads.

fjl · 2018-05-29T15:52:29Z

#16720 is merged now and the leak seems fixed on our infrastructure. I'm closing this, please reopen if you still need this change.

core: use channel for promoted transactions to improve order during h…

7b4f6c0

…igh volume Also send via channel here too.

ryanschneider requested review from holiman and karalabe as code owners May 3, 2018 20:34

ryanschneider mentioned this pull request May 3, 2018

I think I find a bug when I read source code of feed.go in package event. #16580

Closed

fjl reviewed May 4, 2018

View reviewed changes

core: send tx events synchronously

70c1f89

vogelito mentioned this pull request May 20, 2018

Memory leak on go-ethereum 1.8.7 #16728

Closed

fjl closed this May 29, 2018

This was referenced Jun 2, 2018

Memory Leak Still Present in TxPool as of 1.8.10 #16859

Closed

internal/debug: use pprof goroutine writer for debug_stacks #16892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: use channel for promoted transactions to improve order during high volume #16673

core: use channel for promoted transactions to improve order during high volume #16673

ryanschneider commented May 3, 2018

karalabe commented May 4, 2018

fjl May 4, 2018

ryanschneider commented May 4, 2018

ryanschneider commented May 4, 2018

karalabe commented May 7, 2018

holiman commented May 17, 2018

holiman commented May 17, 2018

fjl commented May 17, 2018

karalabe commented May 17, 2018 via email

fjl commented May 17, 2018

ryanschneider commented May 17, 2018

ryanschneider commented May 17, 2018

fjl commented May 29, 2018

core: use channel for promoted transactions to improve order during high volume #16673

core: use channel for promoted transactions to improve order during high volume #16673

Conversation

ryanschneider commented May 3, 2018

karalabe commented May 4, 2018

fjl May 4, 2018

Choose a reason for hiding this comment

ryanschneider commented May 4, 2018

ryanschneider commented May 4, 2018

karalabe commented May 7, 2018

holiman commented May 17, 2018

holiman commented May 17, 2018

fjl commented May 17, 2018

karalabe commented May 17, 2018 via email

fjl commented May 17, 2018

ryanschneider commented May 17, 2018

ryanschneider commented May 17, 2018

fjl commented May 29, 2018