-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: use channel for promoted transactions to improve order during high volume #16673
core: use channel for promoted transactions to improve order during high volume #16673
Conversation
…igh volume Also send via channel here too.
Neither the old code nor the new one guarantee ordering. If a lot of goroutines are started, it's up to the scheduler to decide on the order. If |
core/tx_pool.go
Outdated
@@ -653,7 +659,9 @@ func (pool *TxPool) add(tx *types.Transaction, local bool) (bool, error) { | |||
log.Trace("Pooled new executable transaction", "hash", hash, "from", from, "to", tx.To()) | |||
|
|||
// We've directly injected a replacement transaction, notify subsystems | |||
go pool.txFeed.Send(TxPreEvent{tx}) | |||
go func(tx *types.Transaction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to just send on the same goroutine.
I agree that this doesn't do anything to guarantee stricter ordering (especially with the send to channel still on a separate goroutine as Felix mentioned). I was really hesitant to send this as a PR, mainly wanted to do so because of the improved performance we are seeing our nodes running it, so it's great to get this sort of feedback. FWIW, based on my experience w/ the issue, it's not that Anyways, I 100% agree that the real fix is to find what causes |
Here's a good example, I just restarted a node and was almost immediately seeing the
However, I could clearly see that the total number of goroutines in that state was fluctuating up and down:
And event though that goroutine was "stuck" for 2 minutes, it eventually went away:
So there's definitely movement in the system, its just that some goroutines get stuck not being serviced while others continue to operating fine. I'd invite you guys to try a similar test with a fully synced node on mainnet (I was using an archive node, but fast sync exhibits the same issue). I haven't looked to see if the same thing happens on testnets, but my guess is it won't just because of the difference in tx volume. |
Just for the reference, here's a similar stack dump from the Rinkeby bootnode. I'm seeing it has some trouble keeping up with the chian. Posting it here as one more data point: https://gist.github.com/karalabe/83ab19a04f42d3d5a28505e3577dbc92 |
For reference, which may also affect tickets about memory leaks, I saw that the monitoring node was up at 30G memory. A pprof dump showed
For context, that means there are 1.3 MILLION goroutines on the event feed. The second-runner is 32 goroutines. |
Also for future reference, the info above was found via browsing
|
I have pushed a change that makes tx event sending fully synchronous. |
That will break the pool hard, because shuffling transactions will depend
on external readers consuming the data fast enough. I don't randomly add
goroutines for the fun of it.
…On Thu, May 17, 2018, 13:10 Felix Lange ***@***.***> wrote:
I have pushed a change that makes tx event sending fully synchronous.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#16673 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAH6GeNf6pEW-YBp9Ixpk81sGgFd8Z85ks5tzVq-gaJpZM4TxvgM>
.
|
I've said it before: events should be sent synchronously to create back-pressure if downstream cannot keep up. I have verified consumers of the |
If the subscribers can be expected to process/queue immediately then I agree with making the Send synchronous, however I don’t believe that’s currently the case. For example, when I traced the p2p code it appeared that it would block in response to a this event while it wrote the tx to every peer in the same goroutine. IMO a compromise would be to keep the I agree that the long term goal should be synchronous/quick processing of events in every subsystem that subscribes to them. |
I just read #16720 and am happy to rebase off that, I think the perf changes there would help the issue a lot but still not fully resolve it. I’m traveling this week but can try vanilla 16720 and it in combination with my changes early next week on some production loads. |
#16720 is merged now and the leak seems fixed on our infrastructure. I'm closing this, please reopen if you still need this change. |
Related: #16617 and #14669.
This works around what seems to be a source of contention when
event.Feed.Send
is called from multiple goroutines at high volume. For example, see thedebug.stacks()
output in #16617:As mentioned in #14669 I was unable to find the root cause of what appears to be the non-FIFO nature of the semaphore in
event.Feed.Send
, however by routing all theTxPool.promoteTx
calls to a channel serviced inloop
the contention is removed.IMO the root cause of
event.Feed.Send
issues should still be investigated, but this change is orthogonal to any fix in that subsystem.