-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eth/downloader: drop beacon head updates if the syncer is restarting #27397
eth/downloader: drop beacon head updates if the syncer is restarting #27397
Conversation
So if we have this flow of events piling up: Seems to me that a better solution would be to, even if the internals are busy, keep the best/latest update, and discard earlier. So that So basically;
Once the api is done with whatever it was doing, it takes the How about something like that? |
Not sure it matters. If there's a header gap, you need to do a sync run afterwards. Now whether you keep the last header or drop all and wait for teh next update is kind of similar. Tho' I can see your version a bit more stable. All in all I'm unhappy about both. That said, the moment the thing locks up, even if we block all goroutines somewhere the order of insertion after that will be random, so it's gonna get wonky anyway. I guess a possibly better solution would be to have a queue of headers that arrived whilst we were offline and try to consume it when the syncer is restarted. That way we could preserve the inbound order at least and not lost any headers. |
Ah, I think I can make this more elegant. I'll make the suspend run on it's own goroutine instead of the syncer thread and have the synced keep eating up events in between. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change LGTM. Minor note, though, it "looks" more complex than it actually is, in my humble opinion... For me at least, it would look simpler if you had moved the drainer into a new routine, and left the other one where it was. Like this:
defer func() {
done := make(chan struct{})
go func() {
// Keep draining the headEvents, to not block sender (updates
// coming in via the beacon RPC api).
for {
select {
case <-done:
return
case event := <-s.headEvents:
event.errc <- errors.New("beacon syncer reorging")
}
}
}()
if filled := s.filler.suspend(); filled != nil {
// If something was filled, try to delete stale sync helpers. If
// unsuccessful, warn the user, but not much else we can do (it's
// a programming error, just let users report an issue and don't
// choke in the meantime).
if err := s.cleanStales(filled); err != nil {
log.Error("Failed to clean stale beacon headers", "err", err)
}
}
close(done)
}()
Do with it as you see fit :)
…thereum#27397) * eth/downloader: drop beacon head updates if the syncer is restarting * eth/donwloader: v2 of the goroutine spike preventer
…tarting (ethereum#27397)" This reverts commit 0968b3d.
…tarting (ethereum#27397)" This reverts commit 0968b3d.
Fixes #27386.
When a reorg happens, the beacon/skeleton syncer needs to stop, swap out it's internal state to the new head and restart. If the downloader is busy importing blocks in the mean time, it will prevent the beacon syncer from completing it's restart loop until all queued blocks are consumed.
During this "hang", any newly arriving newPayload or forckchoiceUpdate messages will get blocked as they cannot feed their respective new data to the beacon syncer. If the "hang" lasts 15 minutes, these update messages will keep piling and piling, consuming RAM and goroutines.
The PR "fixes" this issue by dropping update messages on the floor while in this limbo state. This should unblock engine API events and prevent bloat.
That said, it does introduce a wonky behavior: if a newPayload arrives back to back after a fockchoiceUpdate that triggered a reorg, that newPayload might get ignored. This isn't a big of a deal for a syncing node, but if the node is already in sync, it will "miss" a block and will need to wait for the next one to come back into sync. If the entire network does this however, will it ever come back into sync? Not clear, seems a bit unsafe.