Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient response handling in Snap #25600

Closed
holiman opened this issue Aug 25, 2022 · 3 comments
Closed

Inefficient response handling in Snap #25600

holiman opened this issue Aug 25, 2022 · 3 comments

Comments

@holiman
Copy link
Contributor

holiman commented Aug 25, 2022

Note This bug report is somewhat speculative, I am not 100% certain that the description here is correct.


I think I know why this PR (#25588) does not make any difference.
Screenshot 2022-08-25 at 08-44-34 Single Geth - Grafana
This is the machine charts. A few observations can be made:

  • There is always 128-150 pending (outstanding) requests
  • The Snap trienode handle times are most commonly in 3-7s bucket, but quite often up to 20 seconds.

So if all our responses have already arrived, handling them will take ~1280 seconds, or 21 minutes. So once we get to actually handle them, they will have timed out in the snap layer, although they were fine in the p2p packet tracker layer.

This is wasteful: we are making requests at a higher rate than we can handle, and thus ignoring responses and refetching stuff. We need to adjust the mechanism so that we do not request more than we can handle -- alternatively, fix the timeout management so that we do not time out deliveries which have already arrived and are just waiting in the queue.

@holiman
Copy link
Contributor Author

holiman commented Aug 25, 2022

In sync.go, every time the loop runs, we will assign/issue another trienode heal request. So once an item has ben lying on the queue for ~15 minutes, we issue and send out a new request. It may be served quickly by the remote peer, but will not be handled until after it has timed out, 15-20 minutes later.

A better model in this case would be to tune down the 128 pending ones down to maybe 10. The node in question has ~300 peers -- I guess this problem doesn't normally occur unless the node has a lot of peers.

OBS: this is a consequence of disk IO speed, not network lag.

@holiman
Copy link
Contributor Author

holiman commented Aug 25, 2022

Here's one hour of execution, where I have set a hard upper limit of pending trie requests at 15.
Screenshot 2022-08-25 at 14-01-59 Single Geth - Grafana

The Unexpected trienode heal log messages almost (but not quite) go away completely.
Screenshot 2022-08-25 at 14-05-57 bootnode-azure-koreasouth-001 - Papertrail
The code I used is at this branch: https://github.com/holiman/go-ethereum/tree/heal_timeout_fix .

It places the cap very un-dynamically at the place where we schedule requests. It could be applied in several different places, and could also be made dynamic -- e.g. adjusted so that the maxPending goes down if the mean handle-time goes up. Suggestions appreciated.

@holiman
Copy link
Contributor Author

holiman commented Sep 1, 2022

Fix: #25651

@holiman holiman closed this as completed Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant