Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Besu StacklessClosedChannelException errors and resulted Timeout errors in CL clients #4410

Merged
merged 1 commit into from
Sep 19, 2022

Conversation

ahamlat
Copy link
Contributor

@ahamlat ahamlat commented Sep 17, 2022

Signed-off-by: Ameziane H ameziane.hamlat@consensys.net

PR description

StacklessClosedChannelException are thrown when some Eth calls, like eth_syncing or eth_getBlockByHash have to wait for the engine_newPayloadV1 call to finish. As the latter takes sometimes more than 1 second to execute depending on the block size, transactions type and hardware setup, the CL closes the connection (in the case of Lighthouse) because the timeout is set to 1 second on these calls.

In the screenshot below, we can see that 1 eth_getBlockByHash call and 2 eth_syncing calls are waiting for engine_newPayloadV1 call to finish.

image

The Eth calls should't execute sequentially with the other Engine API calls, this fix will make several requests to execute concurrently on port 8551.
@garyschulte please confirm that the 'Engine" will still execute sequentially even with this new configuration.

You can find lighthouse logs before and after the fix.

Fixed Issue(s)

fixes #4398 and #4400

Documentation

  • I thought about documentation and added the doc-change-required label to this PR if
    updates are required.

Changelog

…e API requests are executed sequentially.

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>
@kayagoban
Copy link

I installed a Besu patched with this PR and it has had little effect on my missed attestations. Correctly voted head percent still is floating between 58% and 81%. Before merge I was at 98%. But maybe this helps fix other problems. Good luck!

@ahamlat
Copy link
Contributor Author

ahamlat commented Sep 17, 2022

@kayagoban thanks for your feedback, what is your hardware setup and your CL client ? do you see see timeout errors in CL side and could you share Besu logs to check block processing time ?

Copy link
Contributor

@siladu siladu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as long as the message ordering property is maintained for forkChoiceUpdated as per https://github.com/ethereum/execution-apis/blob/main/src/engine/specification.md#message-ordering

Copy link
Contributor

@garyschulte garyschulte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that this fixes the eth_ concurrency issue and keeps the engine_ api calls ordered and synchronous 👍

As suggeted by @jflo we could/should do additional websocket engine api testing with nimbus also to ensure that concurrent websocket requests are handled correctly when nimbus is using websockets for the engine api.

I will update with nimbus websocket testing results.

@kayagoban
Copy link

I’m running Besu/Prysm. No timeout errors on CL side. Maybe I can make a new issue with what logs I do have and my symptoms.

@ahamlat
Copy link
Contributor Author

ahamlat commented Sep 18, 2022

@kayagoban Yes, please, we think we have different use cases around missing attestations. Please, provide your hardware setup, Besu and Prysm logs.

@ibhagwan
Copy link

@garyschulte, running this patch on mainnet with nimbus websockets, just missed an attestation after about two hours of running the patched version.

@garyschulte
Copy link
Contributor

@garyschulte, running this patch on mainnet with nimbus websockets, just missed an attestation after about two hours of running the patched version.

Thanks for testing. Other than continuing to miss attestations did nimbus/besu work normally otherwise? What percentage of attestations were missed before and after?

@ibhagwan
Copy link

ibhagwan commented Sep 19, 2022

@garyschulte, running this patch on mainnet with nimbus websockets, just missed an attestation after about two hours of running the patched version.

Thanks for testing. Other than continuing to miss attestations did nimbus/besu work normally otherwise? What percentage of attestations were missed before and after?

Yes, other than the missed attestation it’s working normally, I can’t tell if there is a difference yet as prior it was also missing roughly 1-2 attestations/hr.

It will be interesting to see if effectiveness improves, before the patch it would hover between 90-95% due to the occasional sub optimal inclusion distance, hoping to see improvement after 12hrs.

@steflsd
Copy link

steflsd commented Sep 19, 2022

I'm receiving missed attestations with a Prysm / Besu configuration. Relevant from Prysm might be the repeating message

time="2022-09-19 07:44:23" level=info msg="Subscribed to topic" prefix=sync topic="/eth2/4a26c58b/beacon_attestation_2/ssz_snappy"
time="2022-09-19 07:44:33" level=error msg="Could not handle p2p pubsub" error="could not process block: could not validate new payload: timeout from http.Client: received an undefined ee error" prefix=sync topic="/eth2/4a26c58b/beacon_block/ssz_snappy"

@ibhagwan
Copy link

Sadly I can report that this PR did not fix missed attestations with nimbus, roughly same amount of attestations are missed and effectiveness won’t go beyond 94-95% (was 97-98% pre-merge).

@jflo jflo merged commit 5319376 into hyperledger:main Sep 19, 2022
@siladu
Copy link
Contributor

siladu commented Sep 19, 2022

Missed attestations despite this fix being tracked here #4400

eum602 pushed a commit to lacchain/besu that referenced this pull request Nov 3, 2023
…e API requests are executed sequentially. (hyperledger#4410)

Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JsonRpcExecutorHandler | Error streaming JSON-RPC response
7 participants