Fix Besu StacklessClosedChannelException errors and resulted Timeout errors in CL clients #4410

ahamlat · 2022-09-17T15:45:58Z

Signed-off-by: Ameziane H ameziane.hamlat@consensys.net

PR description

StacklessClosedChannelException are thrown when some Eth calls, like eth_syncing or eth_getBlockByHash have to wait for the engine_newPayloadV1 call to finish. As the latter takes sometimes more than 1 second to execute depending on the block size, transactions type and hardware setup, the CL closes the connection (in the case of Lighthouse) because the timeout is set to 1 second on these calls.

In the screenshot below, we can see that 1 eth_getBlockByHash call and 2 eth_syncing calls are waiting for engine_newPayloadV1 call to finish.

The Eth calls should't execute sequentially with the other Engine API calls, this fix will make several requests to execute concurrently on port 8551.
@garyschulte please confirm that the 'Engine" will still execute sequentially even with this new configuration.

You can find lighthouse logs before and after the fix.

Fixed Issue(s)

fixes #4398 and #4400

Documentation

I thought about documentation and added the doc-change-required label to this PR if
updates are required.

Changelog

[x ] I thought about the changelog and included a changelog update if required.

…e API requests are executed sequentially. Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

kayagoban · 2022-09-17T21:30:20Z

I installed a Besu patched with this PR and it has had little effect on my missed attestations. Correctly voted head percent still is floating between 58% and 81%. Before merge I was at 98%. But maybe this helps fix other problems. Good luck!

ahamlat · 2022-09-17T22:11:11Z

@kayagoban thanks for your feedback, what is your hardware setup and your CL client ? do you see see timeout errors in CL side and could you share Besu logs to check block processing time ?

siladu

LGTM, as long as the message ordering property is maintained for forkChoiceUpdated as per https://github.com/ethereum/execution-apis/blob/main/src/engine/specification.md#message-ordering

garyschulte

Confirmed that this fixes the eth_ concurrency issue and keeps the engine_ api calls ordered and synchronous 👍

As suggeted by @jflo we could/should do additional websocket engine api testing with nimbus also to ensure that concurrent websocket requests are handled correctly when nimbus is using websockets for the engine api.

I will update with nimbus websocket testing results.

kayagoban · 2022-09-18T06:38:57Z

I’m running Besu/Prysm. No timeout errors on CL side. Maybe I can make a new issue with what logs I do have and my symptoms.

ahamlat · 2022-09-18T14:55:48Z

@kayagoban Yes, please, we think we have different use cases around missing attestations. Please, provide your hardware setup, Besu and Prysm logs.

ibhagwan · 2022-09-19T03:54:50Z

@garyschulte, running this patch on mainnet with nimbus websockets, just missed an attestation after about two hours of running the patched version.

garyschulte · 2022-09-19T04:08:11Z

@garyschulte, running this patch on mainnet with nimbus websockets, just missed an attestation after about two hours of running the patched version.

Thanks for testing. Other than continuing to miss attestations did nimbus/besu work normally otherwise? What percentage of attestations were missed before and after?

ibhagwan · 2022-09-19T04:18:38Z

@garyschulte, running this patch on mainnet with nimbus websockets, just missed an attestation after about two hours of running the patched version.

Thanks for testing. Other than continuing to miss attestations did nimbus/besu work normally otherwise? What percentage of attestations were missed before and after?

Yes, other than the missed attestation it’s working normally, I can’t tell if there is a difference yet as prior it was also missing roughly 1-2 attestations/hr.

It will be interesting to see if effectiveness improves, before the patch it would hover between 90-95% due to the occasional sub optimal inclusion distance, hoping to see improvement after 12hrs.

steflsd · 2022-09-19T07:45:51Z

I'm receiving missed attestations with a Prysm / Besu configuration. Relevant from Prysm might be the repeating message

time="2022-09-19 07:44:23" level=info msg="Subscribed to topic" prefix=sync topic="/eth2/4a26c58b/beacon_attestation_2/ssz_snappy"
time="2022-09-19 07:44:33" level=error msg="Could not handle p2p pubsub" error="could not process block: could not validate new payload: timeout from http.Client: received an undefined ee error" prefix=sync topic="/eth2/4a26c58b/beacon_block/ssz_snappy"

ibhagwan · 2022-09-19T13:56:22Z

Sadly I can report that this PR did not fix missed attestations with nimbus, roughly same amount of attestations are missed and effectiveness won’t go beyond 94-95% (was 97-98% pre-merge).

siladu · 2022-09-19T20:00:26Z

Missed attestations despite this fix being tracked here #4400

…e API requests are executed sequentially. (hyperledger#4410) Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

Fix StacklessClosedChannelException caused by the fact that all Engin…

05c15bd

…e API requests are executed sequentially. Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

siladu approved these changes Sep 18, 2022

View reviewed changes

garyschulte approved these changes Sep 18, 2022

View reviewed changes

ibhagwan mentioned this pull request Sep 19, 2022

Besu Execution timeout/missed attestations #4400

Closed

jflo merged commit 5319376 into hyperledger:main Sep 19, 2022

eum602 pushed a commit to lacchain/besu that referenced this pull request Nov 3, 2023

Fix StacklessClosedChannelException caused by the fact that all Engin…

068b2be

…e API requests are executed sequentially. (hyperledger#4410) Signed-off-by: Ameziane H <ameziane.hamlat@consensys.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Besu StacklessClosedChannelException errors and resulted Timeout errors in CL clients #4410

Fix Besu StacklessClosedChannelException errors and resulted Timeout errors in CL clients #4410

ahamlat commented Sep 17, 2022

kayagoban commented Sep 17, 2022

ahamlat commented Sep 17, 2022

siladu left a comment

garyschulte left a comment •

edited

Loading

kayagoban commented Sep 18, 2022

ahamlat commented Sep 18, 2022

ibhagwan commented Sep 19, 2022

garyschulte commented Sep 19, 2022

ibhagwan commented Sep 19, 2022 •

edited

Loading

steflsd commented Sep 19, 2022

ibhagwan commented Sep 19, 2022

siladu commented Sep 19, 2022

Fix Besu StacklessClosedChannelException errors and resulted Timeout errors in CL clients #4410

Fix Besu StacklessClosedChannelException errors and resulted Timeout errors in CL clients #4410

Conversation

ahamlat commented Sep 17, 2022

PR description

Fixed Issue(s)

Documentation

Changelog

kayagoban commented Sep 17, 2022

ahamlat commented Sep 17, 2022

siladu left a comment

Choose a reason for hiding this comment

garyschulte left a comment • edited Loading

Choose a reason for hiding this comment

kayagoban commented Sep 18, 2022

ahamlat commented Sep 18, 2022

ibhagwan commented Sep 19, 2022

garyschulte commented Sep 19, 2022

ibhagwan commented Sep 19, 2022 • edited Loading

steflsd commented Sep 19, 2022

ibhagwan commented Sep 19, 2022

siladu commented Sep 19, 2022

garyschulte left a comment •

edited

Loading

ibhagwan commented Sep 19, 2022 •

edited

Loading