fix(lit-core): LIT-4016 - Enhance error handling during epoch changes #710

MaximusHaximus · 2024-11-04T17:43:24Z

Description

Updated error handling and logic flow for handling epoch change events, and signalling errors during epoch change processing to consumers/listeners

LitCore is now an event emitter
- Used EventEmitter3 which is extremely widely used, ESM compatible, and written in ES3 for maximum compatibility (!)
Events for disconnected, connected and error are emitted
We no longer call _stopListeningForNewEpoch() during _connect(); concurrent calls to connect() are chained automatically already.
- This fixes a corner case where failures to connect during event handling would cause LitCore to stop listening for future epoch changes.
Error handling / execution flow updated in _handleStakingContractStateChange()
- Errors encountered anywhere in the process will emit error/disconnected events appropriately
- this.ready is now set to false when epoch change events are not processed correctly
We now throw an explicit NotReady error from getSessionSigs() if we get to the point we're going to map across this.connectedNodes, but the LitCore instance is not ready.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

How Has This Been Tested?

Ran all local-tests

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

- LitCore is now an event emitter - Events for `disconnected`, `connected` and `error` are emitted - We no longer call `_stopListeningForNewEpoch()` during `_connect()` - concurrent calls to `connect()` are chained automatically - Error handling / execution flow updated in `_handleStakingContractStateChange()`

MaximusHaximus · 2024-11-04T17:45:15Z

packages/core/src/lib/lit-core.ts

@@ -492,11 +507,6 @@ export class LitCore {
  }

  private async _connect() {
-    // Ensure an ill-timed epoch change event doesn't trigger concurrent config changes while we're already doing that


Removed this call to stop listening -- if the code inside our listener calls connect(), it will just chain on any already-pending connect() logic (see connect() logic and promise chain on `this._connectingPromise.

This doesn't make the client self-healing, but it does mean that a failure won't leave the client in a state where it is no longer listening for further epoch change events.

But will it chain the promises since we specifically check that if a pending connection is open then just return it?

We always re-set _connectionPromise to null when a connect() call finishes -- so if we are connecting, multiple calls to connect() will all return the same promise. This means that if we're still processing connect() when we receive another epoch change event, it will be effectively a no-op.

It occurred to me that we could implement cancellation across this entire callstack, and then cancel the entire existing call to connect() (along with any pending fetch() calls that it is running), and then run it again from the top. I'd like to implement a much more robust network handling layer based on the v7 branch and land this as-is for now, since it's an improvement over what we've got.

…ected` event is being emitted as expected - Removed misleading globalThis console.warn -- it was backwards, in that our code doesn't actually override existing entries on `globalThis` -- it actually _skips_ initializing any that already exist

…ating across `this.connectedNodes` to build sessionSigs, but the client is disconnected - the sessionSigs map could be incomplete in this case

DashKash54 · 2024-11-05T23:28:45Z

Can you please confirm that in the _connect() even if we're unable to handshake to a single node we never set this.ready=true? @MaximusHaximus @Ansonhkg

DashKash54 · 2024-11-05T23:36:07Z

Currently the loading of modules in the connect() is coupled with the node handshake? it's possible that the handshake fails for any reason so the this.ready=false but we'll still try to reload the WASM modules even though they've been loaded in the gloabalThis? Maybe multiple loads isn't that bad or shall we revisit this?

DashKash54 · 2024-11-05T23:51:41Z

Are we able to successfully & repeatedly reproduce this error @MaximusHaximus ? How has the fix been tested?

DashKash54

Left a few comments

DashKash54 · 2024-11-05T23:52:50Z

packages/core/src/lib/lit-core.ts

-            'Error while attempting to reconnect to nodes after epoch transition:',
-            message
+        } else {
+          // In case of centralised networks, we don't run `connect()` flow, so we will manually update epochInfo here


Why is a centralized network treated differently? The only difference should be in the attestation but the handshake should be invariant to the network?

My understanding is that for centralized networks, the node list doesn't change, so re-handshaking with all the nodes on epoch change would be redundant; if that's not true, we should definitely fix it...

DashKash54 · 2024-11-05T23:56:03Z

packages/core/src/lib/lit-core.ts

@@ -492,11 +507,6 @@ export class LitCore {
  }

  private async _connect() {
-    // Ensure an ill-timed epoch change event doesn't trigger concurrent config changes while we're already doing that


But will it chain the promises since we specifically check that if a pending connection is open then just return it?

DashKash54 · 2024-11-05T23:57:36Z

packages/lit-node-client-nodejs/src/lib/lit-node-client-nodejs.ts

@@ -2203,6 +2203,18 @@ export class LitNodeClientNodeJs

    const signatures: SessionSigsMap = {};

+    if (!this.ready) {


Okay getSessionSig() was the only function missing this.ready check. Actually we check this in the signSessionKey()?

We do check it in signSessionKey(), but that method isn't called from getSessionSigs() -- it is only called from the authNeededCallback() defined in getPkpSessionSigs() and in some auth providers -- so I added it here for completeness 👍

MaximusHaximus · 2024-11-06T01:55:41Z

Can you please confirm that in the _connect() even if we're unable to handshake to a single node we never set this.ready=true? @MaximusHaximus @Ansonhkg

Yes, that's right -- if any handshakes fail, the entire connect() chain rejects immediately

MaximusHaximus · 2024-11-06T01:59:05Z

Currently the loading of modules in the connect() is coupled with the node handshake? it's possible that the handshake fails for any reason so the this.ready=false but we'll still try to reload the WASM modules even though they've been loaded in the gloabalThis? Maybe multiple loads isn't that bad or shall we revisit this?

Our crypto module code checks to see if globalThis. has been set, and if it has, it does nothing at all. I agree it's a bit confusing -- the good news is that in v7+, we no longer keep global state around and there is no loadModules() method -- so this is temporary logic, thankfully :)

MaximusHaximus · 2024-11-06T02:06:00Z

Are we able to successfully & repeatedly reproduce this error @MaximusHaximus ? How has the fix been tested?

Unfortunately this is a very corner-case issue that is caused by failures in entirely internal code. It also requires that we trigger epoch changes to actually verify that the fix is working :(

With very creative Jest mocks, and long-running Shiva tests, I can write a test case that will reproduce it consistently and verify this fix is always present, but our current local-tests don't really facilitate this degree of testing, and we don't have live epoch-change tests yet :(. I was, however, able to test the fix by adding manual throw Error() calls into specific spots in the SDK callstack and verifying the SDK followed the right code paths 👍

MaximusHaximus commented Nov 4, 2024

View reviewed changes

MaximusHaximus added 2 commits November 4, 2024 17:52

fix(lit-node-client-nodejs): LIT-4016 - Throw an error if we are iter…

d99e8b2

…ating across `this.connectedNodes` to build sessionSigs, but the client is disconnected - the sessionSigs map could be incomplete in this case

MaximusHaximus marked this pull request as ready for review November 4, 2024 17:58

MaximusHaximus requested a review from Ansonhkg as a code owner November 4, 2024 17:58

MaximusHaximus requested a review from DashKash54 November 4, 2024 17:58

DashKash54 reviewed Nov 5, 2024

View reviewed changes

MaximusHaximus requested a review from DashKash54 November 6, 2024 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lit-core): LIT-4016 - Enhance error handling during epoch changes #710

fix(lit-core): LIT-4016 - Enhance error handling during epoch changes #710

MaximusHaximus commented Nov 4, 2024 •

edited

Loading

MaximusHaximus Nov 4, 2024 •

edited

Loading

DashKash54 Nov 5, 2024

MaximusHaximus Nov 6, 2024

DashKash54 commented Nov 5, 2024

DashKash54 commented Nov 5, 2024

DashKash54 commented Nov 5, 2024

DashKash54 left a comment

DashKash54 Nov 5, 2024

MaximusHaximus Nov 6, 2024

DashKash54 Nov 5, 2024

DashKash54 Nov 5, 2024

MaximusHaximus Nov 6, 2024

MaximusHaximus commented Nov 6, 2024

MaximusHaximus commented Nov 6, 2024

MaximusHaximus commented Nov 6, 2024

		@@ -2203,6 +2203,18 @@ export class LitNodeClientNodeJs

		const signatures: SessionSigsMap = {};

		if (!this.ready) {

fix(lit-core): LIT-4016 - Enhance error handling during epoch changes #710

Are you sure you want to change the base?

fix(lit-core): LIT-4016 - Enhance error handling during epoch changes #710

Conversation

MaximusHaximus commented Nov 4, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Checklist:

MaximusHaximus Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

DashKash54 Nov 5, 2024

Choose a reason for hiding this comment

MaximusHaximus Nov 6, 2024

Choose a reason for hiding this comment

DashKash54 commented Nov 5, 2024

DashKash54 commented Nov 5, 2024

DashKash54 commented Nov 5, 2024

DashKash54 left a comment

Choose a reason for hiding this comment

DashKash54 Nov 5, 2024

Choose a reason for hiding this comment

MaximusHaximus Nov 6, 2024

Choose a reason for hiding this comment

DashKash54 Nov 5, 2024

Choose a reason for hiding this comment

DashKash54 Nov 5, 2024

Choose a reason for hiding this comment

MaximusHaximus Nov 6, 2024

Choose a reason for hiding this comment

MaximusHaximus commented Nov 6, 2024

MaximusHaximus commented Nov 6, 2024

MaximusHaximus commented Nov 6, 2024

MaximusHaximus commented Nov 4, 2024 •

edited

Loading

MaximusHaximus Nov 4, 2024 •

edited

Loading