-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement retry count and error state for stateful subscription. #27833
Implement retry count and error state for stateful subscription. #27833
Conversation
@steveluscher i think this is the easiest path forward on the issue. |
very nice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thanks for starting this work.
There's a lot to love about this PR, but also a ton of work left, if you're up for it. The addition of new states that a subscription can be in explodes the complexity of the state machine quadratically, and all of those cases have to have coverage in code and in tests. Actually, we probably need to add two states: errored and retry-exhausted.
Some notes to get things started:
- Missed a spot at
connection.ts:5183
; another juncture to catch errors and perform retries - Need coverage in
connection-subscriptions.test.ts
that cover every state transition possible- websocket closes implicitly while in an error state
- websocket is explicitly closed while in an error state
- websocket reopens with subscriptions still in the error state
- probably way more
- The cases that I've seen so far report extremely rapid error rates. In that context, retrying 3 times is probably essentially the same as failing right away. What you might consider instead is a system of exponential backoff – where the delays between retries become longer the more you make.
All of that said, every error report anyone's ever sent to me includes the message ‘socket not open.’ That tells me that our socket open/close tracking is most of the problem here (#25578). I'm sort of inclined to see that fixed first, before implementing retries. If there's a Real Bug with connection tracking, then all retries will do is to mask that problem, without actually making anyone's experience better.
console.error( | ||
`${method} error, max retries reached`, | ||
args, | ||
e instanceof Error ? e.message : `${e}`, | ||
); | ||
|
||
delete this._subscriptionsByHash[hash]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we don't have here yet is a recovery strategy. It's enough to drop the subscription to stop the recursive setup attempts, but then at that point you have an application that thinks it has an open subscription, but doesn't.
What's the plan here?
- Create an API to notify apps their subscriptions are dead?
- Revive errored subscriptions at some logical juncture in the future?
Quite possibly related: #27859. |
Co-authored-by: Steven Luscher <steveluscher@users.noreply.github.com>
Pull request has been modified.
Problem
Running into infinite recursion bug defined in solana-labs/solana-web3.js#1106, in environment in node v16, web3.js 1.56.2. Causes lambda to stall and timeout.
Error example:
Summary of Changes
Implement
error
state type which tracks retries and logs error + deletes subscription when max retries hit for stateful subscription.Fixes solana-labs/solana-web3.js#1106