-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: recover watch stream on more error types #9995
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Reviewed error codes map back to the Go list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that several of these errors are noted as non-retryable under https://aip.dev/194 , however I don't have enough context specific to firestore and this client to understand whether this is problematic or not.
@schmidt-sebastian Can you take a look at the list of error codes here and weigh in if they are okay to retry or not? |
I should note in theory, |
Can we match Node? The list is even more permissive: https://github.com/googleapis/nodejs-firestore/blob/25472e11a0e1a4a5e1931b1652d125f9c8cabf11/dev/src/watch.ts#L817 |
@jadekler voiced concerns about many of the retry codes in Go that I copied. I also found that this didn't fully stop the issue. It is possible I could match node. I have a debug session running and am waiting for a failure so I can dig into what went on. |
I left this run for a few days. I think if we just retry
|
Did you figure out whether RST_STREAM was being returned as an INTERNAL error or not? If so, the change you propose makes sense to me.
INTERNAL in addition to UNAVAILABLE, or just by itself? (UNAVAILABLE should always be retried) |
I would prefer if we used the same retry configuration everywhere, and the Node SDK has the configuration that is most battle-tested. We should try to retry every Watch request unless we know beforehand that a retry will not help (e.g. "PERMISSION_DENIED"). If we don't do this, our users will, and they will do so without backoff. |
@jadekler I haven't gotten an answer on that yet, but discussion is ongoing at b/144734355. I think for now @schmidt-sebastian has a reasonable point. Customer of watch are determined to keep it running, likely to the point of personally retrying any of the codes. If we retry most all of them, but with sensible timeouts, that is likely better than leaving it to chance. I will modify this PR to match Node.js |
Merging this. We have further discussion internally on whether RST_STREAM should occur with the error type we are seeing (INTERNAL), but this ought to resolve the issues for users currently. We can always soften later if this becomes unecessary. |
Watch Retry is more permissive in Go. This PR replicates that in Python.
Fixes #9890 and b/144734355