-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: TestRaftRemoveRace failed under stress #15687
Comments
This doesn't look particularly scary, but it does appear to be new. I'll be curious to see whether it happens again. |
Bisected this to f67a508! Reproduces quite readily on my azworker with |
SHA: https://github.com/cockroachdb/cockroach/commits/4e68800f6b47c22e142b48fdb390a0e5d249a8cf Parameters:
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=253814&tab=buildLog |
SHA: https://github.com/cockroachdb/cockroach/commits/11987dc7e2f20f6417925c638f45c2067435b17f Parameters:
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=256725&tab=buildLog |
SHA: https://github.com/cockroachdb/cockroach/commits/7fb6d983651d77cbd200f3553ce842a82fcf30ff Parameters:
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=257860&tab=buildLog |
|
Reassigning to @petermattis |
Reproduces readily enough:
|
I'm seeing what appears to be an incorrect retry. The test is executing a series of
Not shown is that the last error is Two things seem suspicious here:
@spencerkimball This is in your area of expertise (or, at least, you touched it recently). @tamird Weren't you looking at the |
Also @bdarnell, since you've had thoughts about timeouts recently. |
I did look at the timeouts, but ended up only increasing the abandonment timeout (#16088), which shouldn't affect this scenario given your description above. There seems to be growing doubt about the usefulness of |
Ah, actually this might have been exacerbated by #16088 which also introduced not waiting on outstanding RPCs if there's no |
For context, removing SendNextTimeout has been discussed in #16119 (comment) and #6719 (comment).
It's certainly fragile and easy to break (and we have), but with the waiting restored in #16181 I think things are back in working order. Are there any remaining known issues? Either way, this feature is causing a lot of trouble for something that we have a hard time even articulating the benefit of. We should increase |
|
10m is intended to test the hypothesis that we can remove this timeout completely - if setting it to a high value causes problems, we want to know about it so we can see if there are better ways of handling them than this timeout. In the network partition scenario you describe, the GRPC connection health checking should abort the connection well before we'd hit a 10s timeout. |
This hasn't been seen since #16181. |
SHA: https://github.com/cockroachdb/cockroach/commits/e41b5f3c13f492369d48d09c921b3732209f11e2
Parameters:
Stress build found a failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=242304&tab=buildLog
The text was updated successfully, but these errors were encountered: