-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clientv3: only update initReq.rev == 0 with watch revision #7795
clientv3: only update initReq.rev == 0 with watch revision #7795
Conversation
2946c73
to
fd7e951
Compare
@@ -615,11 +615,17 @@ func (w *watchGrpcStream) serveSubstream(ws *watcherStream, resumec chan struct{ | |||
// send first creation event only if requested | |||
if ws.initReq.createdNotify { | |||
ws.outc <- *wr | |||
if ws.initReq.rev == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comments here? (when resume for a disconnection, we start watcher from a known previous revision not from the current revision, so do not update it to header.revision)
fd7e951
to
fad4de4
Compare
@heyitsanthony How did you find this bug? from a test failure? |
clientv3/integration/watch_test.go
Outdated
testWatchResumeCompacted(t, nil, nil) | ||
} | ||
|
||
func TestWatchResumeCompactedFrequentDisconnect(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to develop a dedicated test for the case we fix?
i feel TestWatchResumeCompacted
is already hard enough to reason about (it is not deterministic. we have to sub case to care about internally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's non-deterministic by nature since it relies on disconnect behavior. If it's going to be deterministic I'd need to add failpoints to the testing infrastructure. If I wrote a separate test it'd be essentially duplicate what's there now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try some ways to do deterministic without failpoints but it'll change the behavior that's being tested. OK, whatever.
The fix looks good to me. |
fad4de4
to
317a75a
Compare
This pr fixes my issue as well. My issue:
It would be nice to have a simple test case that cover the above situation. |
317a75a
to
f9efacb
Compare
if resp, ok := <-wch; !ok || resp.Header.Revision != 4 { | ||
t.Fatalf("got (%v, %v), expected create notification rev=4", resp, ok) | ||
} | ||
// pause wch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that before you DropConnections, watch receives and buffers event ("a", rev=3)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but I tested it without the fix and it fails reliably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the following code fail due to having events buffered?
clus.Members[0].DropConnections()
clus.Members[0].PauseConnections()
select {
case resp, ok := <-wch:
t.Fatalf("wch should block, got (%+v, %v)", resp, ok)
case <-time.After(100 * time.Millisecond):
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I can make it skip instead of fatal, but CI is typically slower rather than faster
@fanminshi the current test essentially already does that, all that would have to change is: if _, err := cli.Put(context.TODO(), "a", "4"); err != nil { instead having: if _, err := cli.Put(context.TODO(), "b", "4"); err != nil { Testing based on a hang won't fail fast-- it'll have to wait for the timeout to catch the failure, instead of failing as soon as it retrieving a bad value. I don't understand how a second test would catch any new failure modes since if there's a timeout for some other reason, it'll still fail. Seems redundant. |
@heyitsanthony understood! |
@heyitsanthony LGTM. Thanks! |
f9efacb
to
6fea1c5
Compare
lgtm, thanks! |
98203d4
to
cf08d28
Compare
Resetting connections sometimes isn't enough; need to stop/resume accepting connections for some tests while keeping the member up.
Always updating the initReq.rev on watch create will resume from the wrong revision if initReq is ever nonzero.
cf08d28
to
4ab818a
Compare
Always updating the initReq.rev on watch create will resume from the wrong
revision if initReq is ever nonzero.