Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: only update initReq.rev == 0 with watch revision #7795

Merged
merged 3 commits into from
Apr 22, 2017

Conversation

heyitsanthony
Copy link
Contributor

Always updating the initReq.rev on watch create will resume from the wrong
revision if initReq is ever nonzero.

@@ -615,11 +615,17 @@ func (w *watchGrpcStream) serveSubstream(ws *watcherStream, resumec chan struct{
// send first creation event only if requested
if ws.initReq.createdNotify {
ws.outc <- *wr
if ws.initReq.rev == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comments here? (when resume for a disconnection, we start watcher from a known previous revision not from the current revision, so do not update it to header.revision)

@xiang90
Copy link
Contributor

xiang90 commented Apr 21, 2017

@heyitsanthony How did you find this bug? from a test failure?

@heyitsanthony
Copy link
Contributor Author

@xiang90 from investigating #7709, but likely an unrelated issue.

testWatchResumeCompacted(t, nil, nil)
}

func TestWatchResumeCompactedFrequentDisconnect(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to develop a dedicated test for the case we fix?

i feel TestWatchResumeCompacted is already hard enough to reason about (it is not deterministic. we have to sub case to care about internally)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non-deterministic by nature since it relies on disconnect behavior. If it's going to be deterministic I'd need to add failpoints to the testing infrastructure. If I wrote a separate test it'd be essentially duplicate what's there now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try some ways to do deterministic without failpoints but it'll change the behavior that's being tested. OK, whatever.

@xiang90
Copy link
Contributor

xiang90 commented Apr 21, 2017

The fix looks good to me.

@fanminshi
Copy link
Member

This pr fixes my issue as well.

My issue:

  1. send create req on initReq{key:foo, rev:2} and recv create resp back.
  2. update initReq.rev to resp.Header.Revision let's say it is 100.
  3. conn disconnect.
  4. watchGrpcStream resends initReq{key:foo, rev:100}.
  5. wch hangs because watch on rev 100 instead of rev 2.

It would be nice to have a simple test case that cover the above situation.
I can write one in a different pr when this gets merged.

if resp, ok := <-wch; !ok || resp.Header.Revision != 4 {
t.Fatalf("got (%v, %v), expected create notification rev=4", resp, ok)
}
// pause wch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that before you DropConnections, watch receives and buffers event ("a", rev=3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but I tested it without the fix and it fails reliably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the following code fail due to having events buffered?

clus.Members[0].DropConnections()
clus.Members[0].PauseConnections()

select {
 	case resp, ok := <-wch:
 		t.Fatalf("wch should block, got (%+v, %v)", resp, ok)
 	case <-time.After(100 * time.Millisecond):
 	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I can make it skip instead of fatal, but CI is typically slower rather than faster

@heyitsanthony
Copy link
Contributor Author

@fanminshi the current test essentially already does that, all that would have to change is:

if _, err := cli.Put(context.TODO(), "a", "4"); err != nil {

instead having:

if _, err := cli.Put(context.TODO(), "b", "4"); err != nil {

Testing based on a hang won't fail fast-- it'll have to wait for the timeout to catch the failure, instead of failing as soon as it retrieving a bad value. I don't understand how a second test would catch any new failure modes since if there's a timeout for some other reason, it'll still fail. Seems redundant.

@fanminshi
Copy link
Member

@heyitsanthony understood!

@xiang90
Copy link
Contributor

xiang90 commented Apr 21, 2017

@heyitsanthony LGTM. Thanks!

@fanminshi
Copy link
Member

lgtm, thanks!

@heyitsanthony heyitsanthony force-pushed the dont-force-initrev branch 3 times, most recently from 98203d4 to cf08d28 Compare April 22, 2017 03:20
Anthony Romano added 3 commits April 21, 2017 20:22
Resetting connections sometimes isn't enough; need to stop/resume
accepting connections for some tests while keeping the member up.
Always updating the initReq.rev on watch create will resume from the wrong
revision if initReq is ever nonzero.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants