xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache #7851

purnesh42H · 2024-11-17T12:29:39Z

Addresses: #7819

If a watch is registered for a listener resource which is already present in the cache with an old good update as well latest NACK error, the new watcher should receive both good update and error, without a new resource request being sent to the management server.

RELEASE NOTES:

xds: fixed an edge-case issue where some clients or servers would not receive errors if another channel or server with the same target was already in use.

codecov · 2024-11-17T17:52:23Z

Codecov Report

Attention: Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.89%. Comparing base (87f0254) to head (742da1b).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
xds/internal/xdsclient/authority.go	57.14%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7851      +/-   ##
==========================================
+ Coverage   81.74%   81.89%   +0.15%     
==========================================
  Files         375      375              
  Lines       37980    37986       +6     
==========================================
+ Hits        31045    31110      +65     
+ Misses       5622     5581      -41     
+ Partials     1313     1295      -18

Files with missing lines	Coverage Δ
xds/internal/xdsclient/authority.go	`76.82% <57.14%> (+1.38%)`	⬆️

... and 21 files with indirect coverage changes

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

easwars · 2024-11-18T23:27:32Z

I think this might be worth release noting.

xds/internal/xdsclient/authority.go

easwars · 2024-11-18T23:30:44Z

xds/internal/xdsclient/tests/lds_watchers_test.go

+
+	// Register another watch for the same resource. This should get the update
+	// and error from the cache.
+	lw2 := newListenerWatcherMultiple(2)


Why do we need this new listener watcher type? Why can't we handle this case with the existing listenerWatcher type?

The current listenerWatcher has channel size of 1 and notifications gets replaced. For this fix we need both good update and error as 2 different notifications to be verified so we need a channel with buffer size > 1. But yeah we don't need a new listenerWatcher struct, we can just modify the current one to have another constructor which accept the size and update OnError to not replace if size > 1, which is what I did.

Ah the way i used existing listenerWatcher struct, it was missing resource update during race test. I have added the separate struct back for handling variable size update channel and the race went away. Didn't get a chance to fully debug why it was happening. But may be its fine to have separate struct to hold multiple updates?

I think the correct way would be to change the listenerWatcher to have multiple channels: one each for update, error and resource not found. That way one callback will not interfere with another callback. But this change would touch a lot of tests.

I wanted to do this change when I was working on some of the refactors recently, but never got around to doing that. I would recommend making that change in a separate PR though. What do you think?

oh yeah i think i can send a separate PR for that. The idea of having 3 channels for each callback is a good idea. Should this fix be blocked for that though?

I'd be Ok if we create an issue for the same and add a TODO in here to remove this new listener watcher type once that issue is taken care of.

Filed an issue #7864. It should be simple as well. Added TODO for the new watcher in this PR.

xds/internal/xdsclient/tests/lds_watchers_test.go

purnesh42H · 2024-11-19T16:20:44Z

Actually, looks like there is some race condition in the test but feel free to review. Debugging the race condition.

easwars

We are still missing a test case for the scenario where the first response from the management server is NACKed, and a second watcher is registered for that same resource (and we expect the error callback to be invoked on that watcher). You could enhance the existing TestLDSWatch_NACKError test for this.

xds/internal/xdsclient/authority.go

xds/internal/xdsclient/tests/lds_watchers_test.go

easwars · 2024-11-19T22:07:39Z

xds/internal/xdsclient/tests/lds_watchers_test.go

+		t.Fatalf("timeout when waiting for a listener resource from the management server: %v", err)
+	}
+	gotErr = u.(listenerUpdateErrTuple).err
+	if gotErr == nil || !strings.Contains(gotErr.Error(), wantListenerNACKErr) {


The error type in the xdsresource package provides a way to get to the underlying error type. See: https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/xdsresource/errors.go#L61

I think we should use that instead of string comparison.

I agree existing tests are not doing that. But that is not a good enough reason to make new code not do that. In fact, it would be wonderful in existing code could be cleaned up as well.

I think for NACK error of no routing, we don't have any specific error type. So, type will always be unknown. That's why we are verifying the string.

What do you think about adding a new type for NACK errors? If you think it makes sense to do that, we should file an issue to track it and eventually get to it. Might be good to get to (and fixing tests) before making the client public.

Filed an issue #7863. I think it should be simple change as i described in the issue to set the NACK error type while decoding. Though it still can be separate PR because we will have to update all the tests.

xds/internal/xdsclient/tests/lds_watchers_test.go

easwars · 2024-11-19T22:13:11Z

xds/internal/xdsclient/tests/lds_watchers_test.go

+
+	// Register another watch for the same resource. This should get the update
+	// and error from the cache.
+	lw2 := newListenerWatcherMultiple(2)


I think the correct way would be to change the listenerWatcher to have multiple channels: one each for update, error and resource not found. That way one callback will not interfere with another callback. But this change would touch a lot of tests.

I wanted to do this change when I was working on some of the refactors recently, but never got around to doing that. I would recommend making that change in a separate PR though. What do you think?

xds/internal/xdsclient/tests/lds_watchers_test.go

purnesh42H · 2024-11-20T15:44:13Z

We are still missing a test case for the scenario where the first response from the management server is NACKed, and a second watcher is registered for that same resource (and we expect the error callback to be invoked on that watcher). You could enhance the existing TestLDSWatch_NACKError test for this.

I had done the same #7851 (comment) and this is the change https://github.com/grpc/grpc-go/pull/7851/files#diff-33ea1a548fc69853905a83ab6f29daba065a266e4145ff1db859e74ca8064ad3R939. Did i miss anything?

easwars · 2024-11-20T20:14:48Z

xds/internal/xdsclient/tests/lds_watchers_test.go

+	if err != nil {
+		t.Fatalf("Timeout when waiting for a listener resource from the management server: %v", err)
+	}
+	gotErr := u.(listenerUpdateErrTuple).err
+	if gotErr == nil || !strings.Contains(gotErr.Error(), wantListenerNACKErr) {
+		t.Fatalf("Update received with error: %v, want %q", gotErr, wantListenerNACKErr)
+	}


Should we implement a helper for this?

func verifyListenerError(ctx context.Context, updateCh *testutils.Channel, wantErr string) error { u, err := updateCh.Receive(ctx) if err != nil { return fmt.Errorf("timeout when waiting for a listener update from the management server: %v", err) } gotErr := u.(listenerUpdateErrTuple).err if gotErr == nil || !strings.Contains(gotErr.Error(), wantErr) { return fmt.Errorf("update received with error: %v, want %q", gotErr, wantErr) } }

Done. Though we need the same helper for all other resource types as well. Will send a separate PR.

easwars · 2024-11-21T17:39:47Z

xds/internal/xdsclient/tests/lds_watchers_test.go

@@ -94,6 +94,8 @@ type listenerWatcherMultiple struct {
 	updateCh *testutils.Channel
 }

+// TODO: delete this once `newListenerWatcher` is modified to handle multiple


Please link the issue here.

…r, add cache check

purnesh42H force-pushed the new-watcher-caching-behavior branch from a3c89bb to 7e4c73d Compare November 17, 2024 17:47

purnesh42H force-pushed the new-watcher-caching-behavior branch 2 times, most recently from 30597fe to d2aae7d Compare November 17, 2024 18:42

purnesh42H changed the title ~~xds/internal/xdsclient/test: new watcher resource caching behavior~~ xds/internal/xdsclient/test: add test to verify a new watcher gets old good update and nack error from the cache Nov 17, 2024

purnesh42H changed the title ~~xds/internal/xdsclient/test: add test to verify a new watcher gets old good update and nack error from the cache~~ xdsclient/test/lds_watchers_test: add test to verify a new watcher gets old good update and nack error from the cache Nov 17, 2024

purnesh42H force-pushed the new-watcher-caching-behavior branch from d2aae7d to 5da90a2 Compare November 18, 2024 07:38

purnesh42H added this to the 1.69 Release milestone Nov 18, 2024

purnesh42H added the Type: Bug label Nov 18, 2024

purnesh42H changed the title ~~xdsclient/test/lds_watchers_test: add test to verify a new watcher gets old good update and nack error from the cache~~ xds/internal/xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache Nov 18, 2024

purnesh42H force-pushed the new-watcher-caching-behavior branch 2 times, most recently from a9ee1a6 to f57c599 Compare November 18, 2024 10:03

purnesh42H requested a review from easwars November 18, 2024 18:40

purnesh42H assigned easwars Nov 18, 2024

easwars reviewed Nov 18, 2024

View reviewed changes

easwars assigned purnesh42H and unassigned easwars Nov 18, 2024

purnesh42H requested a review from easwars November 19, 2024 15:04

purnesh42H assigned easwars and unassigned purnesh42H Nov 19, 2024

purnesh42H assigned purnesh42H and unassigned easwars Nov 19, 2024

purnesh42H changed the title ~~xds/internal/xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache~~ xds: fix new watcher to get both old good update and nack error (if exist) from the cache Nov 19, 2024

purnesh42H force-pushed the new-watcher-caching-behavior branch from d4bf100 to 8ed2a3a Compare November 19, 2024 18:39

purnesh42H assigned easwars and unassigned purnesh42H Nov 19, 2024

easwars reviewed Nov 19, 2024

View reviewed changes

easwars assigned purnesh42H Nov 19, 2024

easwars removed their assignment Nov 19, 2024

purnesh42H requested a review from easwars November 20, 2024 15:47

purnesh42H assigned easwars and unassigned purnesh42H Nov 20, 2024

purnesh42H force-pushed the new-watcher-caching-behavior branch from 1d8bbe7 to df88c19 Compare November 20, 2024 19:39

easwars reviewed Nov 20, 2024

View reviewed changes

easwars assigned purnesh42H and unassigned easwars Nov 20, 2024

purnesh42H requested a review from easwars November 21, 2024 11:48

purnesh42H assigned easwars and unassigned purnesh42H Nov 21, 2024

easwars approved these changes Nov 21, 2024

View reviewed changes

easwars assigned purnesh42H and unassigned easwars Nov 21, 2024

easwars changed the title ~~xds: fix new watcher to get both old good update and nack error (if exist) from the cache~~ xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache Nov 21, 2024

purnesh42H added 10 commits November 22, 2024 00:52

xds/internal/xdsclient/test: new watcher resource caching behavior

e337e2b

test to verify both update and error are sent

9d6bd0e

test without cache verification

9a51cf3

move nack error out of cache condition, use same test listener watche…

757053e

…r, add cache check

Add test case for new watcher getting nack while registering

3570449

remove second request verification check

777e12b

separate watcher struct

9ca1ed3

address nits, verify caching without version number

65d5760

helper function for unknown errors

0e1bc31

link issue for todo

742da1b

purnesh42H force-pushed the new-watcher-caching-behavior branch from 86a241c to 742da1b Compare November 21, 2024 19:22

purnesh42H merged commit 44a5eb9 into grpc:master Nov 21, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache #7851

xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache #7851

purnesh42H commented Nov 17, 2024 •

edited

Loading

codecov bot commented Nov 17, 2024 •

edited

Loading

easwars commented Nov 18, 2024

easwars Nov 18, 2024

purnesh42H Nov 19, 2024

purnesh42H Nov 19, 2024

easwars Nov 19, 2024

purnesh42H Nov 20, 2024 •

edited

Loading

easwars Nov 20, 2024

purnesh42H Nov 21, 2024

purnesh42H commented Nov 19, 2024 •

edited

Loading

easwars left a comment

easwars Nov 19, 2024

purnesh42H Nov 20, 2024 •

edited

Loading

easwars Nov 20, 2024

purnesh42H Nov 21, 2024

easwars Nov 19, 2024

purnesh42H commented Nov 20, 2024 •

edited

Loading

easwars Nov 20, 2024

purnesh42H Nov 21, 2024

easwars Nov 21, 2024

purnesh42H Nov 21, 2024

xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache #7851

xdsclient: fix new watcher to get both old good update and nack error (if exist) from the cache #7851

Conversation

purnesh42H commented Nov 17, 2024 • edited Loading

codecov bot commented Nov 17, 2024 • edited Loading

Codecov Report

easwars commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H commented Nov 19, 2024 • edited Loading

easwars left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H commented Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H commented Nov 17, 2024 •

edited

Loading

codecov bot commented Nov 17, 2024 •

edited

Loading

purnesh42H Nov 20, 2024 •

edited

Loading

purnesh42H commented Nov 19, 2024 •

edited

Loading

purnesh42H Nov 20, 2024 •

edited

Loading

purnesh42H commented Nov 20, 2024 •

edited

Loading