xdsclient: invoke watch callback when new update matches cached one, but previous update was NACKed (1/N) #7692

easwars · 2024-10-02T12:26:59Z

Consider the following sequence of events:

watcher registers a watch for resource A with the xDS client
xDS client requests resource A from the management server
management server sends resource A to the xDS client
xDS client ACKs the resource and invokes the watch callback
management server sends a new version of the resource
xDS client NACKs this version and invokes the watch callback with the error
- at this point, the previously ACKed version is still stored in the cache
management server sends a new version of the resource (the contents are the same as the first version which was ACKed)

This PR fixes a bug which was causing the xDS client to not invoke the watch callback at the last step of the above sequence of events.

This PR also contains a couple of other changes:

improve the error message when unmarshaling a listener resource
make the ACK/NACK tests e2e style
- instead of creating an xDS transport and calling methods on it, the tests now create an xDS client and rely on the xDS client API

This PR is the first in a sequence of PRs that refactor the xDS client codebase to support fallback.

#a71-xds-fallback
#xdsclient-refactor

Addresses #6902

RELEASE NOTES:

xdsclient: fix a bug where the watch callback was not invoked if a new update matches cached one, but previous update was NACKed

…but previous update was NACKed

codecov · 2024-10-02T12:29:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.84%. Comparing base (ca4865d) to head (333e5d8).
Report is 6 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7692      +/-   ##
==========================================
- Coverage   81.93%   81.84%   -0.09%     
==========================================
  Files         361      361              
  Lines       27822    27822              
==========================================
- Hits        22797    22772      -25     
- Misses       3838     3853      +15     
- Partials     1187     1197      +10

Files with missing lines	Coverage Δ
xds/internal/xdsclient/authority.go	`89.20% <100.00%> (ø)`
...ds/internal/xdsclient/xdsresource/unmarshal_lds.go	`88.46% <100.00%> (ø)`

... and 21 files with indirect coverage changes

zasweq

Some minor comments but overall LGTM.

zasweq · 2024-10-04T19:48:02Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+//  2. When a subsequent bad response is received, i.e. once is expected to be
+//     NACKed, the test verifies that a NACK is sent matching the previously
+//     ACKed version and current nonce from the response.
+//  3. When a subsequent goos response is received, the test verifies that an


Lol. When I first read this comment from you, I thought you were appreciating me for something :).

Fixed.

zasweq · 2024-10-04T19:49:49Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+		ResponseNonce: "",
+	}
+	if diff := cmp.Diff(gotReq, wantReq, protocmp.Transform()); diff != "" {
+		t.Fatalf("Unexpected diff in received discovery request, diff (-got, +want):\n%s", diff)


Hmmm you like new line now?

Did I comment any time in the past against newlines?

I like it here because the diff can be large. So, it makes sense to be on a separate line.

I've just never seen the diff outputted on a new line. I wasn't saying it's wrong, just interesting. Maybe I'll do this for large diffs too.

zasweq · 2024-10-04T19:51:15Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+	}
+	gotResp = r.(*v3discoverypb.DiscoveryResponse)
+
+	var wantNackErr = errors.New("unexpected http connection manager resource type")


wantNackErr := is better here?

Sure. Not sure why I had it the other way around.

Just FYI. There is a related topic (not exact though) in non-decisions: https://google.github.io/styleguide/go/decisions#non-decisions

zasweq · 2024-10-04T19:52:06Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+	// We expect the version to not change as this is a NACK.
+	r, err = streamRequestCh.Receive(ctx)
+	if err != nil {
+		t.Fatal("Timeout when waiting for ACK")


NACK. But it's not really specific, it's just a timeout waiting for a request on stream.

ACKs and NACKs are similar in a lot of ways to actual requests. They use the same message type and the same fields.

And the channel that we use here streamRequestCh is used to for all of them.

zasweq · 2024-10-04T19:54:27Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+	for {
+		r, err = streamResponseCh.Receive(ctx)
+		if err != nil {
+			t.Fatal("Timeout when waiting for the discovery response from client")
+		}
+		gotResp = r.(*v3discoverypb.DiscoveryResponse)
+
+		// Verify that the ACK contains the appropriate version and nonce.
+		r, err = streamRequestCh.Receive(ctx)
+		if err != nil {
+			t.Fatal("Timeout when waiting for ACK")
+		}
+		gotReq = r.(*v3discoverypb.DiscoveryRequest)
+		wantReq.VersionInfo = gotResp.GetVersionInfo()
+		wantReq.ResponseNonce = gotResp.GetNonce()
+		wantReq.ErrorDetail = nil
+		diff := cmp.Diff(gotReq, wantReq, protocmp.Transform())
+		if diff == "" {
+			break
+		}
+		t.Logf("Unexpected diff in received discovery request, diff (-got, +want):\n%s", diff)
+	}


Should this polling codeblock also exit with an error of never got expected discovery request if times out? I don't think it's right to fail because no discovery response or waiting for ack, as that is part of the polling operation under test but not the eventually consistent thing that is expected to happen.

If the test timeout expires, the first line in the for loop will see an error and the test will fail with Timeout when waiting for the discovery response from client. Do you want more here? If so, can you please be specific on what you want me to do? Thanks.

Oh this is related to the comment on the other PR. I didn't like the fact that failures occured within the operations of the polling loop, where the polling loop is testing another layer/behavior that is expected to be eventually consistent.

zasweq · 2024-10-04T19:55:09Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+	// Verify the update received by the watcher.
+	for ; ctx.Err() == nil; <-time.After(100 * time.Millisecond) {
+		if err := verifyListenerUpdate(ctx, lw.updateCh, wantUpdate); err != nil {
+			t.Logf("Failed to verify listener update, err: %v", err)


Optional nit: You fail earlier with just the err. Prepend the same text? Or remove this text?

The reason that this is a t.Logf and not a t.Errorf or t.Fatalf is because it will take some time for the watcher callback to be invoked. So, I want to log failures up until the expected listener update is received.

Actually I thought about this a little more and changed it to not log on every failure, but only to log the error when the test actually fails to receive the expected listener update and times out.

I was referring to the actual text in the error message (so fail with "something" + err vs. err).

zasweq · 2024-10-04T19:55:45Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+
+// Tests the case where the first response is invalid. The test verifies that
+// the NACK contains an empty version string.
+func (s) TestADS_ACK_NACK_InvalidFirstResponse(t *testing.T) {


Why is this test called ACK_NACK when it only NACK's?

Valid point. The only reason I have it the way I have it is to be able to do a go test <package name> -run Test/ADS_ACK_NACK and run all of these tests.

If you really want me to change it, do let me know and I will. Thanks.

Yeah I prefer it, but it's already merged so no worries haha.

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

purnesh42H · 2024-10-06T19:41:35Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+		t.Fatal("Timeout when waiting for the initial discovery request")
+	}
+	gotReq := r.(*v3discoverypb.DiscoveryRequest)
+	wantReq := &v3discoverypb.DiscoveryRequest{


Isn't the mentioned issue of "not acking the good update if last update was bad" exist for all resource types? Is it enough to just verify it for a listener resource?

not acking the good update if last update was bad
Where do we have this behavior? That doesn't sound right to me.

But anyways, we re-architected the xDS client transport to be agnostic of the resource type. So, there should not be any code in the xDS client transport layer that does different things based on the resource type. Any resource type specific behavior is mostly limited to parsing/decoding the resource and that is implemented in the xdsresource package which we sometimes refer to as the data-model layer. Therefore verifying ACK/NACK sematics behavior for one resource type is good enough.

purnesh42H · 2024-10-06T19:43:30Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+	}
+	gotReq = r.(*v3discoverypb.DiscoveryRequest)
+	if gotNonce, wantNonce := gotReq.GetResponseNonce(), gotResp.GetNonce(); gotNonce != wantNonce {
+		t.Errorf("Unexpected nonce in discovery request, got: %v, want: %v", gotNonce, wantNonce)


nit: is there any convention we follow on when to use t.Errorf or t.Fatalf? If not, why not always use t.Fatalf to be consistent?

https://google.github.io/styleguide/go/decisions#keep-going

Here we want the error message to also contain any diff in the error detail.

purnesh42H · 2024-10-06T19:46:24Z

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go

+
+	// Verify the update received by the watcher.
+	var lastErr error
+	for ; ctx.Err() == nil; <-time.After(100 * time.Millisecond) {


why are we using <-time.After instead of select with context

The time.After here is part of the for loop (to ensure that we pause a little in between iterations). Basically this loop could be written as follows as well:

for { err := verifyListenerUpdate(ctx, lw.updateCh, wantUpdate) if err == nil { break } lastErr = err select { case <-time.After(100 * time.Millisecond): continue case <-ctx.Done: t.Fatalf("Timeout when waiting for listener update. Last seen error: %v", lastErr) } }

purnesh42H

lgtm

xdsclient: invoke watch callback when new update matches cached one, …

159cde6

…but previous update was NACKed

easwars added the Type: Bug label Oct 2, 2024

easwars added this to the 1.68 Release milestone Oct 2, 2024

easwars requested review from zasweq and purnesh42H October 2, 2024 12:27

easwars assigned zasweq and purnesh42H Oct 2, 2024

fix a flake in the test

7c9f578

easwars force-pushed the transport_ack_nack_test branch from 2cf64a5 to e41e411 Compare October 2, 2024 13:26

use SendContext to ensure no goroutine leaks

a41f75b

easwars force-pushed the transport_ack_nack_test branch from e41e411 to a41f75b Compare October 2, 2024 16:57

zasweq approved these changes Oct 4, 2024

View reviewed changes

zasweq assigned easwars and unassigned zasweq Oct 4, 2024

first round of review comments from zasweq

333e5d8

easwars assigned zasweq and unassigned easwars Oct 4, 2024

purnesh42H reviewed Oct 6, 2024

View reviewed changes

purnesh42H self-requested a review October 7, 2024 19:19

purnesh42H approved these changes Oct 7, 2024

View reviewed changes

purnesh42H removed their assignment Oct 7, 2024

easwars merged commit 9afb232 into grpc:master Oct 7, 2024
14 checks passed

easwars deleted the transport_ack_nack_test branch October 7, 2024 20:01

tbg mentioned this pull request Nov 27, 2024

DEPS: upgrade grpc to v1.68.0 cockroachdb/cockroach#136278

Draft

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xdsclient: invoke watch callback when new update matches cached one, but previous update was NACKed (1/N) #7692

xdsclient: invoke watch callback when new update matches cached one, but previous update was NACKed (1/N) #7692

easwars commented Oct 2, 2024 •

edited

Loading

codecov bot commented Oct 2, 2024 •

edited

Loading

zasweq left a comment

zasweq Oct 4, 2024

easwars Oct 4, 2024

zasweq Oct 4, 2024

easwars Oct 4, 2024

zasweq Oct 7, 2024

zasweq Oct 4, 2024

easwars Oct 4, 2024

zasweq Oct 7, 2024

zasweq Oct 4, 2024

easwars Oct 4, 2024

zasweq Oct 4, 2024

easwars Oct 4, 2024

zasweq Oct 7, 2024

zasweq Oct 4, 2024

easwars Oct 4, 2024

easwars Oct 4, 2024

zasweq Oct 7, 2024 •

edited

Loading

zasweq Oct 4, 2024

easwars Oct 4, 2024

zasweq Oct 7, 2024

purnesh42H Oct 6, 2024

easwars Oct 7, 2024

purnesh42H Oct 6, 2024

easwars Oct 7, 2024

purnesh42H Oct 6, 2024

easwars Oct 7, 2024

purnesh42H left a comment

xdsclient: invoke watch callback when new update matches cached one, but previous update was NACKed (1/N) #7692

xdsclient: invoke watch callback when new update matches cached one, but previous update was NACKed (1/N) #7692

Conversation

easwars commented Oct 2, 2024 • edited Loading

codecov bot commented Oct 2, 2024 • edited Loading

Codecov Report

zasweq left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zasweq Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

purnesh42H left a comment

Choose a reason for hiding this comment

easwars commented Oct 2, 2024 •

edited

Loading

codecov bot commented Oct 2, 2024 •

edited

Loading

zasweq Oct 7, 2024 •

edited

Loading