Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdsclient: invoke watch callback when new update matches cached one, but previous update was NACKed (1/N) #7692

Merged
merged 4 commits into from
Oct 7, 2024

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Oct 2, 2024

Consider the following sequence of events:

  • watcher registers a watch for resource A with the xDS client
  • xDS client requests resource A from the management server
  • management server sends resource A to the xDS client
  • xDS client ACKs the resource and invokes the watch callback
  • management server sends a new version of the resource
  • xDS client NACKs this version and invokes the watch callback with the error
    • at this point, the previously ACKed version is still stored in the cache
  • management server sends a new version of the resource (the contents are the same as the first version which was ACKed)

This PR fixes a bug which was causing the xDS client to not invoke the watch callback at the last step of the above sequence of events.

This PR also contains a couple of other changes:

  • improve the error message when unmarshaling a listener resource
  • make the ACK/NACK tests e2e style
    • instead of creating an xDS transport and calling methods on it, the tests now create an xDS client and rely on the xDS client API

This PR is the first in a sequence of PRs that refactor the xDS client codebase to support fallback.

#a71-xds-fallback
#xdsclient-refactor

Addresses #6902

RELEASE NOTES:

  • xdsclient: fix a bug where the watch callback was not invoked if a new update matches cached one, but previous update was NACKed

@easwars easwars added this to the 1.68 Release milestone Oct 2, 2024
@easwars easwars requested review from zasweq and purnesh42H October 2, 2024 12:27
Copy link

codecov bot commented Oct 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.84%. Comparing base (ca4865d) to head (333e5d8).
Report is 6 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7692      +/-   ##
==========================================
- Coverage   81.93%   81.84%   -0.09%     
==========================================
  Files         361      361              
  Lines       27822    27822              
==========================================
- Hits        22797    22772      -25     
- Misses       3838     3853      +15     
- Partials     1187     1197      +10     
Files with missing lines Coverage Δ
xds/internal/xdsclient/authority.go 89.20% <100.00%> (ø)
...ds/internal/xdsclient/xdsresource/unmarshal_lds.go 88.46% <100.00%> (ø)

... and 21 files with indirect coverage changes

@easwars easwars force-pushed the transport_ack_nack_test branch from 2cf64a5 to e41e411 Compare October 2, 2024 13:26
@easwars easwars force-pushed the transport_ack_nack_test branch from e41e411 to a41f75b Compare October 2, 2024 16:57
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments but overall LGTM.

// 2. When a subsequent bad response is received, i.e. once is expected to be
// NACKed, the test verifies that a NACK is sent matching the previously
// ACKed version and current nonce from the response.
// 3. When a subsequent goos response is received, the test verifies that an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol. When I first read this comment from you, I thought you were appreciating me for something :).

Fixed.

ResponseNonce: "",
}
if diff := cmp.Diff(gotReq, wantReq, protocmp.Transform()); diff != "" {
t.Fatalf("Unexpected diff in received discovery request, diff (-got, +want):\n%s", diff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm you like new line now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I comment any time in the past against newlines?

I like it here because the diff can be large. So, it makes sense to be on a separate line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just never seen the diff outputted on a new line. I wasn't saying it's wrong, just interesting. Maybe I'll do this for large diffs too.

}
gotResp = r.(*v3discoverypb.DiscoveryResponse)

var wantNackErr = errors.New("unexpected http connection manager resource type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wantNackErr := is better here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Not sure why I had it the other way around.

Just FYI. There is a related topic (not exact though) in non-decisions: https://google.github.io/styleguide/go/decisions#non-decisions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.

// We expect the version to not change as this is a NACK.
r, err = streamRequestCh.Receive(ctx)
if err != nil {
t.Fatal("Timeout when waiting for ACK")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NACK. But it's not really specific, it's just a timeout waiting for a request on stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACKs and NACKs are similar in a lot of ways to actual requests. They use the same message type and the same fields.

And the channel that we use here streamRequestCh is used to for all of them.

Comment on lines +193 to +214
for {
r, err = streamResponseCh.Receive(ctx)
if err != nil {
t.Fatal("Timeout when waiting for the discovery response from client")
}
gotResp = r.(*v3discoverypb.DiscoveryResponse)

// Verify that the ACK contains the appropriate version and nonce.
r, err = streamRequestCh.Receive(ctx)
if err != nil {
t.Fatal("Timeout when waiting for ACK")
}
gotReq = r.(*v3discoverypb.DiscoveryRequest)
wantReq.VersionInfo = gotResp.GetVersionInfo()
wantReq.ResponseNonce = gotResp.GetNonce()
wantReq.ErrorDetail = nil
diff := cmp.Diff(gotReq, wantReq, protocmp.Transform())
if diff == "" {
break
}
t.Logf("Unexpected diff in received discovery request, diff (-got, +want):\n%s", diff)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this polling codeblock also exit with an error of never got expected discovery request if times out? I don't think it's right to fail because no discovery response or waiting for ack, as that is part of the polling operation under test but not the eventually consistent thing that is expected to happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test timeout expires, the first line in the for loop will see an error and the test will fail with Timeout when waiting for the discovery response from client. Do you want more here? If so, can you please be specific on what you want me to do? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is related to the comment on the other PR. I didn't like the fact that failures occured within the operations of the polling loop, where the polling loop is testing another layer/behavior that is expected to be eventually consistent.

// Verify the update received by the watcher.
for ; ctx.Err() == nil; <-time.After(100 * time.Millisecond) {
if err := verifyListenerUpdate(ctx, lw.updateCh, wantUpdate); err != nil {
t.Logf("Failed to verify listener update, err: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: You fail earlier with just the err. Prepend the same text? Or remove this text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that this is a t.Logf and not a t.Errorf or t.Fatalf is because it will take some time for the watcher callback to be invoked. So, I want to log failures up until the expected listener update is received.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I thought about this a little more and changed it to not log on every failure, but only to log the error when the test actually fails to receive the expected listener update and times out.

Copy link
Contributor

@zasweq zasweq Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the actual text in the error message (so fail with "something" + err vs. err).


// Tests the case where the first response is invalid. The test verifies that
// the NACK contains an empty version string.
func (s) TestADS_ACK_NACK_InvalidFirstResponse(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test called ACK_NACK when it only NACK's?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. The only reason I have it the way I have it is to be able to do a go test <package name> -run Test/ADS_ACK_NACK and run all of these tests.

If you really want me to change it, do let me know and I will. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I prefer it, but it's already merged so no worries haha.

xds/internal/xdsclient/tests/ads_stream_ack_nack_test.go Outdated Show resolved Hide resolved
@zasweq zasweq assigned easwars and unassigned zasweq Oct 4, 2024
@easwars easwars assigned zasweq and unassigned easwars Oct 4, 2024
t.Fatal("Timeout when waiting for the initial discovery request")
}
gotReq := r.(*v3discoverypb.DiscoveryRequest)
wantReq := &v3discoverypb.DiscoveryRequest{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the mentioned issue of "not acking the good update if last update was bad" exist for all resource types? Is it enough to just verify it for a listener resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not acking the good update if last update was bad
Where do we have this behavior? That doesn't sound right to me.

But anyways, we re-architected the xDS client transport to be agnostic of the resource type. So, there should not be any code in the xDS client transport layer that does different things based on the resource type. Any resource type specific behavior is mostly limited to parsing/decoding the resource and that is implemented in the xdsresource package which we sometimes refer to as the data-model layer. Therefore verifying ACK/NACK sematics behavior for one resource type is good enough.

}
gotReq = r.(*v3discoverypb.DiscoveryRequest)
if gotNonce, wantNonce := gotReq.GetResponseNonce(), gotResp.GetNonce(); gotNonce != wantNonce {
t.Errorf("Unexpected nonce in discovery request, got: %v, want: %v", gotNonce, wantNonce)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is there any convention we follow on when to use t.Errorf or t.Fatalf? If not, why not always use t.Fatalf to be consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://google.github.io/styleguide/go/decisions#keep-going

Here we want the error message to also contain any diff in the error detail.


// Verify the update received by the watcher.
var lastErr error
for ; ctx.Err() == nil; <-time.After(100 * time.Millisecond) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using <-time.After instead of select with context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time.After here is part of the for loop (to ensure that we pause a little in between iterations). Basically this loop could be written as follows as well:

for {
	err := verifyListenerUpdate(ctx, lw.updateCh, wantUpdate)
    if err == nil {
        break
    }
	lastErr = err
    select {
    case <-time.After(100 * time.Millisecond):
        continue
    case <-ctx.Done:
		t.Fatalf("Timeout when waiting for listener update. Last seen error: %v", lastErr)
    }
}

@purnesh42H purnesh42H self-requested a review October 7, 2024 19:19
Copy link
Contributor

@purnesh42H purnesh42H left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@purnesh42H purnesh42H removed their assignment Oct 7, 2024
@easwars easwars merged commit 9afb232 into grpc:master Oct 7, 2024
14 checks passed
@easwars easwars deleted the transport_ack_nack_test branch October 7, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants