Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ringhash: port e2e tests from c-core #7271

Merged
merged 18 commits into from
Jun 11, 2024
Merged

Conversation

atollena
Copy link
Collaborator

@atollena atollena commented May 27, 2024

Fixes #6072.

RELEASE NOTES: none

Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overall approach looks good. So, I will wait for you to move it out of draft before I give any further comments. Thanks.

@easwars easwars assigned atollena and unassigned easwars May 28, 2024
@atollena atollena marked this pull request as ready for review May 30, 2024 08:43
@atollena
Copy link
Collaborator Author

I implemented 8 out of the 23 tests in https://github.com/grpc/grpc/blob/master/test/cpp/end2end/xds/xds_ring_hash_end2end_test.cc.

I'd suggest we try to get those in and add more tests in a separate commit.

Regarding potential flakyness: I ran those tests with -count 1000 which didn't yield any failure.

@easwars
Copy link
Contributor

easwars commented May 31, 2024

I see that this is still assigned to you. Please assign it to me when you think it is ready for me to take a pass. Thanks.

@atollena atollena assigned easwars and unassigned atollena Jun 3, 2024
@atollena
Copy link
Collaborator Author

atollena commented Jun 3, 2024

I see that this is still assigned to you. Please assign it to me when you think it is ready for me to take a pass. Thanks.

This is ready for review.

Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass.

Many of these comments inside of the test apply to multiple tests instead of just the one where I've made the comment.

Thanks.

internal/testutils/blocking_context_dialer.go Show resolved Hide resolved
internal/testutils/xds/e2e/clientresources.go Show resolved Hide resolved
internal/testutils/xds/e2e/setup_management_server.go Outdated Show resolved Hide resolved
internal/testutils/xds/e2e/setup_management_server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass.

Many of these comments inside of the test apply to multiple tests instead of just the one where I've made the comment.

Thanks.

@easwars easwars assigned atollena and unassigned easwars Jun 3, 2024
@easwars easwars assigned atollena and unassigned easwars Jun 6, 2024
@atollena
Copy link
Collaborator Author

Looks like one of the tests added in this PR is flaky here: https://github.com/grpc/grpc-go/actions/runs/9398809643/job/25884944530?pr=7271.

I looked into this, and I think there is a race which is not obvious.
Here is what may be happening:

  1. We create a stubserver and close it immediately, to simulate a server that is unreachable
  2. The stub server calls Serve in a separate goroutine spawned at
    go ss.S.Serve(lis)
    -- if this goroutine runs immediately, then the gRPC server is started properly. Calling Stop() on it will make sure the listener is closed synchronously at

    grpc-go/server.go

    Lines 1886 to 1892 in e40eb2e

    func (s *Server) stop(graceful bool) {
    s.quit.Fire()
    defer s.done.Fire()
    s.channelzRemoveOnce.Do(func() { channelz.RemoveEntry(s.channelz.ID) })
    s.mu.Lock()
    s.closeListenersLocked()
    .
  3. However, if server.Stop() is called before server.Serve() is scheduled, then the listener is not yet registered and the listener is not closed. This is what happened in this test run, and in my tests in most cases if you do:
lis, err := net.Listen
server := grpc.NewServer()
go server.Serve(lis)
server.Stop()

then Stop will run before Serve.
Even when this happens, in most cases the listener is then closed by Serve (behavior implemented at

grpc-go/server.go

Lines 843 to 848 in e40eb2e

if s.lis == nil {
// Serve called after Stop or GracefulStop.
s.mu.Unlock()
lis.Close()
return ErrServerStopped
}
).

However, since that happens in a separate goroutine, there's no guarantee that closing the listener will run before the first connections, later on in a test.

I think the best way to work around this is to directly create listeners instead of creating stubservers. This avoids the asynchronous close behavior described above, while still reserving an available port and making sure it's not used by another server. I did that in the last changes.

@atollena atollena assigned easwars and unassigned atollena Jun 10, 2024
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of all the comments.

Comment on lines 247 to 251
primaryClusterName := "new_cluster_1"
primaryServiceName := "new_eds_service_1"
secondaryClusterName := "new_cluster_2"
secondaryServiceName := "new_eds_service_2"
clusterName := "aggregate_cluster"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, should have caught this much earlier. Could we please use consts here, and in other tests as well. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed that in the last commit.

@easwars easwars assigned atollena and unassigned easwars Jun 10, 2024
@atollena atollena assigned easwars and unassigned atollena Jun 11, 2024
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Yay!!!

@easwars easwars assigned atollena and arvindbr8 and unassigned easwars Jun 11, 2024
@arvindbr8
Copy link
Member

Thanks @atollena for taking care of this!

@arvindbr8 arvindbr8 merged commit 4dd7f55 into grpc:master Jun 11, 2024
11 checks passed
atollena added a commit to atollena/grpc-go that referenced this pull request Jun 19, 2024
Follow up to grpc#7271 to fix
grpc#6072.

This adds a dozen more end to end tests.

There are tests that I did not port, specifically:

- TestRingHash_TransientFailureSkipToAvailableReady was flaky when I ported it,
so I removed it while investigating.

- TestRingHash_SwitchToLowerPriorityAndThenBack was also flaky, I also removed
it while investigating.

- TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime, I'm not sure
we implement this behavior, and if we do, it's not working the same way as in
c-core, where the order of subchannel connection attempts is based on the
resolver address order rather than the ring order.

I will follow up with fixes for each one of the remaining tests.
atollena added a commit to atollena/grpc-go that referenced this pull request Jun 19, 2024
Follow up to grpc#7271 to fix
grpc#6072.

This adds a dozen more end to end tests.

There are tests that I did not port, specifically:

- TestRingHash_TransientFailureSkipToAvailableReady was flaky when I ported it,
so I removed it while investigating.

- TestRingHash_SwitchToLowerPriorityAndThenBack was also flaky, I also removed
it while investigating.

- TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime, I'm not sure
we implement this behavior, and if we do, it's not working the same way as in
c-core, where the order of subchannel connection attempts is based on the
resolver address order rather than the ring order.

I will follow up with fixes for each one of the remaining tests.
atollena added a commit to atollena/grpc-go that referenced this pull request Jun 19, 2024
Follow up to grpc#7271 to fix
grpc#6072.

This adds a dozen more end to end tests.

There are tests that I did not port, specifically:

- TestRingHash_TransientFailureSkipToAvailableReady was flaky when I ported it,
so I removed it while investigating.

- TestRingHash_SwitchToLowerPriorityAndThenBack was also flaky, I also removed
it while investigating.

- TestRingHash_ContinuesConnectingWithoutPicksOneSubchannelAtATime, I'm not sure
we implement this behavior, and if we do, it's not working the same way as in
c-core, where the order of subchannel connection attempts is based on the
resolver address order rather than the ring order.

I will follow up with fixes for each one of the remaining tests.
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ringhash: improve test coverage
3 participants