Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: rework resolver and balancer wrappers to avoid deadlock #6804

Merged
merged 21 commits into from
Dec 5, 2023

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented Nov 16, 2023

  • Balancer wrapper is now recreated for each trip out of idle mode (instead of supporting its own idle mode).
  • Removed mutexes or limited as much as possible.
  • All calls into resolver/balancer are done in serializer, with no mutexes held.
  • Calls from old resolver instances are ignored.
  • ExitIdleCond is no longer needed by closing the idlenessMgr at the start of Close.
  • Constructors figure out their own parameters from cc instead of being passed in.
  • Handle resolver.Build->cc.updateResolverState->balancer.UpdateState->resolver.ResolveNow properly
  • Idle Manager starts in idle state.
  • All channel idleness management goes through idle manager -- e.g. Connect().
  • Add lock and closed state to balancer_wrapper to prevent a race where the
    balancer creates a subchannel and we enter idle mode during that process.
  • Split resolver wrapper into constructor and start method so the serializer is
    available during resolver.Build.
  • Add EnterIdleModeForTesting to idleness manager and use idlenessMgr methods
    in the idleness test instead of methods on the ClientConn which should not be
    called directly.

Fixes #6783

RELEASE NOTES:

  • client: fix race that could cause a deadlock while entering idle mode and receiving a name resolver update

@dfawley dfawley added this to the 1.61 Release milestone Nov 16, 2023
@dfawley dfawley requested a review from arvindbr8 November 16, 2023 06:06
Copy link

codecov bot commented Nov 16, 2023

Codecov Report

Merging #6804 (a4ef6d7) into master (7935c4f) will increase coverage by 0.18%.
Report is 6 commits behind head on master.
The diff coverage is 85.97%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6804      +/-   ##
==========================================
+ Coverage   83.40%   83.59%   +0.18%     
==========================================
  Files         285      285              
  Lines       30879    30763     -116     
==========================================
- Hits        25754    25715      -39     
+ Misses       4053     3987      -66     
+ Partials     1072     1061      -11     
Files Coverage Δ
internal/channelz/funcs.go 95.02% <100.00%> (+0.04%) ⬆️
picker_wrapper.go 88.88% <50.00%> (-1.81%) ⬇️
internal/idle/idle.go 88.40% <92.45%> (+26.18%) ⬆️
clientconn.go 92.20% <86.95%> (-0.67%) ⬇️
balancer_wrapper.go 82.92% <87.95%> (+2.26%) ⬆️
resolver_wrapper.go 85.04% <77.61%> (-9.60%) ⬇️

... and 23 files with indirect coverage changes

- All channel idleness management goes through idle manager -- e.g. Connect().
- Add lock and closed state to balancer_wrapper to prevent a race where the
  balancer creates a subchannel and we enter idle mode during that process.
- Split resolver wrapper into constructor and start method so the serializer is
  available during resolver.Build.
- Add EnterIdleModeForTesting to idleness manager and use idlenessMgr methods
  in the idleness test instead of methods on the ClientConn which should not be
  called directly.
@dfawley
Copy link
Member Author

dfawley commented Nov 17, 2023

@zasweq I may have misunderstood, so if @arvindbr8 can review this tomorrow then feel free to ignore. Otherwise if you could review this next week so we can have it done in time to cherry-pick it to the release branch, that would be great. Thanks!

Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a style pass. Technically, I have a very minute understanding of the original Exit Idle code, and this fix. I think since Easwar is back next week he should be primary reviewer since he wrote the ExitIdle code. We should also have a technical sync where you briefly describe to us the original code, all the issues, the fix and why it overall fixes.

balancer_wrapper.go Show resolved Hide resolved
balancer_wrapper.go Show resolved Hide resolved
resolver_wrapper.go Show resolved Hide resolved
balancer_wrapper.go Outdated Show resolved Hide resolved
balancer_wrapper.go Outdated Show resolved Hide resolved
clientconn.go Outdated Show resolved Hide resolved
clientconn.go Show resolved Hide resolved
internal/idle/idle.go Show resolved Hide resolved
resolver_wrapper.go Outdated Show resolved Hide resolved
resolver_wrapper.go Show resolved Hide resolved
@zasweq zasweq requested review from easwars and removed request for arvindbr8 November 22, 2023 19:31
@zasweq zasweq assigned easwars and unassigned arvindbr8 and zasweq Nov 22, 2023
internal/idle/idle.go Outdated Show resolved Hide resolved
internal/idle/idle.go Outdated Show resolved Hide resolved
internal/idle/idle.go Outdated Show resolved Hide resolved
internal/idle/idle.go Show resolved Hide resolved
internal/idle/idle.go Outdated Show resolved Hide resolved
@easwars easwars assigned dfawley and unassigned easwars Nov 28, 2023
@easwars
Copy link
Contributor

easwars commented Nov 28, 2023

@dfawley : I've only looked through the idleness code so far. But assigning back to you to make the requested changes while I will continue to review other changes.

Copy link
Member Author

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments address (I think!)

balancer_wrapper.go Show resolved Hide resolved
balancer_wrapper.go Show resolved Hide resolved
balancer_wrapper.go Show resolved Hide resolved
balancer_wrapper.go Outdated Show resolved Hide resolved
balancer_wrapper.go Outdated Show resolved Hide resolved
internal/idle/idle.go Outdated Show resolved Hide resolved
internal/idle/idle.go Outdated Show resolved Hide resolved
internal/idle/idle.go Show resolved Hide resolved
internal/idle/idle.go Outdated Show resolved Hide resolved
internal/idle/idle.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned easwars and zasweq and unassigned dfawley Nov 28, 2023
Copy link
Contributor

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM module some nits

balancer_wrapper.go Outdated Show resolved Hide resolved
balancer_wrapper.go Show resolved Hide resolved
clientconn.go Outdated Show resolved Hide resolved
clientconn.go Show resolved Hide resolved
clientconn.go Outdated Show resolved Hide resolved
clientconn.go Show resolved Hide resolved
resolver_balancer_ext_test.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned easwars and unassigned dfawley Nov 30, 2023
balancer_wrapper.go Show resolved Hide resolved
balancer_wrapper.go Show resolved Hide resolved
balancer_wrapper.go Outdated Show resolved Hide resolved
clientconn.go Outdated Show resolved Hide resolved
clientconn.go Show resolved Hide resolved
clientconn.go Outdated Show resolved Hide resolved
@easwars easwars assigned dfawley and unassigned easwars Dec 1, 2023
@dfawley dfawley assigned easwars and unassigned dfawley Dec 1, 2023
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

@@ -184,115 +168,49 @@ func (ccb *ccBalancerWrapper) buildLoadBalancingPolicy(name string) {
ccb.curBalancerName = builder.Name()
}

// close initiates async shutdown of the wrapper. To determine the wrapper has
// finished shutting down, the channel should block on ccb.serializer.Done()
// without cc.mu held.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we also please mention here that the caller (channel) should hold cc.mu when invoking close().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// ccBalancerWrapper sits between the ClientConn and the Balancer.
//
// ccBalancerWrapper implements methods corresponding to the ones on the
// balancer.Balancer interface. The ClientConn is free to call these methods
// concurrently and the ccBalancerWrapper ensures that calls from the ClientConn
// to the Balancer happen synchronously and in order.
// to the Balancer happen in order by performing them in the serializer, without
// any mutexes held.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do hold ccb.mu in close(), and require that the caller hold cc.mu when calling into close(). Would it be better to have that spelled out here?

Copy link
Member Author

@dfawley dfawley Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is about how the calls into the plugin, themselves, are made without cc.mu held. Not that calls into ccb must not hold cc.mu.

Probably we should just document the functions. (Done, and it looks like the only one that matters either way is close, which needs cc.mu.)

@@ -313,10 +231,6 @@ func (ccb *ccBalancerWrapper) RemoveSubConn(sc balancer.SubConn) {
}

func (ccb *ccBalancerWrapper) UpdateAddresses(sc balancer.SubConn, addrs []resolver.Address) {
if ccb.isIdleOrClosed() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not want to check if the balancer is closed here? UpdateAddresses could lead to transport creation, right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more relevant whether the SubConn/addrConn was closed and not whether the balancer was closed. And we do that check here:

if ac.state == connectivity.Shutdown ||

@@ -0,0 +1,264 @@
/*
*
* Copyright 2014 gRPC authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2023?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O_O

// the channel enters idle mode.
func (s) TestEnterIdleDuringResolverUpdateState(t *testing.T) {
enterIdle := internal.EnterIdleModeForTesting.(func(*grpc.ClientConn))
const name = "testeidrus"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: In each of these tests, could we use t.Name() as the resolver name instead? I think a shortened name could be very confusing for someone looking at a failed test.

Copy link
Member Author

@dfawley dfawley Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that originally, but it doesn't work because of case/slashes. I can fix it, though. Done.

@easwars easwars assigned dfawley and unassigned easwars Dec 5, 2023
@dfawley dfawley merged commit 5d7453e into grpc:master Dec 5, 2023
14 checks passed
@dfawley dfawley deleted the idlerace2 branch December 5, 2023 19:02
dfawley added a commit to dfawley/grpc-go that referenced this pull request Dec 7, 2023
@dfawley dfawley modified the milestones: 1.61 Release, 1.60 Release Dec 7, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race between calls from resolver and entering idle mode
4 participants