Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix potential deadlock in dynconfig resolver #3024

Merged
merged 1 commit into from
Jan 16, 2024

Conversation

realityone
Copy link
Contributor

Description

This is a quite easy patch to fix potential grpc connection resolve deadlock with dynconfig resolver.

Related Issue

Motivation and Context

The grpc-go introduce the WithIdleTimeout feature in v1.56.0, and the default idleTimeout set as 30 * time.Minute in v1.59.0. This may be a potential cause of many reporting of preheat failed with DeadlineExceeded.

It seems that grpc with gc the idle connection after 30 minutes. And scheduler doesn't health check the seed peer, so when we create a preheat task, the last idle connection may be already closed by grpc, but when new connection is created, the connection is waiting for the resolver state to be done.

Here is the goroutine stack:

1 @ 0x43e56e 0x44ea25 0xac9d33 0xae44ab 0x109b2ed 0x109b1f8 0xb2509b 0x109b1f8 0xb2ca03 0x109b1f8 0xbed3f8 0x109b13a 0xae4003 0x10706c7 0x109be04 0x1611d4f 0x4c9067 0x4c8139 0x15545d8 0x160d3fa 0x1585617 0x158528f 0x471a01
#       0xac9d32        google.golang.org/grpc.(*ClientConn).waitForResolvedAddrs+0xd2                                          /go/pkg/mod/google.golang.org/grpc@v1.60.1/clientconn.go:680
#       0xae44aa        google.golang.org/grpc.newClientStream+0x40a                                                            /go/pkg/mod/google.golang.org/grpc@v1.60.1/stream.go:211
#       0x109b2ec       d7y.io/dragonfly/v2/pkg/rpc/cdnsystem/client.GetClient.RefresherStreamClientInterceptor.func10+0x6c     /var/run/nyx/dragonfly2-scheduler/pkg/rpc/interceptor.go:56
#       0x109b1f7       d7y.io/dragonfly/v2/pkg/rpc/cdnsystem/client.GetClient.ChainStreamClient.func12.1+0x57                  /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:143
#       0xb2509a        github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamClientInterceptor.func1+0xfa             /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/logging/zap/client_interceptors.go:41
#       0x109b1f7       d7y.io/dragonfly/v2/pkg/rpc/cdnsystem/client.GetClient.ChainStreamClient.func12.1+0x57                  /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:143
#       0xb2ca02        github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ClientMetrics).StreamClientInterceptor.func2+0xe2   /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/client_metrics.go:126
#       0x109b1f7       d7y.io/dragonfly/v2/pkg/rpc/cdnsystem/client.GetClient.ChainStreamClient.func12.1+0x57                  /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:143
#       0xbed3f7        d7y.io/dragonfly/v2/pkg/rpc.ConvertErrorStreamClientInterceptor+0x37                                    /var/run/nyx/dragonfly2-scheduler/pkg/rpc/interceptor.go:121
#       0x109b139       d7y.io/dragonfly/v2/pkg/rpc/cdnsystem/client.GetClient.ChainStreamClient.func12+0x179                   /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.4.0/chain.go:146
#       0xae4002        google.golang.org/grpc.(*ClientConn).NewStream+0x182                                                    /go/pkg/mod/google.golang.org/grpc@v1.60.1/stream.go:166
#       0x10706c6       d7y.io/api/v2/pkg/apis/cdnsystem/v1.(*seederClient).ObtainSeeds+0x66                                    /go/pkg/mod/d7y.io/api/v2@v2.0.76/pkg/apis/cdnsystem/v1/cdnsystem_grpc.pb.go:43
#       0x109be03       d7y.io/dragonfly/v2/pkg/rpc/cdnsystem/client.(*client).ObtainSeeds+0xa3                                 /var/run/nyx/dragonfly2-scheduler/pkg/rpc/cdnsystem/client/client.go:153
#       0x1611d4e       d7y.io/dragonfly/v2/scheduler/job.(*job).preheat+0x6ae                                                  /var/run/nyx/dragonfly2-scheduler/scheduler/job/job.go:200
#       0x4c9066        reflect.Value.call+0xce6                                                                                /usr/local/go/src/reflect/value.go:596
#       0x4c8138        reflect.Value.Call+0xb8                                                                                 /usr/local/go/src/reflect/value.go:380
#       0x15545d7       github.com/RichardKnop/machinery/v1/tasks.(*Task).Call+0x297                                            /go/pkg/mod/github.com/!richard!knop/machinery@v1.10.6/v1/tasks/task.go:142
#       0x160d3f9       github.com/RichardKnop/machinery/v1.(*Worker).Process+0x2d9                                             /go/pkg/mod/github.com/!richard!knop/machinery@v1.10.6/v1/worker.go:182
#       0x1585616       github.com/RichardKnop/machinery/v1/brokers/redis.(*Broker).consumeOne+0x2b6                            /go/pkg/mod/github.com/!richard!knop/machinery@v1.10.6/v1/brokers/redis/redis.go:349
#       0x158528e       github.com/RichardKnop/machinery/v1/brokers/redis.(*Broker).consume.func2+0x4e                          /go/pkg/mod/github.com/!richard!knop/machinery@v1.10.6/v1/brokers/redis/redis.go:312

The dynconfig resolver is comparing the seed peers' address, but seed peer has no change. So there is no another chance to call cc.UpdateState like below:

	if reflect.DeepEqual(r.addrs, addrs) {
		return
	}
	r.addrs = addrs

	if err := r.cc.UpdateState(resolver.State{
		Addresses: addrs,
	}); err != nil {
		plogger.Errorf("resolver update ClientConn error %v", err)
	}

These cause the deadlock in waitForResolvedAddrs method.

Another fix is tracing the connection state, but it's more complicated. Or maybe grpc-go should clarify the resolver behavior first.

I have to Dragonfly2 cluster, one for test and one for production. After disable the idleTimeout feature, I haven't seen it happen again after 4 hours.

But I still change many code in dynconfig resolver to digging this state. Hope this is the root cause.

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation Update (if none of the other choices apply)

Checklist

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

@realityone realityone requested a review from a team as a code owner January 15, 2024 10:22
Copy link

codecov bot commented Jan 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (abcf7ea) 51.38% compared to head (5d5916e) 51.37%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3024      +/-   ##
==========================================
- Coverage   51.38%   51.37%   -0.02%     
==========================================
  Files         162      162              
  Lines       22154    22154              
==========================================
- Hits        11384    11381       -3     
- Misses      10111    10112       +1     
- Partials      659      661       +2     
Flag Coverage Δ
Object-compatibility-e2etests ?
e2etests ?
unittests 51.37% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

Copy link
Member

@gaius-qi gaius-qi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gaius-qi gaius-qi merged commit 9a1c744 into dragonflyoss:main Jan 16, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants