Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/client: fix gRPC connection pool, don't close conn when meeting error #1196

Merged
merged 1 commit into from
Dec 11, 2020

Conversation

amyangfei
Copy link
Contributor

@amyangfei amyangfei commented Dec 11, 2020

What problem does this PR solve?

In an internal test, TiCDC can't recover from a TiKV crash or recovery.

We meet endless following errors after killing a TiKV, where store=172.16.4.197:21160 is the killed TiKV server.

[2020/12/10 17:05:01.000 +08:00] [WARN] [client.go:723] ["get grpc stream client failed"] [regionID=11544] [requestID=7814] [storeID=7] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Canceled desc = grpc: the client connection is closing"]
[2020/12/10 17:05:01.000 +08:00] [INFO] [region_cache.go:600] ["mark store's regions need be refill"] [store=172.16.4.197:21160]
[2020/12/10 17:05:01.000 +08:00] [INFO] [region_cache.go:414] ["invalidate current region, because others failed on same store"] [region=12557] [store=172.16.4.197:21160]
[2020/12/10 17:05:01.000 +08:00] [INFO] [client.go:656] ["cannot get rpcCtx, retry span"] [regionID=12557] [span="[7480000000000001ff2d5f728000000000ff4bd6de0000000000fa, 7480000000000001ff2d5f728000000000ff52290a0000000000fa)"]

On the other side, recovers a TiKV server meets following error

[2020/12/10 17:23:12.931 +08:00] [INFO] [client.go:363] ["establish stream to store failed, retry later"] [addr=172.16.4.197:21160] [error="[CDC:ErrTiKVEventFeed]rpc error: code = Canceled desc = grpc: the client connection is closing"] [errorVerbose="[CDC:ErrTiKVEventFeed]rpc error: code = Canceled desc = grpc: the client connection is closing\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201029093017-5a7df2af2ac7/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc/pkg/errors/helper.go:28\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream.func1\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:362\ngithub.com/pingcap/ticdc/pkg/retry.Run.func1\n\tgithub.com/pingcap/ticdc/pkg/retry/retry.go:32\ngithub.com/cenkalti/backoff.RetryNotify\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:37\ngithub.com/cenkalti/backoff.Retry\n\tgithub.com/cenkalti/backoff@v2.2.1+incompatible/retry.go:24\ngithub.com/pingcap/ticdc/pkg/retry.Run\n\tgithub.com/pingcap/ticdc/pkg/retry/retry.go:31\ngithub.com/pingcap/ticdc/cdc/kv.(*CDCClient).newStream\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:346\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).dispatchRequest\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:720\ngithub.com/pingcap/ticdc/cdc/kv.(*eventFeedSession).eventFeed.func1\n\tgithub.com/pingcap/ticdc/cdc/kv/client.go:477\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1374"]

What is changed and how it works?

We don't close gPRC conn here, let it goes into TransientFailure. If the store recovers, the gPRC conn can be reused.

TODO:

  • Add an integration test later
  • Refine the gRPC connetion pool

Check List

Tests

  • Unit test
  • Integration test

Release note

  • Fix a bug that TiCDC could fail to continue replicating when a TiKV crashes or recovers from a crash, the bug exists in v4.0.8 only.

@amyangfei amyangfei added status/ptal Could you please take a look? type/bugfix This PR fixes a bug. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. labels Dec 11, 2020
@amyangfei amyangfei added this to the v4.0.9 milestone Dec 11, 2020
@amyangfei
Copy link
Contributor Author

/run-all-tests

@amyangfei amyangfei added the priority/P0 The issue has P0 priority. label Dec 11, 2020
@zier-one
Copy link
Contributor

LGTM

@ti-srebot ti-srebot added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 11, 2020
@liuzix
Copy link
Contributor

liuzix commented Dec 11, 2020

LGTM

@ti-srebot ti-srebot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Dec 11, 2020
@zier-one
Copy link
Contributor

/merge

@ti-srebot ti-srebot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 11, 2020
@ti-srebot
Copy link
Contributor

/run-all-tests

@ti-srebot
Copy link
Contributor

@amyangfei merge failed.

@amyangfei
Copy link
Contributor Author

/run-integration-tests

1 similar comment
@amyangfei
Copy link
Contributor Author

/run-integration-tests

@codecov-io
Copy link

Codecov Report

Merging #1196 (e70e4ef) into master (b3a87a8) will increase coverage by 0.0675%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             master      #1196        +/-   ##
================================================
+ Coverage   45.8436%   45.9111%   +0.0675%     
================================================
  Files           112        112                
  Lines         11729      11727         -2     
================================================
+ Hits           5377       5384         +7     
+ Misses         5748       5740         -8     
+ Partials        604        603         -1     

@amyangfei amyangfei merged commit 24b5f00 into pingcap:master Dec 11, 2020
@amyangfei amyangfei deleted the fix-grpc-conn-2 branch December 11, 2020 09:28
ti-srebot pushed a commit to ti-srebot/ticdc that referenced this pull request Dec 11, 2020
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #1198

ti-srebot added a commit that referenced this pull request Dec 11, 2020
…ror (#1196) (#1198)

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
@amyangfei amyangfei added the type/bug The issue is confirmed as a bug. label Dec 17, 2020
@AkiraXie AkiraXie added the area/ticdc Issues or PRs related to TiCDC. label Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. priority/P0 The issue has P0 priority. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. status/ptal Could you please take a look? type/bug The issue is confirmed as a bug. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants