store/tikv: keepalive with pd #14118

nolouch · 2019-12-18T09:48:52Z

Signed-off-by: nolouch nolouch@gmail.com

What problem does this PR solve?

keepalive with pd.
should wait client: supports to add gRPC dial options tikv/pd#2035
problem:

After all 3 instances of PD are killed in AWS(k8s environment), it takes a long time (15 minutes) for TiDB server instances to reconnect to new PD instances. and we found the stale TCP connection after all pod IP is changed.

 we see the connection is still establish, and that ip does not exist after kill.
10.0.48.116 is not exist, but the tcp is establisted.

Tue Dec 17 12:54:36 UTC 2019
tcp        0      1 10.0.38.181:53424       172.20.62.78:2379       SYN_SENT    1/tidb-server
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0   1140 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 12:54:37 UTC 2019
tcp        0      1 10.0.38.181:53424       172.20.62.78:2379       SYN_SENT    1/tidb-server
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0   1153 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 12:54:37 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0   1153 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server

....

Tue Dec 17 13:07:15 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0  26868 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 13:07:16 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0  26868 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 13:07:16 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0  26868 10.0.38.181:35378       10.0.48.116:2379        ESTABLISHED 1/tidb-server
Tue Dec 17 13:07:17 UTC 2019
tcp        0      0 10.0.38.181:53302       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53304       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53488       172.20.62.78:2379       ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:53300       10.0.22.167:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:37554       10.0.46.225:2379        ESTABLISHED 1/tidb-server
tcp        0      0 10.0.38.181:41754       10.0.55.94:2379         ESTABLISHED 1/tidb-server

This problem same as #7099. may k8s CNI dropping all packets send to the removed node(Indeterminate), that cause a stall conneciton, until kernel TCP retransmission times out and closes the connection.

What is changed and how it works?

update pd client and use keepalive in gRPC.

Check List

Tests

Manual test (add detailed scripts or steps below)
no stale connection appear.

Signed-off-by: nolouch <nolouch@gmail.com>

nolouch · 2019-12-18T10:02:38Z

/rebuild

lysu · 2019-12-18T12:06:09Z

@nolouch plugin CI will be fixed after PD repo merge

DanielZhangQD · 2019-12-19T03:14:51Z

may k8s CNI dropping all packets send to the removed node(Indeterminate), that cause a stall conneciton, @nolouch as we have discussed, it's not the issue of k8s CNI, it's PD that does not close gracefully (no FIN sent to TiDB) that causes the stale connection.

Signed-off-by: nolouch <nolouch@gmail.com>

nolouch · 2019-12-20T05:47:47Z

@DanielZhangQD I have test kill -9 but not found the same issue in k8s. anyway, the keepalive is needed.

lysu

LGTM

siddontang · 2019-12-22T14:03:14Z

@nolouch

Have we already used keepalive for TiKV?

disksing · 2019-12-23T02:44:45Z

LGTM

bb7133

LGTM

go.sum

bb7133 · 2019-12-23T15:29:55Z

/run-all-tests

disksing · 2019-12-24T03:34:16Z

LGTM

nolouch · 2019-12-24T05:04:25Z

/merge

sre-bot · 2019-12-24T05:06:42Z

/run-all-tests

sre-bot · 2019-12-24T05:10:38Z

cherry pick to release-2.1 failed

sre-bot · 2019-12-24T05:11:19Z

cherry pick to release-3.0 failed

sre-bot · 2020-04-07T10:25:34Z

It seems that, not for sure, we failed to cherry-pick this commit to release-2.1. Please comment '/run-cherry-picker' to try to trigger the cherry-picker if we did fail to cherry-pick this commit before. @nolouch PTAL.

store/tikv: keepalive with pd

29652bf

Signed-off-by: nolouch <nolouch@gmail.com>

nolouch added the type/bugfix This PR fixes a bug. label Dec 18, 2019

nolouch requested review from breezewish, hicqu and disksing December 18, 2019 09:48

hicqu requested a review from lysu December 18, 2019 09:52

nolouch added 2 commits December 20, 2019 13:14

update pd

3a7cfb7

Signed-off-by: nolouch <nolouch@gmail.com>

Merge remote-tracking branch 'origin' into keepalive-pd

b1769ad

nolouch requested a review from bb7133 December 20, 2019 05:51

lysu reviewed Dec 20, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into keepalive-pd

194edeb

nolouch added needs-cherry-pick-2.1 labels Dec 23, 2019

nolouch added 4 commits December 23, 2019 15:46

Merge remote-tracking branch 'origin/master' into keepalive-pd

a47b2a7

Merge branch 'master' into keepalive-pd

044c326

Merge branch 'master' into keepalive-pd

9454483

Merge branch 'master' into keepalive-pd

445f532

bb7133 approved these changes Dec 23, 2019

View reviewed changes

bb7133 reviewed Dec 23, 2019

View reviewed changes

go.sum Show resolved Hide resolved

sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label Dec 24, 2019

Merge branch 'master' into keepalive-pd

d08d6a5

sre-bot merged commit 9e376cf into pingcap:master Dec 24, 2019

nolouch added a commit to nolouch/tidb that referenced this pull request Dec 25, 2019

store/tikv: keepalive with pd (pingcap#14118)

f55c1a2

nolouch mentioned this pull request Dec 25, 2019

store/tikv: keepalive with pd (#14118) #14233

Merged

nolouch added a commit to nolouch/tidb that referenced this pull request Dec 25, 2019

store/tikv: keepalive with pd (pingcap#14118)

d377ab0

nolouch mentioned this pull request Dec 25, 2019

store/tikv: keepalive with pd (#14118) #14234

Closed

hicqu mentioned this pull request Dec 26, 2019

store: keep alive for etcd client #14253

Merged

jackysp pushed a commit that referenced this pull request Dec 27, 2019

store/tikv: keepalive with pd (#14118) (#14233)

6adce23

nolouch deleted the keepalive-pd branch December 27, 2019 09:58

gotoxu mentioned this pull request Sep 18, 2021

grpc: keepalive with tikv tikv/client-java#279

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store/tikv: keepalive with pd #14118

store/tikv: keepalive with pd #14118

nolouch commented Dec 18, 2019

nolouch commented Dec 18, 2019

lysu commented Dec 18, 2019

DanielZhangQD commented Dec 19, 2019

nolouch commented Dec 20, 2019

lysu left a comment

siddontang commented Dec 22, 2019

disksing commented Dec 23, 2019

bb7133 left a comment

bb7133 commented Dec 23, 2019

disksing commented Dec 24, 2019

nolouch commented Dec 24, 2019

sre-bot commented Dec 24, 2019

sre-bot commented Dec 24, 2019

sre-bot commented Dec 24, 2019

sre-bot commented Apr 7, 2020

store/tikv: keepalive with pd #14118

store/tikv: keepalive with pd #14118

Conversation

nolouch commented Dec 18, 2019

What problem does this PR solve?

What is changed and how it works?

Check List

nolouch commented Dec 18, 2019

lysu commented Dec 18, 2019

DanielZhangQD commented Dec 19, 2019

nolouch commented Dec 20, 2019

lysu left a comment

Choose a reason for hiding this comment

siddontang commented Dec 22, 2019

disksing commented Dec 23, 2019

bb7133 left a comment

Choose a reason for hiding this comment

bb7133 commented Dec 23, 2019

disksing commented Dec 24, 2019

nolouch commented Dec 24, 2019

sre-bot commented Dec 24, 2019

sre-bot commented Dec 24, 2019

sre-bot commented Dec 24, 2019

sre-bot commented Apr 7, 2020