Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcs, tso, client: After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately. #6748

Closed
binshi-bing opened this issue Jul 4, 2023 · 2 comments · Fixed by #6749
Assignees
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@binshi-bing
Copy link
Contributor

binshi-bing commented Jul 4, 2023

Enhancement Task

Summary

It's a pure client issue during tso service discovery at the first time. During the incident, we restarted PD pods and TSO pods, but they won't help. The real action which mitigated the issue was we moved group2's primary to the same location of group 0's primary around 13:55, which I'll explain later.

What's the problem

After TiDB calls NewClientWithAPIContextV2 and returns, the first GetTS always uses the default keyspace group (failed) instead of the keyspace group discovered by the keyspace name passed from the NewClientWithAPIContextV2 API.
Although PD (TSO) client almost immediately uses the right keyspace id, queried by keyspace name, to discovery the right keyspace group that the keyspace belongs to, TiDB didn't have retry and failed before PD (TSO) client discovered the right keyspace group service info.
I can consistently repo the issue in staging env with -keyspacename option in pd-tso-bench.

What's the main scenearios hitting the issue?

The standby TiDB pods are provisioned on demand. It seems that in this case the tidb doesn't retry GetTS and surface the problem. We didn't hit this issue in Dev/QA/Unittest, and the reason seems to be the caller of GetTS have retries.

What's the impact

Only impacted the non-default keyspace groups in staging, so it doesn't impact the single-timeline version in eu-central prod.

Details

What's the problem

After TiDB calls NewClientWithAPIContextV2 and it returns, the first GetTS always uses the default keyspace group instead of the keyspace group discovered by the keyspace name passed from the NewClientWithAPIContextV2 API. Although PD (TSO) client almost immediately uses the right keyspace id, queried by keyspace name, to discovery the right keyspace group that the keyspace belongs to, TiDB finished the retry within 1ms and failed before PD (TSO) client discovers the right keyspace group service info

image
[2023/07/04 07:09:07.035 +00:00] [INFO] [tso_service_discovery.go:495] ["[tso] updated keyspace group service discovery info"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] 
    [keyspace-group-service="id:2 user_kind:\"basic\" members:<address:\"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379\" is_primary:true > members:<address:\"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379\" > "]
[2023/07/04 07:09:07.035 +00:00] [ERROR] [tso_dispatcher.go:459] ["[tso] getTS error"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [dc-location=global] [stream-addr=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, after processing requests"]
[2023/07/04 07:09:07.030 +00:00] [INFO] [client.go:522] ["[pd] create pd client with endpoints and keyspace"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [pd-address="[serverless-cluster-pd.tidb-serverless.svc:2379]"] [keyspace-name=2oVzWDxP4sSxu9f] [keyspace-id=18076]
[2023/07/04 07:09:07.024 +00:00] [INFO] [tso_service_discovery.go:495] ["[tso] updated keyspace group service discovery info"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] 
    [keyspace-group-service="user_kind:\"basic\" members:<address:\"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379\" > members:<address:\"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379\" is_primary:true > "]
[2023/07/04 07:09:07.019 +00:00] [INFO] [tso_service_discovery.go:584] ["update tso server addresses"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [addrs="[http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379,http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]"]
[2023/07/04 07:09:07.019 +00:00] [INFO] [tso_service_discovery.go:195] ["initializing tso service discovery"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [max-retry-times=100] [retry-interval=1s]

Root Cause Analysis

newClientWithKeyspaceName
image
We should synchronously discover keyspace group service with the updated keyspace ID at line 516 in the above picture.

@binshi-bing binshi-bing added the type/enhancement The issue or PR belongs to an enhancement. label Jul 4, 2023
@binshi-bing
Copy link
Contributor Author

It can be consistently reproed by pd-tso-bench with -keyspace-name option.

root@tso-bench:/# ./pd-tso-bench -v -duration 250000s -pd "http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379" -client 1 -c 1 -interval 30s -keyspace-name "2oVzWDxP4sSxu9f"

Start benchmark #0, duration: 250000s
Create 1 client(s) for benchmark
[2023/07/04 23:39:38.981 +00:00] [INFO] [pd_service_discovery.go:590] ["[pd] update member urls"] [old-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"] [new-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-1.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-2.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"]
[2023/07/04 23:39:38.981 +00:00] [INFO] [pd_service_discovery.go:616] ["[pd] switch leader"] [new-leader=http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379] [old-leader=]
[2023/07/04 23:39:38.981 +00:00] [INFO] [pd_service_discovery.go:193] ["[pd] init cluster id"] [cluster-id=7158110369020695519]
[2023/07/04 23:39:38.981 +00:00] [INFO] [client.go:590] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=API_SVC_MODE]
[2023/07/04 23:39:38.981 +00:00] [INFO] [tso_service_discovery.go:185] ["created tso service discovery"] [cluster-id=7158110369020695519] [keyspace-id=0] [default-discovery-key=/ms/7158110369020695519/tso/00000/primary]
[2023/07/04 23:39:38.981 +00:00] [INFO] [tso_service_discovery.go:195] ["initializing tso service discovery"] [max-retry-times=100] [retry-interval=1s]
[2023/07/04 23:39:38.982 +00:00] [INFO] [tso_service_discovery.go:594] ["update tso server addresses"] [addrs="[http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379,http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]"]
[2023/07/04 23:39:38.985 +00:00] [INFO] [tso_service_discovery.go:503] ["[tso] updated keyspace group service discovery info"] [keyspace-id-in-request=0] [tso-server-addr=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379] [keyspace-group-service="user_kind:"basic" members:<address:"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379" is_primary:true > members:<address:"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379" > "]
[2023/07/04 23:39:38.985 +00:00] [INFO] [tso_client.go:230] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379]
[2023/07/04 23:39:38.985 +00:00] [INFO] [tso_service_discovery.go:413] ["[tso] switch primary"] [new-primary=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379] [old-primary=]
[2023/07/04 23:39:38.994 +00:00] [FATAL] [main.go:120] ["get first time tso failed"] [client-number=0] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*tsoTSOStream).processRequests\n\t/go/src/github.com/tikv/pd/client/tso_stream.go:204\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\t/go/src/github.com/tikv/pd/client/tso_dispatcher.go:716\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\t/go/src/github.com/tikv/pd/client/tso_dispatcher.go:449\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\t/go/src/github.com/tikv/pd/client/tso_dispatcher.go:98\ngithub.com/tikv/pd/client.(*client).GetLocalTS\n\t/go/src/github.com/tikv/pd/client/client.go:843\nmain.bench\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:118\nmain.main\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:97\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172"] [stack="main.bench\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:120\nmain.main\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:97\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]

@binshi-bing
Copy link
Contributor Author

binshi-bing commented Jul 5, 2023

Fix is verified

root@tso-bench:/# ./pd-tso-bench -v -duration 250000s -pd "http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379" -client 1 -c 1 -interval 10s -keyspace-name "2oVzWDxP4sSxu9f"

Start benchmark #0, duration: 250000s
Create 1 client(s) for benchmark
[2023/07/05 03:16:47.148 +00:00] [INFO] [pd_service_discovery.go:600] ["[pd] update member urls"] [old-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"] [new-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-1.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-2.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"]
[2023/07/05 03:16:47.148 +00:00] [INFO] [pd_service_discovery.go:626] ["[pd] switch leader"] [new-leader=http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379] [old-leader=]
[2023/07/05 03:16:47.148 +00:00] [INFO] [pd_service_discovery.go:195] ["[pd] init cluster id"] [cluster-id=7158110369020695519]
[2023/07/05 03:16:47.152 +00:00] [INFO] [client.go:586] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=API_SVC_MODE]
[2023/07/05 03:16:47.152 +00:00] [INFO] [tso_service_discovery.go:185] ["created tso service discovery"] [cluster-id=7158110369020695519] [keyspace-id=18076] [default-discovery-key=/ms/7158110369020695519/tso/00000/primary]
[2023/07/05 03:16:47.152 +00:00] [INFO] [tso_service_discovery.go:195] ["initializing tso service discovery"] [max-retry-times=100] [retry-interval=1s]
[2023/07/05 03:16:47.152 +00:00] [INFO] [tso_service_discovery.go:589] ["update tso server addresses"] [addrs="[http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379,http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]"]
[2023/07/05 03:16:47.155 +00:00] [INFO] [tso_service_discovery.go:498] ["[tso] updated keyspace group service discovery info"] [keyspace-id-in-request=18076] [tso-server-addr=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379] [keyspace-group-service="id:2 user_kind:"basic" members:<address:"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379" > members:<address:"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379" is_primary:true > "]
[2023/07/05 03:16:47.155 +00:00] [INFO] [tso_client.go:230] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]
[2023/07/05 03:16:47.155 +00:00] [INFO] [tso_service_discovery.go:408] ["[tso] switch primary"] [new-primary=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379] [old-primary=]
[2023/07/05 03:16:47.158 +00:00] [INFO] [tso_dispatcher.go:295] ["[tso] tso dispatcher created"] [dc-location=global]
[2023/07/05 03:16:47.158 +00:00] [INFO] [client.go:634] ["[pd] service mode changed"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=API_SVC_MODE]
[2023/07/05 03:16:47.158 +00:00] [INFO] [client.go:511] ["[pd] create pd client with endpoints and keyspace"] [pd-address="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"] [keyspace-name=2oVzWDxP4sSxu9f] [keyspace-id=18076]

count: 30787, max: 20.4116ms, min: 0.2205ms, avg: 0.3239ms
<1ms: 30747, >1ms: 22, >2ms: 13, >5ms: 3, >10ms: 2, >30ms: 0, >50ms: 0, >100ms: 0, >200ms: 0, >400ms: 0, >800ms: 0, >1s: 0

@ti-chi-bot ti-chi-bot bot closed this as completed in #6749 Jul 5, 2023
ti-chi-bot bot pushed a commit that referenced this issue Jul 5, 2023
…APIContext (#6749)

close #6748

After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately

Signed-off-by: Bin Shi <binshi.bing@gmail.com>
ti-chi-bot bot pushed a commit that referenced this issue Jul 5, 2023
ref #6747, ref #6748, ref #6749

Signed-off-by: Ryan Leung <rleungx@gmail.com>
rleungx pushed a commit to rleungx/pd that referenced this issue Aug 2, 2023
…APIContext (tikv#6749)

close tikv#6748

After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately

Signed-off-by: Bin Shi <binshi.bing@gmail.com>
rleungx added a commit to rleungx/pd that referenced this issue Aug 2, 2023
rleungx pushed a commit to rleungx/pd that referenced this issue Aug 2, 2023
…APIContext (tikv#6749)

close tikv#6748

After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately

Signed-off-by: Bin Shi <binshi.bing@gmail.com>
rleungx added a commit to rleungx/pd that referenced this issue Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant