mcs, tso, client: After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately. #6748

binshi-bing · 2023-07-04T23:09:12Z

Enhancement Task

Summary

It's a pure client issue during tso service discovery at the first time. During the incident, we restarted PD pods and TSO pods, but they won't help. The real action which mitigated the issue was we moved group2's primary to the same location of group 0's primary around 13:55, which I'll explain later.

What's the problem

After TiDB calls NewClientWithAPIContextV2 and returns, the first GetTS always uses the default keyspace group (failed) instead of the keyspace group discovered by the keyspace name passed from the NewClientWithAPIContextV2 API.
Although PD (TSO) client almost immediately uses the right keyspace id, queried by keyspace name, to discovery the right keyspace group that the keyspace belongs to, TiDB didn't have retry and failed before PD (TSO) client discovered the right keyspace group service info.
I can consistently repo the issue in staging env with -keyspacename option in pd-tso-bench.

What's the main scenearios hitting the issue?

The standby TiDB pods are provisioned on demand. It seems that in this case the tidb doesn't retry GetTS and surface the problem. We didn't hit this issue in Dev/QA/Unittest, and the reason seems to be the caller of GetTS have retries.

What's the impact

Only impacted the non-default keyspace groups in staging, so it doesn't impact the single-timeline version in eu-central prod.

Details

What's the problem

After TiDB calls NewClientWithAPIContextV2 and it returns, the first GetTS always uses the default keyspace group instead of the keyspace group discovered by the keyspace name passed from the NewClientWithAPIContextV2 API. Although PD (TSO) client almost immediately uses the right keyspace id, queried by keyspace name, to discovery the right keyspace group that the keyspace belongs to, TiDB finished the retry within 1ms and failed before PD (TSO) client discovers the right keyspace group service info

[2023/07/04 07:09:07.035 +00:00] [INFO] [tso_service_discovery.go:495] ["[tso] updated keyspace group service discovery info"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] 
    [keyspace-group-service="id:2 user_kind:\"basic\" members:<address:\"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379\" is_primary:true > members:<address:\"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379\" > "]
[2023/07/04 07:09:07.035 +00:00] [ERROR] [tso_dispatcher.go:459] ["[tso] getTS error"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [dc-location=global] [stream-addr=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, after processing requests"]
[2023/07/04 07:09:07.030 +00:00] [INFO] [client.go:522] ["[pd] create pd client with endpoints and keyspace"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [pd-address="[serverless-cluster-pd.tidb-serverless.svc:2379]"] [keyspace-name=2oVzWDxP4sSxu9f] [keyspace-id=18076]
[2023/07/04 07:09:07.024 +00:00] [INFO] [tso_service_discovery.go:495] ["[tso] updated keyspace group service discovery info"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] 
    [keyspace-group-service="user_kind:\"basic\" members:<address:\"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379\" > members:<address:\"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379\" is_primary:true > "]
[2023/07/04 07:09:07.019 +00:00] [INFO] [tso_service_discovery.go:584] ["update tso server addresses"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [addrs="[http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379,http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]"]
[2023/07/04 07:09:07.019 +00:00] [INFO] [tso_service_discovery.go:195] ["initializing tso service discovery"] [keyspaceName=2oVzWDxP4sSxu9f] [keyspaceID=18076] [max-retry-times=100] [retry-interval=1s]

Root Cause Analysis

newClientWithKeyspaceName

We should synchronously discover keyspace group service with the updated keyspace ID at line 516 in the above picture.

The text was updated successfully, but these errors were encountered:

binshi-bing · 2023-07-04T23:42:50Z

It can be consistently reproed by pd-tso-bench with -keyspace-name option.

root@tso-bench:/# ./pd-tso-bench -v -duration 250000s -pd "http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379" -client 1 -c 1 -interval 30s -keyspace-name "2oVzWDxP4sSxu9f"

Start benchmark #0, duration: 250000s
Create 1 client(s) for benchmark
[2023/07/04 23:39:38.981 +00:00] [INFO] [pd_service_discovery.go:590] ["[pd] update member urls"] [old-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"] [new-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-1.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-2.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"]
[2023/07/04 23:39:38.981 +00:00] [INFO] [pd_service_discovery.go:616] ["[pd] switch leader"] [new-leader=http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379] [old-leader=]
[2023/07/04 23:39:38.981 +00:00] [INFO] [pd_service_discovery.go:193] ["[pd] init cluster id"] [cluster-id=7158110369020695519]
[2023/07/04 23:39:38.981 +00:00] [INFO] [client.go:590] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=API_SVC_MODE]
[2023/07/04 23:39:38.981 +00:00] [INFO] [tso_service_discovery.go:185] ["created tso service discovery"] [cluster-id=7158110369020695519] [keyspace-id=0] [default-discovery-key=/ms/7158110369020695519/tso/00000/primary]
[2023/07/04 23:39:38.981 +00:00] [INFO] [tso_service_discovery.go:195] ["initializing tso service discovery"] [max-retry-times=100] [retry-interval=1s]
[2023/07/04 23:39:38.982 +00:00] [INFO] [tso_service_discovery.go:594] ["update tso server addresses"] [addrs="[http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379,http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]"]
[2023/07/04 23:39:38.985 +00:00] [INFO] [tso_service_discovery.go:503] ["[tso] updated keyspace group service discovery info"] [keyspace-id-in-request=0] [tso-server-addr=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379] [keyspace-group-service="user_kind:"basic" members:<address:"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379" is_primary:true > members:<address:"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379" > "]
[2023/07/04 23:39:38.985 +00:00] [INFO] [tso_client.go:230] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379]
[2023/07/04 23:39:38.985 +00:00] [INFO] [tso_service_discovery.go:413] ["[tso] switch primary"] [new-primary=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379] [old-primary=]
[2023/07/04 23:39:38.994 +00:00] [FATAL] [main.go:120] ["get first time tso failed"] [client-number=0] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*tsoTSOStream).processRequests\n\t/go/src/github.com/tikv/pd/client/tso_stream.go:204\ngithub.com/tikv/pd/client.(*tsoClient).processRequests\n\t/go/src/github.com/tikv/pd/client/tso_dispatcher.go:716\ngithub.com/tikv/pd/client.(*tsoClient).handleDispatcher\n\t/go/src/github.com/tikv/pd/client/tso_dispatcher.go:449\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\t/go/src/github.com/tikv/pd/client/tso_dispatcher.go:98\ngithub.com/tikv/pd/client.(*client).GetLocalTS\n\t/go/src/github.com/tikv/pd/client/client.go:843\nmain.bench\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:118\nmain.main\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:97\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172"] [stack="main.bench\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:120\nmain.main\n\t/go/src/github.com/tikv/pd/tools/pd-tso-bench/main.go:97\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]

binshi-bing · 2023-07-05T03:14:09Z

Fix is verified

root@tso-bench:/# ./pd-tso-bench -v -duration 250000s -pd "http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379" -client 1 -c 1 -interval 10s -keyspace-name "2oVzWDxP4sSxu9f"

Start benchmark #0, duration: 250000s
Create 1 client(s) for benchmark
[2023/07/05 03:16:47.148 +00:00] [INFO] [pd_service_discovery.go:600] ["[pd] update member urls"] [old-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"] [new-urls="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-1.serverless-cluster-pd-peer.tidb-serverless.svc:2379,http://serverless-cluster-pd-2.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"]
[2023/07/05 03:16:47.148 +00:00] [INFO] [pd_service_discovery.go:626] ["[pd] switch leader"] [new-leader=http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379] [old-leader=]
[2023/07/05 03:16:47.148 +00:00] [INFO] [pd_service_discovery.go:195] ["[pd] init cluster id"] [cluster-id=7158110369020695519]
[2023/07/05 03:16:47.152 +00:00] [INFO] [client.go:586] ["[pd] changing service mode"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=API_SVC_MODE]
[2023/07/05 03:16:47.152 +00:00] [INFO] [tso_service_discovery.go:185] ["created tso service discovery"] [cluster-id=7158110369020695519] [keyspace-id=18076] [default-discovery-key=/ms/7158110369020695519/tso/00000/primary]
[2023/07/05 03:16:47.152 +00:00] [INFO] [tso_service_discovery.go:195] ["initializing tso service discovery"] [max-retry-times=100] [retry-interval=1s]
[2023/07/05 03:16:47.152 +00:00] [INFO] [tso_service_discovery.go:589] ["update tso server addresses"] [addrs="[http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379,http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]"]
[2023/07/05 03:16:47.155 +00:00] [INFO] [tso_service_discovery.go:498] ["[tso] updated keyspace group service discovery info"] [keyspace-id-in-request=18076] [tso-server-addr=http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379] [keyspace-group-service="id:2 user_kind:"basic" members:<address:"http://pd-tso-server-0.tso-service.tidb-serverless.svc:2379" > members:<address:"http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379" is_primary:true > "]
[2023/07/05 03:16:47.155 +00:00] [INFO] [tso_client.go:230] ["[tso] switch dc tso global allocator serving address"] [dc-location=global] [new-address=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379]
[2023/07/05 03:16:47.155 +00:00] [INFO] [tso_service_discovery.go:408] ["[tso] switch primary"] [new-primary=http://pd-tso-server-1.tso-service.tidb-serverless.svc:2379] [old-primary=]
[2023/07/05 03:16:47.158 +00:00] [INFO] [tso_dispatcher.go:295] ["[tso] tso dispatcher created"] [dc-location=global]
[2023/07/05 03:16:47.158 +00:00] [INFO] [client.go:634] ["[pd] service mode changed"] [old-mode=UNKNOWN_SVC_MODE] [new-mode=API_SVC_MODE]
[2023/07/05 03:16:47.158 +00:00] [INFO] [client.go:511] ["[pd] create pd client with endpoints and keyspace"] [pd-address="[http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379]"] [keyspace-name=2oVzWDxP4sSxu9f] [keyspace-id=18076]

count: 30787, max: 20.4116ms, min: 0.2205ms, avg: 0.3239ms
<1ms: 30747, >1ms: 22, >2ms: 13, >5ms: 3, >10ms: 2, >30ms: 0, >50ms: 0, >100ms: 0, >200ms: 0, >400ms: 0, >800ms: 0, >1s: 0

…APIContext (#6749) close #6748 After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately Signed-off-by: Bin Shi <binshi.bing@gmail.com>

ref #6747, ref #6748, ref #6749 Signed-off-by: Ryan Leung <rleungx@gmail.com>

…APIContext (tikv#6749) close tikv#6748 After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately Signed-off-by: Bin Shi <binshi.bing@gmail.com>

) ref tikv#6747, ref tikv#6748, ref tikv#6749 Signed-off-by: Ryan Leung <rleungx@gmail.com>

…APIContext (tikv#6749) close tikv#6748 After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately Signed-off-by: Bin Shi <binshi.bing@gmail.com>

) ref tikv#6747, ref tikv#6748, ref tikv#6749 Signed-off-by: Ryan Leung <rleungx@gmail.com>

binshi-bing added the type/enhancement The issue or PR belongs to an enhancement. label Jul 4, 2023

binshi-bing mentioned this issue Jul 5, 2023

client: fix tso service discovery at the first time for NewClientWithAPIContext #6749

Merged

binshi-bing self-assigned this Jul 5, 2023

ti-chi-bot bot closed this as completed in #6749 Jul 5, 2023

rleungx mentioned this issue Jul 5, 2023

*: add test for misusing keyspace ID when creating the client #6754

Merged

ti-chi-bot bot pushed a commit that referenced this issue Jul 5, 2023

*: add test for misusing keyspace ID when creating the client (#6754)

dc180ca

ref #6747, ref #6748, ref #6749 Signed-off-by: Ryan Leung <rleungx@gmail.com>

rleungx added a commit to rleungx/pd that referenced this issue Aug 2, 2023

*: add test for misusing keyspace ID when creating the client (tikv#6754

ff1e38e

) ref tikv#6747, ref tikv#6748, ref tikv#6749 Signed-off-by: Ryan Leung <rleungx@gmail.com>

rleungx added a commit to rleungx/pd that referenced this issue Aug 2, 2023

*: add test for misusing keyspace ID when creating the client (tikv#6754

90b1316

) ref tikv#6747, ref tikv#6748, ref tikv#6749 Signed-off-by: Ryan Leung <rleungx@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcs, tso, client: After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately. #6748

mcs, tso, client: After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately. #6748

binshi-bing commented Jul 4, 2023 •

edited

Loading

binshi-bing commented Jul 4, 2023

binshi-bing commented Jul 5, 2023 •

edited

Loading

mcs, tso, client: After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately. #6748

mcs, tso, client: After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately. #6748

Comments

binshi-bing commented Jul 4, 2023 • edited Loading

Enhancement Task

Summary

What's the problem

What's the main scenearios hitting the issue?

What's the impact

Details

What's the problem

Root Cause Analysis

binshi-bing commented Jul 4, 2023

binshi-bing commented Jul 5, 2023 • edited Loading

binshi-bing commented Jul 4, 2023 •

edited

Loading

binshi-bing commented Jul 5, 2023 •

edited

Loading