-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcs, tso, client: After NewClientWithAPIContextV2 returns, the keyspace group should be discovered by the passed keyspace name immediately. #6748
Comments
It can be consistently reproed by pd-tso-bench with -keyspace-name option. root@tso-bench:/# ./pd-tso-bench -v -duration 250000s -pd "http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379" -client 1 -c 1 -interval 30s -keyspace-name "2oVzWDxP4sSxu9f" Start benchmark #0, duration: 250000s |
Fix is verified root@tso-bench:/# ./pd-tso-bench -v -duration 250000s -pd "http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379" -client 1 -c 1 -interval 10s -keyspace-name "2oVzWDxP4sSxu9f" Start benchmark #0, duration: 250000s count: 30787, max: 20.4116ms, min: 0.2205ms, avg: 0.3239ms |
Enhancement Task
Summary
It's a pure client issue during tso service discovery at the first time. During the incident, we restarted PD pods and TSO pods, but they won't help. The real action which mitigated the issue was we moved group2's primary to the same location of group 0's primary around 13:55, which I'll explain later.
What's the problem
After TiDB calls NewClientWithAPIContextV2 and returns, the first GetTS always uses the default keyspace group (failed) instead of the keyspace group discovered by the keyspace name passed from the NewClientWithAPIContextV2 API.
Although PD (TSO) client almost immediately uses the right keyspace id, queried by keyspace name, to discovery the right keyspace group that the keyspace belongs to, TiDB didn't have retry and failed before PD (TSO) client discovered the right keyspace group service info.
I can consistently repo the issue in staging env with -keyspacename option in pd-tso-bench.
What's the main scenearios hitting the issue?
The standby TiDB pods are provisioned on demand. It seems that in this case the tidb doesn't retry GetTS and surface the problem. We didn't hit this issue in Dev/QA/Unittest, and the reason seems to be the caller of GetTS have retries.
What's the impact
Only impacted the non-default keyspace groups in staging, so it doesn't impact the single-timeline version in eu-central prod.
Details
What's the problem
After TiDB calls NewClientWithAPIContextV2 and it returns, the first GetTS always uses the default keyspace group instead of the keyspace group discovered by the keyspace name passed from the NewClientWithAPIContextV2 API. Although PD (TSO) client almost immediately uses the right keyspace id, queried by keyspace name, to discovery the right keyspace group that the keyspace belongs to, TiDB finished the retry within 1ms and failed before PD (TSO) client discovers the right keyspace group service info
Root Cause Analysis
newClientWithKeyspaceName
We should synchronously discover keyspace group service with the updated keyspace ID at line 516 in the above picture.
The text was updated successfully, but these errors were encountered: