Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The api leader got stuck at tso requests forwarding #6549

Closed
binshi-bing opened this issue Jun 1, 2023 · 0 comments · Fixed by #6572 · May be fixed by #6565
Closed

The api leader got stuck at tso requests forwarding #6549

binshi-bing opened this issue Jun 1, 2023 · 0 comments · Fixed by #6572 · May be fixed by #6565
Assignees
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@binshi-bing
Copy link
Contributor

binshi-bing commented Jun 1, 2023

Enhancement Task

What did you do?

It happened as time went when it was running in staging cluster.

What did you expect to see?

No tso forwarding stuck issue.

What did you see instead?

The api leader got stuck at dispathTSORequest when forwarding tso requests to TSO servers.

What version of PD are you using (pd-server -V)?

tidbcloud/pd-cse release-6.6-keyspace 9e1e2de

Root Cause Analysis

In short, it's about wrong usage of unbuffered channel in multiple goroutines.

The problem is in the TSO forwarding & Dispatching framework on Server side, where many streaming process goroutines, calling Tso() gPRC api, share the same handleDispatcher() goroutine per forwarded host.

Below is the event sequence describing what happened:

  1. In the first streaming process goroutine, which created the handleDispatcher() goroutine, dispatched a request then invoked the blocking call (client.gPRC stream).Recv() and waited for some time without receiving anything.
  2. The handleDispatcher() goroutine failed to process the request dispatched above, then trying to pass the error through an unbuffered channel (errCh <- error)
  3. The streaming process goroutine can't move forward to reach the place to consume the error channel because of 1, so both goroutines blocked.
  4. As new Tso streaming requests came in, more streaming process goroutines were created, and they enqueued the requests to requests channel (buffered channel with 10000 capacity). Since handleDispatcher() goroutine, the consumer of requests channel, were blocked at 2, no one drained the requests in the requests channel and the max capacity of the channel was reached. Eventually there were more and more streaming process goroutines blocked at enqueuing requests to the requests channel.
  5. Eventually, gPRC server can't spawn more streaming process goroutines.

Besides the above issue in the TSO forwarding & Dispatching framework, there might have other several issues:

  1. All gPRC client streams share the same handleDispatcher() goroutine for the same forwarded host, if one gPRC client stream Send() has problem, handleDispatcher() goroutine exits, and all requests dispatched by all gPRC client streams become orphan requests.
  2. When error happens in the handleDispatcher() goroutine, the error handling speed will be slowed down by blocking call of stream.Recv() in the corresponding streaming process goroutine.
  3. After getting response of the batch request from tso microservice, it sequentially sends the requests to the client streams, which actually eat the benefit of batching.
@binshi-bing binshi-bing added the type/enhancement The issue or PR belongs to an enhancement. label Jun 1, 2023
@binshi-bing binshi-bing self-assigned this Jun 2, 2023
@ti-chi-bot ti-chi-bot bot closed this as completed in #6572 Jun 9, 2023
ti-chi-bot bot pushed a commit that referenced this issue Jun 9, 2023
… gPRC stream (#6572)

close #6549, ref #6565

Simplify tso proxy implementation by using one forward stream for one grpc.ServerStream.
#6565 is a longer term solution for both follower batching and tso microservice. 
It's well implemented, but just need more time to bake, and we need a short term workable solution for now.

Signed-off-by: Bin Shi <binshi.bing@gmail.com>
rleungx pushed a commit to rleungx/pd that referenced this issue Aug 2, 2023
… gPRC stream (tikv#6572)

close tikv#6549, ref tikv#6565

Simplify tso proxy implementation by using one forward stream for one grpc.ServerStream.
tikv#6565 is a longer term solution for both follower batching and tso microservice. 
It's well implemented, but just need more time to bake, and we need a short term workable solution for now.

Signed-off-by: Bin Shi <binshi.bing@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
1 participant