-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rawkv can not recovery after one of the tikv-servers restart #522
Comments
We have encountered similar issues where the client encountered timeouts when accessing specific ranges of keys after a network jitter. The client returns an error "loadRegion from PD failed" and logs show "no available connections". The issue is resolved after restarting the client. This problem is particularly frustrating because there is no server-side monitoring to track client-side issues. The client's log information is also very messy, making it difficult to analyze the relevant causes. Implementing client-side monitoring of relevant metrics can improve our ability to identify and troubleshoot such problems. I found #170 and #555 mentioned client-side monitor, but no response. In order to troubleshoot client anomalies, I think there is value in this matter. Do you have plans to move this forward? We can participate in contributing. |
Can work around this issue by disable batch: import "github.com/tikv/client-go/v2/config"
config.UpdateGlobal(func(conf *config.Config) {
conf.TiKVClient.MaxBatchSize = 0
}) |
And you are always welcome to contributing for client-side monitoring and other components. |
Thanks for your help! |
I don't know why this happen. But after inspection, I believe that it's a bug in internal batch mechanism. I wrote how to reproduce this bug in issue description, please check it. p.s. As |
Perhaps the scenario in which we triggered this error was different, in our scenario there was just a network jitter and the tikv process did not be killed. It seems that this problem has not been solved at the root. We are also using chaosmesh for fault injection testing, we will update this issue if we have progress. |
That's great ! |
Version
master, 2807409
Environment
RawKV cluster deploy by TiUP, v6.1.0-alpha
Cluster setup
4 x TiKV + 3 x PD (deployed together with TiKV), 4 x 32C64GB + 500GB cloud SSD. Kingsoft cloud.
Cluster config
Steps to reproduce
What did happen
10.2.103.99 is recovered in less than 30 seconds.
But no request is sent to it for minutes.
Even there are external retires (rawkv_ha.go)
The error log of one of the threads (it is collected by
go.uber.org/multierr
for all the 10 retries).Cluster metrics & logs
Clinic: https://clinic.pingcap.com.cn/portal/#/orgs/117/clusters/7106749932299553217
Time: 2022-06-08T15:20:10+08:00 ~ 2022-06-08T15:40:10+08:00
Client log: rawkv_ha.log
Others
Maybe a similar issue with #511.
The text was updated successfully, but these errors were encountered: