Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate tikv server instead of killing it when evict leader timeout or is skipped #1432

Closed
BusyJay opened this issue Jun 17, 2021 · 6 comments · Fixed by #1434
Closed

Terminate tikv server instead of killing it when evict leader timeout or is skipped #1432

BusyJay opened this issue Jun 17, 2021 · 6 comments · Fixed by #1434
Assignees
Labels
type/feature-request Categorizes issue as related to a new feature.
Milestone

Comments

@BusyJay
Copy link
Contributor

BusyJay commented Jun 17, 2021

Feature Request

Is your feature request related to a problem? Please describe:
Due to tikv/tikv#10353, if there are still leaders on a TiKV and it is shutdown gracefully, there is a chance to corrupt transactions.

Describe the feature you'd like:
TiKV is working on a patch to fix the problem, but for older versions, it's required to use SIGKILL to terminate TiKV immediately if evict leader timeout or is skipped to avoid potential risk. So there are two TODOs:

  1. Set a larger timeouts for evicting leaders;
  2. If evicting leader timeouts, use systemctl kill -s KILL service to stop TiKV.

Why the featue is needed:
See above.

Describe alternatives you've considered:
Ignore the problem, then there is a risk that upgrading cluster, especially large cluster, can corrupt data.

Teachability, Documentation, Adoption, Migration Strategy:

@BusyJay BusyJay added the type/feature-request Categorizes issue as related to a new feature. label Jun 17, 2021
@AstroProfundis AstroProfundis self-assigned this Jun 18, 2021
@AstroProfundis
Copy link
Contributor

@BusyJay Could you confirm killing TiKV with SIGKILL is safe?

@AstroProfundis AstroProfundis added this to the v1.5.2 milestone Jun 18, 2021
@BusyJay
Copy link
Contributor Author

BusyJay commented Jun 18, 2021

Yes, we have tests to random kill TiKV. Or we can enlarge evict leader timeout to infinite, which can be considered safer.

@BusyJay
Copy link
Contributor Author

BusyJay commented Jun 18, 2021

Due to tikv/tikv#9624, it may not be safe to kill -9 for all versions of TiKV, especially when lightening is used. So the best option seems to be set an infinite timeout for evicting leader.

@AstroProfundis
Copy link
Contributor

Infinite timeout is not good in practice, that could block the upgrade process and the user have to interrupt the command by hand, leaving the cluster in a hybrid status with both old and new versions of components running, and can not overcome from that status (because retry of upgrade command will still be blocked at leader evicting).

We can indeed increase the default timeout for evicting leader, but that don't really solve the problem.

@BusyJay
Copy link
Contributor Author

BusyJay commented Jun 18, 2021

We can indeed increase the default timeout for evicting leader, but that don't really solve the problem.

Better than nothing.

@AstroProfundis
Copy link
Contributor

AstroProfundis commented Jun 18, 2021

OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants