-
Notifications
You must be signed in to change notification settings - Fork 409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refine tiflash shutdown logic #4291
Conversation
Signed-off-by: bestwoody <bestwoody@163.com>
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>
Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>
grpc::Status status(static_cast<grpc::StatusCode>(GRPC_STATUS_UNKNOWN), "Consumer exits unexpected, grpc writes failed."); | ||
responderFinish(status); | ||
responder.Finish(status, this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it save to call responder.Finish
multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Finish will not be called multiple times
@@ -308,6 +312,9 @@ void MPPTunnelBase<Writer>::consumerFinish(const String & err_msg, bool need_loc | |||
send_queue.finish(); | |||
|
|||
auto rest_work = [this, &err_msg] { | |||
// it's safe to call it multiple times | |||
if (finished && consumer_state.errHasSet()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is the case that finished
is true, while consumer_state.errHasSet()
is false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when shutdown
is set true and consumerFinish
in sendjob
is called, subsequent writeDone
will call consumerFinish
again. So make consumerFinish
idempotenta will be more safe.
dbms/src/Server/Server.cpp
Outdated
*is_shutdown = true; | ||
// Wait all existed MPPTunnels done to prevent crash. | ||
// If all existed MPPTunnels are done, almost in all cases it means all existed MPPTasks and ExchangeReceivers are also done. | ||
while (GET_METRIC(tiflash_object_count, type_count_of_mpptunnel).Value() >= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it block the shutdown forever if some MPPTunnels leaked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly no. Normally our cluster tools(such as tiup) will kill -9 when time out( a few seconds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard limit added.
LOG_FMT_INFO(log, "Begin to shut down flash grpc server"); | ||
flash_grpc_server->Shutdown(deadline); | ||
flash_grpc_server->Shutdown(); | ||
*is_shutdown = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared to the previous version, why set is_shutdown
after flash_grpc_server->Shutdown()
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if flash_grpc_server->Shutdown();
is called later, then canceled query will cause client(such as TiDB、TiFlash) send a lot of retry queries, those retried queries will be accept if flash_grpc_server->Shutdown();
is not called
} | ||
|
||
private: | ||
std::promise<String> promise; | ||
std::shared_future<String> future; | ||
std::atomic<bool> err_has_set{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think err_has_set
is always be accessed under lock's protection, why still need to be atomic variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for furture usage. So that we won't make a mistake that forget to check the flag externel with a lock. Since it's a indepentdent subclass, it need protect himself.
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
@bestwoody: It seems you want to merge this PR, I will help you trigger all the tests: /run-all-tests You only need to trigger If you have any questions about the PR merge process, please refer to pr process. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
This pull request has been accepted and is ready to merge. Commit hash: c1bb01f
|
Coverage for changed files
Coverage summary
full coverage report (for internal network access only) |
/run-all-tests |
Coverage for changed files
Coverage summary
full coverage report (for internal network access only) |
/run-all-tests |
Coverage for changed files
Coverage summary
full coverage report (for internal network access only) |
* 1.add metrics of calldata&mpptunnel 2.refine shutdown logic Signed-off-by: bestwoody <bestwoody@163.com> * update * Apply suggestions from code review Co-authored-by: Fu Zhe <fuzhe1989@gmail.com> * Update dbms/src/Flash/EstablishCall.cpp Co-authored-by: Fu Zhe <fuzhe1989@gmail.com> * add harm limit to wait Signed-off-by: bestwoody <bestwoody@163.com> * fix Signed-off-by: bestwoody <bestwoody@163.com> Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>
Coverage for changed files
Coverage summary
full coverage report (for internal network access only) |
Signed-off-by: bestwoody bestwoody@163.com
What problem does this PR solve?
Issue Number: close #4262
Problem Summary:
What is changed and how it works?
1.refine tiflash shutdown logic to prevent coredump when query and shutdown both occur
2.add metrics of calldata and mpptunnel
the purpose is to let existed rpc and tasks done before CQ shutdown.
To achieve that:
is_shut_down=true
, to cancel existed rpcs and tasks and wait them doneCheck List
Tests
Side effects
Documentation
Release note