Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine tiflash shutdown logic #4291

Merged
merged 9 commits into from
Mar 16, 2022
Merged

refine tiflash shutdown logic #4291

merged 9 commits into from
Mar 16, 2022

Conversation

bestwoody
Copy link
Contributor

@bestwoody bestwoody commented Mar 15, 2022

Signed-off-by: bestwoody bestwoody@163.com

What problem does this PR solve?

Issue Number: close #4262

Problem Summary:

What is changed and how it works?

1.refine tiflash shutdown logic to prevent coredump when query and shutdown both occur
2.add metrics of calldata and mpptunnel

the purpose is to let existed rpc and tasks done before CQ shutdown.
To achieve that:

  1. shutdown grpc service so that no new rpc request will come.
  2. set is_shut_down=true, to cancel existed rpcs and tasks and wait them done
  3. shut down CQ

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: bestwoody <bestwoody@163.com>
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Mar 15, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • fuzhe1989
  • windtalker

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 15, 2022
dbms/src/Flash/EstablishCall.cpp Outdated Show resolved Hide resolved
dbms/src/Flash/EstablishCall.cpp Outdated Show resolved Hide resolved
dbms/src/Server/Server.cpp Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 16, 2022
bestwoody and others added 3 commits March 16, 2022 10:51
Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>
Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>
dbms/src/Flash/EstablishCall.cpp Show resolved Hide resolved
grpc::Status status(static_cast<grpc::StatusCode>(GRPC_STATUS_UNKNOWN), "Consumer exits unexpected, grpc writes failed.");
responderFinish(status);
responder.Finish(status, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it save to call responder.Finish multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Finish will not be called multiple times

dbms/src/Flash/Mpp/MPPTunnel.cpp Outdated Show resolved Hide resolved
@@ -308,6 +312,9 @@ void MPPTunnelBase<Writer>::consumerFinish(const String & err_msg, bool need_loc
send_queue.finish();

auto rest_work = [this, &err_msg] {
// it's safe to call it multiple times
if (finished && consumer_state.errHasSet())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is the case that finished is true, while consumer_state.errHasSet() is false?

Copy link
Contributor Author

@bestwoody bestwoody Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when shutdown is set true and consumerFinish in sendjob is called, subsequent writeDone will call consumerFinish again. So make consumerFinish idempotenta will be more safe.

*is_shutdown = true;
// Wait all existed MPPTunnels done to prevent crash.
// If all existed MPPTunnels are done, almost in all cases it means all existed MPPTasks and ExchangeReceivers are also done.
while (GET_METRIC(tiflash_object_count, type_count_of_mpptunnel).Value() >= 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it block the shutdown forever if some MPPTunnels leaked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly no. Normally our cluster tools(such as tiup) will kill -9 when time out( a few seconds).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard limit added.

LOG_FMT_INFO(log, "Begin to shut down flash grpc server");
flash_grpc_server->Shutdown(deadline);
flash_grpc_server->Shutdown();
*is_shutdown = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to the previous version, why set is_shutdown after flash_grpc_server->Shutdown() now?

Copy link
Contributor Author

@bestwoody bestwoody Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if flash_grpc_server->Shutdown(); is called later, then canceled query will cause client(such as TiDB、TiFlash) send a lot of retry queries, those retried queries will be accept if flash_grpc_server->Shutdown(); is not called

Signed-off-by: bestwoody <bestwoody@163.com>
Signed-off-by: bestwoody <bestwoody@163.com>
}

private:
std::promise<String> promise;
std::shared_future<String> future;
std::atomic<bool> err_has_set{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think err_has_set is always be accessed under lock's protection, why still need to be atomic variable?

Copy link
Contributor Author

@bestwoody bestwoody Mar 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for furture usage. So that we won't make a mistake that forget to check the flag externel with a lock. Since it's a indepentdent subclass, it need protect himself.

@ti-chi-bot
Copy link
Member

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

Copy link
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 16, 2022
@bestwoody
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

@bestwoody: It seems you want to merge this PR, I will help you trigger all the tests:

/run-all-tests

You only need to trigger /merge once, and if the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

If you have any questions about the PR merge process, please refer to pr process.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: c1bb01f

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 16, 2022
@sre-bot
Copy link
Collaborator

sre-bot commented Mar 16, 2022

Coverage for changed files

Filename                                    Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Common/TiFlashMetrics.h                          18                 0   100.00%          11                 0   100.00%          47                 1    97.87%           8                 0   100.00%
Flash/Coprocessor/DAGUtils.cpp                  290               233    19.66%          36                24    33.33%         505               404    20.00%         324               230    29.01%
Flash/EstablishCall.h                             2                 2     0.00%           2                 2     0.00%           2                 2     0.00%           0                 0         -
Flash/Mpp/MPPTunnel.cpp                         185               185     0.00%          16                16     0.00%         241               241     0.00%          94                94     0.00%
Flash/Mpp/MPPTunnel.h                             9                 9     0.00%           9                 9     0.00%          18                18     0.00%           0                 0         -
Functions/FunctionsDateTime.cpp                  27                19    29.63%           4                 2    50.00%          99                25    74.75%          16                13    18.75%
Functions/FunctionsDateTime.h                  1073               638    40.54%         322               209    35.09%        2018              1226    39.25%         534               340    36.33%
Functions/tests/gtest_dayofweekyear.cpp         258                74    71.32%           2                 0   100.00%         135                 0   100.00%          80                42    47.50%
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                          1862              1160    37.70%         402               262    34.83%        3065              1917    37.46%        1056               719    31.91%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
16930      9506             43.85%    190844  96505        49.43%

full coverage report (for internal network access only)

@bestwoody
Copy link
Contributor Author

/run-all-tests

@sre-bot
Copy link
Collaborator

sre-bot commented Mar 16, 2022

Coverage for changed files

Filename                                    Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Common/TiFlashMetrics.h                          18                 0   100.00%          11                 0   100.00%          47                 1    97.87%           8                 0   100.00%
Flash/Coprocessor/DAGUtils.cpp                  290               233    19.66%          36                24    33.33%         505               404    20.00%         324               230    29.01%
Flash/EstablishCall.h                             2                 2     0.00%           2                 2     0.00%           2                 2     0.00%           0                 0         -
Flash/Mpp/MPPTunnel.cpp                         185               185     0.00%          16                16     0.00%         241               241     0.00%          94                94     0.00%
Flash/Mpp/MPPTunnel.h                             9                 9     0.00%           9                 9     0.00%          18                18     0.00%           0                 0         -
Functions/FunctionsDateTime.cpp                  27                19    29.63%           4                 2    50.00%          99                25    74.75%          16                13    18.75%
Functions/FunctionsDateTime.h                  1073               638    40.54%         322               209    35.09%        2018              1226    39.25%         534               340    36.33%
Functions/tests/gtest_dayofweekyear.cpp         258                74    71.32%           2                 0   100.00%         135                 0   100.00%          80                42    47.50%
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                          1862              1160    37.70%         402               262    34.83%        3065              1917    37.46%        1056               719    31.91%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
16930      9506             43.85%    190844  96490        49.44%

full coverage report (for internal network access only)

@bestwoody
Copy link
Contributor Author

/run-all-tests

@sre-bot
Copy link
Collaborator

sre-bot commented Mar 16, 2022

Coverage for changed files

Filename                                    Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Common/TiFlashMetrics.h                          18                 0   100.00%          11                 0   100.00%          47                 1    97.87%           8                 0   100.00%
Flash/Coprocessor/DAGUtils.cpp                  290               233    19.66%          36                24    33.33%         505               404    20.00%         324               230    29.01%
Flash/EstablishCall.h                             2                 2     0.00%           2                 2     0.00%           2                 2     0.00%           0                 0         -
Flash/Mpp/MPPTunnel.cpp                         185               185     0.00%          16                16     0.00%         241               241     0.00%          94                94     0.00%
Flash/Mpp/MPPTunnel.h                             9                 9     0.00%           9                 9     0.00%          18                18     0.00%           0                 0         -
Functions/FunctionsDateTime.cpp                  27                19    29.63%           4                 2    50.00%          99                25    74.75%          16                13    18.75%
Functions/FunctionsDateTime.h                  1073               638    40.54%         322               209    35.09%        2018              1226    39.25%         534               340    36.33%
Functions/tests/gtest_dayofweekyear.cpp         258                74    71.32%           2                 0   100.00%         135                 0   100.00%          80                42    47.50%
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                                          1862              1160    37.70%         402               262    34.83%        3065              1917    37.46%        1056               719    31.91%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
16930      9506             43.85%    190844  96495        49.44%

full coverage report (for internal network access only)

@bestwoody bestwoody merged commit dfed057 into pingcap:master Mar 16, 2022
@JaySon-Huang JaySon-Huang deleted the grace_shutdown branch March 16, 2022 07:21
JaySon-Huang pushed a commit to JaySon-Huang/tiflash that referenced this pull request Mar 17, 2022
* 1.add metrics of calldata&mpptunnel 2.refine shutdown logic

Signed-off-by: bestwoody <bestwoody@163.com>

* update

* Apply suggestions from code review

Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>

* Update dbms/src/Flash/EstablishCall.cpp

Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>

* add harm limit to wait

Signed-off-by: bestwoody <bestwoody@163.com>

* fix

Signed-off-by: bestwoody <bestwoody@163.com>

Co-authored-by: Fu Zhe <fuzhe1989@gmail.com>
@sre-bot
Copy link
Collaborator

sre-bot commented Mar 18, 2022

Coverage for changed files

Filename                      Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Common/TiFlashMetrics.h            18                 0   100.00%          11                 0   100.00%          47                 1    97.87%           8                 0   100.00%
Flash/EstablishCall.h               2                 2     0.00%           2                 2     0.00%           2                 2     0.00%           0                 0         -
Flash/Mpp/MPPTunnel.cpp           185               185     0.00%          16                16     0.00%         241               241     0.00%          94                94     0.00%
Flash/Mpp/MPPTunnel.h               9                 9     0.00%           9                 9     0.00%          18                18     0.00%           0                 0         -
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                             214               196     8.41%          38                27    28.95%         308               262    14.94%         102                94     7.84%

Coverage summary

Functions  MissedFunctions  Executed  Lines   MissedLines  Cover
16930      9506             43.85%    190844  96504        49.43%

full coverage report (for internal network access only)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tiflash produce core file when tiup restart tiflash node
5 participants