Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disttask: refine scheduler error handling #47313

Merged
merged 35 commits into from
Oct 12, 2023
Merged

Conversation

ywqzzy
Copy link
Contributor

@ywqzzy ywqzzy commented Sep 27, 2023

What problem does this PR solve?

Issue Number: ref #46258

Problem Summary:
When tidb gracefully shutdown, dist task failed.
When network partition with tikv, dist task failed due to retryable error.

What is changed and how it works?

  1. use backoff for functions related to tikv in scheduler.
  2. check if error retryable, if retryable, don't mark subtask as failed.
  3. use cancelCauseFunc to cancel subtasks. Then when tidb gracefully shutdown, subtask will not be marked as canceled.
func (s *BaseScheduler) markTaskCancelOrFailed(ctx context.Context, subtask *proto.Subtask) bool {
	if err := s.getError(); err != nil {
		if ctx.Err() != nil && context.Cause(ctx).Error() == "cancel subtasks" {
			logutil.Logger(s.logCtx).Warn("subtask canceled", zap.Error(err))
			s.updateSubtaskStateAndError(subtask, proto.TaskStateCanceled, nil)
		} else if common.IsRetryableError(err) {
			logutil.Logger(s.logCtx).Warn("met retryable error", zap.Error(err))
		} else if errors.Cause(err) != context.Canceled {
			logutil.Logger(s.logCtx).Warn("subtask failed", zap.Error(err))
			s.updateSubtaskStateAndError(subtask, proto.TaskStateFailed, err)
		} else {
			logutil.Logger(s.logCtx).Warn("met context canceled for gracefully shutdown", zap.Error(err))
		}
		s.markErrorHandled()
		return true
	}
	return false
}

We use the above function to persist error to subtasks.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.
image HA test tidb to all network partition passed. Side effects
  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Sep 27, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Sep 27, 2023

Hi @ywqzzy. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 27, 2023
@tiprow
Copy link

tiprow bot commented Sep 27, 2023

Hi @ywqzzy. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ywqzzy
Copy link
Contributor Author

ywqzzy commented Sep 27, 2023

/label ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Sep 27, 2023
@ywqzzy
Copy link
Contributor Author

ywqzzy commented Sep 27, 2023

/test all

@ti-chi-bot ti-chi-bot bot removed the needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. label Sep 27, 2023
@codecov
Copy link

codecov bot commented Sep 27, 2023

Codecov Report

Merging #47313 (4d5501a) into master (0fd232f) will increase coverage by 0.7698%.
The diff coverage is 82.3008%.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #47313        +/-   ##
================================================
+ Coverage   71.9864%   72.7562%   +0.7698%     
================================================
  Files          1353       1375        +22     
  Lines        401002     407418      +6416     
================================================
+ Hits         288667     296422      +7755     
+ Misses        92965      92204       -761     
+ Partials      19370      18792       -578     
Flag Coverage Δ
integration 40.1958% <0.8849%> (?)
unit 72.0036% <82.3008%> (+0.0172%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9913% <ø> (ø)
parser 84.7472% <ø> (+0.0107%) ⬆️
br 49.0100% <ø> (-4.3363%) ⬇️

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 8, 2023
@ywqzzy ywqzzy requested a review from D3Hunter October 10, 2023 09:29
@ywqzzy
Copy link
Contributor Author

ywqzzy commented Oct 10, 2023

/retest

@@ -168,6 +186,10 @@ func (s *BaseScheduler) run(ctx context.Context, task *proto.Task) error {
proto.TaskStatePending, proto.TaskStateRunning)
if err != nil {
s.onError(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to record the error if it is retryable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to record the error if it is retryable?

No, this will make the dispatcher logic much more complicated, dispatcher need to distinguish the error type from string.
I think we can just check retryable in scheduler.

@@ -67,7 +66,7 @@ type Manager struct {
sync.RWMutex
// taskID -> cancelFunc.
// cancelFunc is used to fast cancel the scheduler.Run.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the comment out of date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will fix it

Copy link
Collaborator

@Benjamin2037 Benjamin2037 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Oct 11, 2023
Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

subtask.State = proto.TaskStateSucceed
metrics.IncDistTaskSubTaskCnt(subtask)
}

func (s *BaseScheduler) markTaskCancelOrFailed(ctx context.Context, subtask *proto.Subtask) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (s *BaseScheduler) markTaskCancelOrFailed(ctx context.Context, subtask *proto.Subtask) bool {
func (s *BaseScheduler) markSubtaskCancelOrFailed(ctx context.Context, subtask *proto.Subtask) bool {

@D3Hunter
Copy link
Contributor

/lgtm

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Oct 12, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 12, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-10-11 08:03:33.247311794 +0000 UTC m=+1212210.834421925: ☑️ agreed by Benjamin2037.
  • 2023-10-12 04:37:54.838636497 +0000 UTC m=+1286272.425746654: ☑️ agreed by D3Hunter.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Benjamin2037, D3Hunter

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Oct 12, 2023
@tiprow
Copy link

tiprow bot commented Oct 12, 2023

@ywqzzy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
tiprow_fast_test 4d5501a link true /test tiprow_fast_test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants