-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: retry non-cancelable running and all reverting jobs #69300
Conversation
8b90ca6
to
7a4a414
Compare
7a4a414
to
6fb6bd7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pay attention to the tests that I have modified. I have tried to make minimal changes and not beat the purpose of the tests, but I might have overlooked some cases.
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this change. It seems to have changed the behavior from always retrying non-cancelable jobs to always retrying revertible jobs. Those are not the same thing. We want to always retry non-cancelable or reverting jobs. One way to unify them is to say that jobs which are reverting are implicitly non-cancelable.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sajjadrizvi)
pkg/jobs/jobs_test.go, line 3211 at r1 (raw file):
// TestNonCancelableJobsRetry tests that a non-cancelable job is retried when // failed with a non-retryable error. func TestNonCancelableJobsRetry(t *testing.T) {
I thought that our previous intention was that non-cancelable jobs will retry unless they fail with an error which is marked as permanent. This first error does not seem to be marked as permanent. Why are we not retrying it? Was that logic only for the OnFailOrCancel
error? That's a bug. I'm also confused by the renaming here. Isn't the point to exercise NonCancelable
? If NonCancelable
only affect retry behavior of OnFailOrCancel
then it doesn't affect the retry behavior of anything anymore.
pkg/jobs/registry.go, line 1294 at r1 (raw file):
} // A non-cancelable job is always retried while reverting unless the error is marked as permanent. if job.Payload().Noncancelable && !IsPermanentJobError(err) {
why is this behavior gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to always retry non-cancelable or reverting jobs.
I think I missed a point here. I thought we want to retry non-cancelable jobs forever only when they are reverting
. So I have not modified anything for the running state. So, is this what we want: In running state, retry non-cancelable jobs if their error is not marked as a permanent error. In reverting state, retry all jobs regardless of the error type and their non-cancelable status
. Is this correct?
One way to unify them is to say that jobs which are reverting are implicitly non-cancelable.
This is what I intended to do in this PR. Instead of always retrying only non-cancelable jobs, we retry all jobs that fail in their reverting state.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/jobs/jobs_test.go, line 3211 at r1 (raw file):
I thought that our previous intention was that non-cancelable jobs will retry unless they fail with an error which is marked as permanent.
Does it include running
state as well? I thought we want this only for reverting state.
This first error does not seem to be marked as permanent. Why are we not retrying it?
Are you talking about the running
state? I thought we don't want to modify the behaviour in running
state. I think it is retried in the reverting
state. Otherwise, the test should not pass.
pkg/jobs/registry.go, line 1294 at r1 (raw file):
Previously, ajwerner wrote…
why is this behavior gone?
Because we don't have to check error type or non-cancelable status to decide to retry. Don't we want to retry all jobs that are reverting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed a point here. I thought we want to retry non-cancelable jobs forever only when they are reverting. So I have not modified anything for the running state. So, is this what we want: In running state, retry non-cancelable jobs if their error is not marked as a permanent error. In reverting state, retry all jobs regardless of the error type and their non-cancelable status. Is this correct?
Yes, this is correct.
This is what I intended to do in this PR. Instead of always retrying only non-cancelable jobs, we retry all jobs that fail in their reverting state.
Instead of always retrying only non-cancelable jobs, we retry all jobs that fail in their reverting state.
and One way to unify them is to say that jobs which are reverting are implicitly non-cancelable.
sounds to me like they are in conflict. One says the union of non-cancelable and reverting, the other says, just reverting.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @sajjadrizvi)
pkg/jobs/jobs_test.go, line 3211 at r1 (raw file):
Previously, sajjadrizvi (Sajjad Rizvi) wrote…
I thought that our previous intention was that non-cancelable jobs will retry unless they fail with an error which is marked as permanent.
Does it include
running
state as well? I thought we want this only for reverting state.This first error does not seem to be marked as permanent. Why are we not retrying it?
Are you talking about the
running
state? I thought we don't want to modify the behaviour inrunning
state. I think it is retried in thereverting
state. Otherwise, the test should not pass.
We want to retry all non-cancelable jobs even if they are in the running state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is correct.
OK, let me update it.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/jobs/jobs_test.go, line 3211 at r1 (raw file):
Previously, ajwerner wrote…
We want to retry all non-cancelable jobs even if they are in the running state.
I am now adding it.
2933faa
to
ac1a9b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r2, 4 of 5 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @sajjadrizvi)
pkg/jobs/registry.go, line 1218 at r2 (raw file):
Quoted 6 lines of code…
// TODO(spaskob): enforce a limit on retries. if errors.Is(err, retryJobErrorSentinel) { jm.ResumeRetryError.Inc(1) return errors.Errorf("job %d: %s: restarting in background", job.ID(), err) } if job.Payload().Noncancelable && !IsPermanentJobError(err) {
Let's do this in a single conditional:
if nonCancelableRetry := job.Payload().Noncancelable && !IsPermanentJobError(err); nonCancelableRetry ||
errors.Is(err, retryJobErrorSentinel) {
if nonCancelableRetry {
err = errors.Wrap(err, "non-cancelable")
}
jm.ResumeRetryError.Inc(1)
// ...
}
pkg/jobs/testing_knobs.go, line 73 at r3 (raw file):
// ModifyErrorAfterOnFailOrCancel captures the error returned from OnFailOorCancel // and sets the error to the returned value of this function. ModifyErrorAfterOnFailOrCancel func(jobspb.JobID, error) error
I'm skeptical of this knob mostly because all the places it seems to be used it is doing something to an error that was already injected by that test. That has a code smell. Please instead adjust those tests to make sense in light of the behavior change. That might mean that instead of looking for a reverting failed state or something like that that instead they just confirm that retries happen a few times and then just finish. Jobs don't need to all have exited before the test concludes.
pkg/sql/schema_changer_test.go, line 6377 at r3 (raw file):
// back a schema change causes the job to fail, and that the appropriate error // is displayed in the jobs table. func TestPermanentErrorDuringRollback(t *testing.T) {
It seems like this test should be renamed to something like
pkg/sql/schema_changer_test.go, line 6383 at r3 (raw file):
var sqlDB *gosql.DB validateError := func(jobID jobspb.JobID, err error) error {
this does not feel well motivated. It seems to change the control flow of the test, which is already responsible for injecting these errors. Layering another level of filtering here when the error came from the test in the first place makes this feel crufty. If you feel like you must do this, then please explain it with ample commentary. It is important that you digest what this test was exercising and modify it in light of the behavior change in the jobs layer.
pkg/sql/schema_changer_test.go, line 6449 at r3 (raw file):
}, }, // Decrease the adopt-loop interval so that retries happen quickly.
I don't have strong feelings here but this feels odd to me.
ac1a9b9
to
2cc0e2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/jobs/registry.go, line 1218 at r2 (raw file):
Previously, ajwerner wrote…
// TODO(spaskob): enforce a limit on retries. if errors.Is(err, retryJobErrorSentinel) { jm.ResumeRetryError.Inc(1) return errors.Errorf("job %d: %s: restarting in background", job.ID(), err) } if job.Payload().Noncancelable && !IsPermanentJobError(err) {
Let's do this in a single conditional:
if nonCancelableRetry := job.Payload().Noncancelable && !IsPermanentJobError(err); nonCancelableRetry || errors.Is(err, retryJobErrorSentinel) { if nonCancelableRetry { err = errors.Wrap(err, "non-cancelable") } jm.ResumeRetryError.Inc(1) // ... } </blockquote></details> I have updated the code. ___ *[pkg/jobs/testing_knobs.go, line 73 at r3](https://reviewable.io/reviews/cockroachdb/cockroach/69300#-Mi1g7FQ-q0cH04lozDO:-Mi7wVzb4_KjpuWkZPgG:bz6pk6l) ([raw file](https://github.com/cockroachdb/cockroach/blob/ac1a9b92691ae944863a9224c715a0ea4fc67580/pkg/jobs/testing_knobs.go#L73)):* <details><summary><i>Previously, ajwerner wrote…</i></summary><blockquote> I'm skeptical of this knob mostly because all the places it seems to be used it is doing something to an error that was already injected by that test. That has a code smell. Please instead adjust those tests to make sense in light of the behavior change. That might mean that instead of looking for a reverting failed state or something like that that instead they just confirm that retries happen a few times and then just finish. Jobs don't need to all have exited before the test concludes. </blockquote></details> Thanks for the suggestion. My intention was to use the knob only to terminate the test and not retry jobs completely. ___ *[pkg/sql/schema_changer_test.go, line 6377 at r3](https://reviewable.io/reviews/cockroachdb/cockroach/69300#-Mi1eVrS6BcGWYJihEXf:-Mi7wrdJ9s-eUH8mAKlS:b-ij0b03) ([raw file](https://github.com/cockroachdb/cockroach/blob/ac1a9b92691ae944863a9224c715a0ea4fc67580/pkg/sql/schema_changer_test.go#L6377)):* <details><summary><i>Previously, ajwerner wrote…</i></summary><blockquote> It seems like this test should be renamed to something like </blockquote></details> Like?? How about `TestRevertingWhileRollingBack`? ___ *[pkg/sql/schema_changer_test.go, line 6383 at r3](https://reviewable.io/reviews/cockroachdb/cockroach/69300#-Mi1ffVB8RYTr4iVoq10:-Mi7vsya8Bn9ahNzI7-4:bstpbs9) ([raw file](https://github.com/cockroachdb/cockroach/blob/ac1a9b92691ae944863a9224c715a0ea4fc67580/pkg/sql/schema_changer_test.go#L6383)):* <details><summary><i>Previously, ajwerner wrote…</i></summary><blockquote> this does not feel well motivated. It seems to change the control flow of the test, which is already responsible for injecting these errors. Layering another level of filtering here when the error came from the test in the first place makes this feel crufty. If you feel like you *must* do this, then please explain it with ample commentary. It is important that you digest what this test was exercising and modify it in light of the behavior change in the jobs layer. </blockquote></details> Thanks for pointing it out. I have removed the testing knob and related code. <!-- Sent from Reviewable.io -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/jobs/testing_knobs.go, line 73 at r3 (raw file):
Previously, sajjadrizvi (Sajjad Rizvi) wrote…
Thanks for the suggestion. My intention was to use the knob only to terminate the test and not retry jobs completely.
I mean, not retry jobs forever.
a03cf31
to
c8477c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestRegistryLifecycle/dump_traces_on_cancel
is getting failed in stress tests because multiple resumers enter OnFailOrCancel
. It will be fixed once the PR solving this problem is merged in the main branch.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
4df154b
to
df96610
Compare
df96610
to
e8dac3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @sajjadrizvi)
pkg/jobs/jobs_test.go, line 315 at r7 (raw file):
// TODO(sajjad): To discuss: This success function is not called from Registry // when the resumer completes without error. Instead, it is being called from // Resume itself. Is this on purpose or is it a bug?
I think you could call it a bug. It has to do with the fact that there used to be a "best effort" Success hook but it was fundamentally a broken idea. When somebody refactored that hook away, they didn't want to delete these tests. Their approach was to call this function when Resume returns with no error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/jobs/jobs_test.go, line 315 at r7 (raw file):
Previously, ajwerner wrote…
I think you could call it a bug. It has to do with the fact that there used to be a "best effort" Success hook but it was fundamentally a broken idea. When somebody refactored that hook away, they didn't want to delete these tests. Their approach was to call this function when Resume returns with no error.
OK, I will file an issue to remove this. This is a little confusing.
e4b1238
to
0abbca2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajwerner PHAL! Thanks.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
0abbca2
to
ad89af0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite close.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @sajjadrizvi)
pkg/jobs/jobs_test.go, line 804 at r10 (raw file):
// The job will stay in reverting state unless it succeeds reverting. So we don't have // to exit clean.
uber-nit: cleanly
.
Also, why not just have it successfully revert and then move to failed as a clean ending of the test?
pkg/jobs/jobs_test.go, line 893 at r10 (raw file):
rts.check(t, jobs.StatusReverting) require.True(t, triedToMarkFailed.Load().(bool)) // The job is now in state reverting and will never resume again.
I don't think repeating this comment is helpful.
pkg/jobs/jobs_test.go, line 897 at r10 (raw file):
rts.failOrCancelCh <- nil rts.mu.e.OnFailOrCancelExit = true // The job will be retried again, However, we don't have to continue the test.
why not just let the thing succeed in reverting here after one failure?
pkg/jobs/testing_knobs.go, line 73 at r10 (raw file):
// ModifyErrorAfterOnFailOrCancel captures the error returned from OnFailOorCancel // and sets the error to the returned value of this function. ModifyErrorAfterOnFailOrCancel func(jobspb.JobID, error) error
is this being used anymore?
pkg/sql/schema_changer_test.go, line 6377 at r3 (raw file):
Previously, sajjadrizvi (Sajjad Rizvi) wrote…
Like?? How about
TestRevertingWhileRollingBack
?
TestRetryOnAllErrorsWhenReverting
?
pkg/sql/schema_changer_test.go, line 6979 at r10 (raw file):
go func() { _, _ = db.Exec(tc.scStmt)
note that this will not return until server shutdown
pkg/sql/table.go, line 156 at r10 (raw file):
// The job should be cancelable when we are adding a table that doesn't // have mutations, e.g., in CREATE TABLE AS VALUES. NonCancelable: mutationID == descpb.InvalidMutationID && !tableDesc.Adding(),
this change is important and deserves its own test. That test should create a CREATE TABLE ... AS
job and validate that it is not marked as NonCancelable
despite having an InvalidMutationID
. You can leverage a hook to do this. Ideally this change and test will be a separate commit.
pkg/sql/type_change_test.go, line 228 at r10 (raw file):
//tdb := sqlutils.MakeSQLRunner(sqlDB) //tdb.CheckQueryResultsRetry(t)
what is this about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/jobs/jobs_test.go, line 804 at r10 (raw file):
Previously, ajwerner wrote…
uber-nit:
cleanly
.Also, why not just have it successfully revert and then move to failed as a clean ending of the test?
Yeah, I should do that. I have updated accordingly.
pkg/jobs/jobs_test.go, line 893 at r10 (raw file):
Previously, ajwerner wrote…
I don't think repeating this comment is helpful.
OK, I have removed the comment.
pkg/jobs/jobs_test.go, line 897 at r10 (raw file):
Previously, ajwerner wrote…
why not just let the thing succeed in reverting here after one failure?
yeah, it should be like that! I have updated the test accordingly.
pkg/jobs/testing_knobs.go, line 73 at r10 (raw file):
Previously, ajwerner wrote…
is this being used anymore?
No, I have now removed this knob.
pkg/sql/schema_changer_test.go, line 6377 at r3 (raw file):
Previously, ajwerner wrote…
TestRetryOnAllErrorsWhenReverting
?
Done.
pkg/sql/table.go, line 156 at r10 (raw file):
Previously, ajwerner wrote…
this change is important and deserves its own test. That test should create a
CREATE TABLE ... AS
job and validate that it is not marked asNonCancelable
despite having anInvalidMutationID
. You can leverage a hook to do this. Ideally this change and test will be a separate commit.
I have added a test. The change is already in another commit.
ad89af0
to
94e4779
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 10 files at r9, 4 of 8 files at r12, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @sajjadrizvi)
pkg/sql/schema_changer_test.go, line 2154 at r12 (raw file):
const chunkSize = 200 attempts := 0 //syncChan := make(chan struct{})
?
This is a small fix to make the jobs cancelable that are created when we add a table without mutations, e.g., in `CREATE TABLE foo (x PRIMARY KEY) AS VALUES (1)`. Release justification: bug fixes and low-risk updates Release note: None
Previously, non-cancelable jobs were retried in running state only if their errors were marked as retryable. Moreover, only non-cancelable reverting jobs were retried by default. This commit makes non-cancelable jobs always retry in running state unless their error is marked as a permanent error. In addition, this commit makes all reverting jobs to retry when they fail. As a result, non-cancelable running jobs and all reverting jobs do not fail due to transient errors. Release justification: low-risk updates to new functionality. Release note (general change): Non-cancelable jobs, such as schema-change GC jobs, now do not fail unless they fail with a permanent error. They retry with exponential-backoff if they fail due to a transient error. Furthermore, Jobs that perform reverting tasks do not fail. Instead, they are retried with exponential-backoff if an error is encountered while reverting. As a result, transient errors do not impact jobs that are reverting. Fixes: cockroachdb#66685
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks much for the reviews. I appreciate your thorough review and feedback.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/sql/schema_changer_test.go, line 2154 at r12 (raw file):
Previously, ajwerner wrote…
?
Sorry, I forgot to remove it.
94e4779
to
d7ab27e
Compare
TFTR! bors r=ajwerner |
Build succeeded: |
Previously, non-cancelable jobs were retried in running state
only if their errors were marked as retryable. Moreover, only
non-cancelable reverting jobs were retried by default. This
commit makes non-cancelable jobs always retry in running
state unless their error is marked as a permanent error. In
addition, this commit makes all reverting jobs to retry when
they fail. As a result, non-cancelable running jobs and all
reverting jobs do not fail due to transient errors.
Release justification: low-risk updates to new functionality.
Release note (general change): Non-cancelable jobs, such as
schema-change GC jobs, now do not fail unless they fail with
a permanent error. They retry with exponential-backoff if
they fail due to a transient error. Furthermore, Jobs that
perform reverting tasks do not fail. Instead, they are retried
with exponential-backoff if an error is encountered while
reverting. As a result, transient errors do not impact the jobs
that are reverting.
Fixes: #66685