-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: validate descriptors during restore #91250
Conversation
ba284a2
to
87de130
Compare
87de130
to
2bda3d5
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
c9a5a1f
to
b642383
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see this moving along!
b642383
to
49ae84a
Compare
@fqazi a few questions for you:
|
|
ec39b62
to
cf4a7d3
Compare
Once the latest TC run finishes (added a commit that doesn't filter dropped tables), you'll see a flurry of failures. As an example, running Note that some fail when the restore rolls back, leading to an unclear error msg, e.g. "table [104] not found".
|
So, whats happening here is that you shouldn't have a namespace entry for dropped descriptors. So, just adjust the validation backuputils logic to not add one. |
This patch checks that the restored descriptors are valid in three places during the job. First, during planning to ensure no existing descriptor in the cluster is invalid. Second, as soon as the descriptors are written in an offline state before the restoration flow begins. This primarily ensures that the descriptors in the backup were in a valid state, allowing the restore to fail fast. Third, right after the descriptors return online and before the job begins, which ensures the restore created and modified the descriptors correctly. Note that this validation step does not check job references in the descriptors for 2 reasons: 1) it would require loading the whole jobs table into memory, 2) during the second check, the offline written descriptors will have invalid job references during a cluster restore, as the jobs table isn't properly populated yet. If the customer needs to proceed with the restore, they can prevent the restore from failing by passing the skip_validation_check flag. Fixes cockroachdb#84757 Release note: this patch introduces checks during restore which cause the restore to fail if there exist invalid descriptors in the restoring cluster. The user can prevent the restore from failing by passing the skip_validation_check flag.
cf4a7d3
to
ea0188b
Compare
---- | ||
|
||
job tag=a wait-for-state=revert-failed | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test currently fails. After 2 minutes of waiting for the revert-failed
state, the job is left in a reverting
state. Why? The rollback should fail, bc the evil descriptor i created causes the dropDescriptors
to fail in OnFailOrCancel, which means the job should terminate in the revert-failed state.
Instead, the job system seems to infinitely retry OnFailOrCancel if it fails, which seems like bad behavior:
https://github.com/cockroachdb/cockroach/blob/master/pkg/jobs/registry.go#L1401
I'll file a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour is intentional: #69300
The idea being that for many jobs OnFailOrCancel running to completion is required to put the system back into a usable state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but should this job retry forever? I guess with exponential backoff, the time between retry frequency will become sufficiently large, preventing the exhaustion of jobs resources.
hm, I'm also starting to think this PR is not really necessary. It seems sql schema code validates all descriptors anyway. If I run the unit test i created with
|
I've done a bit more digging, and I don't think there's significant additional descriptor validation coverage that this PR, via
|
@stevendanna @fqazi does my analysis above seem sound? TLDR: i don't think we need all this extra validation. Perhaps i'll keep the test to ensure that the automatic validation does indeed catch invalid descriptor bugs |
@msbutler Yes it makes it sound unnecessary unless we can catch things earlier. |
thanks for clarifying! closing this PR and will open a seperate, test only PR, that checks that restore validation does indeed occur automatically. |
This patch checks that the restored descriptors are valid in two
places during the job. First, as soon as the descriptors are written in an
offline state before the restoration flow begins. This primarily ensures that
the descriptors in the backup were in a valid state, allowing the restore to
fail fast. Second, right after the descriptors return online and before the job
begins, which ensures the restore created and modified the descriptors
correctly.
Fixes #84757
Release note: None