Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: validate descriptors during restore #91250

Closed
wants to merge 1 commit into from

Conversation

msbutler
Copy link
Collaborator

@msbutler msbutler commented Nov 3, 2022

This patch checks that the restored descriptors are valid in two
places during the job. First, as soon as the descriptors are written in an
offline state before the restoration flow begins. This primarily ensures that
the descriptors in the backup were in a valid state, allowing the restore to
fail fast. Second, right after the descriptors return online and before the job
begins, which ensures the restore created and modified the descriptors
correctly.

Fixes #84757
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler changed the title first backupccl: validate descriptors during restore Nov 4, 2022
@msbutler

This comment was marked as outdated.

@msbutler

This comment was marked as outdated.

@msbutler msbutler force-pushed the butler-desc-valid branch 4 times, most recently from c9a5a1f to b642383 Compare November 14, 2022 14:52
@msbutler msbutler marked this pull request as ready for review November 14, 2022 14:53
@msbutler msbutler requested review from a team as code owners November 14, 2022 14:53
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this moving along!

pkg/ccl/backupccl/backuputils/utils.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/backuputils/utils.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_planning.go Outdated Show resolved Hide resolved
@msbutler
Copy link
Collaborator Author

msbutler commented Nov 14, 2022

@fqazi a few questions for you:

  • could confirm that i'm not misusing of doctor.Examine?
  • could you explain why the validation checker would fail on dropped descriptors?
  • any ideas for how to create a backup with an invalid descriptor that would fail these checks?

@fqazi
Copy link
Collaborator

fqazi commented Nov 14, 2022

@fqazi a few questions for you:

  • could confirm that i'm not misusing of doctor.Examine?
  • could you explain why the validation checker would fail on dropped descriptors?
  • any ideas for how to create a backup with an invalid descriptor that would fail these checks?
  1. Assuming that you are able to get a closure of all the descriptors involved in the restore in the case of partial restores involving a subset of tables. Looking at your code you're getting the full descriptor set to the usage looks correct here.
  2. Can you give the exact error we are seeing? Is it linked to a missing namespace entry or something else?
  3. One option is to exploit a bug in an older release when creating the backup. You could also use the JSON protobuf stuff to manipulate things to add corruption, but you need to disable validation. I did make a few manifests for the SHOW BACKUP command, not sure if those can be reused.

@msbutler
Copy link
Collaborator Author

  1. sounds good!

Can you give the exact error we are seeing? Is it linked to a missing namespace entry or something else?

Once the latest TC run finishes (added a commit that doesn't filter dropped tables), you'll see a flurry of failures. As an example, running TestBackupAndRestoreJobDescription returns:
pq: the restoring cluster contains invalid descriptors, %s: descriptor validation check failed: Examining 47 descriptors and 47 namespace entries... ParentID 104, ParentSchemaID 105: namespace entry "bank" (106): no matching name info in draining names of dropped relation

Note that some fail when the restore rolls back, leading to an unclear error msg, e.g. "table [104] not found".

  1. Ideally, the test would generate the corrupt backup, so the backup is the same release as the restore, so I'll go with the pb to json manipulation route.

@fqazi
Copy link
Collaborator

fqazi commented Nov 14, 2022

  1. sounds good!

Can you give the exact error we are seeing? Is it linked to a missing namespace entry or something else?

Once the latest TC run finishes (added a commit that doesn't filter dropped tables), you'll see a flurry of failures. As an example, running TestBackupAndRestoreJobDescription returns: pq: the restoring cluster contains invalid descriptors, %s: descriptor validation check failed: Examining 47 descriptors and 47 namespace entries... ParentID 104, ParentSchemaID 105: namespace entry "bank" (106): no matching name info in draining names of dropped relation

Note that some fail when the restore rolls back, leading to an unclear error msg, e.g. "table [104] not found".

  1. Ideally, the test would generate the corrupt backup, so the backup is the same release as the restore, so I'll go with the pb to json manipulation route.

So, whats happening here is that you shouldn't have a namespace entry for dropped descriptors. So, just adjust the validation backuputils logic to not add one.

This patch checks that the restored descriptors are valid in three
places during the job. First, during planning to ensure no existing descriptor
in the cluster is invalid. Second, as soon as the descriptors are written in an
offline state before the restoration flow begins. This primarily ensures that
the descriptors in the backup were in a valid state, allowing the restore to
fail fast. Third, right after the descriptors return online and before the job
begins, which ensures the restore created and modified the descriptors
correctly.

Note that this validation step does not check job references in the
descriptors for 2 reasons: 1) it would require loading the whole jobs table
into memory, 2) during the second check, the offline written descriptors will
have invalid job references during a cluster restore, as the jobs table isn't
properly populated yet.

If the customer needs to proceed with the restore, they can prevent the restore
from failing by passing the skip_validation_check flag.

Fixes cockroachdb#84757

Release note: this patch introduces checks during restore  which
cause the restore to fail if there exist invalid descriptors in the restoring
cluster. The user can prevent the restore from failing by passing the
skip_validation_check flag.
----

job tag=a wait-for-state=revert-failed
----
Copy link
Collaborator Author

@msbutler msbutler Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test currently fails. After 2 minutes of waiting for the revert-failed state, the job is left in a reverting state. Why? The rollback should fail, bc the evil descriptor i created causes the dropDescriptors to fail in OnFailOrCancel, which means the job should terminate in the revert-failed state.

Instead, the job system seems to infinitely retry OnFailOrCancel if it fails, which seems like bad behavior:
https://github.com/cockroachdb/cockroach/blob/master/pkg/jobs/registry.go#L1401

I'll file a separate issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour is intentional: #69300

The idea being that for many jobs OnFailOrCancel running to completion is required to put the system back into a usable state.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but should this job retry forever? I guess with exponential backoff, the time between retry frequency will become sufficiently large, preventing the exhaustion of jobs resources.

@msbutler
Copy link
Collaborator Author

hm, I'm also starting to think this PR is not really necessary. It seems sql schema code validates all descriptors anyway. If I run the unit test i created with skip_descriptor_check, the restore still fails with the following stack trace, which implies that the collection.GetAllDescriptors call will do a bunch of validation.

I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.(*validationErrorAccumulator).decorate
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:238
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.(*validationErrorAccumulator).Report
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:173
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc.(*wrapper).ValidateBackReferences
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/validate.go:251
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.Validate.func3
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:101
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.(*validationErrorAccumulator).validateDescriptorsAtLevel
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:189
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.Validate
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:96
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).Validate
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/validate.go:38
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).finalizeDescriptors
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:568
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.getDescriptorsByID
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:203
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getDescriptorsByID
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:104
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetAllDescriptors
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:434
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backupresolver.LoadAllDescs.func1

@msbutler
Copy link
Collaborator Author

I've done a bit more digging, and I don't think there's significant additional descriptor validation coverage that this PR, via doctor.ExamineDescriptors, provides over the automatic validation conducted whenever a descriptor is read or written from disk.

  • doctor.ExamineDescriptors calls catalog.Validate on the validate.Write level of validation, the highest form of validation. Every time a descriptor is written to disk, it is also validated with the validate.Write level. In other words, in this PR, we're currently calling this doctor cmd to validate stuff, right after we've already validated all the written descriptors! This seems useless.
  • doctor.ExamineDescriptors also calls ValidateDescriptorReferencesInJob, which does not get called during automatic descriptor validation, but neither do we in this PR. The jobs table should not get checked until the very end of restore, so do we really want to fail the whole restore, after hours of work, just bc a descriptor has some bad reference to the jobs table?

@msbutler
Copy link
Collaborator Author

@stevendanna @fqazi does my analysis above seem sound? TLDR: i don't think we need all this extra validation. Perhaps i'll keep the test to ensure that the automatic validation does indeed catch invalid descriptor bugs

@fqazi
Copy link
Collaborator

fqazi commented Nov 17, 2022

@msbutler Yes it makes it sound unnecessary unless we can catch things earlier.

@msbutler
Copy link
Collaborator Author

thanks for clarifying! closing this PR and will open a seperate, test only PR, that checks that restore validation does indeed occur automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

importccl,backupccl: use crdb_internal.invalid_objects to check for invalid descriptors
4 participants