backupccl: validate descriptors during restore #91250

msbutler · 2022-11-03T22:09:23Z

This patch checks that the restored descriptors are valid in two
places during the job. First, as soon as the descriptors are written in an
offline state before the restoration flow begins. This primarily ensures that
the descriptors in the backup were in a valid state, allowing the restore to
fail fast. Second, right after the descriptors return online and before the job
begins, which ensures the restore created and modified the descriptors
correctly.

Fixes #84757
Release note: None

cockroach-teamcity · 2022-11-03T22:09:31Z

This change is

pkg/ccl/backupccl/restore_job.go

stevendanna

Nice to see this moving along!

pkg/ccl/backupccl/backuputils/utils.go

pkg/ccl/backupccl/restore_job.go

pkg/ccl/backupccl/restore_planning.go

msbutler · 2022-11-14T16:04:25Z

@fqazi a few questions for you:

could confirm that i'm not misusing of doctor.Examine?
could you explain why the validation checker would fail on dropped descriptors?
any ideas for how to create a backup with an invalid descriptor that would fail these checks?

fqazi · 2022-11-14T18:09:10Z

@fqazi a few questions for you:

could confirm that i'm not misusing of doctor.Examine?

could you explain why the validation checker would fail on dropped descriptors?

any ideas for how to create a backup with an invalid descriptor that would fail these checks?

Assuming that you are able to get a closure of all the descriptors involved in the restore in the case of partial restores involving a subset of tables. Looking at your code you're getting the full descriptor set to the usage looks correct here.
Can you give the exact error we are seeing? Is it linked to a missing namespace entry or something else?
One option is to exploit a bug in an older release when creating the backup. You could also use the JSON protobuf stuff to manipulate things to add corruption, but you need to disable validation. I did make a few manifests for the SHOW BACKUP command, not sure if those can be reused.

msbutler · 2022-11-14T21:53:46Z

sounds good!

Can you give the exact error we are seeing? Is it linked to a missing namespace entry or something else?

Once the latest TC run finishes (added a commit that doesn't filter dropped tables), you'll see a flurry of failures. As an example, running TestBackupAndRestoreJobDescription returns:
pq: the restoring cluster contains invalid descriptors, %s: descriptor validation check failed: Examining 47 descriptors and 47 namespace entries... ParentID 104, ParentSchemaID 105: namespace entry "bank" (106): no matching name info in draining names of dropped relation

Note that some fail when the restore rolls back, leading to an unclear error msg, e.g. "table [104] not found".

Ideally, the test would generate the corrupt backup, so the backup is the same release as the restore, so I'll go with the pb to json manipulation route.

fqazi · 2022-11-14T22:00:49Z

sounds good!

Can you give the exact error we are seeing? Is it linked to a missing namespace entry or something else?

Once the latest TC run finishes (added a commit that doesn't filter dropped tables), you'll see a flurry of failures. As an example, running TestBackupAndRestoreJobDescription returns: pq: the restoring cluster contains invalid descriptors, %s: descriptor validation check failed: Examining 47 descriptors and 47 namespace entries... ParentID 104, ParentSchemaID 105: namespace entry "bank" (106): no matching name info in draining names of dropped relation

Note that some fail when the restore rolls back, leading to an unclear error msg, e.g. "table [104] not found".

Ideally, the test would generate the corrupt backup, so the backup is the same release as the restore, so I'll go with the pb to json manipulation route.

So, whats happening here is that you shouldn't have a namespace entry for dropped descriptors. So, just adjust the validation backuputils logic to not add one.

This patch checks that the restored descriptors are valid in three places during the job. First, during planning to ensure no existing descriptor in the cluster is invalid. Second, as soon as the descriptors are written in an offline state before the restoration flow begins. This primarily ensures that the descriptors in the backup were in a valid state, allowing the restore to fail fast. Third, right after the descriptors return online and before the job begins, which ensures the restore created and modified the descriptors correctly. Note that this validation step does not check job references in the descriptors for 2 reasons: 1) it would require loading the whole jobs table into memory, 2) during the second check, the offline written descriptors will have invalid job references during a cluster restore, as the jobs table isn't properly populated yet. If the customer needs to proceed with the restore, they can prevent the restore from failing by passing the skip_validation_check flag. Fixes cockroachdb#84757 Release note: this patch introduces checks during restore which cause the restore to fail if there exist invalid descriptors in the restoring cluster. The user can prevent the restore from failing by passing the skip_validation_check flag.

msbutler · 2022-11-15T22:27:54Z

pkg/ccl/backupccl/testdata/backup-restore/restore-check-descriptors

+----
+
+job tag=a wait-for-state=revert-failed
+----


this test currently fails. After 2 minutes of waiting for the revert-failed state, the job is left in a reverting state. Why? The rollback should fail, bc the evil descriptor i created causes the dropDescriptors to fail in OnFailOrCancel, which means the job should terminate in the revert-failed state.

Instead, the job system seems to infinitely retry OnFailOrCancel if it fails, which seems like bad behavior:
https://github.com/cockroachdb/cockroach/blob/master/pkg/jobs/registry.go#L1401

I'll file a separate issue.

This behaviour is intentional: #69300

The idea being that for many jobs OnFailOrCancel running to completion is required to put the system back into a usable state.

but should this job retry forever? I guess with exponential backoff, the time between retry frequency will become sufficiently large, preventing the exhaustion of jobs resources.

msbutler · 2022-11-15T22:41:27Z

hm, I'm also starting to think this PR is not really necessary. It seems sql schema code validates all descriptors anyway. If I run the unit test i created with skip_descriptor_check, the restore still fails with the following stack trace, which implies that the collection.GetAllDescriptors call will do a bunch of validation.

I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.(*validationErrorAccumulator).decorate
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:238
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.(*validationErrorAccumulator).Report
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:173
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc.(*wrapper).ValidateBackReferences
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/tabledesc/validate.go:251
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.Validate.func3
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:101
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.(*validationErrorAccumulator).validateDescriptorsAtLevel
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:189
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate.Validate
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/validate/validate.go:96
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).Validate
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/validate.go:38
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).finalizeDescriptors
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:568
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.getDescriptorsByID
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:203
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getDescriptorsByID
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/descriptor.go:104
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetAllDescriptors
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | 	github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:434
I221115 22:35:13.674710 362732 jobs/registry.go:1234  [nsql1] 2 +  | github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backupresolver.LoadAllDescs.func1

msbutler · 2022-11-16T12:37:02Z

I've done a bit more digging, and I don't think there's significant additional descriptor validation coverage that this PR, via doctor.ExamineDescriptors, provides over the automatic validation conducted whenever a descriptor is read or written from disk.

doctor.ExamineDescriptors calls catalog.Validate on the validate.Write level of validation, the highest form of validation. Every time a descriptor is written to disk, it is also validated with the validate.Write level. In other words, in this PR, we're currently calling this doctor cmd to validate stuff, right after we've already validated all the written descriptors! This seems useless.
doctor.ExamineDescriptors also calls ValidateDescriptorReferencesInJob, which does not get called during automatic descriptor validation, but neither do we in this PR. The jobs table should not get checked until the very end of restore, so do we really want to fail the whole restore, after hours of work, just bc a descriptor has some bad reference to the jobs table?

msbutler · 2022-11-17T13:54:38Z

@stevendanna @fqazi does my analysis above seem sound? TLDR: i don't think we need all this extra validation. Perhaps i'll keep the test to ensure that the automatic validation does indeed catch invalid descriptor bugs

fqazi · 2022-11-17T17:57:34Z

@msbutler Yes it makes it sound unnecessary unless we can catch things earlier.

msbutler · 2022-11-29T00:38:17Z

thanks for clarifying! closing this PR and will open a seperate, test only PR, that checks that restore validation does indeed occur automatically.

msbutler force-pushed the butler-desc-valid branch from ba284a2 to 87de130 Compare November 4, 2022 16:35

msbutler self-assigned this Nov 4, 2022

msbutler commented Nov 4, 2022

View reviewed changes

pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved

msbutler changed the title ~~first~~ backupccl: validate descriptors during restore Nov 4, 2022

msbutler force-pushed the butler-desc-valid branch from 87de130 to 2bda3d5 Compare November 4, 2022 19:34

This comment was marked as outdated.

Sign in to view

msbutler force-pushed the butler-desc-valid branch 4 times, most recently from c9a5a1f to b642383 Compare November 14, 2022 14:52

msbutler requested a review from stevendanna November 14, 2022 14:53

msbutler added the T-disaster-recovery label Nov 14, 2022

msbutler marked this pull request as ready for review November 14, 2022 14:53

msbutler requested review from a team as code owners November 14, 2022 14:53

stevendanna requested changes Nov 14, 2022

View reviewed changes

msbutler force-pushed the butler-desc-valid branch from b642383 to 49ae84a Compare November 14, 2022 16:00

msbutler requested a review from fqazi November 14, 2022 16:01

msbutler force-pushed the butler-desc-valid branch from ec39b62 to cf4a7d3 Compare November 14, 2022 21:45

msbutler force-pushed the butler-desc-valid branch from cf4a7d3 to ea0188b Compare November 15, 2022 22:22

msbutler requested a review from a team as a code owner November 15, 2022 22:22

msbutler commented Nov 15, 2022

View reviewed changes

msbutler closed this Nov 29, 2022

msbutler mentioned this pull request Nov 29, 2022

importccl,backupccl: use crdb_internal.invalid_objects to check for invalid descriptors #84757

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: validate descriptors during restore #91250

backupccl: validate descriptors during restore #91250

msbutler commented Nov 3, 2022 •

edited

Loading

cockroach-teamcity commented Nov 3, 2022

This comment was marked as outdated.

This comment was marked as outdated.

stevendanna left a comment

msbutler commented Nov 14, 2022 •

edited

Loading

fqazi commented Nov 14, 2022

msbutler commented Nov 14, 2022

fqazi commented Nov 14, 2022

msbutler Nov 15, 2022 •

edited

Loading

stevendanna Nov 15, 2022

msbutler Nov 16, 2022

msbutler commented Nov 15, 2022

msbutler commented Nov 16, 2022

msbutler commented Nov 17, 2022

fqazi commented Nov 17, 2022

msbutler commented Nov 29, 2022

backupccl: validate descriptors during restore #91250

backupccl: validate descriptors during restore #91250

Conversation

msbutler commented Nov 3, 2022 • edited Loading

cockroach-teamcity commented Nov 3, 2022

This comment was marked as outdated.

This comment was marked as outdated.

stevendanna left a comment

Choose a reason for hiding this comment

msbutler commented Nov 14, 2022 • edited Loading

fqazi commented Nov 14, 2022

msbutler commented Nov 14, 2022

fqazi commented Nov 14, 2022

msbutler Nov 15, 2022 • edited Loading

Choose a reason for hiding this comment

stevendanna Nov 15, 2022

Choose a reason for hiding this comment

msbutler Nov 16, 2022

Choose a reason for hiding this comment

msbutler commented Nov 15, 2022

msbutler commented Nov 16, 2022

msbutler commented Nov 17, 2022

fqazi commented Nov 17, 2022

msbutler commented Nov 29, 2022

msbutler commented Nov 3, 2022 •

edited

Loading

msbutler commented Nov 14, 2022 •

edited

Loading

msbutler Nov 15, 2022 •

edited

Loading