-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
br: restore checksum shouldn't rely on backup checksum #56712
Conversation
Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
Skipping CI for Draft Pull Request. |
Hi @Tristan1900. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #56712 +/- ##
=================================================
- Coverage 73.3472% 59.8768% -13.4705%
=================================================
Files 1635 1801 +166
Lines 452734 674211 +221477
=================================================
+ Hits 332068 403696 +71628
- Misses 100306 246164 +145858
- Partials 20360 24351 +3991
Flags with carried forward coverage won't be shown. Click here to find out more.
|
mv "$filename" "$filename_bak" | ||
done | ||
|
||
# need to drop db otherwise restore will fail because of cluster not fresh but not the expected issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous br_file_corruption is not testing what it should be testing, restore fails because of cluster not fresh but not because of corruption.
echo "corruption" > $filename_temp | ||
cat $filename >> $filename_temp | ||
# Replace the single file manipulation with a loop over all .sst files | ||
for filename in $(find $TEST_DIR/$DB -name "*.sst"); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to corrupt every sst file cuz during restore not all sst are going to be used, since backup backs up a bunch of tables but some of them are going to be filtered out during restore
br/tests/br_file_corruption/run.sh
Outdated
truncate --size=-11 $filename | ||
for filename in $(find $TEST_DIR/$DB -name "*.sst_temp"); do | ||
mv "$filename" "${filename%_temp}" | ||
truncate -s 11 "${filename%_temp}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--size=-11 doesn't work on macOS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncate -s 11 "${filename%_temp}" | |
truncate -s -11 "${filename%_temp}" |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
if tbl.OldTable.NoChecksum() { | ||
logger.Warn("table has no checksum, skipping checksum") | ||
expectedChecksumStats := metautil.CalculateChecksumStatsOnFiles(tbl.OldTable.Files) | ||
if !expectedChecksumStats.ChecksumExists() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if we need to do the check here, if an empty table is backed up, should we check after restore if it's still empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the error log is misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I remove the checking here at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
br/pkg/task/backup.go
Outdated
@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig { | |||
if err != nil { | |||
log.Panic("infallible operation failed.", zap.Error(err)) | |||
} | |||
|
|||
// Check if the checksum flag was set by the user | |||
if !fs.Changed("checksum") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use flagChecksum
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right... my mistake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems in this context, fs.Changed(anything)
is always false... As the flagset is newly created at here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, can we override the default value of --checksum
in DefineBackupFlags
...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like having override in DefineBackupFlags
is a bit weird since it's technically not define
, might be a bit confusing
// pipeline checksum | ||
if cfg.Checksum { | ||
// pipeline checksum only when enabled and is not incremental snapshot repair mode cuz incremental doesn't have | ||
// enough information in backup meta to validate checksum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use the collection of file-level
checksums to match the table-level
checksum in the incremental backup. However, we use the table-level
checksum in the incremental backup to match the table-level
checksum in the incremental restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you point me to the code, I thought table level checksum for incremental is not calculated during backup
skipChecksum := !cfg.Checksum || isIncrementalBackup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
br/pkg/backup/schema.go
Outdated
@@ -145,7 +145,7 @@ func (ss *Schemas) BackupSchemas( | |||
zap.Uint64("Crc64Xor", schema.crc64xor), | |||
zap.Uint64("TotalKvs", schema.totalKvs), | |||
zap.Uint64("TotalBytes", schema.totalBytes), | |||
zap.Duration("calculate-take", calculateCost)) | |||
zap.Duration("Time taken", calculateCost)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep consistency with other fields?
zap.Duration("Time taken", calculateCost)) | |
zap.Duration("TimeTaken", calculateCost)) |
br/pkg/metautil/metafile.go
Outdated
func (stats *ChecksumStats) ChecksumExists() bool { | ||
if stats == nil { | ||
return false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems pointer receiver isn't needed here? As I cannot find where a *ChecksumStats
instead of ChecksumState
is needed.
func (stats *ChecksumStats) ChecksumExists() bool { | |
if stats == nil { | |
return false | |
} | |
func (stats ChecksumStats) ChecksumExists() bool { |
br/pkg/restore/snap_client/client.go
Outdated
logger.Warn("table has no checksum, skipping checksum") | ||
expectedChecksumStats := metautil.CalculateChecksumStatsOnFiles(tbl.OldTable.Files) | ||
if !expectedChecksumStats.ChecksumExists() { | ||
logger.Error("table has no checksum, skipping checksum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also print the table and database name here? Also I think this can be a warning
instead of an error
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I think this line above adds the db and table as fields for all subsequent logging
logger := log.L().With(
zap.String("db", tbl.OldTable.DB.Name.O),
zap.String("table", tbl.OldTable.Info.Name.O),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit skeptical about this logic, should we just remove it? If there is an empty table restored(would there be a case?), we should still check the checksum make sure it's empty. What do you think?
zap.Uint64("calculated total bytes", item.TotalBytes), | ||
) | ||
return errors.Annotate(berrors.ErrRestoreChecksumMismatch, "failed to validate checksum") | ||
} | ||
logger.Info("success in validate checksum") | ||
logger.Info("success in validating checksum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you also add the table / database name to this log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
br/pkg/task/backup.go
Outdated
@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig { | |||
if err != nil { | |||
log.Panic("infallible operation failed.", zap.Error(err)) | |||
} | |||
|
|||
// Check if the checksum flag was set by the user | |||
if !fs.Changed("checksum") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems in this context, fs.Changed(anything)
is always false... As the flagset is newly created at here
br/pkg/task/backup.go
Outdated
@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig { | |||
if err != nil { | |||
log.Panic("infallible operation failed.", zap.Error(err)) | |||
} | |||
|
|||
// Check if the checksum flag was set by the user | |||
if !fs.Changed("checksum") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, can we override the default value of --checksum
in DefineBackupFlags
...?
br/pkg/task/backup_test.go
Outdated
// Test with checksum flag set | ||
os.Args = []string{"cmd", "--checksum=true"} | ||
cfg = DefaultBackupConfig() | ||
require.True(t, cfg.Checksum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the DefaultBackupConfig
will parse command line from os.Args
... TBH I'm even not sure how this case passes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are absolutely right, should revisit before making PR read for review. It did pass locally and on CI which is weird.
run_br --pd $PD_ADDR restore full -s "local://$TEST_DIR/$DB" --checksum=true || restore_fail=1 | ||
export GO_FAILPOINTS="" | ||
if [ $restore_fail -ne 1 ]; then | ||
echo 'expect restore to fail on checksum mismatch but succeed' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I guess if we have a corrupted SST file, the restoration fails though with --checksum=false
... Perhaps a better (also harder) way is to delete an file
entry in the backupmeta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it's not testing corrupted SST file in this test, the valid files are moved back after the previous corruption test, This test is just to verify the checksum process is running even backup disables the checksum, by injecting the fail point into the checksum routine. Right after this test there is a sanity test that can restore successfully, verifying all files are valid.
Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
@Tristan1900: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
In response to a cherrypick label: new pull request created to branch |
close pingcap#56373 (cherry picked from commit 4f047be)
close pingcap#56373 (cherry picked from commit 4f047be)
close pingcap#56373 (cherry picked from commit 4f047be)
close pingcap#56373 (cherry picked from commit 4f047be)
close pingcap#56373 (cherry picked from commit 4f047be)
close pingcap#56373 (cherry picked from commit 4f047be)
close pingcap#56373 (cherry picked from commit 4f047be)
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
What problem does this PR solve?
Issue Number: close #56373
Problem Summary:
What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.