Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br: restore checksum shouldn't rely on backup checksum #56712

Merged
merged 13 commits into from
Nov 18, 2024

Conversation

Tristan1900
Copy link
Contributor

@Tristan1900 Tristan1900 commented Oct 18, 2024

What problem does this PR solve?

Issue Number: close #56373

Problem Summary:

What changed and how does it work?

  1. Disable checksum by default only for snapshot backup process.
  2. Keep checksum enabled for the rest
  3. During snapshot restore, verify admin checksum by doing xor on all meta file checksums, so it doesn't rely on backup generating the CRC for the database.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

The default behavior of doing checksum during full backup has changed from enabled to disabled. 
The meaning of checksum flag is kind of ambiguous here, it doesn't mean we don't do the checksum of files when doing backup, instead we don't do checksum to compare from coprocessor side to the br side to improve the performance

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
Copy link

ti-chi-bot bot commented Oct 18, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 18, 2024
Copy link

tiprow bot commented Oct 18, 2024

Hi @Tristan1900. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-tests-checked size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 21, 2024
@Tristan1900 Tristan1900 marked this pull request as ready for review October 21, 2024 22:11
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2024
Copy link

codecov bot commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 22 lines in your changes missing coverage. Please review.

Project coverage is 59.8768%. Comparing base (255ea42) to head (95059bc).
Report is 401 commits behind head on master.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #56712         +/-   ##
=================================================
- Coverage   73.3472%   59.8768%   -13.4705%     
=================================================
  Files          1635       1801        +166     
  Lines        452734     674211     +221477     
=================================================
+ Hits         332068     403696      +71628     
- Misses       100306     246164     +145858     
- Partials      20360      24351       +3991     
Flag Coverage Δ
integration 40.6202% <75.0000%> (?)
unit 75.0713% <32.9545%> (+2.5971%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 55.2602% <ø> (+2.3124%) ⬆️
parser ∅ <ø> (∅)
br 64.1702% <72.5000%> (+18.1375%) ⬆️

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
mv "$filename" "$filename_bak"
done

# need to drop db otherwise restore will fail because of cluster not fresh but not the expected issue
Copy link
Contributor Author

@Tristan1900 Tristan1900 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous br_file_corruption is not testing what it should be testing, restore fails because of cluster not fresh but not because of corruption.

echo "corruption" > $filename_temp
cat $filename >> $filename_temp
# Replace the single file manipulation with a loop over all .sst files
for filename in $(find $TEST_DIR/$DB -name "*.sst"); do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to corrupt every sst file cuz during restore not all sst are going to be used, since backup backs up a bunch of tables but some of them are going to be filtered out during restore

truncate --size=-11 $filename
for filename in $(find $TEST_DIR/$DB -name "*.sst_temp"); do
mv "$filename" "${filename%_temp}"
truncate -s 11 "${filename%_temp}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--size=-11 doesn't work on macOS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
truncate -s 11 "${filename%_temp}"
truncate -s -11 "${filename%_temp}"

This comment was marked as resolved.

if tbl.OldTable.NoChecksum() {
logger.Warn("table has no checksum, skipping checksum")
expectedChecksumStats := metautil.CalculateChecksumStatsOnFiles(tbl.OldTable.Files)
if !expectedChecksumStats.ChecksumExists() {
Copy link
Contributor Author

@Tristan1900 Tristan1900 Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we need to do the check here, if an empty table is backed up, should we check after restore if it's still empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the error log is misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I remove the checking here at all?

@Tristan1900
Copy link
Contributor Author

/assign @Leavrth @YuJuncen

Copy link
Contributor

@Leavrth Leavrth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig {
if err != nil {
log.Panic("infallible operation failed.", zap.Error(err))
}

// Check if the checksum flag was set by the user
if !fs.Changed("checksum") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use flagChecksum instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right... my mistake

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems in this context, fs.Changed(anything) is always false... As the flagset is newly created at here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, can we override the default value of --checksum in DefineBackupFlags...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having override in DefineBackupFlags is a bit weird since it's technically not define, might be a bit confusing

// pipeline checksum
if cfg.Checksum {
// pipeline checksum only when enabled and is not incremental snapshot repair mode cuz incremental doesn't have
// enough information in backup meta to validate checksum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use the collection of file-level checksums to match the table-level checksum in the incremental backup. However, we use the table-level checksum in the incremental backup to match the table-level checksum in the incremental restore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you point me to the code, I thought table level checksum for incremental is not calculated during backup

skipChecksum := !cfg.Checksum || isIncrementalBackup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it.

Copy link
Contributor

@YuJuncen YuJuncen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

@@ -145,7 +145,7 @@ func (ss *Schemas) BackupSchemas(
zap.Uint64("Crc64Xor", schema.crc64xor),
zap.Uint64("TotalKvs", schema.totalKvs),
zap.Uint64("TotalBytes", schema.totalBytes),
zap.Duration("calculate-take", calculateCost))
zap.Duration("Time taken", calculateCost))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep consistency with other fields?

Suggested change
zap.Duration("Time taken", calculateCost))
zap.Duration("TimeTaken", calculateCost))

Comment on lines 247 to 250
func (stats *ChecksumStats) ChecksumExists() bool {
if stats == nil {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems pointer receiver isn't needed here? As I cannot find where a *ChecksumStats instead of ChecksumState is needed.

Suggested change
func (stats *ChecksumStats) ChecksumExists() bool {
if stats == nil {
return false
}
func (stats ChecksumStats) ChecksumExists() bool {

logger.Warn("table has no checksum, skipping checksum")
expectedChecksumStats := metautil.CalculateChecksumStatsOnFiles(tbl.OldTable.Files)
if !expectedChecksumStats.ChecksumExists() {
logger.Error("table has no checksum, skipping checksum")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also print the table and database name here? Also I think this can be a warning instead of an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I think this line above adds the db and table as fields for all subsequent logging

	logger := log.L().With(
		zap.String("db", tbl.OldTable.DB.Name.O),
		zap.String("table", tbl.OldTable.Info.Name.O),
	)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit skeptical about this logic, should we just remove it? If there is an empty table restored(would there be a case?), we should still check the checksum make sure it's empty. What do you think?

zap.Uint64("calculated total bytes", item.TotalBytes),
)
return errors.Annotate(berrors.ErrRestoreChecksumMismatch, "failed to validate checksum")
}
logger.Info("success in validate checksum")
logger.Info("success in validating checksum")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you also add the table / database name to this log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig {
if err != nil {
log.Panic("infallible operation failed.", zap.Error(err))
}

// Check if the checksum flag was set by the user
if !fs.Changed("checksum") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems in this context, fs.Changed(anything) is always false... As the flagset is newly created at here

@@ -800,6 +800,13 @@ func DefaultBackupConfig() BackupConfig {
if err != nil {
log.Panic("infallible operation failed.", zap.Error(err))
}

// Check if the checksum flag was set by the user
if !fs.Changed("checksum") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, can we override the default value of --checksum in DefineBackupFlags...?

// Test with checksum flag set
os.Args = []string{"cmd", "--checksum=true"}
cfg = DefaultBackupConfig()
require.True(t, cfg.Checksum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the DefaultBackupConfig will parse command line from os.Args... TBH I'm even not sure how this case passes...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are absolutely right, should revisit before making PR read for review. It did pass locally and on CI which is weird.

run_br --pd $PD_ADDR restore full -s "local://$TEST_DIR/$DB" --checksum=true || restore_fail=1
export GO_FAILPOINTS=""
if [ $restore_fail -ne 1 ]; then
echo 'expect restore to fail on checksum mismatch but succeed'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I guess if we have a corrupted SST file, the restoration fails though with --checksum=false... Perhaps a better (also harder) way is to delete an file entry in the backupmeta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it's not testing corrupted SST file in this test, the valid files are moved back after the previous corruption test, This test is just to verify the checksum process is running even backup disables the checksum, by injecting the fail point into the checksum routine. Right after this test there is a sanity test that can restore successfully, verifying all files are valid.

Signed-off-by: Wenqi Mou <wenqimou@gmail.com>
Copy link

tiprow bot commented Nov 18, 2024

@Tristan1900: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
tidb_parser_test 95059bc link true /test tidb_parser_test
fast_test_tiprow 95059bc link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot merged commit 4f047be into pingcap:master Nov 18, 2024
31 of 33 checks passed
@Tristan1900 Tristan1900 deleted the checksum branch November 18, 2024 16:49
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Nov 19, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #57503.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 20, 2024
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Dec 2, 2024
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Dec 2, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #57887.

@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. label Dec 4, 2024
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Dec 4, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #57984.

@ti-chi-bot ti-chi-bot bot removed the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Dec 12, 2024
Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 12, 2024
Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 12, 2024
Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024
Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024
Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024
Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024
Tristan1900 added a commit to Tristan1900/tidb that referenced this pull request Dec 13, 2024
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. label Dec 17, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #58335.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Dec 17, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BR-restore should still perform checksum even when the schema checksum is missing
4 participants