Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider disabling --checksum for BR backup #43007

Open
kennytm opened this issue Apr 12, 2023 · 4 comments
Open

Consider disabling --checksum for BR backup #43007

kennytm opened this issue Apr 12, 2023 · 4 comments
Labels
component/br This issue is related to BR of TiDB. type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@kennytm
Copy link
Contributor

kennytm commented Apr 12, 2023

Feature Request

Is your feature request related to a problem? Please describe:

Currently BR backup performs checksum twice, once during SST generation in TiKV, and once again after a table is backed up using TiKV Coprocessor. These two checksums are compared to verify correctness.

However, these checksums are calculated from the exact same source: the online TiKV database. The actual uploaded SST file could be corrupt but this won't be detected by the checksum.

The only way they differ is if the RocksDB snapshot mechanism break down or the checksum algorithm is changed. These are not something the users really care about. That is to say, running with --checksum=1 only wasted CPU on TiKV to prove something that can never be false.

Describe the feature you'd like:

  1. Given that TiKV always recorded the checksum into backupmeta regardless of the --checksum flag, we can safely change the flag's default value to false.
  2. Enhance the br debug checksum command to actually read the SST files to calculate that file's CRC64XOR etc. Currently it only compares the SHA256 hash.

In the documentation recommend users to run br debug checksum after performing br backup to ensure the integrity of the uploaded backup archive.

We are only talking about BR backup here. The default --checksum setting for BR restore should still and always be true.

Describe alternatives you've considered:

A. Change the --checksum flag to mean actually performing br debug checksum. Note that br debug checksum will need to download the SST files again so it will double the I/O cost of the target cloud storage.

B. Don't calculate CRC64XOR, keeping the current implementation of br debug checksum. Checking SHA256 should be sufficient to ensure the file content is intact.

Teachability, Documentation, Adoption, Migration Strategy:

For the public API since we are only changing the default value of a flag this is fully compatible.

The br debug command is currently hidden and br debug checksum doubly so. They need to be revealed and documented.

@kennytm kennytm added type/feature-request Categorizes issue or PR as related to a new feature. component/br This issue is related to BR of TiDB. labels Apr 12, 2023
@YuJuncen
Copy link
Contributor

YuJuncen commented Sep 11, 2024

The only way they differ is if the RocksDB snapshot mechanism break down or the checksum algorithm is changed. These are not something the users really care about. That is to say, running with --checksum=1 only wasted CPU on TiKV to prove something that can never be false.

Also notice that the client implementation may make them differ. Say, some regions were omitted during backup.

(BR uses a range tree and batched style to query TiKV, while checksumming queries TiKV region by region).

@kennytm
Copy link
Contributor Author

kennytm commented Sep 11, 2024

@YuJuncen

Also notice that the client implementation may make them differ. Say, some regions were omitted during backup.

Well if the local checksum and remote checksum are operating on different set of ranges, it is an implementation bug of BR which again the user cannot control besides upgrading BR to a fixed version.

@3pointer
Copy link
Contributor

In the early design, checksum were used to verify:

  • No Backup logic bugs
  • No physical imports (Lightning) during the backup
  • Double check for bit flips and other extreme cases.

In practice, backup checksum mismatches have rarely occurred. To my recollection, there were only a few times issues were caused by physical imports during the backup.

Therefore, we plan to gradually remove this logic and disable the checksum after the table is backed up by default. we already done it on Cloud. I think we can remove it on Premise for next release.

@kennytm
Copy link
Contributor Author

kennytm commented Sep 25, 2024

Currently (v8.1) BR backup will only write the schema checksum ("TXNKV_CHECKSUM_ORIGIN") when we enable backup --checksum=1, causing restore to ignore the E2E checksum validation even if we restore --checksum=1.

Will we change the restore checksum to consider the sum of file checksum ("TXNKV_CHECKSUM_SST") when the schema checksum is missing from the backupmeta? The two should be equivalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/br This issue is related to BR of TiDB. type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants