Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PITR: Run PITR for multiple times could lead to tiflash crash #52628

Closed
JaySon-Huang opened this issue Apr 16, 2024 · 8 comments · Fixed by #53658
Closed

PITR: Run PITR for multiple times could lead to tiflash crash #52628

JaySon-Huang opened this issue Apr 16, 2024 · 8 comments · Fixed by #53658
Labels
affects-8.1 component/br This issue is related to BR of TiDB. impact/crash crash/fatal severity/critical type/bug The issue is confirmed as a bug.

Comments

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Apr 16, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Backup snapshot and log using PITR
  2. Restore the data within tso1 into a new cluster with tiflash instances by br restore point
  3. Add tiflash replica(s) for the restored table(s) # or if the backup data contains tiflash replica, the tiflash replica will be added after step 2.
  4. Restore the data within tso1...tso2 into the cluster by br restore point

2. What did you expect to see? (Required)

Restore success and all instances run normally

3. What did you see instead (Required)

When running step 4, TiFlash instances crash with backtrace like

[FATAL] [Exception.cpp:106] ["Code: 9008, e.displayText() = DB::Exception: Raw TiDB PK: C80000000CC265A2, Prewrite ts: 449057189415092426 can not found in default cf for key: 7480000000003580FFD75F72C80000000CFFC265A20000000000FAF9C4A0DD8D13FEE1, region_id: 43498, applied_index: 31: (applied_term: 7) ...

4. What is your TiDB version? (Required)

v7.5.1

@JaySon-Huang JaySon-Huang added the type/bug The issue is confirmed as a bug. label Apr 16, 2024
@JaySon-Huang
Copy link
Contributor Author

JaySon-Huang commented Apr 16, 2024

Because PITR will try to restore the logs into the cluster without caring about the order of default cf and write cf (in terms of speeding up the restore). While tiflash rely on when applying a write cf key, its belonging default cf key must exist, otherwise tiflash can not decode the key-value pairs into column data correctly. When tiflash see a write cf without its belonging default cf, tiflash panic.

@AkiraXie
Copy link

/component br
/severity critical

@BornChanger
Copy link
Contributor

It's a compatibility issue and we don't have solution to resolve it but have to document the limitation.

@BornChanger
Copy link
Contributor

@JaySon-Huang can TiFlash lift the restriction instead?

@JaySon-Huang
Copy link
Contributor Author

@BornChanger During the step 4 (PITR restore point again), TiFlash cannot tell whether the error is a corrupted RaftLog that was accepted that violated the transaction model or a RaftLog that was recovered by PITR. so TiFlash cannot lift the restriction only for PITR.
Can PITR guarantee that in a single region, all kvs in default_cf will be written before any kvs in the write_cf are restored? I think this can resolve the problem.

@YuJuncen
Copy link
Contributor

YuJuncen commented Jun 5, 2024

I guess we need further discussion to decide whether bring this to release branches. For now just fix this in master.

ti-chi-bot bot pushed a commit that referenced this issue Jun 5, 2024
@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 19, 2024
@seiya-annie
Copy link

/remove-found customer

@ti-chi-bot ti-chi-bot bot removed the report/customer Customers have encountered this bug. label Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.1 component/br This issue is related to BR of TiDB. impact/crash crash/fatal severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants