Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: xtrabackup on replica + promotion to primary causes outage #12754

Closed
makmanalp opened this issue Mar 29, 2023 · 5 comments · Fixed by #12856
Closed

Bug Report: xtrabackup on replica + promotion to primary causes outage #12754

makmanalp opened this issue Mar 29, 2023 · 5 comments · Fixed by #12856

Comments

@makmanalp
Copy link

Overview of the Issue

If an xtrabackup is started on a replica and that replica is promoted to primary, the backup continues to run and eventually kills the primary:

I0329 03:58:18.672269       6 xtrabackupengine.go:198] Backup completed
I0329 03:58:18.672292       6 xtrabackupengine.go:117] Closing backup file MANIFEST
I0329 03:58:20.080713       6 rpc_server.go:84] TabletManager.Backup(concurrency:6)(on vt_green-xxxx902 from ): <nil>
I0329 03:58:20.096945       6 rpc_replication.go:659] SetReplicationSource: parent: cell:"vt_green" uid:xxxx902  position:  force: false semiSync: false
>>>>>>    I0329 03:58:20.097009       6 tm_state.go:177] Changing Tablet Type: REPLICA
I0329 03:58:20.097223       6 replmanager.go:88] Replication Manager: starting
I0329 03:58:20.097425       6 replmanager.go:121] Replication is stopped, reconnecting to primary.
>>>>>>    I0329 03:58:20.097974       6 replmanager.go:128] Failed to reconnect to primary: shard AutomationPlatform/f8- record claims tablet vt_green-xxxx902 is primary, but its type is REPLICA, will keep retrying.
I0329 03:58:20.098078       6 engine.go:312] VReplication Engine: closed
I0329 03:58:20.098086       6 engine.go:219] VDiff Engine: closed
I0329 03:58:20.098092       6 state_manager.go:220] Starting transition to REPLICA Serving, timestamp: 0001-01-01 00:00:00 +0000 UTC

Key bit seems to be the two lines I marked where it turns itself into a replica, and then is confused that the primary has type replica 🥲

Manual emergency reparent got us out of this situation: no orchestrator intervention since the mysql topology was technically fine. Not sure how this'd behave with vtorc (whether that is also watching vitess topo), we're still a short bit away from testing that yet.

Reproduction Steps

  1. On a shard where backups take more than a few minutes, take an xtrabackup on a replica tablet.
  2. Promote the tablet to primary via vtctl plannedreparentshard while the backup is ongoing.
  3. Wait for backup to complete.

Binary Version

Version: 14.0.4-SNAPSHOT (Git revision 02da1bfa3d10cd0c279c0aaf641324609d8dfe0c branch 'HEAD') built on Thu Mar  9 19:20:30 UTC 2023 by root@buildkitsandbox using go1.18.7 linux/amd64

(this is our own build of v14 with some custom patches that are probably not relevant)

Operating System and Environment details

NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Stream 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"

Linux 5.4.141-hs22.el8.x86_64

Log Fragments

And some logs of the backup replica being promoted earlier on, for good measure:


I0329 01:55:37.527285       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:37 >> log scanned up to (22406742329571)
I0329 01:55:38.527720       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:38 >> log scanned up to (22406742342294)
>>>>>>> I0329 01:55:39.322095       6 rpc_replication.go:659] SetReplicationSource: parent: cell:"vt_green" uid:xxxx901  position: MySQL56/..... force: true semiSync: false
E0329 01:55:39.322455       6 rpc_replication.go:1017] invalid configuration - enabling semi sync even though not specified by durability policies. Possibly in the process of upgrading.
I0329 01:55:39.322469       6 replication.go:581] Setting semi-sync mode: primary=false, replica=true
I0329 01:55:39.322552       6 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
I0329 01:55:39.323695       6 hook.go:207] preflight_stop_slave hook doesn't exist
I0329 01:55:39.323836       6 query.go:81] exec STOP SLAVE
I0329 01:55:39.328095       6 query.go:81] exec START SLAVE
....
I0329 01:55:40.342657       6 rpc_replication.go:928] PromoteReplica
I0329 01:55:40.342890       6 query.go:81] exec STOP SLAVE
I0329 01:55:40.428270       6 query.go:81] exec RESET SLAVE ALL
I0329 01:55:40.437930       6 query.go:81] exec FLUSH BINARY LOGS
E0329 01:55:40.520002       6 rpc_replication.go:1017] invalid configuration - enabling semi sync even though not specified by durability policies. Possibly in the process of upgrading.
I0329 01:55:40.520024       6 replication.go:581] Setting semi-sync mode: primary=true, replica=true
....
I0329 01:55:41.527992       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:41 >> log scanned up to (22406742352406)
I0329 01:55:42.528134       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:42 >> log scanned up to (22406742352912)
I0329 01:55:43.528435       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:43 >> log scanned up to (22406742353368)
I0329 01:55:43.711073       6 tablegc.go:257] TableGC: transition into primary
I0329 01:55:44.528467       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:44 >> log scanned up to (22406742359605)
I0329 01:55:45.528686       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:45 >> log scanned up to (22406742360077)
@makmanalp makmanalp added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Mar 29, 2023
@rohit-nayak-ps rohit-nayak-ps added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Mar 29, 2023
@makmanalp
Copy link
Author

Thanks for assigning this quickly - I think the root cause might be this line: the assumption is that the originalType is what we need to set the tablet back to, which in the case of a newly promoted primary results in a demotion to replica.

if err := tm.changeTypeLocked(bgCtx, originalType, DBActionNone, SemiSyncActionNone); err != nil {

@makmanalp
Copy link
Author

makmanalp commented Mar 29, 2023

I'm not entirely sure on what the "correct" logic here should be - perhaps we store the previous tablet types for the case of non-serving backup engines (which I think changes the tablet type to BACKUP for the duration) so we need to keep the logic for that, but for online backup types where the tablet remains serving, we need to not do anything.

Other thoughts:

  • Should a tablet running a backup even be promotable? I think the answer is yes. Part of the point of xtrabackup is that it can run without taking a tablet offline, without reducing the redundancy of the cluster in case of failover.
    • For context: from our POV, the reason we'd ever take a backup on a replica meant for redundancy is because our dedicated backup pods (which spin up and down for backups to save cash / resources) cannot catch up due to e.g. very high write traffic since the last successful backup.
  • Should the ongoing backup process survive the promotion as it does now?
    • In the most common / general case, probably no: xtrabackup does have a nontrivial memory + performance impact and serving queries right now should take priority over backups.
    • In the case of a disaster recovery scenario, perhaps yes: if we have e.g. only 2 tablets remaining in the shard, I probably desperately want the backup no matter what so I can grow the replica pool ASAP. Admittedly this is more niche.

@makmanalp
Copy link
Author

Final note: I noticed it seems like this PR (this issue) is also about to change that same bit of the code, though I think the logic from the POV of this bug remains the same.

@olyazavr
Copy link
Contributor

olyazavr commented Apr 6, 2023

After more investigation, it's not the line we thought, as xtrabackup does not drain the tablet and ergo does not set the tablet back to its original type. It's this line:

return reparentutil.SetReplicationSource(ctx, s.ts, s.tmc, tabletInfo.Tablet)
that sets replication source after a backup completes that then results in it becoming a replica

@olyazavr
Copy link
Contributor

olyazavr commented Apr 6, 2023

I patched this on our end by adding

if tabletInfo.Type == topodatapb.TabletType_PRIMARY {
				return nil
			}

I can upstream a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants