Bug Report: xtrabackup on replica + promotion to primary causes outage #12754

makmanalp · 2023-03-29T09:44:43Z

Overview of the Issue

If an xtrabackup is started on a replica and that replica is promoted to primary, the backup continues to run and eventually kills the primary:

I0329 03:58:18.672269       6 xtrabackupengine.go:198] Backup completed
I0329 03:58:18.672292       6 xtrabackupengine.go:117] Closing backup file MANIFEST
I0329 03:58:20.080713       6 rpc_server.go:84] TabletManager.Backup(concurrency:6)(on vt_green-xxxx902 from ): <nil>
I0329 03:58:20.096945       6 rpc_replication.go:659] SetReplicationSource: parent: cell:"vt_green" uid:xxxx902  position:  force: false semiSync: false
>>>>>>    I0329 03:58:20.097009       6 tm_state.go:177] Changing Tablet Type: REPLICA
I0329 03:58:20.097223       6 replmanager.go:88] Replication Manager: starting
I0329 03:58:20.097425       6 replmanager.go:121] Replication is stopped, reconnecting to primary.
>>>>>>    I0329 03:58:20.097974       6 replmanager.go:128] Failed to reconnect to primary: shard AutomationPlatform/f8- record claims tablet vt_green-xxxx902 is primary, but its type is REPLICA, will keep retrying.
I0329 03:58:20.098078       6 engine.go:312] VReplication Engine: closed
I0329 03:58:20.098086       6 engine.go:219] VDiff Engine: closed
I0329 03:58:20.098092       6 state_manager.go:220] Starting transition to REPLICA Serving, timestamp: 0001-01-01 00:00:00 +0000 UTC

Key bit seems to be the two lines I marked where it turns itself into a replica, and then is confused that the primary has type replica 🥲

Manual emergency reparent got us out of this situation: no orchestrator intervention since the mysql topology was technically fine. Not sure how this'd behave with vtorc (whether that is also watching vitess topo), we're still a short bit away from testing that yet.

Reproduction Steps

On a shard where backups take more than a few minutes, take an xtrabackup on a replica tablet.
Promote the tablet to primary via vtctl plannedreparentshard while the backup is ongoing.
Wait for backup to complete.

Binary Version

Version: 14.0.4-SNAPSHOT (Git revision 02da1bfa3d10cd0c279c0aaf641324609d8dfe0c branch 'HEAD') built on Thu Mar  9 19:20:30 UTC 2023 by root@buildkitsandbox using go1.18.7 linux/amd64

(this is our own build of v14 with some custom patches that are probably not relevant)

Operating System and Environment details

NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Stream 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"

Linux 5.4.141-hs22.el8.x86_64

Log Fragments

And some logs of the backup replica being promoted earlier on, for good measure:


I0329 01:55:37.527285       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:37 >> log scanned up to (22406742329571)
I0329 01:55:38.527720       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:38 >> log scanned up to (22406742342294)
>>>>>>> I0329 01:55:39.322095       6 rpc_replication.go:659] SetReplicationSource: parent: cell:"vt_green" uid:xxxx901  position: MySQL56/..... force: true semiSync: false
E0329 01:55:39.322455       6 rpc_replication.go:1017] invalid configuration - enabling semi sync even though not specified by durability policies. Possibly in the process of upgrading.
I0329 01:55:39.322469       6 replication.go:581] Setting semi-sync mode: primary=false, replica=true
I0329 01:55:39.322552       6 query.go:81] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
I0329 01:55:39.323695       6 hook.go:207] preflight_stop_slave hook doesn't exist
I0329 01:55:39.323836       6 query.go:81] exec STOP SLAVE
I0329 01:55:39.328095       6 query.go:81] exec START SLAVE
....
I0329 01:55:40.342657       6 rpc_replication.go:928] PromoteReplica
I0329 01:55:40.342890       6 query.go:81] exec STOP SLAVE
I0329 01:55:40.428270       6 query.go:81] exec RESET SLAVE ALL
I0329 01:55:40.437930       6 query.go:81] exec FLUSH BINARY LOGS
E0329 01:55:40.520002       6 rpc_replication.go:1017] invalid configuration - enabling semi sync even though not specified by durability policies. Possibly in the process of upgrading.
I0329 01:55:40.520024       6 replication.go:581] Setting semi-sync mode: primary=true, replica=true
....
I0329 01:55:41.527992       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:41 >> log scanned up to (22406742352406)
I0329 01:55:42.528134       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:42 >> log scanned up to (22406742352912)
I0329 01:55:43.528435       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:43 >> log scanned up to (22406742353368)
I0329 01:55:43.711073       6 tablegc.go:257] TableGC: transition into primary
I0329 01:55:44.528467       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:44 >> log scanned up to (22406742359605)
I0329 01:55:45.528686       6 xtrabackupengine.go:311] xtrabackup stderr: 230329 01:55:45 >> log scanned up to (22406742360077)

The text was updated successfully, but these errors were encountered:

makmanalp · 2023-03-29T15:37:36Z

Thanks for assigning this quickly - I think the root cause might be this line: the assumption is that the originalType is what we need to set the tablet back to, which in the case of a newly promoted primary results in a demotion to replica.

vitess/go/vt/vttablet/tabletmanager/rpc_backup.go

Line 129 in 7fafc94

    
           if err := tm.changeTypeLocked(bgCtx, originalType, DBActionNone, SemiSyncActionNone); err != nil {

makmanalp · 2023-03-29T15:41:37Z

I'm not entirely sure on what the "correct" logic here should be - perhaps we store the previous tablet types for the case of non-serving backup engines (which I think changes the tablet type to BACKUP for the duration) so we need to keep the logic for that, but for online backup types where the tablet remains serving, we need to not do anything.

Other thoughts:

Should a tablet running a backup even be promotable? I think the answer is yes. Part of the point of xtrabackup is that it can run without taking a tablet offline, without reducing the redundancy of the cluster in case of failover.
- For context: from our POV, the reason we'd ever take a backup on a replica meant for redundancy is because our dedicated backup pods (which spin up and down for backups to save cash / resources) cannot catch up due to e.g. very high write traffic since the last successful backup.
Should the ongoing backup process survive the promotion as it does now?
- In the most common / general case, probably no: xtrabackup does have a nontrivial memory + performance impact and serving queries right now should take priority over backups.
- In the case of a disaster recovery scenario, perhaps yes: if we have e.g. only 2 tablets remaining in the shard, I probably desperately want the backup no matter what so I can grow the replica pool ASAP. Admittedly this is more niche.

makmanalp · 2023-03-29T18:48:39Z

Final note: I noticed it seems like this PR (this issue) is also about to change that same bit of the code, though I think the logic from the POV of this bug remains the same.

olyazavr · 2023-04-06T15:56:56Z

After more investigation, it's not the line we thought, as xtrabackup does not drain the tablet and ergo does not set the tablet back to its original type. It's this line:

vitess/go/vt/vtctl/grpcvtctldserver/server.go

Line 505 in c0463e3

return reparentutil.SetReplicationSource(ctx, s.ts, s.tmc, tabletInfo.Tablet)

that sets replication source after a backup completes that then results in it becoming a replica

olyazavr · 2023-04-06T16:16:48Z

I patched this on our end by adding

if tabletInfo.Type == topodatapb.TabletType_PRIMARY {
				return nil
			}

I can upstream a fix

makmanalp added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Mar 29, 2023

rohit-nayak-ps added Component: Cluster management and removed Needs Triage This issue needs to be correctly labelled and triaged labels Mar 29, 2023

rohit-nayak-ps assigned rsajwani Mar 29, 2023

This was referenced Apr 6, 2023

do not demote a new primary after backup completion #12856

Merged

Feature Request: do not promote replica taking an xtrabackup #12857

Open

GuptaManan100 closed this as completed in #12856 May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: xtrabackup on replica + promotion to primary causes outage #12754

Bug Report: xtrabackup on replica + promotion to primary causes outage #12754

makmanalp commented Mar 29, 2023

makmanalp commented Mar 29, 2023

makmanalp commented Mar 29, 2023 •

edited

Loading

makmanalp commented Mar 29, 2023

olyazavr commented Apr 6, 2023

olyazavr commented Apr 6, 2023

Bug Report: xtrabackup on replica + promotion to primary causes outage #12754

Bug Report: xtrabackup on replica + promotion to primary causes outage #12754

Comments

makmanalp commented Mar 29, 2023

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

makmanalp commented Mar 29, 2023

makmanalp commented Mar 29, 2023 • edited Loading

makmanalp commented Mar 29, 2023

olyazavr commented Apr 6, 2023

olyazavr commented Apr 6, 2023

makmanalp commented Mar 29, 2023 •

edited

Loading