[release-16.0] Upgrade-Downgrade Fix: Schema-initialization stuck on semi-sync ACKs while upgrading (#13411) #13441
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR is a backport of #13411.
When we introduced the schema-init-db code, we failed to realize that it would start doing writes as part of the code to change the tablet type. As part of the PRS process, we used to call
PromoteReplica
first, followed by calls toSetReplicationSource
.When a user upgrades from v16 (/v15) to v17 (/v16), as part of
PromoteReplica
call, the schema-init realizes that there are schema diffs to apply and ends up writing to the database. The problem is that if semi-sync is enabled, all of these writes get blocked indefinitely. Eventually,PromoteReplica
fails and this fails the entirePRS
call.In this PR we fix this issue, by altering the PRS flow slightly, where we call
SetReplicationSource
on all the replicas andPromoteReplica
on the new primary in parallel. This allowsPromoteReplica
to be unblocked just as any semi-sync capable replica reparents to it.As part of this PR, the upgrade-downgrade tests for manual backups has been augmented as well to start using semi-sync and to follow the correct steps to upgrade the cluser instead of just shutting down all the tablets and restarting all the tablets. Also, we call
PlannedReparentShard
now in the test instead ofInitShardPrimary
which has been long deprecated.Related Issue(s)
Checklist
Deployment Notes