-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not demote a new primary after backup completion #12856
do not demote a new primary after backup completion #12856
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
if tabletInfo.Type == topodatapb.TabletType_PRIMARY { | ||
return nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the right fix is to actually not call SetReplicationSource
here at all. The ChangeTabletType
code path already effectively does that here
https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/tabletmanager/rpc_actions.go#L100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though.. I'm guessing that this line is present to handle the case where the identity of the primary changed while the replica was taking a backup.
Even then, it doesn't belong at this outer layer, it should all be handled inside the call to tmc.Backup
because that is where we are making the decisions of whether the tablet type needs to change or not, whether we stop MySQL or not etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it will be good if you can create a unit test to validate the final change. Either in https://github.com/vitessio/vitess/blob/main/go/vt/vtctl/grpcvtctldserver/server_test.go#L510 or in the tabletmanager
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello Olga!,
I looked at the PR today and realized that the bug you are running into came to be because of a PR I did. #9725 was the PR that added the SetReplicationSource
RPC at the end of the Backup call. The motivation for the change was to fix the semi-sync settings on the vttablets after the Backup process has ended. If a vttablet becomes a REPLICA type from BACKUP, it might need to start sending semi-sync ACKs (If the primary or the durability changed).
The addition of this however broke backups from xtrabackup since the tablet itself can become a PRIMARY which the code doesn't take into account.
As @deepthi said, it doesn't look like a good idea to keep adding code here in the vtctld layer. Instead, we should remove the SetReplicationSource
call entirely from here and instead, fix the semi-sync and replication settings on the vttablet directly.
I think the best place for these changes would be https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/tabletmanager/rpc_backup.go#L101. Here we can read the shard record and keyspace record to find the current primary and make necessary actions.
Regarding testing, I think tests can be added in https://github.com/vitessio/vitess/blob/main/go/vt/wrangler/testlib/backup_test.go as the original PR does.
I've moved the SetReplicationSource call out of vtctl and into vttablet, now trying to wrangle the tests |
I'm having a hard time with the test- there doesn't seem to be a way to set xtrabackup, and for it to be a real test of the issue, there would have to be a reparent to make the replica tablet primary to ensure that we re-check. I'm trying to do this in an async go func to make it simultaneous with the backup |
@deepthi @GuptaManan100 could you give this another look? I tested this with our setup and didn't see the bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me! Does the newly added test fail without these changes? I tried running it without the changes in the PR and it still passed.
Also, the description needs to be changed to reflect the code changes that have happened |
@GuptaManan100 I'm looking for help with the test- there doesn't seem to be a way to set xtrabackup, and for it to be a real test of the issue, there would have to be a reparent to make the replica tablet primary to ensure that we re-check. I tried to do this in an async go func to make it simultaneous with the backup, but it didn't work and anyhow seems flakey |
@olyazavr I see! I agree with you that a unit test isn't sufficient for testing these changes. I think we should add an end-to-end test, especially because we have the ability to run xtrabackup there. You can take a look at https://github.com/vitessio/vitess/blob/main/go/test/endtoend/backup/xtrabackup/xtrabackup_test.go and add a test here which runs a xtrabackup test and reparents the primary before the backup completes. |
@GuptaManan100 I added an e2e test, though it also passes in main. The way I tried to simultaneously reparent and backup on that tablet is by launching two go functions, but I can't get it to print any output locally, so not sure how it's passing (perhaps it's a race here), could you give me some pointers? |
@olyazavr I looked at the end to end test that you added. I noticed something right off the bat, you aren't setting up the xtrabackup flags in the vttablets. So they're probably running normal builtin backups. You can look at vitess/go/test/endtoend/backup/vtctlbackup/backup_utils.go Lines 140 to 158 in fa0761e
clusterInstance.VttabletExtraArgs before starting the vttablets.
Also, if you setup your |
@GuptaManan100 added in the backup args to the test |
clusterInstance.VtTabletExtraArgs = []string{ | ||
"--backup_engine_implementation", "xtrabackup", | ||
"--xtrabackup_stream_mode=xbstream", | ||
"--xtrabackup_user=vt_dba", | ||
"--xtrabackup_backup_flags", | ||
"--password=VtDbaPass", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work because the cluster has already been setup. The vttablets are already running by the time this code runs. You should add a parameter to the function you're using and then put these configurations before the vttablets are launched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok with enough finagling, I finally got this test to work 🎉
668817b
to
4d876dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything LGTM now!
I have also tested the added test locally to make sure it isn't flaky! |
aa26106
to
242560d
Compare
this is an interesting failure when doing GetTablet: output of GetTablet is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json2
is a package in Vitess "vitess.io/vitess/go/json2"
. Using that unmarshaller fixes the problem. 😄
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Olga Shestopalova <olgash@mit.edu> Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> Signed-off-by: <oshestopalova@hubspot.com> Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Signed-off-by: Olga Shestopalova <olgash@mit.edu>
Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Signed-off-by: Olga Shestopalova <olgash@mit.edu>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
5c7e5a1
to
3802300
Compare
Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com>
I've rebased and fixed tests, this is good to go! |
And just in time for RC-1. ❤️ it! |
* do not demote a new primary after backup completion Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * move set replication source out of vtctld and into tablet Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * remove set replication source in vtctld Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * fix ctx, and wip test Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * add e2e test instead of unit test Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * use built in methods Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * omit this in test Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * Update go/vt/vttablet/tabletmanager/rpc_backup.go Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Olga Shestopalova <olgash@mit.edu> Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * move wg done to top of func Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * fix method change Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> Signed-off-by: <oshestopalova@hubspot.com> Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * try running this in isolation Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * try running this in isolation, dont run with others Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * add debug logging Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * add debug logging Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * Update go/test/endtoend/backup/vtctlbackup/backup_utils.go Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Signed-off-by: Olga Shestopalova <olgash@mit.edu> * Update go/test/endtoend/backup/vtctlbackup/backup_utils.go Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com> Signed-off-by: Olga Shestopalova <olgash@mit.edu> * run goimports Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * remove dupe Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> * remove generated file Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> --------- Signed-off-by: Olga Shestopalova <oshestopalova@hubspot.com> Signed-off-by: Olga Shestopalova <olgash@mit.edu> Signed-off-by: <oshestopalova@hubspot.com> Co-authored-by: Olga Shestopalova <oshestopalova@hubspot.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Manan Gupta <35839558+GuptaManan100@users.noreply.github.com>
Description
We have had several outages when the following occurred:
I tracked this down to this line in vtctld where we do
SetReplicationSource
after a backup completes. This moves that into the backup itself, and only after it is done andengine.ShouldDrainForBackup()
is true, we then refetch the tablet status, and then perform the equivalent toSetReplicationSource
. Here we also check if the tablet itself has been promoted to a primary or not. More context here at #12856 (comment)Related Issue(s)
Fixes #12754
Checklist
Deployment Notes