-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SwitchTraffic: use separate context while canceling a migration #17340
SwitchTraffic: use separate context while canceling a migration #17340
Conversation
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17340 +/- ##
==========================================
+ Coverage 67.41% 67.46% +0.04%
==========================================
Files 1576 1581 +5
Lines 253417 253944 +527
==========================================
+ Hits 170846 171328 +482
- Misses 82571 82616 +45 ☔ View full report in Codecov by Sentry. |
// We create a new context while canceling the migration, so that we are independent of the original | ||
// context being cancelled prior to or during the cancel operation. | ||
cmTimeout := 60 * time.Second | ||
cmCtx, cmCancel := context.WithTimeout(context.Background(), cmTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably does not matter logically, but I'm often wondering whether semantically context.WithoutCancel
would be more accurate in this situation?
cmCtx, cmCancel := context.WithTimeout(context.Background(), cmTimeout) | |
cmCtx, cmCancel := context.WithTimeout(context.WithoutCancel(ctx), cmTimeout) |
It probably makes no difference as we're not using values stored on the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible that we get here when the ctx
is already cancelled. In the specific case which lead us to this issue the client had a short timeout and original context's deadline had already expired.
I think I've run into this before, where switching traffic back and forth could lead to one side of the workflow (either the reverse or the original one) being deleted if a timeout is hit. Does this sound familiar? 🤔 |
If you reverse traffic then the source should not have any reverse workflow running since we are still doing forward vreplication (no traffic is switched). So the reverse workflow will get deleted. |
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ IMO it's worth adding a unit test [case] where we call cancel with a canceled context but it's such a simple and obvious change that it's not required.
} else { | ||
err = ts.changeShardsAccess(ctx, ts.SourceKeyspaceName(), ts.SourceShards(), allowWrites) | ||
err = ts.changeShardsAccess(cmCtx, ts.SourceKeyspaceName(), ts.SourceShards(), allowWrites) | ||
} | ||
if err != nil { | ||
ts.Logger().Errorf("Cancel migration failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should take this opportunity to improve the log message:
ts.Logger().Errorf("Cancel migration failed: could not revert denied tables / shared access: %v", err)
I also think that we should accumulate these and return them to the caller. But we could defer that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the log message. Let's make the changes to pass them to the caller separately since we are not returning the error at the moment.
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Description
Use a fresh context in
cancelMigration()
so that the rollback steps will happen independent of any timeouts on the client side.We should backport this to supported versions since failures in SwitchTraffic can cause cluster unavailability.
Related Issue(s)
#17326
Checklist