-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix problematic watch cancellation due to context cancellation #11170
Conversation
Right now we pass in the context when starting a Watch that is also used for the request context. This means that the Watch ends up being cancelled when the original request that started it as a side effects ends up completing and cancels the context to clean up. This is of course not as intended. Before the refactor in vitessio#10906 this wasn't causing a practical issue yet. We'd still have the expired context internally in the watcher and it would be passed through with updating entries, but there were no calls that ended up validating the context expiry, avoiding any immediate issue. This is bound to fail though at some point if something would be added that does care about the context. What is needed is that the watcher we start sets up it's own context based on the background context since it is detached from the original request that might trigger starting the watcher as a side effect. Additionally, it means that the tracked context for an error isn't really useful. It would often be an already cancelled context from a mostly unrelated request which doesn't provide useful information. Even more so, it would keep a reference to that context so it would never be garbage collection potentially and would keep more request data alive than necessary. With the fix, the context is always from the background context with a cancel on top for that watcher. This isn't very useful either. Also we don't use this context tracking for any error messaging or reporting anywhere, so I believe it's better to clean up this tracking. By cleaning up that tracking, we also avoid the need to pass down the context in entry updates and that is all cleaned up here as well. Lastly, a failing test is introduced that verifies the original issue. It retrieves serving keyspace information, cancels the original request that triggered that and then validates the watcher is still running by updating the value again within the timeout window. This failed before this fix as the watcher would be cancelled and the cached old value was returned before the TTL expired. The main problem of this bug is not an issue of correctness, but of a serious performance degration in the vtgate. Each second we'd restart context setup if we ever had a failure on the path triggered by regular queries and the system would not recover from this situation and heavily query the topo server and make things very expensive. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
The timer here can stay around if other events fire first, so we want to use an explicit timer to stop it immediately when we know it completes. Additionally, because of binding issues, watchCancel() would not rebind if we start a new inner watcher. Therefore this adds back an outer context that we can cancel in a defer to we know for sure we cancel things properly when stopping the watcher. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
We never closed the `cli` instance here so it would linger until the process completes. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
*srvTopoCacheTTL = time.Duration(100 * time.Millisecond) | ||
*srvTopoCacheRefresh = time.Duration(40 * time.Millisecond) | ||
*srvTopoCacheTTL = time.Duration(200 * time.Millisecond) | ||
*srvTopoCacheRefresh = time.Duration(80 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the timing to have some extra room here so it's possible to verify more reliably below that things work as expected also within this timeout window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -64,7 +64,6 @@ func (q *SrvKeyspaceNamesQuery) srvKeyspaceNamesCacheStatus() (result []*SrvKeys | |||
ExpirationTime: entry.insertionTime.Add(q.rq.cacheTTL), | |||
LastQueryTime: entry.lastQueryTime, | |||
LastError: entry.lastError, | |||
LastErrorCtx: entry.lastErrorCtx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right to me. This field is never used anywhere, so it's unclear why we even have it.
Let's wait for @mattlord to review before we merge this. |
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense to me, I only had one question about the inner watchCancel()
usage.
I also had one minor request about the test as flakiness is a persistent problem in the CI.
Thank you!
require.NoError(t, err, "UpdateSrvKeyspace(test_cell, test_ks, %s) failed", want) | ||
|
||
// Wait a bit to give the watcher enough time to update the value. | ||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely introduce some level of flakiness. I think that we should use something like this:
func waitForSrvKeyspaceResult(t *testing.T, rs ResilientServer, cell string, keyspace string, want *topodatapb.SrvKeyspace) {
// Allow for several attempts to deal with any ephemeral issues
// and prevent flaky tests.
timeout := *srvTopoCacheRefresh * 5
tick := *srvTopoCacheRefresh
tmr := time.NewTimer(timeout)
defer tmr.Stop()
var got *topodatapb.SrvKeyspace
ctx := context.Background()
for {
select {
case <-tmr.C:
t.Fatalf("Did not get expected GetSrvKeyspace() result of %+v before the timeout of %s. Last seen value: %+v",
want, timeout, got)
default:
got, err := rs.GetSrvKeyspace(ctx, cell, keyspace)
if err != nil {
t.Fatalf("GetSrvKeyspace() call had unexpected error: %v", err)
}
if proto.Equal(want, got) {
return
}
}
time.Sleep(tick)
}
}
return nil, nil, vterrors.Errorf(vtrpc.Code_INVALID_ARGUMENT, "Watch failed") | ||
} | ||
|
||
// Create the notifications channel, send updates to it. | ||
notifications := make(chan *topo.WatchData, 10) | ||
go func() { | ||
defer close(notifications) | ||
defer watchCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why we removed this line but call watchCancel() before outerCancel() in other places like like line # 192?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattlord So this one should be the outerCancel().
This is because values get bound when the defer is defined. That means that when we replace the watchCancel()
later on, this defer
would not be updated and we'd never defer the new watchCancel()
.
That's also why I added back the outerCancel()
here since that makes it easier to reason about it. In line 192 it's because the linter otherwise complains about a leaked context (which doesn't in practice happen since the enclosing one is cancelled) but it seemed easiest to do that.
That's also the logic we have before this refactoring around these two contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed using a waitForSrvKeyspaceResult()
function but this case is not as straightforward as the typical waitForX case. The new test modifications also line up pretty well with the rest of the test so we can put off any potential test de-flaking until the test actually turns out to be flaky. 🙂
For any future reference, I think that something like this may be what we'd want to use if we do need to start adding waits with a timeout:
func waitForSrvKeyspaceResult(t *testing.T, rs *ResilientServer, cell string, keyspace string, want *topodatapb.SrvKeyspace) {
// Allow for several attempts to deal with any ephemeral issues
// and prevent flaky tests.
timeout := *srvTopoCacheRefresh * 5
tick := *srvTopoCacheRefresh
timer := time.NewTimer(timeout)
defer timer.Stop()
var got *topodatapb.SrvKeyspace
ctx := context.Background()
for {
select {
case <-timer.C:
t.Fatalf("Did not get expected GetSrvKeyspace() result of %+v before the timeout of %s. Last seen value: %+v",
want, timeout, got)
default:
status := rs.CacheStatus()
for _, sk := range status.SrvKeyspaces {
if sk.Cell == cell && sk.Keyspace == keyspace && sk.LastError != nil {
t.Fatalf("SrvKeyspace cache entry for cell %s and keyspace %s recorded the following error, which should not happen: %v",
cell, keyspace, sk.LastError)
}
}
got, err := rs.GetSrvKeyspace(ctx, cell, keyspace)
if err != nil {
t.Fatalf("GetSrvKeyspace() call had unexpected error: %v", err)
}
if proto.Equal(want, got) {
return
}
}
time.Sleep(tick)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…sio#1025) * Revert "Revert WatchRecursive Topo Feature (vitessio#1023)" This reverts commit ba735af. * Fix problematic watch cancellation due to context cancellation (vitessio#11170) * Fix problematic watch cancellation due to context cancellation Right now we pass in the context when starting a Watch that is also used for the request context. This means that the Watch ends up being cancelled when the original request that started it as a side effects ends up completing and cancels the context to clean up. This is of course not as intended. Before the refactor in vitessio#10906 this wasn't causing a practical issue yet. We'd still have the expired context internally in the watcher and it would be passed through with updating entries, but there were no calls that ended up validating the context expiry, avoiding any immediate issue. This is bound to fail though at some point if something would be added that does care about the context. What is needed is that the watcher we start sets up it's own context based on the background context since it is detached from the original request that might trigger starting the watcher as a side effect. Additionally, it means that the tracked context for an error isn't really useful. It would often be an already cancelled context from a mostly unrelated request which doesn't provide useful information. Even more so, it would keep a reference to that context so it would never be garbage collection potentially and would keep more request data alive than necessary. With the fix, the context is always from the background context with a cancel on top for that watcher. This isn't very useful either. Also we don't use this context tracking for any error messaging or reporting anywhere, so I believe it's better to clean up this tracking. By cleaning up that tracking, we also avoid the need to pass down the context in entry updates and that is all cleaned up here as well. Lastly, a failing test is introduced that verifies the original issue. It retrieves serving keyspace information, cancels the original request that triggered that and then validates the watcher is still running by updating the value again within the timeout window. This failed before this fix as the watcher would be cancelled and the cached old value was returned before the TTL expired. The main problem of this bug is not an issue of correctness, but of a serious performance degration in the vtgate. Each second we'd restart context setup if we ever had a failure on the path triggered by regular queries and the system would not recover from this situation and heavily query the topo server and make things very expensive. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Improve handling of retries and timer wait The timer here can stay around if other events fire first, so we want to use an explicit timer to stop it immediately when we know it completes. Additionally, because of binding issues, watchCancel() would not rebind if we start a new inner watcher. Therefore this adds back an outer context that we can cancel in a defer to we know for sure we cancel things properly when stopping the watcher. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Fix leak in etc2topo tests We never closed the `cli` instance here so it would linger until the process completes. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Remove unused context Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Right now we pass in the context when starting a Watch that is also used for the request context. This means that the Watch ends up being cancelled when the original request that started it as a side effects ends up completing and cancels the context to clean up.
This is of course not as intended. Before the refactor in #10906 this wasn't causing a
practical issue yet. We'd still have the expired context internally in the watcher and it would be passed through with updating entries, but there were no calls that ended up validating the context expiry, avoiding any immediate issue.
This is bound to fail though at some point if something would be added that does care about the context. What is needed is that the watcher we start sets up it's own context based on the background context since it is detached from the original request that might trigger starting the watcher as a side effect.
Additionally, it means that the tracked context for an error isn't really useful. It would often be an already cancelled context from a mostly unrelated request which doesn't provide useful information. Even more so, it would keep a reference to that context so it would never be garbage collection potentially and would keep more request data alive than necessary.
With the fix, the context is always from the background context with a cancel on top for that watcher. This isn't very useful either. Also we don't use this context tracking for any error messaging or reporting anywhere, so I believe it's better to clean up this tracking.
By cleaning up that tracking, we also avoid the need to pass down the context in entry updates and that is all cleaned up here as well.
Lastly, a failing test is introduced that verifies the original issue. It retrieves serving keyspace information, cancels the original request that triggered that and then validates the watcher is still running by updating the value again within the timeout window. This failed before this fix as the watcher would be cancelled and the cached old value was returned before the TTL expired.
The main problem of this bug is not an issue of correctness, but of a serious performance degration in the vtgate. Each second we'd restart context setup if we ever had a failure on the path triggered by regular queries and the system would not recover from this situation and heavily query the topo server and make things very expensive.
Related Issue(s)
Follow up for #10906
Checklist