Fix problematic watch cancellation due to context cancellation #11170

dbussink · 2022-09-02T11:58:05Z

Right now we pass in the context when starting a Watch that is also used for the request context. This means that the Watch ends up being cancelled when the original request that started it as a side effects ends up completing and cancels the context to clean up.

This is of course not as intended. Before the refactor in #10906 this wasn't causing a
practical issue yet. We'd still have the expired context internally in the watcher and it would be passed through with updating entries, but there were no calls that ended up validating the context expiry, avoiding any immediate issue.

This is bound to fail though at some point if something would be added that does care about the context. What is needed is that the watcher we start sets up it's own context based on the background context since it is detached from the original request that might trigger starting the watcher as a side effect.

Additionally, it means that the tracked context for an error isn't really useful. It would often be an already cancelled context from a mostly unrelated request which doesn't provide useful information. Even more so, it would keep a reference to that context so it would never be garbage collection potentially and would keep more request data alive than necessary.

With the fix, the context is always from the background context with a cancel on top for that watcher. This isn't very useful either. Also we don't use this context tracking for any error messaging or reporting anywhere, so I believe it's better to clean up this tracking.

By cleaning up that tracking, we also avoid the need to pass down the context in entry updates and that is all cleaned up here as well.

Lastly, a failing test is introduced that verifies the original issue. It retrieves serving keyspace information, cancels the original request that triggered that and then validates the watcher is still running by updating the value again within the timeout window. This failed before this fix as the watcher would be cancelled and the cached old value was returned before the TTL expired.

The main problem of this bug is not an issue of correctness, but of a serious performance degration in the vtgate. Each second we'd restart context setup if we ever had a failure on the path triggered by regular queries and the system would not recover from this situation and heavily query the topo server and make things very expensive.

Related Issue(s)

Follow up for #10906

Checklist

"Backport me!" label has been added if this change should be backported
Tests were added or are not required
Documentation was added or is not required

Right now we pass in the context when starting a Watch that is also used for the request context. This means that the Watch ends up being cancelled when the original request that started it as a side effects ends up completing and cancels the context to clean up. This is of course not as intended. Before the refactor in vitessio#10906 this wasn't causing a practical issue yet. We'd still have the expired context internally in the watcher and it would be passed through with updating entries, but there were no calls that ended up validating the context expiry, avoiding any immediate issue. This is bound to fail though at some point if something would be added that does care about the context. What is needed is that the watcher we start sets up it's own context based on the background context since it is detached from the original request that might trigger starting the watcher as a side effect. Additionally, it means that the tracked context for an error isn't really useful. It would often be an already cancelled context from a mostly unrelated request which doesn't provide useful information. Even more so, it would keep a reference to that context so it would never be garbage collection potentially and would keep more request data alive than necessary. With the fix, the context is always from the background context with a cancel on top for that watcher. This isn't very useful either. Also we don't use this context tracking for any error messaging or reporting anywhere, so I believe it's better to clean up this tracking. By cleaning up that tracking, we also avoid the need to pass down the context in entry updates and that is all cleaned up here as well. Lastly, a failing test is introduced that verifies the original issue. It retrieves serving keyspace information, cancels the original request that triggered that and then validates the watcher is still running by updating the value again within the timeout window. This failed before this fix as the watcher would be cancelled and the cached old value was returned before the TTL expired. The main problem of this bug is not an issue of correctness, but of a serious performance degration in the vtgate. Each second we'd restart context setup if we ever had a failure on the path triggered by regular queries and the system would not recover from this situation and heavily query the topo server and make things very expensive. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

The timer here can stay around if other events fire first, so we want to use an explicit timer to stop it immediately when we know it completes. Additionally, because of binding issues, watchCancel() would not rebind if we start a new inner watcher. Therefore this adds back an outer context that we can cancel in a defer to we know for sure we cancel things properly when stopping the watcher. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

We never closed the `cli` instance here so it would linger until the process completes. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

vitess-bot · 2022-09-02T11:58:08Z

dbussink · 2022-09-02T11:59:12Z

go/vt/srvtopo/resilient_server_test.go

-	*srvTopoCacheTTL = time.Duration(100 * time.Millisecond)
-	*srvTopoCacheRefresh = time.Duration(40 * time.Millisecond)
+	*srvTopoCacheTTL = time.Duration(200 * time.Millisecond)
+	*srvTopoCacheRefresh = time.Duration(80 * time.Millisecond)


Changed the timing to have some extra room here so it's possible to verify more reliably below that things work as expected also within this timeout window.

deepthi

LGTM

deepthi · 2022-09-02T16:32:56Z

go/vt/srvtopo/query_srvkeyspacenames.go

@@ -64,7 +64,6 @@ func (q *SrvKeyspaceNamesQuery) srvKeyspaceNamesCacheStatus() (result []*SrvKeys
 			ExpirationTime: entry.insertionTime.Add(q.rq.cacheTTL),
 			LastQueryTime:  entry.lastQueryTime,
 			LastError:      entry.lastError,
-			LastErrorCtx:   entry.lastErrorCtx,


This looks right to me. This field is never used anywhere, so it's unclear why we even have it.

deepthi · 2022-09-02T21:56:11Z

LGTM

Let's wait for @mattlord to review before we merge this.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

mattlord

The changes make sense to me, I only had one question about the inner watchCancel() usage.

I also had one minor request about the test as flakiness is a persistent problem in the CI.

Thank you!

mattlord · 2022-09-06T12:56:45Z

go/vt/srvtopo/resilient_server_test.go

+	require.NoError(t, err, "UpdateSrvKeyspace(test_cell, test_ks, %s) failed", want)
+
+	// Wait a bit to give the watcher enough time to update the value.
+	time.Sleep(10 * time.Millisecond)


This will likely introduce some level of flakiness. I think that we should use something like this:

func waitForSrvKeyspaceResult(t *testing.T, rs ResilientServer, cell string, keyspace string, want *topodatapb.SrvKeyspace) { // Allow for several attempts to deal with any ephemeral issues // and prevent flaky tests. timeout := *srvTopoCacheRefresh * 5 tick := *srvTopoCacheRefresh tmr := time.NewTimer(timeout) defer tmr.Stop() var got *topodatapb.SrvKeyspace ctx := context.Background() for { select { case <-tmr.C: t.Fatalf("Did not get expected GetSrvKeyspace() result of %+v before the timeout of %s. Last seen value: %+v", want, timeout, got) default: got, err := rs.GetSrvKeyspace(ctx, cell, keyspace) if err != nil { t.Fatalf("GetSrvKeyspace() call had unexpected error: %v", err) } if proto.Equal(want, got) { return } } time.Sleep(tick) } }

mattlord · 2022-09-06T13:04:08Z

go/vt/topo/etcd2topo/watch.go

 		return nil, nil, vterrors.Errorf(vtrpc.Code_INVALID_ARGUMENT, "Watch failed")
 	}

 	// Create the notifications channel, send updates to it.
 	notifications := make(chan *topo.WatchData, 10)
 	go func() {
 		defer close(notifications)
-		defer watchCancel()


Curious why we removed this line but call watchCancel() before outerCancel() in other places like like line # 192?

@mattlord So this one should be the outerCancel(). This is because values get bound when the defer is defined. That means that when we replace the watchCancel() later on, this defer would not be updated and we'd never defer the new watchCancel().

That's also why I added back the outerCancel() here since that makes it easier to reason about it. In line 192 it's because the linter otherwise complains about a leaked context (which doesn't in practice happen since the enclosing one is cancelled) but it seemed easiest to do that.

That's also the logic we have before this refactoring around these two contexts.

mattlord

We discussed using a waitForSrvKeyspaceResult() function but this case is not as straightforward as the typical waitForX case. The new test modifications also line up pretty well with the rest of the test so we can put off any potential test de-flaking until the test actually turns out to be flaky. 🙂

For any future reference, I think that something like this may be what we'd want to use if we do need to start adding waits with a timeout:

func waitForSrvKeyspaceResult(t *testing.T, rs *ResilientServer, cell string, keyspace string, want *topodatapb.SrvKeyspace) {
	// Allow for several attempts to deal with any ephemeral issues
	// and prevent flaky tests.
	timeout := *srvTopoCacheRefresh * 5
	tick := *srvTopoCacheRefresh
	timer := time.NewTimer(timeout)
	defer timer.Stop()
	var got *topodatapb.SrvKeyspace
	ctx := context.Background()

	for {
		select {
		case <-timer.C:
			t.Fatalf("Did not get expected GetSrvKeyspace() result of %+v before the timeout of %s. Last seen value: %+v",
				want, timeout, got)
		default:
			status := rs.CacheStatus()
			for _, sk := range status.SrvKeyspaces {
				if sk.Cell == cell && sk.Keyspace == keyspace && sk.LastError != nil {
					t.Fatalf("SrvKeyspace cache entry for cell %s and keyspace %s recorded the following error, which should not happen: %v",
						cell, keyspace, sk.LastError)
				}
			}
			got, err := rs.GetSrvKeyspace(ctx, cell, keyspace)
			if err != nil {
				t.Fatalf("GetSrvKeyspace() call had unexpected error: %v", err)
			}
			if proto.Equal(want, got) {
				return
			}
		}
		time.Sleep(tick)
	}
}

rsajwani

LGTM

…sio#1025) * Revert "Revert WatchRecursive Topo Feature (vitessio#1023)" This reverts commit ba735af. * Fix problematic watch cancellation due to context cancellation (vitessio#11170) * Fix problematic watch cancellation due to context cancellation Right now we pass in the context when starting a Watch that is also used for the request context. This means that the Watch ends up being cancelled when the original request that started it as a side effects ends up completing and cancels the context to clean up. This is of course not as intended. Before the refactor in vitessio#10906 this wasn't causing a practical issue yet. We'd still have the expired context internally in the watcher and it would be passed through with updating entries, but there were no calls that ended up validating the context expiry, avoiding any immediate issue. This is bound to fail though at some point if something would be added that does care about the context. What is needed is that the watcher we start sets up it's own context based on the background context since it is detached from the original request that might trigger starting the watcher as a side effect. Additionally, it means that the tracked context for an error isn't really useful. It would often be an already cancelled context from a mostly unrelated request which doesn't provide useful information. Even more so, it would keep a reference to that context so it would never be garbage collection potentially and would keep more request data alive than necessary. With the fix, the context is always from the background context with a cancel on top for that watcher. This isn't very useful either. Also we don't use this context tracking for any error messaging or reporting anywhere, so I believe it's better to clean up this tracking. By cleaning up that tracking, we also avoid the need to pass down the context in entry updates and that is all cleaned up here as well. Lastly, a failing test is introduced that verifies the original issue. It retrieves serving keyspace information, cancels the original request that triggered that and then validates the watcher is still running by updating the value again within the timeout window. This failed before this fix as the watcher would be cancelled and the cached old value was returned before the TTL expired. The main problem of this bug is not an issue of correctness, but of a serious performance degration in the vtgate. Each second we'd restart context setup if we ever had a failure on the path triggered by regular queries and the system would not recover from this situation and heavily query the topo server and make things very expensive. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Improve handling of retries and timer wait The timer here can stay around if other events fire first, so we want to use an explicit timer to stop it immediately when we know it completes. Additionally, because of binding issues, watchCancel() would not rebind if we start a new inner watcher. Therefore this adds back an outer context that we can cancel in a defer to we know for sure we cancel things properly when stopping the watcher. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Fix leak in etc2topo tests We never closed the `cli` instance here so it would linger until the process completes. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * Remove unused context Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

dbussink added 3 commits September 2, 2022 13:24

Fix leak in etc2topo tests

9621b4c

We never closed the `cli` instance here so it would linger until the process completes. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

dbussink added Type: Bug Component: Query Serving labels Sep 2, 2022

dbussink requested a review from rafael as a code owner September 2, 2022 11:58

dbussink requested a review from deepthi as a code owner September 2, 2022 11:58

dbussink requested a review from mattlord September 2, 2022 11:58

dbussink commented Sep 2, 2022

View reviewed changes

deepthi reviewed Sep 2, 2022

View reviewed changes

Remove unused context

7d83b4a

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

mattlord reviewed Sep 6, 2022

View reviewed changes

mattlord self-requested a review September 6, 2022 13:55

mattlord approved these changes Sep 6, 2022

View reviewed changes

mattlord merged commit f95f652 into vitessio:main Sep 6, 2022

dbussink deleted the fix-topo-watch-cancellation branch September 6, 2022 14:56

rsajwani reviewed Sep 6, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix problematic watch cancellation due to context cancellation #11170

Fix problematic watch cancellation due to context cancellation #11170

dbussink commented Sep 2, 2022

vitess-bot bot commented Sep 2, 2022 •

edited by deepthi

Loading

dbussink Sep 2, 2022

deepthi left a comment

deepthi Sep 2, 2022

deepthi commented Sep 2, 2022

mattlord left a comment •

edited

Loading

mattlord Sep 6, 2022

mattlord Sep 6, 2022 •

edited

Loading

dbussink Sep 6, 2022

mattlord left a comment •

edited

Loading

rsajwani left a comment

Fix problematic watch cancellation due to context cancellation #11170

Fix problematic watch cancellation due to context cancellation #11170

Conversation

dbussink commented Sep 2, 2022

Related Issue(s)

Checklist

vitess-bot bot commented Sep 2, 2022 • edited by deepthi Loading

Review Checklist

General

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

dbussink Sep 2, 2022

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

deepthi Sep 2, 2022

Choose a reason for hiding this comment

deepthi commented Sep 2, 2022

mattlord left a comment • edited Loading

Choose a reason for hiding this comment

mattlord Sep 6, 2022

Choose a reason for hiding this comment

mattlord Sep 6, 2022 • edited Loading

Choose a reason for hiding this comment

dbussink Sep 6, 2022

Choose a reason for hiding this comment

mattlord left a comment • edited Loading

Choose a reason for hiding this comment

rsajwani left a comment

Choose a reason for hiding this comment

vitess-bot bot commented Sep 2, 2022 •

edited by deepthi

Loading

mattlord left a comment •

edited

Loading

mattlord Sep 6, 2022 •

edited

Loading

mattlord left a comment •

edited

Loading