Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problematic watch cancellation due to context cancellation #11170

Merged
merged 4 commits into from
Sep 6, 2022

Conversation

dbussink
Copy link
Contributor

@dbussink dbussink commented Sep 2, 2022

Right now we pass in the context when starting a Watch that is also used for the request context. This means that the Watch ends up being cancelled when the original request that started it as a side effects ends up completing and cancels the context to clean up.

This is of course not as intended. Before the refactor in #10906 this wasn't causing a
practical issue yet. We'd still have the expired context internally in the watcher and it would be passed through with updating entries, but there were no calls that ended up validating the context expiry, avoiding any immediate issue.

This is bound to fail though at some point if something would be added that does care about the context. What is needed is that the watcher we start sets up it's own context based on the background context since it is detached from the original request that might trigger starting the watcher as a side effect.

Additionally, it means that the tracked context for an error isn't really useful. It would often be an already cancelled context from a mostly unrelated request which doesn't provide useful information. Even more so, it would keep a reference to that context so it would never be garbage collection potentially and would keep more request data alive than necessary.

With the fix, the context is always from the background context with a cancel on top for that watcher. This isn't very useful either. Also we don't use this context tracking for any error messaging or reporting anywhere, so I believe it's better to clean up this tracking.

By cleaning up that tracking, we also avoid the need to pass down the context in entry updates and that is all cleaned up here as well.

Lastly, a failing test is introduced that verifies the original issue. It retrieves serving keyspace information, cancels the original request that triggered that and then validates the watcher is still running by updating the value again within the timeout window. This failed before this fix as the watcher would be cancelled and the cached old value was returned before the TTL expired.

The main problem of this bug is not an issue of correctness, but of a serious performance degration in the vtgate. Each second we'd restart context setup if we ever had a failure on the path triggered by regular queries and the system would not recover from this situation and heavily query the topo server and make things very expensive.

Related Issue(s)

Follow up for #10906

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Right now we pass in the context when starting a Watch that is also
used for the request context. This means that the Watch ends up being
cancelled when the original request that started it as a side effects
ends up completing and cancels the context to clean up.

This is of course not as intended. Before the refactor in
vitessio#10906 this wasn't causing a
practical issue yet. We'd still have the expired context internally in
the watcher and it would be passed through with updating entries, but
there were no calls that ended up validating the context expiry,
avoiding any immediate issue.

This is bound to fail though at some point if something would be
added that does care about the context. What is needed is that the
watcher we start sets up it's own context based on the background
context since it is detached from the original request that might
trigger starting the watcher as a side effect.

Additionally, it means that the tracked context for an error isn't
really useful. It would often be an already cancelled context from a
mostly unrelated request which doesn't provide useful information. Even
more so, it would keep a reference to that context so it would never be
garbage collection potentially and would keep more request data alive
than necessary.

With the fix, the context is always from the background context with a
cancel on top for that watcher. This isn't very useful either. Also we
don't use this context tracking for any error messaging or reporting
anywhere, so I believe it's better to clean up this tracking.

By cleaning up that tracking, we also avoid the need to pass down the
context in entry updates and that is all cleaned up here as well.

Lastly, a failing test is introduced that verifies the original issue.
It retrieves serving keyspace information, cancels the original request
that triggered that and then validates the watcher is still running by
updating the value again within the timeout window. This failed before
this fix as the watcher would be cancelled and the cached old value was
returned before the TTL expired.

The main problem of this bug is not an issue of correctness, but of a
serious performance degration in the vtgate. Each second we'd restart
context setup if we ever had a failure on the path triggered by regular
queries and the system would not recover from this situation and heavily
query the topo server and make things very expensive.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
The timer here can stay around if other events fire first, so we want to
use an explicit timer to stop it immediately when we know it completes.

Additionally, because of binding issues, watchCancel() would not rebind
if we start a new inner watcher. Therefore this adds back an outer
context that we can cancel in a defer to we know for sure we cancel
things properly when stopping the watcher.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
We never closed the `cli` instance here so it would linger until the
process completes.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Sep 2, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive. Additionally, flag names should use dashes (-) as word separators rather than underscores (_).
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

@dbussink dbussink requested a review from deepthi as a code owner September 2, 2022 11:58
@dbussink dbussink requested a review from mattlord September 2, 2022 11:58
*srvTopoCacheTTL = time.Duration(100 * time.Millisecond)
*srvTopoCacheRefresh = time.Duration(40 * time.Millisecond)
*srvTopoCacheTTL = time.Duration(200 * time.Millisecond)
*srvTopoCacheRefresh = time.Duration(80 * time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the timing to have some extra room here so it's possible to verify more reliably below that things work as expected also within this timeout window.

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -64,7 +64,6 @@ func (q *SrvKeyspaceNamesQuery) srvKeyspaceNamesCacheStatus() (result []*SrvKeys
ExpirationTime: entry.insertionTime.Add(q.rq.cacheTTL),
LastQueryTime: entry.lastQueryTime,
LastError: entry.lastError,
LastErrorCtx: entry.lastErrorCtx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me. This field is never used anywhere, so it's unclear why we even have it.

@deepthi
Copy link
Member

deepthi commented Sep 2, 2022

LGTM

Let's wait for @mattlord to review before we merge this.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense to me, I only had one question about the inner watchCancel() usage.

I also had one minor request about the test as flakiness is a persistent problem in the CI.

Thank you!

require.NoError(t, err, "UpdateSrvKeyspace(test_cell, test_ks, %s) failed", want)

// Wait a bit to give the watcher enough time to update the value.
time.Sleep(10 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely introduce some level of flakiness. I think that we should use something like this:

func waitForSrvKeyspaceResult(t *testing.T, rs ResilientServer, cell string, keyspace string, want *topodatapb.SrvKeyspace) {
	// Allow for several attempts to deal with any ephemeral issues
	// and prevent flaky tests.
	timeout := *srvTopoCacheRefresh * 5
	tick := *srvTopoCacheRefresh
	tmr := time.NewTimer(timeout)
	defer tmr.Stop()
	var got *topodatapb.SrvKeyspace
	ctx := context.Background()

	for {
		select {
		case <-tmr.C:
			t.Fatalf("Did not get expected GetSrvKeyspace() result of %+v before the timeout of %s. Last seen value: %+v",
				want, timeout, got)
		default:
			got, err := rs.GetSrvKeyspace(ctx, cell, keyspace)
			if err != nil {
				t.Fatalf("GetSrvKeyspace() call had unexpected error: %v", err)
			}
			if proto.Equal(want, got) {
				return
			}
		}
		time.Sleep(tick)
	}
}

return nil, nil, vterrors.Errorf(vtrpc.Code_INVALID_ARGUMENT, "Watch failed")
}

// Create the notifications channel, send updates to it.
notifications := make(chan *topo.WatchData, 10)
go func() {
defer close(notifications)
defer watchCancel()
Copy link
Contributor

@mattlord mattlord Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why we removed this line but call watchCancel() before outerCancel() in other places like like line # 192?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattlord So this one should be the outerCancel(). This is because values get bound when the defer is defined. That means that when we replace the watchCancel() later on, this defer would not be updated and we'd never defer the new watchCancel().

That's also why I added back the outerCancel() here since that makes it easier to reason about it. In line 192 it's because the linter otherwise complains about a leaked context (which doesn't in practice happen since the enclosing one is cancelled) but it seemed easiest to do that.

That's also the logic we have before this refactoring around these two contexts.

@mattlord mattlord self-requested a review September 6, 2022 13:55
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed using a waitForSrvKeyspaceResult() function but this case is not as straightforward as the typical waitForX case. The new test modifications also line up pretty well with the rest of the test so we can put off any potential test de-flaking until the test actually turns out to be flaky. 🙂

For any future reference, I think that something like this may be what we'd want to use if we do need to start adding waits with a timeout:

func waitForSrvKeyspaceResult(t *testing.T, rs *ResilientServer, cell string, keyspace string, want *topodatapb.SrvKeyspace) {
	// Allow for several attempts to deal with any ephemeral issues
	// and prevent flaky tests.
	timeout := *srvTopoCacheRefresh * 5
	tick := *srvTopoCacheRefresh
	timer := time.NewTimer(timeout)
	defer timer.Stop()
	var got *topodatapb.SrvKeyspace
	ctx := context.Background()

	for {
		select {
		case <-timer.C:
			t.Fatalf("Did not get expected GetSrvKeyspace() result of %+v before the timeout of %s. Last seen value: %+v",
				want, timeout, got)
		default:
			status := rs.CacheStatus()
			for _, sk := range status.SrvKeyspaces {
				if sk.Cell == cell && sk.Keyspace == keyspace && sk.LastError != nil {
					t.Fatalf("SrvKeyspace cache entry for cell %s and keyspace %s recorded the following error, which should not happen: %v",
						cell, keyspace, sk.LastError)
				}
			}
			got, err := rs.GetSrvKeyspace(ctx, cell, keyspace)
			if err != nil {
				t.Fatalf("GetSrvKeyspace() call had unexpected error: %v", err)
			}
			if proto.Equal(want, got) {
				return
			}
		}
		time.Sleep(tick)
	}
}

@mattlord mattlord merged commit f95f652 into vitessio:main Sep 6, 2022
@dbussink dbussink deleted the fix-topo-watch-cancellation branch September 6, 2022 14:56
Copy link
Contributor

@rsajwani rsajwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

notfelineit pushed a commit to planetscale/vitess that referenced this pull request Sep 21, 2022
…sio#1025)

* Revert "Revert WatchRecursive Topo Feature (vitessio#1023)"

This reverts commit ba735af.

* Fix problematic watch cancellation due to context cancellation (vitessio#11170)

* Fix problematic watch cancellation due to context cancellation

Right now we pass in the context when starting a Watch that is also
used for the request context. This means that the Watch ends up being
cancelled when the original request that started it as a side effects
ends up completing and cancels the context to clean up.

This is of course not as intended. Before the refactor in
vitessio#10906 this wasn't causing a
practical issue yet. We'd still have the expired context internally in
the watcher and it would be passed through with updating entries, but
there were no calls that ended up validating the context expiry,
avoiding any immediate issue.

This is bound to fail though at some point if something would be
added that does care about the context. What is needed is that the
watcher we start sets up it's own context based on the background
context since it is detached from the original request that might
trigger starting the watcher as a side effect.

Additionally, it means that the tracked context for an error isn't
really useful. It would often be an already cancelled context from a
mostly unrelated request which doesn't provide useful information. Even
more so, it would keep a reference to that context so it would never be
garbage collection potentially and would keep more request data alive
than necessary.

With the fix, the context is always from the background context with a
cancel on top for that watcher. This isn't very useful either. Also we
don't use this context tracking for any error messaging or reporting
anywhere, so I believe it's better to clean up this tracking.

By cleaning up that tracking, we also avoid the need to pass down the
context in entry updates and that is all cleaned up here as well.

Lastly, a failing test is introduced that verifies the original issue.
It retrieves serving keyspace information, cancels the original request
that triggered that and then validates the watcher is still running by
updating the value again within the timeout window. This failed before
this fix as the watcher would be cancelled and the cached old value was
returned before the TTL expired.

The main problem of this bug is not an issue of correctness, but of a
serious performance degration in the vtgate. Each second we'd restart
context setup if we ever had a failure on the path triggered by regular
queries and the system would not recover from this situation and heavily
query the topo server and make things very expensive.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Improve handling of retries and timer wait

The timer here can stay around if other events fire first, so we want to
use an explicit timer to stop it immediately when we know it completes.

Additionally, because of binding issues, watchCancel() would not rebind
if we start a new inner watcher. Therefore this adds back an outer
context that we can cancel in a defer to we know for sure we cancel
things properly when stopping the watcher.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Fix leak in etc2topo tests

We never closed the `cli` instance here so it would linger until the
process completes.

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

* Remove unused context

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants