Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ordered responses for ADS delta watches #752

Merged
merged 17 commits into from
Oct 10, 2023

Conversation

zhaohuabing
Copy link
Member

@zhaohuabing zhaohuabing commented Aug 14, 2023

Fix #705

Why we need this?

go-control-plane doesn't guarantee the orders of responses for delta ADS. If there are two requests, the response to the second request may be sent before the first request. This causes problems for us since our application generates xDS resources for clusters and listeners in sequence. Sometimes, the the listener resources are sent to Envoy before the cluster resources. As a result, Envoy complains that it can't find referenced clusters in the Listener configuration and fail to apply the configuration.

How this PR fix this issue?

This PR follow an approach similar to #544 to enable ordered responses for ADS delta watches.

  • Sort the watches to ensure that the responses are handled in the correct order(Endpoint, Cluster, Route, Listener ...)
  • Send all the responses to the shared deltaMuxedResponses channel to guarantee that the orders won't be changed
  • Since responses and requests are handled in a single go routine, to avoid deadlocks, the response channel is purged before processing a request, and the deltaMuxedResponses channel is created with a buffer size of 2x the number of types.

@zhaohuabing zhaohuabing force-pushed the ordered-delta-watch branch 2 times, most recently from 5c21349 to ef293f8 Compare August 14, 2023 10:44
@zhaohuabing zhaohuabing marked this pull request as draft August 14, 2023 10:49
@zhaohuabing zhaohuabing changed the title Enable ordered responses for delta watches Enable ordered responses for ADS delta watches Aug 14, 2023
@zhaohuabing zhaohuabing marked this pull request as ready for review August 15, 2023 07:13
@zhaohuabing zhaohuabing marked this pull request as draft August 15, 2023 07:13
@zhaohuabing zhaohuabing force-pushed the ordered-delta-watch branch 4 times, most recently from 44d23ba to 487dc31 Compare August 15, 2023 11:25
@zhaohuabing zhaohuabing marked this pull request as ready for review August 15, 2023 11:32
@zhaohuabing zhaohuabing marked this pull request as draft August 16, 2023 04:22
@zhaohuabing zhaohuabing force-pushed the ordered-delta-watch branch 3 times, most recently from d66fefa to 6adf01b Compare August 16, 2023 04:54
@zhaohuabing zhaohuabing marked this pull request as ready for review August 16, 2023 05:20
@arkodg
Copy link
Contributor

arkodg commented Aug 17, 2023

ptal @valerian-roche @alecholmez

Copy link
Contributor

@valerian-roche valerian-roche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only an initial review of the PR. I need to spend more time on the impact of this new shared channel.
My initial analysis is that there is a potential deadlock case as a watch can have queued a response prior to being cancelled. In some edge cases when the client has sent a lot of requests in a short sequence, I believe this channel could end up overflowing the muxed one and then itself. The previous model guarantees that the muxed channel being filled does not block other processes (as they each dequeue within their own goroutine), but here it is no longer the case

pkg/server/delta/v3/server.go Outdated Show resolved Hide resolved
pkg/server/delta/v3/server.go Outdated Show resolved Hide resolved
pkg/server/delta/v3/watches.go Outdated Show resolved Hide resolved
pkg/cache/v3/simple.go Show resolved Hide resolved
pkg/server/delta/v3/server.go Outdated Show resolved Hide resolved
pkg/server/delta/v3/server.go Outdated Show resolved Hide resolved
@zhaohuabing zhaohuabing force-pushed the ordered-delta-watch branch 2 times, most recently from f45c59c to 52998ad Compare August 21, 2023 02:58
@zhaohuabing zhaohuabing force-pushed the ordered-delta-watch branch 2 times, most recently from 3895960 to 1fd388b Compare August 22, 2023 04:01
watch.responses = make(chan cache.DeltaResponse, 1)
if ordered {
// Use the shared channel to keep the order of responses.
watch.UseSharedResponseChan(watches.deltaMuxedResponses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I globally like the state of this PR, but I think that there is an issue with how the channel is buffered.
This channel will take potentially multiple responses for each type (e.g. if a cache update triggers while a new request is retrieved and tries to enqueue in the same goroutine). As the channel is only buffered for the nb of types this will deadlock
The issue is that theoretically there could be more than 2 responses per type, as the request/response channels are in a select (and therefore are not ordered). In this case even buffering for two times would not guarantee to not deadlock
I think this can be solved if it is guaranteed that enqueued responses would be processed prior to new requests, but I'm not sure that'd be simple
Another possibility is to, like for the sotw, purge the response queue prior to processing a request. In this case I believe buffering with 2x would be enough to guarantee there is no deadlock. I'm honestly not a huge fan as I think it makes the state machine more complex, but it might be okay
@alecholmez do you think this would cover this issue?

Copy link
Member Author

@zhaohuabing zhaohuabing Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valerian-roche If we handle the responses and requests in two different go routines, will that solve the deadlock problem?

Copy link
Member Author

@zhaohuabing zhaohuabing Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valerian-roche I tried to follow your suggestion to guarantee that the responses ar processed prior to new requests. Could you please take another look?

pkg/server/delta/v3/server.go Outdated Show resolved Hide resolved
@zhaohuabing
Copy link
Member Author

Some tests failed because of this new change. If this approach is acceptable, I'll go on and fix them.

Signed-off-by: huabing zhao <zhaohuabing@gmail.com>
Copy link
Contributor

@valerian-roche valerian-roche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for your patience on this!

Were you able to test this on a running environments? I might be able to test it on some of our test systems next week if you cannot easily run it

Can you expand in the description describing a bit the implementation and the reasoning behind if people need to take a look later on?

@zhaohuabing
Copy link
Member Author

LGTM, thanks a lot for your patience on this!

Were you able to test this on a running environments? I might be able to test it on some of our test systems next week if you cannot easily run it

Can you expand in the description describing a bit the implementation and the reasoning behind if people need to take a look later on?

@valerian-roche Thanks. I can test it since I have already encountered this issue in my local development environment. But it may take a few days since I'll be attending KubeCon CN next week and I'm preparing on my talk.

Description updated with reasons and hows.

@zhaohuabing
Copy link
Member Author

@valerian-roche Tested in my local development environment.

@arkodg
Copy link
Contributor

arkodg commented Oct 9, 2023

thanks @zhaohuabing for fixing this, can we get this merged ? Envoy Gateway (based on delta xDS) needs this

@valerian-roche valerian-roche merged commit b652489 into envoyproxy:main Oct 10, 2023
3 checks passed
arkodg added a commit to arkodg/gateway that referenced this pull request Oct 10, 2023
This import brings in envoyproxy/go-control-plane#752
which should ensure resources pushed via delta ADS follow a specific
order to ensure the traffic chain doesnt transiently break

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
@zhaohuabing zhaohuabing deleted the ordered-delta-watch branch October 10, 2023 23:30
arkodg added a commit to envoyproxy/gateway that referenced this pull request Oct 10, 2023
* Bring in go-control-plane fixes

This import brings in envoyproxy/go-control-plane#752
which should ensure resources pushed via delta ADS follow a specific
order to ensure the traffic chain doesnt transiently break

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* rerun go generate

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Order resources for guaranteed eventual consistency in Delta + ADS mode
3 participants