Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for the SDS update failure #615

Merged
merged 1 commit into from
Dec 19, 2022

Conversation

AmitKatyal-Sophos
Copy link
Contributor

@AmitKatyal-Sophos AmitKatyal-Sophos commented Dec 12, 2022

Summary

This PR includes the changes for the SDS update failure

Problem

We started using Delta ADS for LDS, CDS, RDS and SDS resources. We have three instances of envoy proxy client connected to the management server. During testing, we found that some of the envoy proxy's were not receiving the SDS updates.

Root cause analysis

We investigated the issue by adding logs to the go-control-plane and figured out that control plane is managing the list of the subscribed resources for each client and before sending the updates it compares the hash of the resource set in the snapshot with the hash of the same resource in the subscription list. If the specified resource is not specified in the client's subscription list, then updates are not sent to the respective client. And we found that somehow the entry from the subscription list was getting removed and as a result the updates were not getting pushed to the envoy proxy.

Why resources were getting removed from the subscription list ?
processDelta handler responsible for handling the client requests and responses updates the "watch.state.resourceVersions" at two places,

[1] As part of the subscription request from the envoy "s.subscribe(req.GetResourceNamesSubscribe(), watch.state.GetResourceVersions())"

[2] After sending the response to the envoy proxy "watch.state.SetResourceVersions(resp.GetNextVersionMap())"

Here, the problem is in the update of the resourceversion map after sending the response. After sending the response, we are directly replacing the resourceversion map with the map specified in the response. This is problem for the non wildcard resource type.

Please find below the detailed use-case to understand why replacing the resourceversion map inside the response handling is problematic.

Envoy requests for the resource (resource-1)

In the request handler, stream state resourceversion map is updated to [resource-1: empty hash] & response-1 is created with newresouceversion map [resource-1: hash]

Before the response is sent to the envoy, envoy sends the new request for the resource (resource-2). In the request handler, stream state resourceversion map is updated to [resource-1: empty hash, resource-2 empty hash] and response-2 is created with newresouceversion map [resource-1: hash, resource-2: hash].

Response-1 is sent to the envoy proxy and stream state version map is updated with the newresouceversion map [resource-1: hash]. After sending the response-1, the stream state resourceversion map will be updated to [resource-1: hash]

Envoy sends another request for the resource-3. In the request handler, stream state resourceversion map is updated to [resource-1: hash, resource-3 empty hash] and response-3 is created with newresouceversion map [resource-1: hash, resource-3: hash].

Response-2 is sent to the envoy proxy and stream state version map is updated with the newresouceversion map [resource-1: hash, resource-2: hash]. After sending the response-2, the stream state resourceversion map will be [resource-1: hash, resource-2: hash]

Response-3 is sent to the envoy proxy and stream state version map is updated with the newresouceversion map [resource-1: hash, resource-3: hash]. After sending the response-3, the stream state resourceversion map will be [resource-1: hash, resource-3: hash]

Final resourceversion map will be [resource-1: hash, resource-3: hash]. This is incorrect as we lost the subscription to the resource-2. Hence, any config updates to the resource-2 will not be sent to the envoy proxy.

@AmitKatyal1980 AmitKatyal1980 force-pushed the bugfix/SDSUpdateFailure branch from 60e490f to 6089e8e Compare December 12, 2022 16:47
@AmitKatyal-Sophos
Copy link
Contributor Author

Can someone please review the changes ? Thanks.

@alecholmez alecholmez self-requested a review December 15, 2022 21:40
@alecholmez alecholmez self-assigned this Dec 15, 2022
Signed-off-by: Amit Katyal <amit.katyal@sophos.com>
@AmitKatyal1980 AmitKatyal1980 force-pushed the bugfix/SDSUpdateFailure branch from 6089e8e to cee46b4 Compare December 16, 2022 01:11
@alecholmez alecholmez merged commit 0e0f25d into envoyproxy:main Dec 19, 2022
valerian-roche added a commit to valerian-roche/go-control-plane that referenced this pull request Mar 13, 2023
This reverts commit 0e0f25d.

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
valerian-roche added a commit to valerian-roche/go-control-plane that referenced this pull request Mar 14, 2023
This reverts commit 0e0f25d.

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
alecholmez pushed a commit that referenced this pull request Mar 20, 2023
… top of #559 (#657)

* Set safe directory in CI script to fix build issue related to VCS stamping in binaries

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>

* Revert "Fix for the SDS update failure (#615)"

This reverts commit 0e0f25d.

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants