Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta ADS with simple cache not working for ECDS #612

Closed
haoruolei opened this issue Dec 6, 2022 · 8 comments
Closed

Delta ADS with simple cache not working for ECDS #612

haoruolei opened this issue Dec 6, 2022 · 8 comments

Comments

@haoruolei
Copy link

Hi envoy go-control-plane,

We recently migrated our project from SOTW ADS to Delta ADS, and we also added support for ECDS. In production, we observed some abnormal behaviors. Posting here to see if anyone has information or can help to investigate. We have several instances with the same setup for management server and envoy. The management server provides the same resource in the snapshot. The only difference we notice is the start-up time(about several minutes apart). We saw the following:

  1. In one of the instance, for ECDS, when the management server has the subscribed resource, a response was sent and a watch is never opened for the same resource again. This leads to envoy not getting updates after receiving the initial version of the resource. No errors were seen in both envoy and management server logs.
  2. For the other instances, we were about to see watches were opened for ECDS after initialization and whenever the resources were updated, responses were sent to envoy and new watches were opened. Very strangely though, at the same time point, many watches were open for ECDS(up to 100). From the control-plane code, it looks like for one typeUrl there should only be one watch? Same for the delta response where several identical responses were sent at the same time point for ECDS.

We were not able to get more info from the log but the guess is there may be some concurrency issue. Any help will be appreciated. Thanks

@AmitKatyal-Sophos
Copy link
Contributor

@haoruolei It seems your issue is related to #613

@haoruolei
Copy link
Author

@haoruolei It seems your issue is related to #613

Thanks @AmitKatyal-Sophos I took a look at your issue and I think it's the same issue as mine.

@AmitKatyal-Sophos
Copy link
Contributor

AmitKatyal-Sophos commented Dec 13, 2022

Could you please try out the fix and see if it solves your problem ?
https://github.com/envoyproxy/go-control-plane/pull/615/files

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 12, 2023
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2023
@alecholmez alecholmez reopened this Jan 31, 2023
@haoruolei
Copy link
Author

haoruolei commented Mar 8, 2023

Hi, I finally got a chance to reproduce and debug this issue. By adding and analyzing the logs in go-control-plane, I can confirm that the root cause is the same as #613. In short, the resourceVersions map in server stream state was not updated correctly due to race condition and subscriptions to resources could be lost. However, my testing showed that the fix #615 is problematic. Please see below:

  1. The fix Fix for the SDS update failure #615 was based on release 0.10.3, where StreamState only had the map resourceVersions. This map functioned as both a) the list of all actively subscribed resource names b) up-to-date resource versions on this stream. In that case, fix 615 seemed applicable.
    However, in the new release 0.11.0, change delta: [#558] Support dynamic wildcard subscription in delta-xds #559 is included. an explicit map subsribedResourceNames is introduced in stream state along with resourceVersions map. This addition, along with the change to changes in createDeltaResponse logic, addresses the race condition issue above (as confirmed by testing). So fix 615 is not really needed. And separating subscription list from resourceVersion also seems a better approach than updating the resourceVersion map logic in fix 615.

  2. The changes in fix 615 doesn't cover all possible cases.

if !watch.state.IsWildcard() {
	for k, hash := range resp.GetNextVersionMap() {
		if currHash, found := watch.state.GetResourceVersions()[k]; found {
			if currHash != hash {
				watch.state.GetResourceVersions()[k] = hash
			}
		}
	}
} else {
	watch.state.SetResourceVersions(resp.GetNextVersionMap())
}

The above logic in fix 615 only accounts for cases where the version is updated for a resource. At least two scenarios are missing from my testing:
a) When a response is sent for a newly subscribed resource, the version for that resource will never be stored in the resourceVersion map.
b) When a resource is unsubscribed, its entry will not be removed from the resourceVersion map.

Given the above analysis, I propose that we revert #615. With #559, updating resourceVersion map by overriding has no issue because possible info loss was already taken care of when creating the delta response.

FYI @alecholmez @AmitKatyal-Sophos

@valerian-roche
Copy link
Contributor

Hi @haoruolei

Thank you for your update. I am looking at reverting #615 as I believe this is breaking the logic in some cases by having the version map no longer matching what was actually returned. Following #559 this workaround is no longer needed and even harmful as it is no longer reflecting additions and removal of resources within the version map.

You mention that you have tests that are failing with it. Can you confirm if the same tests are working on 0.11.0 or if there are other cases I need to address?

@haoruolei
Copy link
Author

Hi @haoruolei

Thank you for your update. I am looking at reverting #615 as I believe this is breaking the logic in some cases by having the version map no longer matching what was actually returned. Following #559 this workaround is no longer needed and even harmful as it is no longer reflecting additions and removal of resources within the version map.

You mention that you have tests that are failing with it. Can you confirm if the same tests are working on 0.11.0 or if there are other cases I need to address?

Hi @valerian-roche Thanks for taking this up. Per my testings, 0.11.0 alone works fine and I don't see any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants