Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in Delta XDS if number of resources is more than 10 #875

Closed
lobkovilya opened this issue Feb 7, 2024 · 13 comments
Closed

Deadlock in Delta XDS if number of resources is more than 10 #875

lobkovilya opened this issue Feb 7, 2024 · 13 comments
Labels

Comments

@lobkovilya
Copy link

lobkovilya commented Feb 7, 2024

In Kuma we're implementing the ResourceSnapshot interface and our snapshot has more than 20 distinct resource types.

After upgrading to v0.12.0 we see deadlocks on the server and it seems like it was caused by #752. AFAICT the potential deadlock was fixed by increasing the channel capacity, but it's not fixed for us since our snapshot has more resource types.

I'm looking for help or advice on what'd be the best way to fix it? It doesn't seem right that server depends on the number of resource types in the snapshot to work properly.

Copy link

github-actions bot commented Mar 8, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 8, 2024
@slonka
Copy link
Member

slonka commented Mar 8, 2024

There is a PR open so I wouldn't call it stale.

@lobkovilya
Copy link
Author

Some additional context.

Why the deadlock appeared after #752?

The PR replaces a response channel per watch model with a response channel per all watches. This was important to support ADS ordering, but at the same time, xDS was affected by the change.

Why there is no deadlock in SotW?

SotW has 2 different implementations for xds and ads. ADS has a response channel per all watches while xDS still has a response channel per watch.

How the deadlock can be solved in Delta?

  1. Introduce proper xDS implementation, so that the Delta server has both xds.go and ads.go similar to the SotW server.
  2. Let the capacity of a response channel be configurable (see my PR Option for the delta server to override the channel capacity and avoid deadlocks #876)
  3. Build some kind of an "unbounded channel" abstraction when the writer never blocks and the internal buffer is reallocated and shrunk dynamically (I know it's a tiny bit crazy idea 😅).

Probably there are some other options I haven't thought of. @valerian-roche @jpeach @alecholmez please take a look, I'd appreciate any input here, and I'd be happy to contribute the solution.

@valerian-roche
Copy link
Contributor

Hey, sorry for the delay, I have not had much capacity to work on this recently.
In my opinion the first proposal is not going in the correct direction. The sotw server implementation without ads ordering will also go away, as it is really inefficient on some aspects, and require additional maintenance effort.
The second one as discussed would imo be pushing back dangerous complexity on the user. The risk of deadlock might be very hard to assess if an error is made by 1 for instance.
My main proposal was to reimplement a part of the previous model, only for types not explicitly defined in the xds protocol as requiring ordering. This would allow the "unbounded behavior" of the previous implementation for users specifically using non-classic xds resources, while keeping ordering, and lower footprint of goroutines, for users only needing the basic types.

I plan on spending some time on the control-plane in Q2, though likely not in the coming weeks. If you want to take a stab at an implementation I can review it.

@lobkovilya
Copy link
Author

Thanks for the reply @valerian-roche!

My main proposal was to reimplement a part of the previous model, only for types not explicitly defined in the xds protocol as requiring ordering. This would allow the "unbounded behavior" of the previous implementation for users specifically using non-classic xds resources, while keeping ordering, and lower footprint of goroutines, for users only needing the basic types.

Just checking my understanding, does it mean the select statement for Delta has to be re-implemented as reflect.Select in SotW?

index, value, ok := reflect.Select(sw.watches.cases)

So that each "not explicitly defined" type adds a SelectCase with its own watch.response channel?

@valerian-roche
Copy link
Contributor

valerian-roche commented Apr 3, 2024

Hey, I'd like to not go in this direction. The old model of sotw is fundamentally different from the model of delta (even before the ads change), and aligning with it will require maintaining two entirely different loops.

Prior to the ads change delta used to fork a goroutine for each watch. I believe this can be reused if the resource of the watch is not a standard xds resource considered within the main channel buffering.

This would likely require having another muxed channel for those watches, as queuing in the other one could lead to deadlock, but adding an entry on the select should not create an issue. Overall I expect the change to be:

  • re-add support for a channel in watch to be closed on cancel. It will be nil in the general case, which is not an issue
  • add a new case in the select with a channel buffered with a "default" capacity (not fully mattering in this context) which will run the send code without the unneeded "process other" here as order is irrelevant

@lobkovilya
Copy link
Author

If my understanding is correct, having another muxed channel for those watchers with the "default" capacity can still lead to deadlocks. Imagine two things happening simultaneously:

  1. a goroutine on the server calls SetSnapshot
  2. at the same time, the server receives a DeltaDiscoveryRequest

If the Snapshot contains more resource types than the "default" capacity, the SetSnapshot function might block when attempting to write to the new muxed channel, and during this block, it holds the cache mutex. At the same time, the server gets a req from <-reqCh in a select statement, it attempts to call either watch.Cancel() or s.cache.CreateDeltaWatch(...). However, this action also gets blocked because the cache mutex is already held by the first goroutine. That's how we get a deadlock today.

That's why I brought up the reflect.Select. The ability to dynamically change select cases eliminates potential deadlocks. Regardless of the number of types in the Snapshot, the SetSnapshot never blocks with reflect.Select. So it's not really about aligning the code with SotW, I just don't see how else we can avoid the deadlock.

@lobkovilya
Copy link
Author

@valerian-roche what do you think? Does it make sense to go with reflect.Select in Delta because of the reasons I mentioned above?

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label May 24, 2024
@lobkovilya
Copy link
Author

Not stale, still the case

@github-actions github-actions bot removed the stale label May 24, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 23, 2024
Copy link

github-actions bot commented Jul 1, 2024

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024
@lobkovilya
Copy link
Author

Not stale, still the case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants