Deadlock in Delta XDS if number of resources is more than 10 #875

lobkovilya · 2024-02-07T11:13:51Z

In Kuma we're implementing the ResourceSnapshot interface and our snapshot has more than 20 distinct resource types.

After upgrading to v0.12.0 we see deadlocks on the server and it seems like it was caused by #752. AFAICT the potential deadlock was fixed by increasing the channel capacity, but it's not fixed for us since our snapshot has more resource types.

I'm looking for help or advice on what'd be the best way to fix it? It doesn't seem right that server depends on the number of resource types in the snapshot to work properly.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-08T12:08:18Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

slonka · 2024-03-08T12:18:09Z

There is a PR open so I wouldn't call it stale.

lobkovilya · 2024-03-29T20:29:25Z

Some additional context.

Why the deadlock appeared after #752?

The PR replaces a response channel per watch model with a response channel per all watches. This was important to support ADS ordering, but at the same time, xDS was affected by the change.

Why there is no deadlock in SotW?

SotW has 2 different implementations for xds and ads. ADS has a response channel per all watches while xDS still has a response channel per watch.

How the deadlock can be solved in Delta?

Introduce proper xDS implementation, so that the Delta server has both xds.go and ads.go similar to the SotW server.
Let the capacity of a response channel be configurable (see my PR Option for the delta server to override the channel capacity and avoid deadlocks #876)
Build some kind of an "unbounded channel" abstraction when the writer never blocks and the internal buffer is reallocated and shrunk dynamically (I know it's a tiny bit crazy idea 😅).

Probably there are some other options I haven't thought of. @valerian-roche @jpeach @alecholmez please take a look, I'd appreciate any input here, and I'd be happy to contribute the solution.

valerian-roche · 2024-04-02T00:13:21Z

Hey, sorry for the delay, I have not had much capacity to work on this recently.
In my opinion the first proposal is not going in the correct direction. The sotw server implementation without ads ordering will also go away, as it is really inefficient on some aspects, and require additional maintenance effort.
The second one as discussed would imo be pushing back dangerous complexity on the user. The risk of deadlock might be very hard to assess if an error is made by 1 for instance.
My main proposal was to reimplement a part of the previous model, only for types not explicitly defined in the xds protocol as requiring ordering. This would allow the "unbounded behavior" of the previous implementation for users specifically using non-classic xds resources, while keeping ordering, and lower footprint of goroutines, for users only needing the basic types.

I plan on spending some time on the control-plane in Q2, though likely not in the coming weeks. If you want to take a stab at an implementation I can review it.

lobkovilya · 2024-04-02T08:15:10Z

Thanks for the reply @valerian-roche!

My main proposal was to reimplement a part of the previous model, only for types not explicitly defined in the xds protocol as requiring ordering. This would allow the "unbounded behavior" of the previous implementation for users specifically using non-classic xds resources, while keeping ordering, and lower footprint of goroutines, for users only needing the basic types.

Just checking my understanding, does it mean the select statement for Delta has to be re-implemented as reflect.Select in SotW?

go-control-plane/pkg/server/sotw/v3/xds.go

Line 53 in f246847

index, value, ok := reflect.Select(sw.watches.cases)

So that each "not explicitly defined" type adds a SelectCase with its own watch.response channel?

valerian-roche · 2024-04-03T02:53:44Z

Hey, I'd like to not go in this direction. The old model of sotw is fundamentally different from the model of delta (even before the ads change), and aligning with it will require maintaining two entirely different loops.

Prior to the ads change delta used to fork a goroutine for each watch. I believe this can be reused if the resource of the watch is not a standard xds resource considered within the main channel buffering.

This would likely require having another muxed channel for those watches, as queuing in the other one could lead to deadlock, but adding an entry on the select should not create an issue. Overall I expect the change to be:

re-add support for a channel in watch to be closed on cancel. It will be nil in the general case, which is not an issue
add a new case in the select with a channel buffered with a "default" capacity (not fully mattering in this context) which will run the send code without the unneeded "process other" here as order is irrelevant

lobkovilya · 2024-04-03T17:24:56Z

If my understanding is correct, having another muxed channel for those watchers with the "default" capacity can still lead to deadlocks. Imagine two things happening simultaneously:

a goroutine on the server calls SetSnapshot
at the same time, the server receives a DeltaDiscoveryRequest

If the Snapshot contains more resource types than the "default" capacity, the SetSnapshot function might block when attempting to write to the new muxed channel, and during this block, it holds the cache mutex. At the same time, the server gets a req from <-reqCh in a select statement, it attempts to call either watch.Cancel() or s.cache.CreateDeltaWatch(...). However, this action also gets blocked because the cache mutex is already held by the first goroutine. That's how we get a deadlock today.

That's why I brought up the reflect.Select. The ability to dynamically change select cases eliminates potential deadlocks. Regardless of the number of types in the Snapshot, the SetSnapshot never blocks with reflect.Select. So it's not really about aligning the code with SotW, I just don't see how else we can avoid the deadlock.

lobkovilya · 2024-04-24T15:27:15Z

@valerian-roche what do you think? Does it make sense to go with reflect.Select in Delta because of the reasons I mentioned above?

github-actions · 2024-05-24T16:06:19Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

lobkovilya · 2024-05-24T16:13:09Z

Not stale, still the case

github-actions · 2024-06-23T20:06:37Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2024-07-01T00:29:12Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

lobkovilya · 2024-08-14T13:42:31Z

Not stale, still the case

This was referenced Feb 7, 2024

chore(deps): downgrade go-control-plane to v0.11.2-0.20231010133108-1dfbe83bcebc kumahq/kuma#9163

Merged

Option for the delta server to override the channel capacity and avoid deadlocks #876

Closed

github-actions bot added the stale label Mar 8, 2024

github-actions bot removed the stale label Mar 8, 2024

lobkovilya mentioned this issue Mar 12, 2024

Upgrade envoyproxy/go-control-plane when deadlock is resolved kumahq/kuma#9580

Open

github-actions bot added the stale label May 24, 2024

github-actions bot removed the stale label May 24, 2024

github-actions bot added the stale label Jun 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024

Icarus9913 mentioned this issue Aug 23, 2024

Update references to deprecated protobuf lib kumahq/kuma#3499

Open

michaelbeaumont mentioned this issue Sep 9, 2024

Drop the replace usage in go.mod kumahq/kuma#11335

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in Delta XDS if number of resources is more than 10 #875

Deadlock in Delta XDS if number of resources is more than 10 #875

lobkovilya commented Feb 7, 2024 •

edited

Loading

github-actions bot commented Mar 8, 2024

slonka commented Mar 8, 2024

lobkovilya commented Mar 29, 2024

valerian-roche commented Apr 2, 2024

lobkovilya commented Apr 2, 2024

valerian-roche commented Apr 3, 2024 •

edited

Loading

lobkovilya commented Apr 3, 2024

lobkovilya commented Apr 24, 2024

github-actions bot commented May 24, 2024

lobkovilya commented May 24, 2024

github-actions bot commented Jun 23, 2024

github-actions bot commented Jul 1, 2024

lobkovilya commented Aug 14, 2024

Deadlock in Delta XDS if number of resources is more than 10 #875

Deadlock in Delta XDS if number of resources is more than 10 #875

Comments

lobkovilya commented Feb 7, 2024 • edited Loading

github-actions bot commented Mar 8, 2024

slonka commented Mar 8, 2024

lobkovilya commented Mar 29, 2024

Why the deadlock appeared after #752?

Why there is no deadlock in SotW?

How the deadlock can be solved in Delta?

valerian-roche commented Apr 2, 2024

lobkovilya commented Apr 2, 2024

valerian-roche commented Apr 3, 2024 • edited Loading

lobkovilya commented Apr 3, 2024

lobkovilya commented Apr 24, 2024

github-actions bot commented May 24, 2024

lobkovilya commented May 24, 2024

github-actions bot commented Jun 23, 2024

github-actions bot commented Jul 1, 2024

lobkovilya commented Aug 14, 2024

lobkovilya commented Feb 7, 2024 •

edited

Loading

valerian-roche commented Apr 3, 2024 •

edited

Loading