You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When disconnection happens on a multi-node setup there are cases where the dp might get pushed an empty SotW which causes all listeners to be removed. Shortly after a correct snapshot is pushed which fixes the problem.
By looking at the code we can see that we push an empty snapshot instead of clearing the snapshot cache: code ref ps: I can't seem to find the origin of this code in the history, it seems to predate CNCF donation.
The comment claims that the fake value will be removed from the cache on Disconnect. However, I see no evidence that this is the case SnapshotCache.ClearSnapshot(nodeId) seems to never be called.
This is likely to cause 2 issues:
A growing snapshot cache (we never evict old nodeIds), the size of this leak depends if there's a churn in dp count.
The issue explained in this ticket.
More explanation on 2.:
As mentioned once we disconnect we push an empty snapshot to the cache
The watchdog waits an entire refreshInterval before generating a new snapshot and populating the cache
So what I believe is happening:
We connect to a first node everything works well
We disconnect push an empty snapshot to the cache
Connect back the empty snapshot is pushed to envoy
Envoy removes all listeners
The watchdog ticks creates a correct snapshot
Envoy receives the correct snapshot and adds back listeners
Proposed fix
reconciler.Clear() should clear the snapshot and not push an empty snapshot.
Note: Will we need to change the way the watchdog refreshes entities?
I don't think it's necessary for these reasons:
a) It works for the first request on a freshly booted CP
b) The ADS request is simply parked until the watchdog runs once.
The text was updated successfully, but these errors were encountered:
Summary
When disconnection happens on a multi-node setup there are cases where the dp might get pushed an empty SotW which causes all listeners to be removed. Shortly after a correct snapshot is pushed which fixes the problem.
Here are some logs showing this:
Steps To Reproduce
Additional Details & Logs
Observations
By looking at the code we can see that we push an empty snapshot instead of clearing the snapshot cache: code ref ps: I can't seem to find the origin of this code in the history, it seems to predate CNCF donation.
The comment claims that the fake value will be removed from the cache on Disconnect. However, I see no evidence that this is the case
SnapshotCache.ClearSnapshot(nodeId)
seems to never be called.This is likely to cause 2 issues:
More explanation on 2.:
refreshInterval
before generating a new snapshot and populating the cacheSo what I believe is happening:
Proposed fix
reconciler.Clear()
should clear the snapshot and not push an empty snapshot.Note: Will we need to change the way the watchdog refreshes entities?
I don't think it's necessary for these reasons:
a) It works for the first request on a freshly booted CP
b) The ADS request is simply parked until the watchdog runs once.
The text was updated successfully, but these errors were encountered: