You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue aims to collect all the knowledges and cases regarding problem with statuses.
NOTE 1: Wrong statuses affect only how information is displayed in Kuma GUI. Incorrect status doesn't affect real connectivity between peers.
NOTE 2: Problem is the same for Dataplane, Zone and ZoneIngress statuses because internally we're using the same Insight/Subscription mechanism. All examples will describe problem for Dataplanes (DPP), but the same things could be applied both for Zones and ZoneIngresses
Kuma CP goes down ungracefully and doesn't set DisconnectTime
DPP goes down while Kuma CP is down
Kuma CP restarts
In GUI DPP is seen as Online while effectively it is Offline
Fixed: not fixed.
There are 2 possible solutions that allow fixing Scenario 2.
Solution 1 (more complicate to implement)
The main idea of this solution is implementing passive health check of every Kuma CP instance. Dedicated job will periodically start on the leader, iterate over all DataplaneInsights and set DisconnectTime in case Kuma CP instance is dead.
Downsides:
doesn't work if Kuma CP instances have clock skew
if Kuma CP instance is busy it could be marked as Unhealthy and thus all connected DPPs will be mark as Unhealthy for some period of time
relatively complex to implement
Upsides:
detection of status changes depends on how often we run recurrent job (could be about every 30s - 1m)
Solution 2 (less complicate to implement)
The main idea of this solution is dropping xDS connection with Envoys every N minutes (could be ~10m).
Downsides:
if we come across Scenario 2 then DPP will be displayed as Online until xDS connection will be dropped (no more than 10m), all other scenarios will work as expected.
Upsides:
easy to implement
probably we can benefit from it later when implement token rotation or something else
@kumahq/kuma-maintainers we have to choose between these two solutions. Personally I'd vote for Solution 2. WDYT?
The text was updated successfully, but these errors were encountered:
Summary
This issue aims to collect all the knowledges and cases regarding problem with statuses.
NOTE 1: Wrong statuses affect only how information is displayed in Kuma GUI. Incorrect status doesn't affect real connectivity between peers.
NOTE 2: Problem is the same for Dataplane, Zone and ZoneIngress statuses because internally we're using the same Insight/Subscription mechanism. All examples will describe problem for Dataplanes (DPP), but the same things could be applied both for Zones and ZoneIngresses
Related issues: #1605 #1737
Scenario 1 (more likely to happen)
DisconnectTime
Fixed: #2246
Scenario 2 (less likely to happen)
DisconnectTime
Fixed: not fixed.
There are 2 possible solutions that allow fixing Scenario 2.
Solution 1 (more complicate to implement)
The main idea of this solution is implementing passive health check of every Kuma CP instance. Dedicated job will periodically start on the leader, iterate over all DataplaneInsights and set
DisconnectTime
in case Kuma CP instance is dead.Downsides:
Upsides:
Solution 2 (less complicate to implement)
The main idea of this solution is dropping xDS connection with Envoys every N minutes (could be ~10m).
Downsides:
Upsides:
@kumahq/kuma-maintainers we have to choose between these two solutions. Personally I'd vote for Solution 2. WDYT?
The text was updated successfully, but these errors were encountered: