Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataplane, Zone and ZoneIngress status problem #2320

Closed
lobkovilya opened this issue Jul 6, 2021 · 2 comments · Fixed by #2526
Closed

Dataplane, Zone and ZoneIngress status problem #2320

lobkovilya opened this issue Jul 6, 2021 · 2 comments · Fixed by #2526

Comments

@lobkovilya
Copy link
Contributor

lobkovilya commented Jul 6, 2021

Summary

This issue aims to collect all the knowledges and cases regarding problem with statuses.

NOTE 1: Wrong statuses affect only how information is displayed in Kuma GUI. Incorrect status doesn't affect real connectivity between peers.

NOTE 2: Problem is the same for Dataplane, Zone and ZoneIngress statuses because internally we're using the same Insight/Subscription mechanism. All examples will describe problem for Dataplanes (DPP), but the same things could be applied both for Zones and ZoneIngresses

Related issues: #1605 #1737

Scenario 1 (more likely to happen)

  1. Kuma CP goes down ungracefully and doesn't set DisconnectTime
  2. DPP connects to another instance of Kuma CP
  3. DPP goes down
  4. In GUI DPP is seen as Online while effectively it is Offline

Fixed: #2246

Scenario 2 (less likely to happen)

  1. Kuma CP goes down ungracefully and doesn't set DisconnectTime
  2. DPP goes down while Kuma CP is down
  3. Kuma CP restarts
  4. In GUI DPP is seen as Online while effectively it is Offline

Fixed: not fixed.

There are 2 possible solutions that allow fixing Scenario 2.

Solution 1 (more complicate to implement)

The main idea of this solution is implementing passive health check of every Kuma CP instance. Dedicated job will periodically start on the leader, iterate over all DataplaneInsights and set DisconnectTime in case Kuma CP instance is dead.

Downsides:

  • doesn't work if Kuma CP instances have clock skew
  • if Kuma CP instance is busy it could be marked as Unhealthy and thus all connected DPPs will be mark as Unhealthy for some period of time
  • relatively complex to implement

Upsides:

  • detection of status changes depends on how often we run recurrent job (could be about every 30s - 1m)

Solution 2 (less complicate to implement)

The main idea of this solution is dropping xDS connection with Envoys every N minutes (could be ~10m).

Downsides:

  • if we come across Scenario 2 then DPP will be displayed as Online until xDS connection will be dropped (no more than 10m), all other scenarios will work as expected.

Upsides:

  • easy to implement
  • probably we can benefit from it later when implement token rotation or something else

@kumahq/kuma-maintainers we have to choose between these two solutions. Personally I'd vote for Solution 2. WDYT?

@jpeach jpeach added bug labels Jul 6, 2021
@subnetmarco
Copy link
Contributor

Seems like there is some consensus to go with solution 2, which I am fine with.

@jakubdyszkiewicz
Copy link
Contributor

+1 for Solution 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants