Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apid loses connectivity to trustd if master IP changes or master goes down (workers) #3068

Closed
smira opened this issue Jan 29, 2021 · 0 comments · Fixed by #3105
Closed

apid loses connectivity to trustd if master IP changes or master goes down (workers) #3068

smira opened this issue Jan 29, 2021 · 0 comments · Fixed by #3105
Assignees
Milestone

Comments

@smira
Copy link
Member

smira commented Jan 29, 2021

Bug Report

apid on workers discovers master endpoints once on startup, finds first working node and sticks to it for cert issue/renewal

if master goes down or gets replaced, apid can't renew certs anymore, so eventually certs expire and connectivity is lost

Description

Logs

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
  • Kubernetes version: [kubectl version --short]
  • Platform:
@smira smira added this to the v0.9 milestone Jan 29, 2021
@smira smira self-assigned this Jan 29, 2021
smira added a commit to smira/talos that referenced this issue Jan 29, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: siderolabs#3068

Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira added a commit to smira/talos that referenced this issue Jan 29, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: siderolabs#3068

Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit to smira/talos that referenced this issue Jan 30, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: siderolabs#3068

Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira added a commit to smira/talos that referenced this issue Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: siderolabs#3068

Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit to smira/talos that referenced this issue Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: siderolabs#3068

Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit to smira/talos that referenced this issue Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: siderolabs#3068

Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit to smira/talos that referenced this issue Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: siderolabs#3068

Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit that referenced this issue Feb 2, 2021
Instead of doing our homegrown "try all the endpoints" method,
use gRPC load-balancing across configured endpoints.

Generalize load-balancer via gRPC resolver we had in Talos API client,
use it in remote certificate generator code. Generalized resolver is
still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we
can't depend on Talos code from `machinery/`.

Related to: #3068

Full fix for #3068 requires dynamic updates to control plane endpoints
while apid is running, this is coming in the next PR.

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira added a commit to smira/talos that referenced this issue Feb 3, 2021
This moves endpoint refresh from the context of the service `apid` in
`machined` into `apid` service itself for the workers. `apid` does
initial poll for the endpoints when it boots, but also periodically
polls for new endpoints to make sure it has accurate list of `trustd`
endpoints to talk to, this handles cases when control plane endpoints
change (e.g. rolling replace of control plane nodes with new IPs).

Related to siderolabs#3069

Fixes siderolabs#3068

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira added a commit to smira/talos that referenced this issue Feb 3, 2021
This moves endpoint refresh from the context of the service `apid` in
`machined` into `apid` service itself for the workers. `apid` does
initial poll for the endpoints when it boots, but also periodically
polls for new endpoints to make sure it has accurate list of `trustd`
endpoints to talk to, this handles cases when control plane endpoints
change (e.g. rolling replace of control plane nodes with new IPs).

Related to siderolabs#3069

Fixes siderolabs#3068

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot pushed a commit that referenced this issue Feb 3, 2021
This moves endpoint refresh from the context of the service `apid` in
`machined` into `apid` service itself for the workers. `apid` does
initial poll for the endpoints when it boots, but also periodically
polls for new endpoints to make sure it has accurate list of `trustd`
endpoints to talk to, this handles cases when control plane endpoints
change (e.g. rolling replace of control plane nodes with new IPs).

Related to #3069

Fixes #3068

Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant