-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apid loses connectivity to trustd if master IP changes or master goes down (workers) #3068
Milestone
Comments
smira
added a commit
to smira/talos
that referenced
this issue
Jan 29, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: siderolabs#3068 Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira
added a commit
to smira/talos
that referenced
this issue
Jan 29, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: siderolabs#3068 Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot
pushed a commit
to smira/talos
that referenced
this issue
Jan 30, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: siderolabs#3068 Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira
added a commit
to smira/talos
that referenced
this issue
Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: siderolabs#3068 Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot
pushed a commit
to smira/talos
that referenced
this issue
Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: siderolabs#3068 Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot
pushed a commit
to smira/talos
that referenced
this issue
Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: siderolabs#3068 Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot
pushed a commit
to smira/talos
that referenced
this issue
Feb 1, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: siderolabs#3068 Full fix for siderolabs#3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot
pushed a commit
that referenced
this issue
Feb 2, 2021
Instead of doing our homegrown "try all the endpoints" method, use gRPC load-balancing across configured endpoints. Generalize load-balancer via gRPC resolver we had in Talos API client, use it in remote certificate generator code. Generalized resolver is still under `machinery/`, as `pkg/grpc` is not in `machinery/`, and we can't depend on Talos code from `machinery/`. Related to: #3068 Full fix for #3068 requires dynamic updates to control plane endpoints while apid is running, this is coming in the next PR. Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira
added a commit
to smira/talos
that referenced
this issue
Feb 3, 2021
This moves endpoint refresh from the context of the service `apid` in `machined` into `apid` service itself for the workers. `apid` does initial poll for the endpoints when it boots, but also periodically polls for new endpoints to make sure it has accurate list of `trustd` endpoints to talk to, this handles cases when control plane endpoints change (e.g. rolling replace of control plane nodes with new IPs). Related to siderolabs#3069 Fixes siderolabs#3068 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
smira
added a commit
to smira/talos
that referenced
this issue
Feb 3, 2021
This moves endpoint refresh from the context of the service `apid` in `machined` into `apid` service itself for the workers. `apid` does initial poll for the endpoints when it boots, but also periodically polls for new endpoints to make sure it has accurate list of `trustd` endpoints to talk to, this handles cases when control plane endpoints change (e.g. rolling replace of control plane nodes with new IPs). Related to siderolabs#3069 Fixes siderolabs#3068 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
talos-bot
pushed a commit
that referenced
this issue
Feb 3, 2021
This moves endpoint refresh from the context of the service `apid` in `machined` into `apid` service itself for the workers. `apid` does initial poll for the endpoints when it boots, but also periodically polls for new endpoints to make sure it has accurate list of `trustd` endpoints to talk to, this handles cases when control plane endpoints change (e.g. rolling replace of control plane nodes with new IPs). Related to #3069 Fixes #3068 Signed-off-by: Andrey Smirnov <smirnov.andrey@gmail.com>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Bug Report
apid on workers discovers master endpoints once on startup, finds first working node and sticks to it for cert issue/renewal
if master goes down or gets replaced, apid can't renew certs anymore, so eventually certs expire and connectivity is lost
Description
Logs
Environment
talosctl version --nodes <problematic nodes>
]kubectl version --short
]The text was updated successfully, but these errors were encountered: