-
Notifications
You must be signed in to change notification settings - Fork 466
kube-aws: decrypt-tls-assets.service failed in a controller node #675
Comments
Regardless of how we improve decrypt-tls-assets, I think we should make starting the kubelet more robust by using the workaround on the upstream systemd issue for failed depedencies. @dghubble used here: https://github.com/coreos-inc/tectonic/pull/250. This would mean Alternatively we could ExecPreStart the decrypt-tls-assets script directly and drop the oneshot service. This would probably mean adding a check to avoid re-decrypting the assets if the kubelet restarts. |
I have attempted to test out that ExecStartPre without success, but I am far from a systemd expert.
Is that actually a problem? |
@cgag @robszumski Thanks for your response. So the problems are: (1) decrypt-tls-assets.service doesn't restart/retry on failure and/or (2) kubelet.service fails when (1) failed one or more times and has no way to recover from it right? |
If so, for (1), I'd
so that If going with the latter option, I'd also add Then for (2), I'd add Do these changes look good to you? @cgag @robszumski |
@robszumski Thanks, but as I've reporter earlier in this issue, I'm using As it is not |
@robszumski I actually only tested the alternate I proposed of, ust directly ExecStartPre'ing decrypting tls assets. I'm validating the is-active version now and realizing it might not actually solve, things which doesn't make much sense to me. Perhaps it's something related to being a oneshot. I'll need to play with it more. @mumoshu: Say service B depends on service A. If you attempt to start B, it will then try to start A. If A fails, B will be marked as "failed due to dependency", and then systemd will never attempt to restart it. Things marked as failing due to dependency are dead for good. If A has restart logic, systemd will try to restart it, but it won't go back and start B once A has come up. The idea behind the Another smaller problem is that |
@cgag Thanks for your detailed explanation. It was very helpful to me.
Ack. That's also why I've proposed (1) in #675 (comment)
Thanks. That's what I wasn't sure in the time of writing (1) above.
Sounds nice. I've seen someone in the internet also came up with that.
Thanks for sharing the great findings. So:
I'd also like the ExecStartPre option in that regard if it works. |
Yep, I use the https://github.com/coreos/coreos-baremetal/blob/master/examples/ignition/k8s-controller.yaml#L68 For AWS configurations on the other hand, I've been seeing the same Adding |
@dghubble Thanks for sharing your valuable insights! May I ask you if #697 looks good to you in that regard? I started to believe the content of the PR:
is the way to go (for now) according to what you and @cgag said. However, I'm not completely sure and I'd greatly appreciate feedbacks from experts like you, to move towards fixing this long standing issue 🙇 |
Thanks for your documentation here. @mumoshu / All. I am trying to get kube-aws to deploy correctly inside an existing VPC and perhaps having related issues due to this requirement and my configurations in the cluster.yaml, etc. (kube-aws seems to currently have issues with existing subnets RE: #538) .. Unfortunately, I'm stuck with the Here's what I see when I am trying to restart the systemd services on the controller/worker nodes
excerpt from journalctl -xe output showing flannel issues
Thanks |
@cmcconnell1 hi, thanks for your feedback here! Does restarting If you'd been already using the external etcd, If restarting It restarting doesn't work for you, it may be related to ACL or something I'm not aware of for now. |
Thanks @mumoshu for the quick response. Regarding ACL's (at least from the EC2 side) I added both the kube-aws-controller and the kube-aws-worker nodes to another (development) security group which allows All traffic in/out (all protocols and port ranges in that security group). Sorry, I should have included this info last night. The below terminal out is from the controller node of a fresh kube-aws cluster deploy (with the above mentioned controller/worker also included in an "Allow All" security group in addition to their respective Controller/Worker security groups).
I've also tried posting on the #kubernetes-users slack channel too, but haven't got any useful responses. Is there another more appropriate IRC channel someone could suggest for kube-aws related issues? Thanks for your help. |
@cmcconnell1 Thanks for the info 👍 The part of log messages you've provided:
seems to indicate that your node doesn't have the connection with the internet. Excuse me if I'm just revisiting what you've already tried before but please let me blindly cover all the possibilities I have in my mind:
|
Hey @mumoshu In all deployments I was using the default route table for the VPC (and specified it in the cluster.yaml file just to be sure). There is access to an internet gateway via NAT on that subnet and both the controller and worker nodes were able to access the internet, ping google, yahoo, had DNS resolution, etc. AFAIK, we decided to not use ACL's for this reason and just use security groups as it's a bit tough to troubleshoot if you are trying to use both. I tried downloading your latest RC-1 candidate earlier today and seemed to get a bit further, just the docker service would not start, but had no issues with the previous failing services. Still couldn't connect remotely with kubectl, etc. So I've jumped over to trying to get going with kops, but am able/willing to try to work on both kube-aws and kops if/as needed. Thanks for your responses. |
Closing due to the fact this is already fixed in https://github.com/coreos/kube-aws/ |
@cmcconnell1 Did you have any luck with kops and/or reached to the root cause of your issue? I'd appreciate it if you could share your experience 🙏 |
Hey @mumoshu thanks for reaching out. . . it seems that so far, we have issues with both kube-aws and kops when trying to deploy into an existing VPC. I've tried specifying internal NAT subnets/routing tables and external subnets/routing tables with routes to internet gateway. With both approaches, the controller and workers are able to ping and resolve external sites, etc. But services aren't happy and currently on latest RC candidate have issues with docker service and install-kube-system.service etc failing. . . Noting that with kops and existing VPC, I had connection refused errors/issues with kubelet service:
I've tried posting in both the kubenetes-users and sig-aws slack channels with no responses other than one person simply stating that "kops works in an existing VPC," but after I posted the errors I get with kubelet, I didn't get any other responses. I was asking if anyone has seen them working in a VPC. |
Hey @cmcconnell1! Excuse me if i'm missing the point again but could you ensure that the etcd cluster is up? If I recall correctly, docker.service and flanneld.service in controller and worker nodes rely on an etcd cluster to be up beforehand. If your etcd cluster isn't up, would you mind checking out kubernetes-retired/kube-aws#62? |
Sorry I've submitted my comment too early by mistake! Edited. |
With the latest master(066d888caa85c00112d891f3c69a8416a59aaee4), it failed and kubelet.service didn't come up.
systemctl status decrypt-tls-assets
showed:journalctl -u decrypt-tls-assets
showed:A work-around for me was to run
and then waiting several minutes until
curl localhost:8080
succeeds.I'm not really sure but the fix for #666 caused another issue?
The text was updated successfully, but these errors were encountered: