Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some nodes fail to join the cluster because kubelet determines node name to be "" (empty string). #635

Closed
raonitimo opened this issue Aug 4, 2023 · 18 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@raonitimo
Copy link

raonitimo commented Aug 4, 2023

What happened:

Some nodes fail to join the cluster, kubelet logs has

Aug 03 19:00:55 ip-10-145-250-1.ec2.internal systemd[1]: Starting Kubernetes Kubelet...
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal systemd[1]: Started Kubernetes Kubelet.
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: Flag --pod-infra-container-image has been deprecated, will be removed in 1.27. Image garbage collector will get sandbox image informat
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: Flag --cloud-provider has been deprecated, will be removed in 1.25 or later, in favor of removing cloud provider code from Kubelet.
...
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.621434    3112 server.go:413] "Kubelet version" kubeletVersion="v1.25.11-eks-a5565ad"
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.621457    3112 server.go:415] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.621525    3112 feature_gate.go:245] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.621634    3112 feature_gate.go:245] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: W0803 19:00:55.621813    3112 plugins.go:132] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated a
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.632010    3112 aws.go:1268] Get AWS region from metadata client
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.632440    3112 aws.go:1313] Zone not specified in configuration file; querying AWS metadata service
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.640929    3112 aws.go:1353] Building AWS cloudprovider
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.857410    3112 tags.go:80] AWS cloud filtering on ClusterID: release-xds-0
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.857446    3112 server.go:555] "Successfully initialized cloud provider" cloudProvider="aws" cloudConfigFile=""
Aug 03 19:00:55 ip-10-145-250-1.ec2.internal kubelet[3112]: I0803 19:00:55.857465    3112 server.go:993] "Cloud provider determined current node" nodeName=""

Node events show

Aug  4 01:19:02 ip-10-145-250-1 kubelet: E0804 01:19:02.163174    3112 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"Unknown\" is forbidden: node \"ip-10-145-250-1.ec2.internal\" is not allowed to modify node \"\"" node=""

No other errors logged.

What you expected to happen:

Kubelet would correctly figure out the node name.

How to reproduce it (as minimally and precisely as possible):

It doesn't happen all the time and I can't correlate with anything. It's happened across different EKS clusters across different AWS accounts.

I can see two DescribeInstances API calls in Cloudtrail event history within the same second at "2023-08-03T19:00:56Z", just like an instance that successfully joined the cluster.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.25.11
  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release): EKS AL2
  • Kernel (e.g. uname -a): 5.10.184-175.731.amzn2.x86_64
  • Install tools: EKS
  • Others:

Happy to provide more context and logs. The instance is still around.

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2023
@raonitimo raonitimo changed the title Some nodes fail to join the cluster because kubelet determine node name to be "" (empty string). Some nodes fail to join the cluster because kubelet determines node name to be "" (empty string). Aug 4, 2023
@raonitimo
Copy link
Author

After restarting kubelet, it shows:

Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598146   23137 server.go:413] "Kubelet version" kubeletVersion="v1.25.11-eks-a5565ad"
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598158   23137 server.go:415] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598213   23137 feature_gate.go:245] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCer
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598295   23137 feature_gate.go:245] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCer
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: W0804 04:54:52.598391   23137 plugins.go:132] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is depr
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598411   23137 aws.go:1268] Get AWS region from metadata client
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.598739   23137 aws.go:1313] Zone not specified in configuration file; querying AWS metadata service
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.600302   23137 aws.go:1353] Building AWS cloudprovider
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.811489   23137 tags.go:80] AWS cloud filtering on ClusterID: release-xds-0
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.811514   23137 server.go:555] "Successfully initialized cloud provider" cloudProvider="aws" cloudConfigFile=""
Aug 04 04:54:52 ip-10-145-250-1.ec2.internal kubelet[23137]: I0804 04:54:52.811526   23137 server.go:993] "Cloud provider determined current node" nodeName="ip-10-145-250-1.ec2.internal"

@raonitimo
Copy link
Author

It still fails to register the node. I see the error:

Aug 04 05:15:59 ip-10-145-250-1.ec2.internal kubelet[23137]: E0804 05:15:59.270406   23137 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes is forbidden: User \"system:node:\" cannot create resource \"nodes\" in API group \"\" at the cluster scope: unknown node for user \"system:node:\"" node="ip-10-145-250-1.ec2.internal"

It looks like the kubelet is not able to use the correct user that should be system:node:{{EC2PrivateDNSName}}.

@raonitimo
Copy link
Author

I ran the bootstrap.sh script again and restarted the kubelet again, then the node joined the cluster.

@raonitimo
Copy link
Author

AWS support engineer linked me this: kubernetes/kubernetes#118421 . So, I guess the in-tree code is still used in 1.25?

Anyway, is the code being kept in-sync with the fixes?

@kmala
Copy link
Member

kmala commented Aug 4, 2023

So, I guess the in-tree code is still used in 1.25?

Yes, that's true. The switch happens with 1.27

Anyway, is the code being kept in-sync with the fixes?

Can you explain what you meant by it?

@raonitimo
Copy link
Author

Hey @kmala , thanks for responding.

Anyway, is the code being kept in-sync with the fixes?

Can you explain what you meant by it?

Yes, sure. I meant to ask if this PR is merged to the in-tree code in 1.26, should we expect to have the same fix applied to this plugin and working on version 1.27?

@olemarkus
Copy link
Member

Looks legit.

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 7, 2023
@kmala
Copy link
Member

kmala commented Aug 8, 2023

should we expect to have the same fix applied to this plugin and working on version 1.27 ?

yes it would be merged in this repo as far as i can tell. @cartermckinnon can comment otherwise

@cartermckinnon
Copy link
Contributor

cartermckinnon commented Aug 8, 2023

That PR wouldn’t help here, because kubelet doesn’t use this code. It has to be merged to the legacy in-tree AWS cloud provider in versions prior to 1.27. I haven’t gotten much traction on that PR, so please bump it if this is a blocker for you. 😌 I’ll go ahead and get this patched in the EKS kubelet builds, at least, because we’ll be supporting 1.26 for a while.

@cartermckinnon
Copy link
Contributor

cartermckinnon commented Aug 8, 2023

We’re handling the PrivateDnsName quirks in 1.27+ with a hostname override: https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh#L536-537

But we still need to address the eventual consistency issue. I’ll put up a PR for that.

The proper fix will be in the aws-iam-authenticator, I think.

@raonitimo
Copy link
Author

Hey @cartermckinnon, thanks for responding.

We’re handling the PrivateDnsName quirks in 1.27+ with a hostname override: awslabs/amazon-eks-ami@master/files/bootstrap.sh#L536-537

IIUC, this override won't fix this particular issue because the DescribeInstances call doesn't fail. It just returns an empty string.

But we still need to address the eventual consistency issue. I’ll put up a PR for that.

The proper fix will be in the aws-iam-authenticator, I think.

Nice. I don't understand how the aws-iam-authenticator is related. Keen to see the PR and understand it. Please link it here.

Thanks!

@cartermckinnon
Copy link
Contributor

IIUC, this override won't fix this particular issue because the DescribeInstances call doesn't fail. It just returns an empty string.

Correct -- I just meant to point out how we're achieving the behavior (Node name == PrivateDnsName) on kubelets that no longer use the in-tree AWS cloud provider. We still need to address the eventual consistency problem; I intended to do so when there was some consensus on the issue upstream.

I don't understand how the aws-iam-authenticator is related.

On EKS, the aws-iam-authenticator is where the PrivateDnsName requirement comes from, i.e. entries in your configmap/aws-auth like system:node:{{EC2PrivateDNSName}}.

@raonitimo
Copy link
Author

raonitimo commented Aug 24, 2023

Hey @cartermckinnon, this problem appeared again. Do you have any rough timeline for the fix? Or any pointers on how this should be fixed so someone can contribute?

@cartermckinnon
Copy link
Contributor

cartermckinnon commented Aug 24, 2023

  1. I have a PR up to handle this in EKS AMI's for Kubernetes 1.27+: Handle eventually-consistent PrivateDnsName on 1.26+ awslabs/amazon-eks-ami#1383. I expect that to be merged in the next few days, and it'll land in the following EKS AMI release.
  2. I'm applying this patch: Handle eventually-consistent EC2 PrivateDnsName kubernetes#118421 to EKS kubelet builds for Kubernetes 1.23-1.26. Those patches will appear in EKS-D in the next couple weeks: https://github.com/aws/eks-distro/tree/main/projects/kubernetes/kubernetes. The patched kubelet-s will land in an upcoming EKS AMI release, but I can't guarantee it'll be the same release as 1.

I'll reach out to the Bottlerocket folks to see what a fix looks like on their end for 1.27+.

Edit: looks like we'll need some handling here: https://github.com/bottlerocket-os/bottlerocket/blob/dea2c11949a95e914b3c72be6456606e945e0e16/sources/api/pluto/src/main.rs#L316-L332

@cartermckinnon
Copy link
Contributor

cartermckinnon commented Aug 29, 2023

@raonitimo I want to make sure we choose the right timeout value, so I need to track down a recent occurrence of this issue in the EC2 backend. Can you share some instance ID's? If you want to open a case with AWS Support, I can track it down 👍 .

@raonitimo
Copy link
Author

@raonitimo I want to make sure we choose the right timeout value, so I need to track down a recent occurrence of this issue in the EC2 backend. Can you share some instance ID's? If you want to open a case with AWS Support, I can track it down 👍 .

Sorry @cartermckinnon, haven't got a recent instance Id. When I get one, I'll raise a case with support and ping you.

@cartermckinnon
Copy link
Contributor

We've patched in handling for this in the EKS kubelet builds, so going to close this. I think a proper fix is to remove usage of the PrivateDnsName altogether, which I'm scoping for a future EKS release.

/close

@k8s-ci-robot
Copy link
Contributor

@cartermckinnon: Closing this issue.

In response to this:

We've patched in handling for this in the EKS kubelet builds, so going to close this. I think a proper fix is to remove usage of the PrivateDnsName altogether, which I'm scoping for a future EKS release.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants