-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some nodes fail to join the cluster because kubelet determines node name to be "" (empty string). #635
Comments
After restarting kubelet, it shows:
|
It still fails to register the node. I see the error:
It looks like the kubelet is not able to use the correct user that should be |
I ran the bootstrap.sh script again and restarted the kubelet again, then the node joined the cluster. |
AWS support engineer linked me this: kubernetes/kubernetes#118421 . So, I guess the in-tree code is still used in 1.25? Anyway, is the code being kept in-sync with the fixes? |
Yes, that's true. The switch happens with 1.27
Can you explain what you meant by it? |
Looks legit. /triage accepted |
yes it would be merged in this repo as far as i can tell. @cartermckinnon can comment otherwise |
That PR wouldn’t help here, because kubelet doesn’t use this code. It has to be merged to the legacy in-tree AWS cloud provider in versions prior to 1.27. I haven’t gotten much traction on that PR, so please bump it if this is a blocker for you. 😌 I’ll go ahead and get this patched in the EKS kubelet builds, at least, because we’ll be supporting 1.26 for a while. |
We’re handling the PrivateDnsName quirks in 1.27+ with a hostname override: https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh#L536-537 But we still need to address the eventual consistency issue. I’ll put up a PR for that. The proper fix will be in the aws-iam-authenticator, I think. |
Hey @cartermckinnon, thanks for responding.
IIUC, this override won't fix this particular issue because the DescribeInstances call doesn't fail. It just returns an empty string.
Nice. I don't understand how the aws-iam-authenticator is related. Keen to see the PR and understand it. Please link it here. Thanks! |
Correct -- I just meant to point out how we're achieving the behavior (
On EKS, the |
Hey @cartermckinnon, this problem appeared again. Do you have any rough timeline for the fix? Or any pointers on how this should be fixed so someone can contribute? |
I'll reach out to the Bottlerocket folks to see what a fix looks like on their end for 1.27+. Edit: looks like we'll need some handling here: https://github.com/bottlerocket-os/bottlerocket/blob/dea2c11949a95e914b3c72be6456606e945e0e16/sources/api/pluto/src/main.rs#L316-L332 |
@raonitimo I want to make sure we choose the right timeout value, so I need to track down a recent occurrence of this issue in the EC2 backend. Can you share some instance ID's? If you want to open a case with AWS Support, I can track it down 👍 . |
Sorry @cartermckinnon, haven't got a recent instance Id. When I get one, I'll raise a case with support and ping you. |
We've patched in handling for this in the EKS kubelet builds, so going to close this. I think a proper fix is to remove usage of the /close |
@cartermckinnon: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
Some nodes fail to join the cluster, kubelet logs has
Node events show
No other errors logged.
What you expected to happen:
Kubelet would correctly figure out the node name.
How to reproduce it (as minimally and precisely as possible):
It doesn't happen all the time and I can't correlate with anything. It's happened across different EKS clusters across different AWS accounts.
I can see two DescribeInstances API calls in Cloudtrail event history within the same second at
"2023-08-03T19:00:56Z"
, just like an instance that successfully joined the cluster.Anything else we need to know?:
Environment:
kubectl version
): 1.25.11uname -a
): 5.10.184-175.731.amzn2.x86_64Happy to provide more context and logs. The instance is still around.
/kind bug
The text was updated successfully, but these errors were encountered: