Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add separate eks kubeconfig secret keys for the cluster-autoscaler #4648

Merged

Conversation

cnmcavoy
Copy link
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it: Cluster Autoscaler can not mount and consume the Cluster API Kubeconfig because the secret contents are refreshed every ten minutes, and no API Machinery exists to reload a kubeconfig safely.

Initially, I attempted solve this in the Cluster Autoscaler: kubernetes/autoscaler#4784 - However meeting with SIG APIMachinery on Nov 1 2023, the SIG cautioned against this approach and advised splitting the token out from the kubeconfig, as there is existing machinery to reload a token file auth. By switching to this approach, no change in the Cluster Autoscaler is needed, users only need to update their Cluster Autoscaler configuration to use the correct secret file from their secret volume mount.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #4607

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Add separate eks kubeconfig secret keys for the cluster-autoscaler to support refreshing the token automatically, see eks kubeconfig for more info.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 21, 2023
@cnmcavoy cnmcavoy marked this pull request as draft November 21, 2023 19:54
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 21, 2023
@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Nov 21, 2023

This is ready to be reviewed, but I haven't had an opportunity to test the full setup with this change + cluster autoscaler end to end yet, so marked as a draft until I verify it.

I have deployed this change in Indeed's clusters and in my testing, the official release of v1.27 cluster autoscaler was able to refresh the bearer token file in EKS clusters.

@cnmcavoy cnmcavoy force-pushed the eks-cluster-autoscaler-secret branch 2 times, most recently from 1533ae0 to c43f7cc Compare November 21, 2023 20:33
@cnmcavoy cnmcavoy marked this pull request as ready for review November 27, 2023 22:16
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 27, 2023
@cnmcavoy cnmcavoy force-pushed the eks-cluster-autoscaler-secret branch from c43f7cc to c4a814a Compare November 27, 2023 22:26
@cnmcavoy cnmcavoy changed the title Add separate eks kubeconfig secret keys for the cluster-autoscaler ✨ Add separate eks kubeconfig secret keys for the cluster-autoscaler Nov 27, 2023
Comment on lines 41 to 43
| value | contains a complete kubeconfig with the cluster admin user and token embedded |
| cluster-autoscaler-value | contains a kubeconfig with the cluster admin user, referencing the token file in a relative path - assumes you are mounting all the secret keys in the same dir |
| cluster-autoscaler-token-file | contains the same token embedded in the complete kubeconfig, it is separated into a single file so that existing APIMachinery can reload the token file when the secret is updated |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are more use cases than just cluster-autoscaler, wdyt about using descriptive names like embedded, relative and single-file ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names are good, but changing the name of value means we have a different Secret shape in EKS clusters than self-managed clusters. I'll update the other two, but I am not sure changing the name of value is worth it, as it will be surprising to users.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 2, 2024
@vincepri
Copy link
Member

vincepri commented Jan 2, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 2, 2024
@cnmcavoy cnmcavoy force-pushed the eks-cluster-autoscaler-secret branch 2 times, most recently from 2a13c53 to f57a8ae Compare January 12, 2024 19:58
@cnmcavoy cnmcavoy force-pushed the eks-cluster-autoscaler-secret branch from f57a8ae to 643f8eb Compare April 17, 2024 21:57
@cnmcavoy cnmcavoy requested a review from mloiseleur April 17, 2024 22:23
@mloiseleur
Copy link

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2024
@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@richardcase
Copy link
Member

It would be good to get cluster autoscaler added to our e2e tests. Lets create an issue to follow up on this.

From my side this look good:

/approve

When the e2e passes we can unhold and merge:

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardcase

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2024
@cnmcavoy cnmcavoy force-pushed the eks-cluster-autoscaler-secret branch from 643f8eb to c50ba26 Compare April 19, 2024 16:09
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024
@cnmcavoy
Copy link
Contributor Author

/retest-required

@cnmcavoy
Copy link
Contributor Author

New changes are detected. LGTM label has been removed.

Only a rebase.

@mloiseleur
Copy link

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024
@mloiseleur
Copy link

You will have to fix the test first.
/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 20, 2024
@cnmcavoy
Copy link
Contributor Author

/test pull-cluster-api-provider-aws-e2e-eks

@cnmcavoy
Copy link
Contributor Author

@mloiseleur seems like the test failure is a flake? I pushed a commit to simply expose the error being suppressed in cloudformation and now it passed. My new commit should not have fixed any tests.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2024
@mloiseleur
Copy link

mloiseleur commented Jul 22, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 22, 2024
@mloiseleur
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2024
@mloiseleur
Copy link

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2024
@k8s-ci-robot k8s-ci-robot merged commit c3c5a53 into kubernetes-sigs:main Jul 22, 2024
20 checks passed
@mloiseleur
Copy link

mloiseleur commented Aug 7, 2024

@cnmcavoy I updated to v2.6.1 this morning and I saw in the logs an error about ownership of the kubeconfig secret in the log for all control plane.

E0807 08:02:32.550249       1 controller.go:329] "Reconciler error" 
err="failed to reconcile control plane for AWSManagedControlPlane flux-system/xxxxxxxxx-cp: 
 failed reconciling kubeconfig: updating kubeconfig secret: 
  EKS kubeconfig flux-system/xxxxxxxxx-kubeconfig missing expected AWSManagedControlPlane ownership" 
controller="awsmanagedcontrolplane"
controllerGroup="controlplane.cluster.x-k8s.io"
controllerKind="AWSManagedControlPlane" 
AWSManagedControlPlane="flux-system/xxxxxxxxx-cp"
namespace="flux-system"
name="xxxxxxxxx-cp"
reconcileID="c634fd01-5c99-4947-9ff7-3297fcaff97c"

Did you encounter this issue when you tested on your side ?

EDIT: On my side, I fixed it by deleting the secret. The new secret was created with expected ownership and it get back on its feet.

@cnmcavoy
Copy link
Contributor Author

cnmcavoy commented Aug 7, 2024

@cnmcavoy I updated to v2.6.1 this morning and I see in the logs an error about ownership of the kubeconfig secret in the log for all control plane.

The secret was created with an owner reference in the previous releases as well. Previously the shape of the secret was assumed, the new behavior that you encountered is that the controller checks if it owns the secret before operating on it. Deleting and recreating was also going to be my suggestion, but the owner reference should already exist unless the secret was not created by the controller. My suspicion is flux or some other system created the secret and so CAPA didn't set the owner reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
6 participants