Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh #4071

Closed
wmgroot opened this issue Feb 15, 2023 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@wmgroot
Copy link

wmgroot commented Feb 15, 2023

/kind bug

What steps did you take and what happened:
Updating the MachinePool, AWSMachinePool or KubeadmConfig resources does not trigger an instanceRefresh on the AWS ASG.

I expect that with the awsmachinepool.spec.refreshPreferences.disable left on it's default value or false, that changes to the MachinePool, AWSMachinePool, and KubeadmConfig would automatically trigger an instance refresh to rotate nodes in the pool to use the updated settings. Currently, I must manually start instance refreshes using the AWS UI or CLI in order for instances to be replaced when my specs change.

What did you expect to happen:
These are the MachinePool, AWSMachinePool, and KubeadmConfig I'm working with.

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  annotations:
    cluster.x-k8s.io/replicas-managed-by: external-autoscaler
  name: worker-private-efficient
spec:
  clusterName: awscmhdev2
  replicas: 2
  template:
    metadata: {}
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfig
          name: worker-private-efficient
      clusterName: awscmhdev2
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachinePool
        name: worker-private-efficient
      version: v1.24.7
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachinePool
metadata:
  annotations:
    cluster-api-provider-aws: "true"
  name: worker-private-efficient
spec:
  additionalTags:
    k8s.io/cluster-autoscaler/awscmhdev2: owned
    k8s.io/cluster-autoscaler/enabled: "true"
  awsLaunchTemplate:
    additionalSecurityGroups:
    - filters:
      - name: tag:Name
        values:
        - capi-hostports
      - name: tag:network-zone
        values:
        - qa
      - name: tag:region
        values:
        - us-east-2
    ami:
      id: ami-0f6b6efcd422b9d85
    iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
    instanceType: c5a.2xlarge
    rootVolume:
      deviceName: /dev/sda1
      encrypted: true
      iops: 16000
      size: 100
      throughput: 1000
      type: gp3
    sshKeyName: ""
  capacityRebalance: true
  defaultCoolDown: 5m0s
  maxSize: 5
  minSize: 0
  mixedInstancesPolicy:
    overrides:
    - instanceType: c5a.2xlarge
    - instanceType: m5a.2xlarge
  subnets:
  - filters:
    - name: availability-zone-id
      values:
      - use2-az1
    - name: tag:type
      values:
      - nodes
  - filters:
    - name: availability-zone-id
      values:
      - use2-az2
    - name: tag:type
      values:
      - nodes
  - filters:
    - name: availability-zone-id
      values:
      - use2-az3
    - name: tag:type
      values:
      - nodes
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
metadata:
  name: worker-private-efficient
spec:
  files:
  - content: |
      vm.max_map_count=262144
    path: /etc/sysctl.d/90-vm-max-map-count.conf
  - content: |
      fs.inotify.max_user_instances=256
    path: /etc/sysctl.d/91-fs-inotify.conf
  format: cloud-config
  joinConfiguration:
    nodeRegistration:
      kubeletExtraArgs:
        cloud-provider: aws
        eviction-hard: memory.available<500Mi,nodefs.available<10%
        kube-reserved: cpu=500m,memory=2Gi,ephemeral-storage=1Gi
        node-labels: role.node.kubernetes.io/worker=true
        protect-kernel-defaults: "true"
        system-reserved: cpu=500m,memory=1Gi,ephemeral-storage=1Gi
      name: '{{ ds.meta_data.local_hostname }}'
  preKubeadmCommands:
  - sudo systemctl restart systemd-sysctl

I have not set disable: true in my refreshPreferences in the AWSMachinePool spec.

$ kubectl explain awsmachinepool.spec.refreshPreferences.disable
KIND:     AWSMachinePool
VERSION:  infrastructure.cluster.x-k8s.io/v1beta2

FIELD:    disable <boolean>

DESCRIPTION:
     Disable, if true, disables instance refresh from triggering when new launch
     templates are detected. This is useful in scenarios where ASG nodes are
     externally managed.

This is the current state of the runcmd in the LaunchTemplate in AWS, version 1566.

runcmd:
  - "sudo systemctl restart systemd-sysctl"
  - kubeadm join --config /run/kubeadm/kubeadm-join-config.yaml  && echo success > /run/cluster-api/bootstrap-success.complete

I apply a change to add a command to the KubeadmConfig, such as this.

  preKubeadmCommands:
  - sudo systemctl restart systemd-sysctl
  - echo "hello world"

I see the LaunchTemplate, has a new version, and wait for 10 minutes.

I notice that there is no active instance refresh started for my ASG in the instance refresh tab, and my instances are still using the old LaunchTemplate version.

Environment:

  • Cluster-api-provider-aws version: v2.0.2
  • Kubernetes version: (use kubectl version):
$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.25.1
Kustomize Version: v4.5.7
Server Version: v1.24.7
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 15, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fiunchinho
Copy link
Contributor

We have experienced the same behavior using CAPA v1.5.2.

@AndiDog
Copy link
Contributor

AndiDog commented Apr 13, 2023

In the current code, this seems on purpose. There are several comments, even in the machine pool controller, explicitly saying that changed user data should lead to a new launch template version, but not instance refresh. I confirmed it currently works like that.

#2354 explicitly turned off instance refresh if only user data changed. That fix is to ensure the bootstrap token gets updated by creating a new launch template version automatically.

If we change the controller to trigger instance refresh, we would most likely see that nodes get rolled over every DefaultTokenTTL = 15 * time.Minute (for some reason it's 10 minutes in my case 🤷) because whenever the bootstrap token gets refreshed, the user data changes to include the new token value:

-   path: /run/kubeadm/kubeadm-join-config.yaml
    owner: root:root
    permissions: '0640'
    content: |
      ---
      apiVersion: kubeadm.k8s.io/v1beta3
      discovery:
        bootstrapToken:
          apiServerEndpoint: blabla-apiserver.example.com:6443
          caCertHashes:
          - sha256:45db4ee54dc2ea64cd30b4066fb4ccd17e5d871ac2cffc43932f984f3e76b22a
          token: 1234u7.abcd4lsoz1rujklo    # <======================= changes frequently

# [...]

What we could do is to diff the user data without the token value. If there is an effective change, trigger instance refresh.

@AndiDog
Copy link
Contributor

AndiDog commented Apr 24, 2023

Since we agreed in chat to try the above suggestion, I gave it a shot. Draft implementation is ready and I will open a PR once tested in a real scenario.

@Skarlso
Copy link
Contributor

Skarlso commented Apr 25, 2023

Sounds good.

@AndiDog
Copy link
Contributor

AndiDog commented Apr 25, 2023

My fix is working, but still any change to the KubeadmConfig (such as adding something to spec.files) only gets reconciled once the bootstrap token needs to be refreshed. From what I see, only then func (r *KubeadmConfigReconciler) rotateMachinePoolBootstrapToken would update the user data secret which will lead to machine pool nodes getting rolled out (on a relevant change i.e. not only the bootstrap token). I will invest some more time on this, since users expect that any changes (or rollbacks in the heat of an incident!) get applied immediately and not only once the bootstrap token gets refreshed (default: 15/2=7.5 minutes). We may need a patch for CAPI. And even if the secret gets updated immediately, I'm not sure if the AWSMachinePool controller reconciliation gets triggered from the secret change – so there might be another delay coming up once the first one is fixed.

@Skarlso
Copy link
Contributor

Skarlso commented Apr 27, 2023

I wonder what the backstory around this is and why it's working the way it does. I'm pretty sure there are some valid reasons somewhere in there... :D

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Jan 19, 2024

/remove-lifecycle stale

PR will be merged soon to fix this.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Sep 4, 2024

/reopen
/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot reopened this Sep 4, 2024
@k8s-ci-robot
Copy link
Contributor

@AndiDog: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 4, 2024
@AndiDog
Copy link
Contributor

AndiDog commented Sep 4, 2024

This was actually fixed via #4619.

/close

@k8s-ci-robot
Copy link
Contributor

@AndiDog: Closing this issue.

In response to this:

This was actually fixed via #4619.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
6 participants