Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh #4071

wmgroot · 2023-02-15T18:09:20Z

/kind bug

What steps did you take and what happened:
Updating the MachinePool, AWSMachinePool or KubeadmConfig resources does not trigger an instanceRefresh on the AWS ASG.

I expect that with the awsmachinepool.spec.refreshPreferences.disable left on it's default value or false, that changes to the MachinePool, AWSMachinePool, and KubeadmConfig would automatically trigger an instance refresh to rotate nodes in the pool to use the updated settings. Currently, I must manually start instance refreshes using the AWS UI or CLI in order for instances to be replaced when my specs change.

What did you expect to happen:
These are the MachinePool, AWSMachinePool, and KubeadmConfig I'm working with.

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
  annotations:
    cluster.x-k8s.io/replicas-managed-by: external-autoscaler
  name: worker-private-efficient
spec:
  clusterName: awscmhdev2
  replicas: 2
  template:
    metadata: {}
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfig
          name: worker-private-efficient
      clusterName: awscmhdev2
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachinePool
        name: worker-private-efficient
      version: v1.24.7
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachinePool
metadata:
  annotations:
    cluster-api-provider-aws: "true"
  name: worker-private-efficient
spec:
  additionalTags:
    k8s.io/cluster-autoscaler/awscmhdev2: owned
    k8s.io/cluster-autoscaler/enabled: "true"
  awsLaunchTemplate:
    additionalSecurityGroups:
    - filters:
      - name: tag:Name
        values:
        - capi-hostports
      - name: tag:network-zone
        values:
        - qa
      - name: tag:region
        values:
        - us-east-2
    ami:
      id: ami-0f6b6efcd422b9d85
    iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
    instanceType: c5a.2xlarge
    rootVolume:
      deviceName: /dev/sda1
      encrypted: true
      iops: 16000
      size: 100
      throughput: 1000
      type: gp3
    sshKeyName: ""
  capacityRebalance: true
  defaultCoolDown: 5m0s
  maxSize: 5
  minSize: 0
  mixedInstancesPolicy:
    overrides:
    - instanceType: c5a.2xlarge
    - instanceType: m5a.2xlarge
  subnets:
  - filters:
    - name: availability-zone-id
      values:
      - use2-az1
    - name: tag:type
      values:
      - nodes
  - filters:
    - name: availability-zone-id
      values:
      - use2-az2
    - name: tag:type
      values:
      - nodes
  - filters:
    - name: availability-zone-id
      values:
      - use2-az3
    - name: tag:type
      values:
      - nodes
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfig
metadata:
  name: worker-private-efficient
spec:
  files:
  - content: |
      vm.max_map_count=262144
    path: /etc/sysctl.d/90-vm-max-map-count.conf
  - content: |
      fs.inotify.max_user_instances=256
    path: /etc/sysctl.d/91-fs-inotify.conf
  format: cloud-config
  joinConfiguration:
    nodeRegistration:
      kubeletExtraArgs:
        cloud-provider: aws
        eviction-hard: memory.available<500Mi,nodefs.available<10%
        kube-reserved: cpu=500m,memory=2Gi,ephemeral-storage=1Gi
        node-labels: role.node.kubernetes.io/worker=true
        protect-kernel-defaults: "true"
        system-reserved: cpu=500m,memory=1Gi,ephemeral-storage=1Gi
      name: '{{ ds.meta_data.local_hostname }}'
  preKubeadmCommands:
  - sudo systemctl restart systemd-sysctl

I have not set disable: true in my refreshPreferences in the AWSMachinePool spec.

$ kubectl explain awsmachinepool.spec.refreshPreferences.disable
KIND:     AWSMachinePool
VERSION:  infrastructure.cluster.x-k8s.io/v1beta2

FIELD:    disable <boolean>

DESCRIPTION:
     Disable, if true, disables instance refresh from triggering when new launch
     templates are detected. This is useful in scenarios where ASG nodes are
     externally managed.

This is the current state of the runcmd in the LaunchTemplate in AWS, version 1566.

runcmd:
  - "sudo systemctl restart systemd-sysctl"
  - kubeadm join --config /run/kubeadm/kubeadm-join-config.yaml  && echo success > /run/cluster-api/bootstrap-success.complete

I apply a change to add a command to the KubeadmConfig, such as this.

  preKubeadmCommands:
  - sudo systemctl restart systemd-sysctl
  - echo "hello world"

I see the LaunchTemplate, has a new version, and wait for 10 minutes.

I notice that there is no active instance refresh started for my ASG in the instance refresh tab, and my instances are still using the old LaunchTemplate version.

Environment:

Cluster-api-provider-aws version: v2.0.2
Kubernetes version: (use kubectl version):

$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.25.1
Kustomize Version: v4.5.7
Server Version: v1.24.7

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-02-15T18:09:28Z

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fiunchinho · 2023-02-20T15:10:30Z

We have experienced the same behavior using CAPA v1.5.2.

AndiDog · 2023-04-13T11:49:24Z

In the current code, this seems on purpose. There are several comments, even in the machine pool controller, explicitly saying that changed user data should lead to a new launch template version, but not instance refresh. I confirmed it currently works like that.

#2354 explicitly turned off instance refresh if only user data changed. That fix is to ensure the bootstrap token gets updated by creating a new launch template version automatically.

If we change the controller to trigger instance refresh, we would most likely see that nodes get rolled over every DefaultTokenTTL = 15 * time.Minute (for some reason it's 10 minutes in my case 🤷) because whenever the bootstrap token gets refreshed, the user data changes to include the new token value:

-   path: /run/kubeadm/kubeadm-join-config.yaml
    owner: root:root
    permissions: '0640'
    content: |
      ---
      apiVersion: kubeadm.k8s.io/v1beta3
      discovery:
        bootstrapToken:
          apiServerEndpoint: blabla-apiserver.example.com:6443
          caCertHashes:
          - sha256:45db4ee54dc2ea64cd30b4066fb4ccd17e5d871ac2cffc43932f984f3e76b22a
          token: 1234u7.abcd4lsoz1rujklo    # <======================= changes frequently

# [...]

What we could do is to diff the user data without the token value. If there is an effective change, trigger instance refresh.

AndiDog · 2023-04-24T18:32:07Z

Since we agreed in chat to try the above suggestion, I gave it a shot. Draft implementation is ready and I will open a PR once tested in a real scenario.

Skarlso · 2023-04-25T07:16:41Z

Sounds good.

AndiDog · 2023-04-25T14:54:50Z

My fix is working, but still any change to the KubeadmConfig (such as adding something to spec.files) only gets reconciled once the bootstrap token needs to be refreshed. From what I see, only then func (r *KubeadmConfigReconciler) rotateMachinePoolBootstrapToken would update the user data secret which will lead to machine pool nodes getting rolled out (on a relevant change i.e. not only the bootstrap token). I will invest some more time on this, since users expect that any changes (or rollbacks in the heat of an incident!) get applied immediately and not only once the bootstrap token gets refreshed (default: 15/2=7.5 minutes). We may need a patch for CAPI. And even if the secret gets updated immediately, I'm not sure if the AWSMachinePool controller reconciliation gets triggered from the secret change – so there might be another delay coming up once the first one is fixed.

Skarlso · 2023-04-27T08:10:36Z

I wonder what the backstory around this is and why it's working the way it does. I'm pretty sure there are some valid reasons somewhere in there... :D

k8s-triage-robot · 2024-01-19T07:00:00Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

AndiDog · 2024-01-19T08:54:05Z

/remove-lifecycle stale

PR will be merged soon to fix this.

k8s-triage-robot · 2024-04-18T09:15:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-18T09:41:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-06-17T09:58:47Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-06-17T09:58:51Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

AndiDog · 2024-09-04T07:16:39Z

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2024-09-04T07:16:44Z

@AndiDog: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

AndiDog · 2024-09-04T07:17:18Z

This was actually fixed via #4619.

/close

k8s-ci-robot · 2024-09-04T07:17:23Z

@AndiDog: Closing this issue.

In response to this:

This was actually fixed via #4619.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 15, 2023

fiunchinho mentioned this issue Feb 21, 2023

Machine pool nodes are not rolled during update giantswarm/roadmap#2217

Closed

alex-dabija mentioned this issue Apr 25, 2023

CAPA cluster upgrades stability giantswarm/roadmap#1777

Closed

14 tasks

AndiDog mentioned this issue May 4, 2023

Trigger machine pool instance refresh also on user data change (unless it's only the bootstrap token) #4245

Closed

4 tasks

AndiDog mentioned this issue May 15, 2023

AWSMachinePool detects LaunchTemplate diffs multiple times and updates LaunchTemplate versions multiple times with no changes #4070

Closed

AndiDog mentioned this issue Jun 15, 2023

KubeadmConfig changes should be reconciled for machine pools, triggering instance recreation kubernetes-sigs/cluster-api#8858

Closed

AndiDog mentioned this issue Sep 5, 2023

REQUEST: New membership for AndiDog kubernetes/org#4437

Closed

9 tasks

This was referenced Nov 6, 2023

MachinePool remains in WaitingForReplicasReady because CAPA does not reconcile node references after instance refresh #4618

Open

✨ Trigger machine pool instance refresh (node rollout) if bootstrap config reference changes #4619

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024

k8s-ci-robot reopened this Sep 4, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 4, 2024

k8s-ci-robot closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh #4071

Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh #4071

wmgroot commented Feb 15, 2023

k8s-ci-robot commented Feb 15, 2023

fiunchinho commented Feb 20, 2023

AndiDog commented Apr 13, 2023

AndiDog commented Apr 24, 2023

Skarlso commented Apr 25, 2023

AndiDog commented Apr 25, 2023

Skarlso commented Apr 27, 2023

k8s-triage-robot commented Jan 19, 2024

AndiDog commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

k8s-triage-robot commented May 18, 2024

k8s-triage-robot commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

AndiDog commented Sep 4, 2024

k8s-ci-robot commented Sep 4, 2024

AndiDog commented Sep 4, 2024

k8s-ci-robot commented Sep 4, 2024

Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh #4071

Updating MachinePool, AWSMachinePool, and KubeadmConfig resources does not trigger an ASG instanceRefresh #4071

Comments

wmgroot commented Feb 15, 2023

k8s-ci-robot commented Feb 15, 2023

fiunchinho commented Feb 20, 2023

AndiDog commented Apr 13, 2023

AndiDog commented Apr 24, 2023

Skarlso commented Apr 25, 2023

AndiDog commented Apr 25, 2023

Skarlso commented Apr 27, 2023

k8s-triage-robot commented Jan 19, 2024

AndiDog commented Jan 19, 2024

k8s-triage-robot commented Apr 18, 2024

k8s-triage-robot commented May 18, 2024

k8s-triage-robot commented Jun 17, 2024

k8s-ci-robot commented Jun 17, 2024

AndiDog commented Sep 4, 2024

k8s-ci-robot commented Sep 4, 2024

AndiDog commented Sep 4, 2024

k8s-ci-robot commented Sep 4, 2024