Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKSA bare metal cluster scale-in doesn't honor new hardware.csv file #8190

Closed
ygao-armada opened this issue May 21, 2024 · 5 comments
Closed
Assignees
Labels
ack Issue has been acknowledged
Milestone

Comments

@ygao-armada
Copy link

ygao-armada commented May 21, 2024

What happened:
In EKSA bare metal cluster, I try to scale-down the cluster by 1 worker node and remove a specific worker node from hardware.csv file. And run following command:
eksctl anywhere upgrade cluster -f eksa-new.yaml --hardware-csv hardware-new.csv
However, it turns out the node to remove may not be the desired one.

What you expected to happen:
The desired work node is removed.

How to reproduce it (as minimally and precisely as possible):
create an EKSA bare metal cluster with 2 worker nodes

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v0.18.7
  • EKS Distro Release:
@jiayiwang7
Copy link
Member

eksctl anywhere upgrade -f hardware-new.csv

This is not the right upgrade command

https://anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/#upgrade-cluster-command

eksctl anywhere upgrade cluster -f cluster.yaml 
# --hardware-csv <hardware.csv> \ # uncomment to add more hardware
--kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig

You should pass in the cluster spec to the -f instead of hardware.csv

@ygao-armada
Copy link
Author

@jiayiwang7 sorry my bad, I already update the description.

Here was my original command:
eksctl anywhere upgrade cluster -f eksa-mgmt05-cluster-cp1-worker1.yaml --hardware-csv hardware-mgmt05-2-new.csv --no-timeouts -v 9 --skip-validations=pod-disruption

@jiayiwang7 jiayiwang7 added the ack Issue has been acknowledged label May 28, 2024
@drewvanstone drewvanstone added this to the v0.20.0 milestone Jun 10, 2024
@pokearu
Copy link
Member

pokearu commented Jun 12, 2024

Hi @ygao-armada
Thanks for creating the issue. I believe we should have this resolved in our upcoming release.

@sp1999
Copy link
Member

sp1999 commented Jun 14, 2024

This issue has been resolved in our latest patch release v0.19.7

@sp1999 sp1999 closed this as completed Jun 14, 2024
@thecloudgarage
Copy link

thecloudgarage commented Jul 30, 2024

I am still seeing this issue in our Bare-metal setup

EKS-A version:

eksctl anywhere version
Version: v0.20.1
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml
Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/69/manifest.yaml

Before starting scale in, I have 2 worker nodes

kubectl get nodes -o wide
NAME           STATUS   ROLES           AGE     VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
instance-528   Ready    control-plane   4h24m   v1.29.5-eks-1109419   10.103.15.163   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-529   Ready    <none>          31m     v1.29.5-eks-1109419   10.103.15.165   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-530   Ready    <none>          3h34m   v1.29.5-eks-1109419   10.103.15.182   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-531   Ready    control-plane   4h7m    v1.29.5-eks-1109419   10.103.15.184   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-532   Ready    control-plane   3h47m   v1.29.5-eks-1109419   10.103.15.186   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4

Then I have edited my hardware csv file to remove the instance-530 worker node

cat hardware-targeted-scale-down.csv
hostname,bmc_ip,bmc_username,bmc_password,mac,ip_address,netmask,gateway,nameservers,labels,disk
instance-531,10.204.196.126,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.184,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-532,10.204.196.127,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.186,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-528,10.204.196.125,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.163,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=cp,/dev/sda
instance-529,10.204.196.129,root,xxxxxx,XX:XX:XX:XX:XX:XX,10.103.15.165,255.255.252.0,10.103.12.1,10.103.8.12|10.103.12.12,type=worker,/dev/nvme0n1

I have also adjusted my cluster config file to scale the worker node count to 1 and then ran the command

eksctl anywhere upgrade cluster -f cluster-config-upgrade-20240730094824-scale-to-1.yaml --hardware-csv hardware-targeted-scale-down.csv --kubeconfig /home/ubuntu/eksanywhere/eksa-xxxx-cluster2n/eksa-xxxx-cluster2n-eks-a-cluster.kubeconfig --skip-validations=pod-disruption
Performing setup and validations
✅ Tinkerbell provider validation
✅ SSH Keys present
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-Anywhere Version
✅ Validate eksa controller is not paused
✅ Validate eksaVersion skew is one minor version
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing GitOps cluster resources reconcile
Upgrading core components
Backing up management cluster's resources before upgrading
Upgrading management cluster
Updating Git Repo with new EKS-A cluster spec
Finalized commit and committed to local repository      {"hash": "2d209dbf9ebd2a0f45ff88c8fe1a793f4d11348a"}
Forcing reconcile Git repo with latest commit
Resuming GitOps cluster resources kustomization
Writing cluster config file
🎉 Cluster upgraded!
Cleaning up backup resources

However, I still see EKS Anywhere does not delete the node that I had removed from hardware csv. Instead, it starts deleting the other node.

kubectl get nodes -o wide
NAME           STATUS                     ROLES           AGE     VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
instance-528   Ready                      control-plane   4h26m   v1.29.5-eks-1109419   10.103.15.163   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-529   Ready,SchedulingDisabled   <none>          33m     v1.29.5-eks-1109419   10.103.15.165   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-530   Ready                      <none>          3h36m   v1.29.5-eks-1109419   10.103.15.182   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-531   Ready                      control-plane   4h9m    v1.29.5-eks-1109419   10.103.15.184   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-532   Ready                      control-plane   3h49m   v1.29.5-eks-1109419   10.103.15.186   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4

Per the logic of what I understand has been fixed, instance-530 should have been deleted as it was removed from hardware csv. However after the scale in upgrade, the other node is deleted

kubectl get nodes -o wide
NAME           STATUS   ROLES           AGE     VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
instance-528   Ready    control-plane   4h32m   v1.29.5-eks-1109419   10.103.15.163   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-530   Ready    <none>          3h43m   v1.29.5-eks-1109419   10.103.15.182   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-531   Ready    control-plane   4h15m   v1.29.5-eks-1109419   10.103.15.184   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4
instance-532   Ready    control-plane   3h55m   v1.29.5-eks-1109419   10.103.15.186   <none>        Ubuntu 22.04.4 LTS   5.15.0-113-generic   containerd://1.7.18-0-gae71819c4

Can someone help?????????

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ack Issue has been acknowledged
Projects
None yet
Development

No branches or pull requests

6 participants