-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS Managed Node Group Module - policies being detached #3202
Comments
From the logs, it looks like the creation of the new policy attachment happens, then the deletion of the old one. The internal TF IDs don't appear to clash (they are derived from timestamps). So something is triggering create_before_destroy for these, I think, and the result is that they are replaced and then the replacements are deleted. |
I've just tried this manually:
can be run repeatedly (it's idempotent). The deletion just drops the policy attachment. So I think the issue here is that create_before_destroy is being triggered by something, and the TF provider itself doesn't know any better. I don't know if supplying an explicit lifecycle block that opts out of this for the policy attachments would sort out the issue (but I can definitely attempt this). |
this is incorrect this is not a module issue |
Without this, the two node pools (the one defined directly on the cluster, and the other) become eligible to roll at the same time. Is there a way to achieve a one-after-the-other behaviour here? |
This feels like an X-Y problem - why are you creating nodes this way? |
The two pools host distinct parts of an application. I'd like to ensure that one is rehomed before proceeding to the second. |
Incidentally, I'd expect that setting depends_on shouldn't break the module behaviour. And indeed, TF looks to plan to replace the policy attachments - that might be a spurious action, but it shouldn't cause the policy to end up deleted. What I can't work out here is why those policy attachments are handled with create_before_delete, which seems to be the cause of the misbehaviour I'm witnessing. |
This sounds like you have not architected for resiliency - Kubernetes applications should be resilient to node replacements. In addition, it sounds like your app is being closely coupled with a node pool which is also not ideal - it treats them more like pets rather than cattle. Correcting these will alleviate the need to venture down this path entirely
Please familiarize yourself with Terraform and its behaviors hashicorp/terraform#30340 (comment) |
Description
I'm using terraform-aws-modules/eks/aws//modules/eks-managed-node-group directly.
During planning, a delete & recreation of the policy attachments for the node group is being planned.
However, after execution, no policies are attached to the nodepool's role.
Versions
Module version [Required]: 2.28.0, but older versions like 2.20.0 are affected.
Terraform version: ~> 1.7.0
Provider version(s): aws at 5.74.0
Reproduction Code [Required]
The desired behaviour is to have node updates affect the util-node group first, then the second node group.
Here:
and then this, managed afterward:
The issue is that there's a plan to replace (delete & recreate) these, whenever the cluster sees an upgrade (eg, a VPC-CNI update)
module.second_node_group.aws_iam_role_policy_attachment.this["AmazonEC2ContainerRegistryReadOnly"]
module.second_node_group.aws_iam_role_policy_attachment.this["AmazonEKS_CNI_Policy"]
module.second_node_group.aws_iam_role_policy_attachment.this["AmazonEKSWorkerNodePolicy"]
and that in itself would not be a problem - TF apply (I'm using TFC) tells me these were all created & replaced.
However, that's a lie; they aren't replaced, they're only removed. Running a second plan & apply will recreate the missing policy attachments.
I'm using TF cloud.
Yes, although this seems unrelated.
Many plans will not cause this if they plan-to-zero. However, an update to the main cluster (eg, a plugin upgrade) will lead to the second node group being reapplied. It's this that causes the detach/attach behaviour. That in and of itself wouldn't be an issue; however, the reattachment doesn't actually happen. I don't know why - possibly some kind of ordering issue?
Expected behavior
The second node pool would continue to operate correctly.
Actual behavior
The second node pool loses the ability to pull images, etc, because it no longer has the pertinent policies attached.
Terminal Output Screenshot(s)
Additional context
If there's some other way to get the module to roll node groups in order, I'd be happy to use that.
The text was updated successfully, but these errors were encountered: