Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Sub Module Prefix Separator #1792

Closed
wants to merge 2 commits into from
Closed

fix: Sub Module Prefix Separator #1792

wants to merge 2 commits into from

Conversation

joshua-giumelli-deltatre
Copy link

@joshua-giumelli-deltatre joshua-giumelli-deltatre commented Jan 19, 2022

Description

This PR addresses the issue raised in issue #1791 of different prefix separators being used pre v18.0.

  • Add optional variable prefix_separator to eks-managed-node-group, fargate-profile and self-managed-node-group submodules that defaults to "-"
  • Replace occurrences of hardcoded trailing "-" in name prefix fields with prefix_separator variable

Motivation and Context

This PR addresses the issue raised in issue #1791. When upgrading from v17.X to v18.X, many resources are replaced due to the previous default separator of empty string ("") not being used in all locations by sub modules. This PR adds support for passing a custom prefix separator to each sub module to address this.

Breaking Changes

N/A

How Has This Been Tested?

  • I have tested and validated these changes using one or more of the provided examples/* projects
    • ran terraform init and terraform plan for the eks_managed_node_group, fargate_profile and self_managed_node_group examples. All successfully planned.

@joshua-giumelli-deltatre joshua-giumelli-deltatre changed the title Prefix separator Sub Module Prefix Separator Jan 19, 2022
@joshua-giumelli-deltatre joshua-giumelli-deltatre changed the title Sub Module Prefix Separator fix: Sub Module Prefix Separator Jan 19, 2022
@bryantbiggs
Copy link
Member

the prefix-separator was added to prevent the cluster resources from being re-created, and I believe we discussed not adding them to node groups. since node groups in theory should be ephemeral, I am leaning towards excluding this change to avoid any additional v17.x creep back into this version

@LukaszRacon
Copy link

"prefix-separator was added to prevent the cluster resources from being re-created"
If you do not propagate prefix-separator to node groups it forces existing node groups to be re-created. This kills idea of zero down time migration from 17.x. If that is the case you'd be better off completely removing prefix-separator from code and force everyone to create separate cluster at version 18.x then switch traffic.

@bryantbiggs
Copy link
Member

No, like I said - we added it to cluster resources because the cluster control plane holds the etcd data, etc.. You can do target applies and bring up new node groups before destroying the existing node groups.

Also, we've made no guarantees on zero downtime upgrades. Its marked as a breaking change for a reason

@bryantbiggs
Copy link
Member

Ref #1745 (comment)

@jurgen-weber-deltatre
Copy link

jurgen-weber-deltatre commented Jan 20, 2022

No, like I said - we added it to cluster resources because the cluster control plane holds the etcd data, etc.. You can do target applies and bring up new node groups before destroying the existing node groups.

Also, we've made no guarantees on zero downtime upgrades. Its marked as a breaking change for a reason

You simply cannot ask users of this module to simply destroy their production node groups and not expect them to push back. Targeted applies being a solution, but also as much as a hack and as unpleasant as just allowing the separator where it is needed. It is not an unreasonable expectation to not want to recreate your node groups, ephemeral or not. Users are going to take the path of least resistance.

This is a simple addition, irrelevant on where it is. It is still in the module.

@bryantbiggs
Copy link
Member

@jurgen-weber-deltatre I don't expect anything from users other than compassion and understanding - users are able to stay at the last v17.x version until they are ready to plan an upgrade path. Making demands of free OSS that was provided by individuals on their own time is not a good look.

You can check out other threads, people that are savvy with Terraform are those that will be most successful in navigating and upgrade path based on their current configurations #1744

@bryantbiggs
Copy link
Member

nice touch with all the 👍🏽 👎🏽

@joshua-giumelli-deltatre
Copy link
Author

@bryantbiggs I believe this discussion essentially boils down to "should the node groups or the nodes be ephemeral?". Seeing as the use of spot/replaceable instances is an extremely common cost saving measure, I am inclined to argue that the nodes are ephemeral while their state/configuration is held in a parent node group/ASG/etc. object. I'm happy to discuss this further outside of this PR, however the point of this PR is more to give users the choice of where they sit on that question.

From an ease of migration point of view, for all users of Terraform, an upgrade path that requires the least number of resources to be destroyed/replaced would be the path of least resistance. Which is what this PR is hoping to achieve. Not forcing the replacement of resources is IMO, the safest upgrade approach. I understand that this module is designed to have breaking changes when compared to v17.X, this is more about reducing the barrier to entry when upgrading.

@LukaszRacon
Copy link

Ideally you would provide a migration path for users:
Release 18.0.x version where you can configure prefix per node group. Allow users to have 2 node groups (old one without prefix and new one with prefix). Once you have a cluster with 2 node groups you can migrate workloads, then one can destroy the old node group. Add a comment that node group prefix is deprecated and that it will be removed in future version. Then at some point in the future remove prefix by releasing 18.x.0 version.

@bryantbiggs
Copy link
Member

Ideally you would provide a migration path for users: Release 18.0.x version where you can configure prefix per node group. Allow users to have 2 node groups (old one without prefix and new one with prefix). Once you have a cluster with 2 node groups you can migrate workloads, then one can destroy the old node group. Add a comment that node group prefix is deprecated and that it will be removed in future version. Then at some point in the future remove prefix by releasing 18.x.0 version.

You just described a breaking change to support a breaking change

@LukaszRacon
Copy link

Breaking change would be done over two releases which gives users a path to migrate workloads.

@bryantbiggs
Copy link
Member

You can achieve this today by doing targetted applies.

  • Migrate your cluster to v18.x; perform any terraform state mv ...s necessary
  • terraform apply --target <new node group defined in module config>
  • Wait for new node group to come up as healthy, do any sanity checks
  • Taint old node group instances and drain pods
  • terraform apply which will remove the old node group

@LukaszRacon
Copy link

Thanks, I tested it a couple hours ago. Trick was that you have to create security groups before running terraform apply on node group, else nodes cannot talk to cluster. I am guessing there are some missing dependencies in terraform.

terraform apply --target=module.eks.aws_security_group_rule.cluster
terraform apply --target=module.eks.aws_security_group_rule.node
terraform apply --target="module.eks.module.eks_managed_node_group[\"default\"]"

@daroga0002
Copy link
Contributor

You can achieve this today by doing targetted applies.

some workaround which I often use is to remove existing node groups from terraform via:

terraform state rm ....

and then rerun new terraform code. After new node groups will be created I just remove existing ones (which are outside terraform) via WEB UI

@sidewinder12s
Copy link

Couldn't you also use the new sub-modules outside the parent module?

Spin up some new node groups that match the old ones, migrate workloads over to them, upgrade the cluster/destroy the old node groups (while using the existing prefix-separator variable in the main cluster).

@ivankovnatsky
Copy link
Contributor

I don't like the hack/fix with manipulating the state, we do a have a wrapper for terraform that does some gitops magic so running these import/state rm would be an excess for us, of course we can do it, but why.

@bryantbiggs @daroga0002 will you be open to mitigate state manipulation to just renaming some resources back? like I do here: master...ivankovnatsky:master

@bryantbiggs
Copy link
Member

I don't like the hack/fix with manipulating the state, we do a have a wrapper for terraform that does some gitops magic so running these import/state rm would be an excess for us, of course we can do it, but why.

@bryantbiggs @daroga0002 will you be open to mitigate state manipulation to just renaming some resources back? like I do here: master...ivankovnatsky:master

no we are not going to revert those changes - they were intentionally made within a breaking change as planned.

if you don't want to do state move commands at the CLI, you can use the new move block syntax https://www.terraform.io/language/modules/develop/refactoring

@joshua-giumelli-deltatre
Copy link
Author

Chiming in here as I have been away for a while.

When using the moved block, the block will need to be inside this module rather than in any codebase that references this.

This is because Terraform requires that the moved blocks are inside the modules containing the resources to be moved. Terraform is unable to moved a resource from outside its module.

Testing locally by editting my .terraform directory, it looks like these moved blocks would be a good starting point:

moved {
from = aws_iam_role.cluster
to = aws_iam_role.this
}

moved {
from = aws_launch_template.workers_launch_template[X]
to = module.self_managed_node_group[XXXX].aws_launch_template.this[0]
}

Implementing these moved blocks inside this module could be another avenue for reducing the interventions required by users for the 17.X to 18.X update.

@igaskin
Copy link

igaskin commented Feb 11, 2022

Chiming in here as I have been away for a while.

When using the moved block, the block will need to be inside this module rather than in any codebase that references this.

This is because Terraform requires that the moved blocks are inside the modules containing the resources to be moved. Terraform is unable to moved a resource from outside its module.

Testing locally by editting my .terraform directory, it looks like these moved blocks would be a good starting point:

moved { from = aws_iam_role.cluster to = aws_iam_role.this }

moved { from = aws_launch_template.workers_launch_template[X] to = module.self_managed_node_group[XXXX].aws_launch_template.this[0] }

Implementing these moved blocks inside this module could be another avenue for reducing the interventions required by users for the 17.X to 18.X update.

A word of caution for others going through the upgrade process, that even after adding the moved blocks, changes to IAM role names may also force cluster recreation as mentioned in the "List of backwards incompatible changes"

https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/main.tf#L173

@ivankovnatsky
Copy link
Contributor

We decided not to upgrade at all, and in the future just migrate out of module.

@joshua-giumelli-deltatre
Copy link
Author

@bryantbiggs is there any update on this? Based on the feedback from here it sounds like there are others that have been having this issue. If this PR/idea is not likely to be adopted, then please close this PR. I understand that this issue has been flagged as wontfix but based on the feedback in this PR it sounds like this has been a pain point for other users.

@bryantbiggs
Copy link
Member

while I appreciate the PR, I don't think this is a change we will be implementing

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants