feat: Fix custom AMI bootstrap #1580

stevehipwell · 2021-09-10T10:28:29Z

PR o'clock

Description

This PR fixes the issues introduced in PR #1473 which would have made it impossible to start a fully customised image without requiring a bootstrap.sh script.

Checklist

README.md has been updated after any changes to variables and outputs. See https://github.com/terraform-aws-modules/terraform-aws-eks/#doc-generation

ArchiFleKs · 2021-09-10T18:36:26Z

Looks good. I’m off until Tuesday I don’t have the time to test it.

My bad but the “role” was a custom label added by me and not something added by AWS. I think it can be removed

stevehipwell · 2021-09-10T19:55:38Z

Thanks @ArchiFleKs, I did wonder about the role...

ArchiFleKs · 2021-09-14T11:22:21Z

I think topology.ebs.csi... can be ommited also as it was also added by me for other reason. What about these EKS labels ? :

eks.amazonaws.com/sourceLaunchTemplateVersion
eks.amazonaws.com/nodegroup
eks.amazonaws.com/sourceLaunchTemplateId

stevehipwell · 2021-09-14T12:51:12Z

@ArchiFleKs I did wonder when they started adding the CSI labels, we add them ourselves until the CSI driver starts using the correct topology labels. The other labels you list would be impossible to inject as they're added to the calculated launch template that's created after the MNG is created.

@andreyBar your PR only supported custom AMIs built from the EKS optimised image or that have added a /etc/eks/bootstrtap.sh script to mimic the behaviour of one. The call to /etc/eks/bootstrtap.sh wouldn't work correctly without providing custom user data and any labels set would be ignored. This PR corrects this behaviour so if you just want to use a fixed AMI you can do so without needing to understand how /etc/eks/bootstrtap.sh works.

modules/node_groups/templates/userdata.sh.tpl

stevehipwell · 2021-09-17T14:51:01Z

@antonbabenko could you take a look at this PR?

antonbabenko · 2021-09-17T14:58:03Z

Sorry, I don't have the capacity to take a look at this one for real (run tests, think about edge cases, compatibility, etc).

Maybe @daroga0002 can put this in his queue for reviews?

daroga0002 · 2021-09-17T14:59:38Z

Maybe @daroga0002 can put this in his queue for reviews?

Yup, I have it on TODO list

daroga0002 · 2021-09-24T15:55:42Z

@stevehipwell sorry for delay but I was sick last days, I will find some time during weekend for this but could you rebase this PR to make it easier for testing.

stevehipwell · 2021-09-24T17:40:06Z

@daroga0002 I hope you're feeling better now? I've rebased.

daroga0002 · 2021-09-28T08:01:29Z

@stevehipwell thank you for your contribution 🎉 , I reviewed PR and I think we can merge it as this should solve multiple issues with custom AMI managed groups.

daroga0002 · 2021-09-28T08:05:21Z

@antonbabenko lets merge this and wait with releasing

antonbabenko

There is a minor comment about shell script. The rest looks good to me.

antonbabenko · 2021-09-28T08:23:13Z

modules/node_groups/templates/userdata.sh.tpl

-# Allow user supplied pre userdata code
+sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" ${kubelet_extra_args}"' /etc/eks/bootstrap.sh
+%{endif ~}
+%{if length(pre_userdata) > 0 ~}


I don't think we need to have this if/endif block but just include ${pre_userdata} always as it was before.

It would work without the if block, but it's tidier to use it so the resulting script has consistent white space. It also makes it clear in the template that it's an optional variable.

Please remove this if. It will be tidy without it. I will merge it right away.

daroga0002 · 2021-09-29T07:44:25Z

modules/node_groups/templates/userdata.sh.tpl

+# Set variables
+API_SERVER_URL=${cluster_endpoint}
+B64_CLUSTER_CA=${cluster_ca}
+K8S_CLUSTER_DNS_IP=172.20.0.10


I found one issue as this should be variable

terraform-aws-eks/main.tf

Line 28 in 253f927

service_ipv4_cidr = var.cluster_service_ipv4_cidr

@daroga0002 I think you're right. I'm currently on vacation, so can't make this change until I'm back next week. If you can modify my PR that's fine. Otherwise setting the optimised variable to false would let this to be set by providing the whole user data, I could then update the other PR to make this customisable as it covers this type of behaviour?

np, we can wait, have a great vacations 😄

@daroga0002 this should be ready to merge.

@stevehipwell are you sure you added change as I still see

@daroga002 sorry, manic day in the office. I only rebased, I'll add the changes first thing in the morning.

@daroga0002 I've updated the userdata template to be simpler and so that all custom AMIs use the same initial logic for setting the variables (this is designed ready to support #1577). After checking the bootstrap.sh code I've removed the manual setting of K8S_CLUSTER_DNS_IP as bootstrap.sh will handle this correctly; #1577 will enable this to be fully customisable.

daroga0002

@antonbabenko it looks good, as per me we can merge it. This will be not breaking or impacting change as it will silently update launch template and update autoscaling group (user will require to roll instances to use new features)

If we will merge this then I think we can make a release (there is also #1584 which was tested and can be included into release)

antonbabenko · 2021-10-08T14:15:12Z

modules/node_groups/templates/userdata.sh.tpl

-# Allow user supplied pre userdata code
+sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" ${kubelet_extra_args}"' /etc/eks/bootstrap.sh
+%{endif ~}
+%{if length(pre_userdata) > 0 ~}


Please remove this if. It will be tidy without it. I will merge it right away.

stevehipwell · 2021-10-08T14:20:16Z

@antonbabenko I've removed the if.

antonbabenko · 2021-10-08T14:42:42Z

Thanks @stevehipwell for the contribution!

I think it will be released at the beginning of next week. Waiting for the last PR by @daroga0002 before the release.

JordyWTC · 2021-10-11T08:25:31Z

Hi all,

This change is now breaking my automatic construction of enviroments :
│ Error: error reading EKS Cluster (ew1-uat01-eks-cluster): couldn't find resource
│
│ on .terraform/modules/eks/modules/node_groups/locals.tf line 1, in data "aws_eks_cluster" "default":
│ 1: data "aws_eks_cluster" "default" {

This is due to the data block that got introduced (without a dependency) in the node_groups module in local.tf.

stevehipwell · 2021-10-11T08:43:57Z

@JordyWTC what do you mean about the data resource being added without a dependency? Also I assume that you're running this off a branch as this has not been released yet?

JordyWTC · 2021-10-11T10:45:31Z

Hi @stevehipwell ,
I was calling in the master ref, i have fixed this problem now by calling in the 17.20.0 ref.

In the locals.tf there's a data block added on aws_eks_cluster, but if the eks cluster still needs to be created, than you will get the error that the resource cannot be found. So i actually expect that the data block does not call var.cluster_name but module.aws_eks_cluster.id

stevehipwell · 2021-10-11T10:58:37Z

@JordyWTC that's not how I read the behaviour. The cluster_name variable is coming for a locals expression which is dependent on the cluster resource. Do you happen to be setting create_eks = false?

JordyWTC · 2021-10-11T11:20:27Z

Hi @stevehipwell , yes you are right i missed the cluster_name being derrived from the locals.tf.
But than it almost seems like that the var.cluster_name value is being picked from the value i send from the eks module.
Cause what i see happen is that the eks module is not even being created, it immediately falls over this error :

│ Error: error reading EKS Cluster (ew1-uat01-eks-cluster): couldn't find resource
│
│ on .terraform/modules/eks/modules/node_groups/locals.tf line 1, in data "aws_eks_cluster" "default":
│ 1: data "aws_eks_cluster" "default" {

But it does have the name from the cluster stated(which i parse via module), which is strange as there is no cluster created yet and thus the locals variable should show an empty value.

module "eks" {
  source                        = "git::https://github.com/terraform-aws-modules/terraform-aws-eks.git?ref=v17.20.0"
  cluster_name                  = var.name
  cluster_version               = var.cluster_version
  subnets                       = var.subnets
  vpc_id                        = var.vpc_id
  manage_worker_iam_resources   = false
  #config_output_path            = var.config_output_path
  write_kubeconfig              = var.write_kubeconfig
  manage_aws_auth               = var.manage_aws_auth
  map_users                     = var.map_users
  map_roles                     = var.map_roles
  manage_cluster_iam_resources  = var.manage_cluster_iam_resources
  cluster_iam_role_name         = var.cluster_iam_role_name
  cluster_enabled_log_types     = var.cluster_enabled_log_types
  cluster_log_retention_in_days = var.cluster_log_retention_in_days
  tags                          = var.tags
}

i am not giving the create_eks value as this is by default true.

stevehipwell · 2021-10-11T11:34:36Z

@JordyWTC I'm not sure how you can be seeing the behaviour you are, not that I'm saying you're not. The module includes a custom depends on hack which should stop the MNGs being created before the cluster is ready. Do you have any more details such as your TF version etc?

JordyWTC · 2021-10-11T12:37:14Z

Hi @stevehipwell,
I am also using terragrunt :
Terraform version :
'0.15.0'
Terragrunt version :
'0.28.18'

I indeed agree with you that i should not be able to see this behavior. I just did another test and indeed it happens that the var.cluster_name in the nodegroup_modules is getting its value from what i am sending via terragrunt -> terraform.

Running terragrunt apply from the eks part.

bisschopj@wtcjordy:~/Git/makro/terraform-infrastructure/makrotest/eu-west-1/int01/eks$ tgt apply
╷
│ Error: error reading EKS Cluster (ew1-int01-eks-cluster): couldn't find resource
│ 
│   with module.eks.module.node_groups.data.aws_eks_cluster.default,
│   on .terraform/modules/eks/modules/node_groups/locals.tf line 1, in data "aws_eks_cluster" "default":
│    1: data "aws_eks_cluster" "default" {
│ 
╵
Releasing state lock. This may take a few moments...
ERRO[0013] Hit multiple errors:
Hit multiple errors:
exit status 1

Terragrunt code in eks folder :

terraform {
  #source = "git::https://wtcnl@dev.azure.com/wtcnl/makro/_git/terraform-modules//aws/resource/eks?ref=develop"
  source = "../../../../../terraform-modules//aws/resource/eks"
}

include {
  path = find_in_parent_folders()
}

dependency "vpc" {
  config_path = "../vpc"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    vpc_id          = "fake-vpc-id",
    private_subnets = "fake-pvsubnet-id"
  }
}

dependency "eks_iam" {
  config_path = "../eks_iam"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    eks_cluster_iam_role_name = "fake-eks_cluster_iam_role_name",
  }
}

dependency "sg_alb_http" {
  config_path = "../sg_alb_http"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    security_group_id = "fake-security-group-id"
  }
}

dependency "sg_alb_https" {
  config_path = "../sg_alb_https"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    security_group_id = "fake-security-group-id"
  }
}

inputs = {
  aws_region  = local.region
  aws_account = local.account

  name                          = "${local.region_abbr}-${local.environment}-eks-cluster"
  cluster_version               = local.eks_cluster_version
  write_kubeconfig              = false
  subnets                       = dependency.vpc.outputs.private_subnets
  vpc_id                        = dependency.vpc.outputs.vpc_id
  manage_cluster_iam_resources  = false
  cluster_iam_role_name         = dependency.eks_iam.outputs.eks_cluster_iam_role_name
  map_roles                     = local.roles_map
  map_users                     = local.admin_users_map
  cluster_enabled_log_types     = local.cluster_enabled_log_types
  cluster_log_retention_in_days = local.cluster_log_retention_in_days

The terraform module terragrunt is calling is than doing as described also in earlier comment :

module "eks" {
  source                        = "git::https://github.com/terraform-aws-modules/terraform-aws-eks.git?ref=v17.20.0"
  cluster_name                  = var.name
  cluster_version               = var.cluster_version
  subnets                       = var.subnets
  vpc_id                        = var.vpc_id
  manage_worker_iam_resources   = false
  #config_output_path            = var.config_output_path
  write_kubeconfig              = var.write_kubeconfig
  manage_aws_auth               = var.manage_aws_auth
  map_users                     = var.map_users
  map_roles                     = var.map_roles
  manage_cluster_iam_resources  = var.manage_cluster_iam_resources
  cluster_iam_role_name         = var.cluster_iam_role_name
  cluster_enabled_log_types     = var.cluster_enabled_log_types
  cluster_log_retention_in_days = var.cluster_log_retention_in_days
  tags                          = var.tags
}

stevehipwell · 2021-10-11T13:02:31Z

@daroga0002 does this look like anything you've seen before?

@JordyWTC do you get the same behaviour if you call Terraform directly?

endrec · 2021-10-11T13:10:14Z

@JordyWTC This PR was merged after v17.20.0, so the issue you are seeing is probably from an earlier change, therefore should go in its own issue.

(The last commit on v17.20.0 was 17 Sep, this Pr was merged on 8 Oct.)

daroga0002 · 2021-10-11T13:43:43Z

@daroga0002 does this look like anything you've seen before?

I suspect there is some magic in user side, as even in eks module he is not using node groups either worker groups.

JordyWTC · 2021-10-11T14:54:06Z

@endrec, you are right the ref 17.20.0 i have put in today to make it work, sorry for the unclearance. The problem happens when i am doing if via ref master

@stevehipwell @daroga0002 from terraform it self it indeed is working, so this seems like a terragrunt problem / the way we are using terragrunt in this specific scenario, so i am not sure if other terragrunt user will experience the same issue.

Might it be a fix that instead of using var.customer_name in the data block that module.aws_eks_cluster.id will be used?

stevehipwell · 2021-10-11T14:59:12Z

@JordyWTC that is exactly how it's currently working.

JordyWTC · 2021-10-11T15:37:24Z

@stevehipwell i did a reforge on terragrunt(latest version) to immediately having the source from this github and i am running in the same problem. That means that everyone who is using terragrunt and sourcing this module will run into the same problem.
From what i see here is that terragrunt also initializes the fargate and node_groups modules as it will pick also look in subfolders :

Initializing modules...
- fargate in modules/fargate
- node_groups in modules/node_groups

I have no idea why terragrunt in this case is already trying to access the datablock while there is a dependency in the node_groups.tf, i will raise this at terragrunt as well.
I do understand that the locals is refreshing the cluster_name variable, but as terragrunt is not picking this up, than to work around this problem would just be to instead of using the variable is to use the module input as this would make it terragrunt friendly.

For me the workaround now is just to call in the ref from an earlier version.
Thanks for your help!

stevehipwell · 2021-10-11T15:40:42Z

@JordyWTC the implementation of local.cluster_name is a safe version of module.aws_eks_cluster.id. I suspect that terragrunt is doing odd things with the nested modules, so this sounds like a terragrunt issue.

JordyWTC · 2021-10-11T15:42:19Z

Thanks @stevehipwell , i will raise it with TG.

endrec · 2021-10-11T16:23:40Z

@JordyWTC , I happen to use terragrunt in my setup, so for a quick sanity check I updated the module source to master, and run a terragrunt plan successfully.

➜ terragrunt --version
terragrunt version v0.32.4

➜ terragrunt version
Terraform v1.0.7
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.62.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.2.0
+ provider registry.terraform.io/hashicorp/helm v2.3.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.5.0
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/hashicorp/template v2.2.0
+ provider registry.terraform.io/terraform-aws-modules/http v2.4.1

Your version of Terraform is out of date! The latest version
is 1.0.8. You can update by downloading from https://www.terraform.io/downloads.html

JordyWTC · 2021-10-11T18:53:11Z

Thanks for info @endrec, ill be diving further in my code :)

daroga0002 · 2021-10-12T09:12:31Z

I found a issue with missing conditional, I opened #1632 and working on fix

… is EKS optimized and has a custom ipv4 CIDR The PR (terraform-aws-modules#1580) is passing the "apiserver-endpoint" and "b64-cluster-ca", which causes the SERVICE_IPV4_CIDR empty (https://github.com/awslabs/amazon-eks-ami/blob/v20211206/files/bootstrap.sh#L366). Because of that, the script fallbacks always to 10.100.0.10 or 172.20.0.10. Defining the ipv4 cidr ensures that the bootstrap script configures the DNS server correctly on the kubelet service, allowing pods to resolve DNS names.

…raform-aws-modules#1717) The PR (terraform-aws-modules#1580) is passing the "apiserver-endpoint" and "b64-cluster-ca", which causes the SERVICE_IPV4_CIDR empty (https://github.com/awslabs/amazon-eks-ami/blob/v20211206/files/bootstrap.sh#L366). Because of that, the script fallbacks always to 10.100.0.10 or 172.20.0.10. Defining the ipv4 cidr ensures that the bootstrap script configures the DNS server correctly on the kubelet service, allowing pods to resolve DNS names.

github-actions · 2022-11-12T02:30:38Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

endrec reviewed Sep 14, 2021

View reviewed changes

modules/node_groups/templates/userdata.sh.tpl Show resolved Hide resolved

stevehipwell mentioned this pull request Sep 24, 2021

feat: Improve managed node group bootstrap revisited #1577

Merged

1 task

daroga0002 approved these changes Sep 28, 2021

View reviewed changes

antonbabenko reviewed Sep 28, 2021

View reviewed changes

daroga0002 requested a review from antonbabenko September 28, 2021 15:11

daroga0002 reviewed Sep 29, 2021

View reviewed changes

daroga0002 approved these changes Oct 7, 2021

View reviewed changes

antonbabenko requested changes Oct 8, 2021

View reviewed changes

feat: Fix custom AMI bootstrap

aac08b7

antonbabenko merged commit f198efd into terraform-aws-modules:master Oct 8, 2021

stevehipwell deleted the mng-custom-ami branch October 8, 2021 14:41

daroga0002 mentioned this pull request Oct 12, 2021

missing conditional for datasource #1632

Closed

lisfo4ka pushed a commit to lisfo4ka/terraform-aws-eks that referenced this pull request Oct 12, 2021

feat: Fix custom AMI bootstrap (terraform-aws-modules#1580)

2cd7f8a

stevehipwell mentioned this pull request Oct 12, 2021

Cannot Create a New Cluster w/ 17.21.0 #1635

Closed

pjrm mentioned this pull request Dec 12, 2021

fix: Correct DNS Server of kubelet service when using custom AMI #1717

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022

feat: Fix custom AMI bootstrap #1580

feat: Fix custom AMI bootstrap #1580

Conversation

stevehipwell commented Sep 10, 2021

PR o'clock

Description

Checklist

ArchiFleKs commented Sep 10, 2021

stevehipwell commented Sep 10, 2021

ArchiFleKs commented Sep 14, 2021

stevehipwell commented Sep 14, 2021

stevehipwell commented Sep 17, 2021

antonbabenko commented Sep 17, 2021

daroga0002 commented Sep 17, 2021

daroga0002 commented Sep 24, 2021

stevehipwell commented Sep 24, 2021

daroga0002 commented Sep 28, 2021

daroga0002 commented Sep 28, 2021

antonbabenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daroga0002 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevehipwell commented Oct 8, 2021

antonbabenko commented Oct 8, 2021

JordyWTC commented Oct 11, 2021

stevehipwell commented Oct 11, 2021

JordyWTC commented Oct 11, 2021

stevehipwell commented Oct 11, 2021

JordyWTC commented Oct 11, 2021 • edited Loading

stevehipwell commented Oct 11, 2021

JordyWTC commented Oct 11, 2021

stevehipwell commented Oct 11, 2021

endrec commented Oct 11, 2021 • edited Loading

daroga0002 commented Oct 11, 2021

JordyWTC commented Oct 11, 2021

stevehipwell commented Oct 11, 2021

JordyWTC commented Oct 11, 2021 • edited Loading

stevehipwell commented Oct 11, 2021

JordyWTC commented Oct 11, 2021

endrec commented Oct 11, 2021

JordyWTC commented Oct 11, 2021

daroga0002 commented Oct 12, 2021

github-actions bot commented Nov 12, 2022

JordyWTC commented Oct 11, 2021 •

edited

Loading

endrec commented Oct 11, 2021 •

edited

Loading

JordyWTC commented Oct 11, 2021 •

edited

Loading