Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Fix custom AMI bootstrap #1580

Merged
merged 1 commit into from
Oct 8, 2021
Merged

feat: Fix custom AMI bootstrap #1580

merged 1 commit into from
Oct 8, 2021

Conversation

stevehipwell
Copy link
Contributor

PR o'clock

Description

This PR fixes the issues introduced in PR #1473 which would have made it impossible to start a fully customised image without requiring a bootstrap.sh script.

Checklist

@ArchiFleKs
Copy link
Contributor

Looks good. I’m off until Tuesday I don’t have the time to test it.

My bad but the “role” was a custom label added by me and not something added by AWS. I think it can be removed

@stevehipwell
Copy link
Contributor Author

Thanks @ArchiFleKs, I did wonder about the role...

@ArchiFleKs
Copy link
Contributor

I think topology.ebs.csi... can be ommited also as it was also added by me for other reason. What about these EKS labels ? :

  • eks.amazonaws.com/sourceLaunchTemplateVersion
  • eks.amazonaws.com/nodegroup
  • eks.amazonaws.com/sourceLaunchTemplateId

@stevehipwell
Copy link
Contributor Author

@ArchiFleKs I did wonder when they started adding the CSI labels, we add them ourselves until the CSI driver starts using the correct topology labels. The other labels you list would be impossible to inject as they're added to the calculated launch template that's created after the MNG is created.

@andreyBar your PR only supported custom AMIs built from the EKS optimised image or that have added a /etc/eks/bootstrtap.sh script to mimic the behaviour of one. The call to /etc/eks/bootstrtap.sh wouldn't work correctly without providing custom user data and any labels set would be ignored. This PR corrects this behaviour so if you just want to use a fixed AMI you can do so without needing to understand how /etc/eks/bootstrtap.sh works.

@stevehipwell
Copy link
Contributor Author

@antonbabenko could you take a look at this PR?

@antonbabenko
Copy link
Member

Sorry, I don't have the capacity to take a look at this one for real (run tests, think about edge cases, compatibility, etc).

Maybe @daroga0002 can put this in his queue for reviews?

@daroga0002
Copy link
Contributor

Maybe @daroga0002 can put this in his queue for reviews?

Yup, I have it on TODO list

@daroga0002
Copy link
Contributor

@stevehipwell sorry for delay but I was sick last days, I will find some time during weekend for this but could you rebase this PR to make it easier for testing.

@stevehipwell
Copy link
Contributor Author

@daroga0002 I hope you're feeling better now? I've rebased.

@daroga0002
Copy link
Contributor

@stevehipwell thank you for your contribution 🎉 , I reviewed PR and I think we can merge it as this should solve multiple issues with custom AMI managed groups.

@daroga0002
Copy link
Contributor

@antonbabenko lets merge this and wait with releasing

Copy link
Member

@antonbabenko antonbabenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a minor comment about shell script. The rest looks good to me.

# Allow user supplied pre userdata code
sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" ${kubelet_extra_args}"' /etc/eks/bootstrap.sh
%{endif ~}
%{if length(pre_userdata) > 0 ~}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to have this if/endif block but just include ${pre_userdata} always as it was before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work without the if block, but it's tidier to use it so the resulting script has consistent white space. It also makes it clear in the template that it's an optional variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this if. It will be tidy without it. I will merge it right away.

# Set variables
API_SERVER_URL=${cluster_endpoint}
B64_CLUSTER_CA=${cluster_ca}
K8S_CLUSTER_DNS_IP=172.20.0.10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one issue as this should be variable

service_ipv4_cidr = var.cluster_service_ipv4_cidr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daroga0002 I think you're right. I'm currently on vacation, so can't make this change until I'm back next week. If you can modify my PR that's fine. Otherwise setting the optimised variable to false would let this to be set by providing the whole user data, I could then update the other PR to make this customisable as it covers this type of behaviour?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, we can wait, have a great vacations 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daroga0002 this should be ready to merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevehipwell are you sure you added change as I still see
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daroga002 sorry, manic day in the office. I only rebased, I'll add the changes first thing in the morning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daroga0002 I've updated the userdata template to be simpler and so that all custom AMIs use the same initial logic for setting the variables (this is designed ready to support #1577). After checking the bootstrap.sh code I've removed the manual setting of K8S_CLUSTER_DNS_IP as bootstrap.sh will handle this correctly; #1577 will enable this to be fully customisable.

Copy link
Contributor

@daroga0002 daroga0002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonbabenko it looks good, as per me we can merge it. This will be not breaking or impacting change as it will silently update launch template and update autoscaling group (user will require to roll instances to use new features)

If we will merge this then I think we can make a release (there is also #1584 which was tested and can be included into release)

# Allow user supplied pre userdata code
sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" ${kubelet_extra_args}"' /etc/eks/bootstrap.sh
%{endif ~}
%{if length(pre_userdata) > 0 ~}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this if. It will be tidy without it. I will merge it right away.

@stevehipwell
Copy link
Contributor Author

@antonbabenko I've removed the if.

@antonbabenko antonbabenko merged commit f198efd into terraform-aws-modules:master Oct 8, 2021
@stevehipwell stevehipwell deleted the mng-custom-ami branch October 8, 2021 14:41
@antonbabenko
Copy link
Member

Thanks @stevehipwell for the contribution!

I think it will be released at the beginning of next week. Waiting for the last PR by @daroga0002 before the release.

@JordyWTC
Copy link

Hi all,

This change is now breaking my automatic construction of enviroments :
│ Error: error reading EKS Cluster (ew1-uat01-eks-cluster): couldn't find resource

│ on .terraform/modules/eks/modules/node_groups/locals.tf line 1, in data "aws_eks_cluster" "default":
│ 1: data "aws_eks_cluster" "default" {

This is due to the data block that got introduced (without a dependency) in the node_groups module in local.tf.

@stevehipwell
Copy link
Contributor Author

@JordyWTC what do you mean about the data resource being added without a dependency? Also I assume that you're running this off a branch as this has not been released yet?

@JordyWTC
Copy link

Hi @stevehipwell ,
I was calling in the master ref, i have fixed this problem now by calling in the 17.20.0 ref.

In the locals.tf there's a data block added on aws_eks_cluster, but if the eks cluster still needs to be created, than you will get the error that the resource cannot be found. So i actually expect that the data block does not call var.cluster_name but module.aws_eks_cluster.id

@stevehipwell
Copy link
Contributor Author

@JordyWTC that's not how I read the behaviour. The cluster_name variable is coming for a locals expression which is dependent on the cluster resource. Do you happen to be setting create_eks = false?

@JordyWTC
Copy link

JordyWTC commented Oct 11, 2021

Hi @stevehipwell , yes you are right i missed the cluster_name being derrived from the locals.tf.
But than it almost seems like that the var.cluster_name value is being picked from the value i send from the eks module.
Cause what i see happen is that the eks module is not even being created, it immediately falls over this error :

│ Error: error reading EKS Cluster (ew1-uat01-eks-cluster): couldn't find resource
│
│ on .terraform/modules/eks/modules/node_groups/locals.tf line 1, in data "aws_eks_cluster" "default":
│ 1: data "aws_eks_cluster" "default" {

But it does have the name from the cluster stated(which i parse via module), which is strange as there is no cluster created yet and thus the locals variable should show an empty value.

module "eks" {
  source                        = "git::https://github.com/terraform-aws-modules/terraform-aws-eks.git?ref=v17.20.0"
  cluster_name                  = var.name
  cluster_version               = var.cluster_version
  subnets                       = var.subnets
  vpc_id                        = var.vpc_id
  manage_worker_iam_resources   = false
  #config_output_path            = var.config_output_path
  write_kubeconfig              = var.write_kubeconfig
  manage_aws_auth               = var.manage_aws_auth
  map_users                     = var.map_users
  map_roles                     = var.map_roles
  manage_cluster_iam_resources  = var.manage_cluster_iam_resources
  cluster_iam_role_name         = var.cluster_iam_role_name
  cluster_enabled_log_types     = var.cluster_enabled_log_types
  cluster_log_retention_in_days = var.cluster_log_retention_in_days
  tags                          = var.tags
}

i am not giving the create_eks value as this is by default true.

@stevehipwell
Copy link
Contributor Author

@JordyWTC I'm not sure how you can be seeing the behaviour you are, not that I'm saying you're not. The module includes a custom depends on hack which should stop the MNGs being created before the cluster is ready. Do you have any more details such as your TF version etc?

@JordyWTC
Copy link

Hi @stevehipwell,
I am also using terragrunt :
Terraform version :
'0.15.0'
Terragrunt version :
'0.28.18'

I indeed agree with you that i should not be able to see this behavior. I just did another test and indeed it happens that the var.cluster_name in the nodegroup_modules is getting its value from what i am sending via terragrunt -> terraform.

Running terragrunt apply from the eks part.

bisschopj@wtcjordy:~/Git/makro/terraform-infrastructure/makrotest/eu-west-1/int01/eks$ tgt apply
╷
│ Error: error reading EKS Cluster (ew1-int01-eks-cluster): couldn't find resource
│ 
│   with module.eks.module.node_groups.data.aws_eks_cluster.default,
│   on .terraform/modules/eks/modules/node_groups/locals.tf line 1, in data "aws_eks_cluster" "default":
│    1: data "aws_eks_cluster" "default" {
│ 
╵
Releasing state lock. This may take a few moments...
ERRO[0013] Hit multiple errors:
Hit multiple errors:
exit status 1 

Terragrunt code in eks folder :

terraform {
  #source = "git::https://wtcnl@dev.azure.com/wtcnl/makro/_git/terraform-modules//aws/resource/eks?ref=develop"
  source = "../../../../../terraform-modules//aws/resource/eks"
}

include {
  path = find_in_parent_folders()
}

dependency "vpc" {
  config_path = "../vpc"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    vpc_id          = "fake-vpc-id",
    private_subnets = "fake-pvsubnet-id"
  }
}

dependency "eks_iam" {
  config_path = "../eks_iam"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    eks_cluster_iam_role_name = "fake-eks_cluster_iam_role_name",
  }
}

dependency "sg_alb_http" {
  config_path = "../sg_alb_http"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    security_group_id = "fake-security-group-id"
  }
}

dependency "sg_alb_https" {
  config_path = "../sg_alb_https"

  mock_outputs_allowed_terraform_commands = ["validate", "plan", "destroy"]
  mock_outputs = {
    security_group_id = "fake-security-group-id"
  }
}

inputs = {
  aws_region  = local.region
  aws_account = local.account

  name                          = "${local.region_abbr}-${local.environment}-eks-cluster"
  cluster_version               = local.eks_cluster_version
  write_kubeconfig              = false
  subnets                       = dependency.vpc.outputs.private_subnets
  vpc_id                        = dependency.vpc.outputs.vpc_id
  manage_cluster_iam_resources  = false
  cluster_iam_role_name         = dependency.eks_iam.outputs.eks_cluster_iam_role_name
  map_roles                     = local.roles_map
  map_users                     = local.admin_users_map
  cluster_enabled_log_types     = local.cluster_enabled_log_types
  cluster_log_retention_in_days = local.cluster_log_retention_in_days

The terraform module terragrunt is calling is than doing as described also in earlier comment :

module "eks" {
  source                        = "git::https://github.com/terraform-aws-modules/terraform-aws-eks.git?ref=v17.20.0"
  cluster_name                  = var.name
  cluster_version               = var.cluster_version
  subnets                       = var.subnets
  vpc_id                        = var.vpc_id
  manage_worker_iam_resources   = false
  #config_output_path            = var.config_output_path
  write_kubeconfig              = var.write_kubeconfig
  manage_aws_auth               = var.manage_aws_auth
  map_users                     = var.map_users
  map_roles                     = var.map_roles
  manage_cluster_iam_resources  = var.manage_cluster_iam_resources
  cluster_iam_role_name         = var.cluster_iam_role_name
  cluster_enabled_log_types     = var.cluster_enabled_log_types
  cluster_log_retention_in_days = var.cluster_log_retention_in_days
  tags                          = var.tags
}

@stevehipwell
Copy link
Contributor Author

@daroga0002 does this look like anything you've seen before?

@JordyWTC do you get the same behaviour if you call Terraform directly?

@endrec
Copy link

endrec commented Oct 11, 2021

@JordyWTC This PR was merged after v17.20.0, so the issue you are seeing is probably from an earlier change, therefore should go in its own issue.

(The last commit on v17.20.0 was 17 Sep, this Pr was merged on 8 Oct.)

@daroga0002
Copy link
Contributor

@daroga0002 does this look like anything you've seen before?

I suspect there is some magic in user side, as even in eks module he is not using node groups either worker groups.

@JordyWTC
Copy link

@endrec, you are right the ref 17.20.0 i have put in today to make it work, sorry for the unclearance. The problem happens when i am doing if via ref master

@stevehipwell @daroga0002 from terraform it self it indeed is working, so this seems like a terragrunt problem / the way we are using terragrunt in this specific scenario, so i am not sure if other terragrunt user will experience the same issue.

Might it be a fix that instead of using var.customer_name in the data block that module.aws_eks_cluster.id will be used?

@stevehipwell
Copy link
Contributor Author

@JordyWTC that is exactly how it's currently working.

@JordyWTC
Copy link

JordyWTC commented Oct 11, 2021

@stevehipwell i did a reforge on terragrunt(latest version) to immediately having the source from this github and i am running in the same problem. That means that everyone who is using terragrunt and sourcing this module will run into the same problem.
From what i see here is that terragrunt also initializes the fargate and node_groups modules as it will pick also look in subfolders :

Initializing modules...
- fargate in modules/fargate
- node_groups in modules/node_groups

I have no idea why terragrunt in this case is already trying to access the datablock while there is a dependency in the node_groups.tf, i will raise this at terragrunt as well.
I do understand that the locals is refreshing the cluster_name variable, but as terragrunt is not picking this up, than to work around this problem would just be to instead of using the variable is to use the module input as this would make it terragrunt friendly.

For me the workaround now is just to call in the ref from an earlier version.
Thanks for your help!

@stevehipwell
Copy link
Contributor Author

@JordyWTC the implementation of local.cluster_name is a safe version of module.aws_eks_cluster.id. I suspect that terragrunt is doing odd things with the nested modules, so this sounds like a terragrunt issue.

@JordyWTC
Copy link

Thanks @stevehipwell , i will raise it with TG.

@endrec
Copy link

endrec commented Oct 11, 2021

@JordyWTC , I happen to use terragrunt in my setup, so for a quick sanity check I updated the module source to master, and run a terragrunt plan successfully.

➜ terragrunt --version
terragrunt version v0.32.4

➜ terragrunt version
Terraform v1.0.7
on linux_amd64
+ provider registry.terraform.io/hashicorp/aws v3.62.0
+ provider registry.terraform.io/hashicorp/cloudinit v2.2.0
+ provider registry.terraform.io/hashicorp/helm v2.3.0
+ provider registry.terraform.io/hashicorp/kubernetes v2.5.0
+ provider registry.terraform.io/hashicorp/local v2.1.0
+ provider registry.terraform.io/hashicorp/null v3.1.0
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/hashicorp/template v2.2.0
+ provider registry.terraform.io/terraform-aws-modules/http v2.4.1

Your version of Terraform is out of date! The latest version
is 1.0.8. You can update by downloading from https://www.terraform.io/downloads.html

@JordyWTC
Copy link

Thanks for info @endrec, ill be diving further in my code :)

@daroga0002
Copy link
Contributor

I found a issue with missing conditional, I opened #1632 and working on fix

lisfo4ka pushed a commit to lisfo4ka/terraform-aws-eks that referenced this pull request Oct 12, 2021
pjrm added a commit to pjrm/terraform-aws-eks that referenced this pull request Dec 12, 2021
… is EKS optimized and has a custom ipv4 CIDR

The PR (terraform-aws-modules#1580) is passing the "apiserver-endpoint" and "b64-cluster-ca", which causes the SERVICE_IPV4_CIDR empty (https://github.com/awslabs/amazon-eks-ami/blob/v20211206/files/bootstrap.sh#L366). Because of that, the script fallbacks always to 10.100.0.10 or 172.20.0.10.

Defining the ipv4 cidr ensures that the bootstrap script configures the DNS server correctly on the kubelet service, allowing pods to resolve DNS names.
pjrm added a commit to pjrm/terraform-aws-eks that referenced this pull request Dec 12, 2021
…raform-aws-modules#1717)

The PR (terraform-aws-modules#1580) is passing the "apiserver-endpoint" and "b64-cluster-ca", which causes the SERVICE_IPV4_CIDR empty (https://github.com/awslabs/amazon-eks-ami/blob/v20211206/files/bootstrap.sh#L366). Because of that, the script fallbacks always to 10.100.0.10 or 172.20.0.10.

Defining the ipv4 cidr ensures that the bootstrap script configures the DNS server correctly on the kubelet service, allowing pods to resolve DNS names.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants