Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout on initial apply on wait_for_cluster null_resource, then inconsistent final plan on re-apply #990

Closed
1 of 4 tasks
glueric opened this issue Aug 27, 2020 · 11 comments

Comments

@glueric
Copy link

glueric commented Aug 27, 2020

I have issues

module.dev-eks-module.null_resource.wait_for_cluster[0] (local-exec): TIMEOUT


Error: Error running command 'for i in `seq 1 60`; do wget --no-check-certificate -O - -q $ENDPOINT/healthz >/dev/null && exit 0 || true; sleep 5; done; echo TIMEOUT && exit 1': exit status 1. Output: TIMEOUT

I'm submitting a...

  • bug report
  • feature request
  • support request - read the FAQ first!
  • kudos, thank you, warm fuzzy

What is the current behavior?

When I apply the module, seemingly everything gets created, but the wait_for_cluster null resource times out. After checking the cluster on EKS, the node group wasn't created. When I re-apply the module, I get an 'Inconsistent final plan' error:

Terraform will perform the following actions:

  # module.dev-eks-module.kubernetes_config_map.aws_auth[0] will be created
  + resource "kubernetes_config_map" "aws_auth" {
      + data = {
          + "mapAccounts" = jsonencode([])
          + "mapRoles"    = <<~EOT
                - "groups":
                  - "system:bootstrappers"
                  - "system:nodes"
                  "rolearn": "arn:aws:iam::519026510774:role/dev-eks-module20200827205656039700000007"
                  "username": "system:node:{{EC2PrivateDNSName}}"
            EOT
          + "mapUsers"    = jsonencode([])
        }
      + id   = (known after apply)

      + metadata {
          + generation       = (known after apply)
          + name             = "aws-auth"
          + namespace        = "kube-system"
          + resource_version = (known after apply)
          + self_link        = (known after apply)
          + uid              = (known after apply)
        }
    }

  # module.dev-eks-module.null_resource.wait_for_cluster[0] is tainted, so must be replaced
+/- resource "null_resource" "wait_for_cluster" {
      ~ id = "145238471061234140" -> (known after apply)
    }

Plan: 2 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes


Error: Provider produced inconsistent final plan

When expanding the plan for
module.dev-eks-module.null_resource.wait_for_cluster[0] to include new values
learned so far during apply, provider "registry.terraform.io/hashicorp/null"
changed the planned action from CreateThenDelete to DeleteThenCreate.

If this is a bug, how to reproduce? Please include a code sample if relevant.

I just copied the sample code from the readme but updated it with the version locking from the latest basic sample.

terraform {
  required_version = ">= 0.12.0"
}

provider "aws" {
  version = ">= 2.28.1"
  region  = var.aws_region
}

provider "random" {
  version = "~> 2.1"
}

provider "local" {
  version = "~> 1.2"
}

provider "null" {
  version = "~> 2.1"
}

provider "template" {
  version = "~> 2.1"
}

data "aws_eks_cluster" "cluster" {
  name = module.dev-eks-module.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.dev-eks-module.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
  token                  = data.aws_eks_cluster_auth.cluster.token
  load_config_file       = false
  version                = "~> 1.11"
}

module "dev-eks-module" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "v12.2.0"
  cluster_name    = "dev-eks-module"
  cluster_version = "1.17"
  subnets         = ["subnet-4abbbf66", "subnet-9599a0cf", "subnet-c977b382"]
  vpc_id          = "vpc-b6a006d0"

  worker_groups = [
    {
      instance_type = "t3.medium"
      asg_max_size  = 5
    }
  ]
}

What's the expected behavior?

The cluster should get created properly with a node group and terraform should apply successfully.

Are you able to fix this problem and submit a PR? Link here if you have already.

Environment details

  • Affected module version: v12.2.0
  • OS: MacOS Mojave 10.14.6
  • Terraform version: v0.13.0

Any other relevant info

@dpiddockcmp
Copy link
Contributor

There's a Terraform bug here that we're triggering in a few ways. This is the same bug as #939 but that had a logical fix. This resource should not be being tagged as create before destroy by Terraform. Not sure if related to #984 as that's also happening to TF 0.12.29.

In your particular case, does dropping the existing null resource allow the plan to finish applying?
terraform state rm module.dev-eks-module.null_resource.wait_for_cluster[0]

@glueric
Copy link
Author

glueric commented Aug 28, 2020

I deleted the existing null resource and ran an apply again:

module.dev-eks-module.null_resource.wait_for_cluster[0]: Still creating... [5m0s elapsed]
module.dev-eks-module.null_resource.wait_for_cluster[0] (local-exec): TIMEOUT


Error: Error running command 'for i in `seq 1 60`; do wget --no-check-certificate -O - -q $ENDPOINT/healthz >/dev/null && exit 0 || true; sleep 5; done; echo TIMEOUT && exit 1': exit status 1. Output: TIMEOUT

I wonder if there is some problem with my configuration? I specified only private subnets in the subnet var, as that is what I saw the example doing. Should I provide my public subnets as well?

@ayush-sharma-devops
Copy link

I have the same problem:

module.my-cluster.null_resource.wait_for_cluster[0]: Still creating... [27m50s elapsed]
^CInterrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...
Stopping operation...


Error: Error running command 'for i in `seq 1 60`; do wget --no-check-certificate -O - -q $ENDPOINT/healthz >/dev/null && exit 0 || true; sleep 5; done; echo TIMEOUT && exit 1': signal: interrupt. Output: 

@dpiddockcmp
Copy link
Contributor

What happens when you run the wget --no-check-certificate -O - -q $ENDPOINT/healthz command manually from your deployment environment?

@glueric
Copy link
Author

glueric commented Aug 31, 2020

Hmm it doesn't print anything. It seems like there's nothing in the $ENDPOINT var.

to-m-mbperic:dev-eks-module eric.rosendale$ wget --no-check-certificate -O - -q $ENDPOINT/healthz
to-m-mbperic:dev-eks-module eric.rosendale$ echo $ENDPOINT

to-m-mbperic:dev-eks-module eric.rosendale$ 

@dpiddockcmp
Copy link
Contributor

You need to set ENDPOINT to the your kubernetes API endpoint's address when testing manually.

@glueric
Copy link
Author

glueric commented Aug 31, 2020

Ok after setting the endpoint var, I also realized the -q flag means no output so I wouldn't see the response anyway. So I ran it again and it seems like it's failing to get an SSL connection:

OpenSSL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
Unable to establish SSL connection.

@dpiddockcmp
Copy link
Contributor

Your wget is too old. Either update the environment you're running or switch to curl by changing wait_for_cluster_cmd

@ayush-sharma-devops
Copy link

My issue was resolved. Turns out it was a networking issue for me. My execution environment couldn't reach the master on the CIDRs that I configured. As a test, I configured my cluster as open to public and everything worked as expected. I then sorted out the networking between my execution environment and the cluster and everything works.

@glueric
Copy link
Author

glueric commented Sep 1, 2020

That was it! Thank you! I updated wget to v 1.20.3 and the terraform apply was able to finish.

@glueric glueric closed this as completed Sep 1, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants