Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster does not recreate or reattach node pools if a change requires cluster respin. #9220

Closed
huang-jy opened this issue May 24, 2021 · 5 comments · Fixed by GoogleCloudPlatform/magic-modules#4842, hashicorp/terraform-provider-google-beta#3314 or #9309
Assignees
Labels

Comments

@huang-jy
Copy link

huang-jy commented May 24, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.15.4
on linux_amd64
+ provider registry.terraform.io/hashicorp/google v3.68.0
+ provider registry.terraform.io/hashicorp/google-beta v3.68.0

Affected Resource(s)

  • google_container_cluster

Terraform Configuration Files

(Please see notes later)

resource "google_service_account" "terraform-sa" {
  account_id   = "terraform-gke"
  display_name = "Service Account For Terraform To Make GKE Cluster"
}

resource "google_container_cluster" "cluster" {
  name               = "bug-report"
  location           = var.zone
  min_master_version = var.cluster_version
  project            = var.project

  node_config {
    preemptible  = true
    machine_type = "e2-micro"

    # Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
    service_account = google_service_account.terraform-sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }

  }


  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  addons_config {
    horizontal_pod_autoscaling {
      disabled = false
    }
    http_load_balancing {
      disabled = false
    }
  }

  maintenance_policy {
    daily_maintenance_window {
      start_time = "00:00"
    }
  }

  release_channel {
    channel = "UNSPECIFIED"
  }

  vertical_pod_autoscaling {
    enabled = true
  }


  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }
}
resource "google_container_node_pool" "nodes" {
  name       = "general-purpose"
  location   = var.zone
  project    = var.project
  cluster    = google_container_cluster.cluster.name
  node_count = 1
  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }
  management {
    ## Both must be true on STABLE channel
    auto_repair  = false
    auto_upgrade = false
  }

  # version = var.cluster_version

  node_config {
    preemptible  = false
    machine_type = "e2-standard-4"

    # Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
    service_account = google_service_account.terraform-sa.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]

    metadata = {
      disable-legacy-endpoints = "true"
    }

  }

  upgrade_settings {
    max_surge       = 1 ## Only allow one extra node during upgrade
    max_unavailable = 0 ## Don't take out any curren nodes until a new one comes good
  }

  lifecycle {
    ignore_changes = [
      # Ignore changes to initial_node_count,
      # otherwise node pool will be recreated if there is drift between initial_node_count
      # in the environment and terraform
      initial_node_count,
      node_count,
      version
    ]
  }


}
provider "google" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}

provider "google-beta" {
  project     = var.project
  region      = var.region
  zone        = var.zone
  credentials = file("credentials.json")
}
variable "project" {
  default = "REPLACE_ME"
}

variable "region" {
  default = "europe-west2"
}

variable "zone" {
  default = "europe-west2-a"
}

variable "cluster_version" {
  default = "1.19.9-gke.1900"
}

Debug Output

Panic Output

Expected Behavior

Terraform should create/recreate or reattach the node pool

Actual Behavior

Terraform deleted the cluster and node pool, respun the cluster, but did not (re)create the node pool

Steps to Reproduce

  1. Save the files above
  2. terraform plan -out tfplan and terraform apply tfplan
  3. Check in console, and you will have a cluster with nodes. All good so far ✔️
  4. Run terraform plan -out tfplan again, you will see the cluster will be respun (the tf files are structured to induce a cluster respin each time to show this issue)
  5. Run terraform apply tfplan. Terraform will delete the cluster, but does not indicate the node pools are being deleted (they are)
  6. The cluster respins, but no node-pools are added to the cluster, giving a zero-node cluster
    image
  7. Run terraform plan -out tfplan and terraform apply tfplan
  8. Terraform will delete and recreate the cluster again, and this time, it will create and attach a new node pool to the cluster
    image

Important Factoids

This problem appears to stem on the fact there does not appear to be a link between the cluster and the node pools. A change that induces the cluster to delete and be recreated should either also induce the node-pool to delete and be recreated, or the node pool should automatically be reconnected to the new cluster after creation.

References

  • #0000
@melinath
Copy link
Collaborator

melinath commented Jun 3, 2021

I think this is a consequence of the behavior described in hashicorp/terraform#24663. We could allow a workaround here by supporting cluster = google_container_cluster.cluster.id (currently doesn't work.)

@huang-jy
Copy link
Author

huang-jy commented Jun 3, 2021

Yes indeed it does sound similar to the example referenced in that ticket. Does it sound like this might be a TF core issue or something in the Google provider?

@melinath
Copy link
Collaborator

melinath commented Jun 3, 2021

From the provider side, I think we can make it so that people can supply google_container_cluster.cluster.id to the node pool to get the behavior you're looking for. Being able to get that behavior with google_container_cluster.cluster.name would definitely be a core issue.

@huang-jy
Copy link
Author

huang-jy commented Jun 3, 2021

Thank you

@github-actions
Copy link

github-actions bot commented Jul 5, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.