Skip to content

Commit

Permalink
Roll instance template changes to worker managed instance groups
Browse files Browse the repository at this point in the history
* When a worker managed instance group's (MIG) instance template
changes (including machine type, disk size, or Butane snippets
but excluding new AMIs), use Google Cloud's rolling update features
to ensure instances match declared state
* Ignore new AMIs since Fedora CoreOS and Flatcar Linux nodes
already auto-update and reboot themselves
* Rolling updates will create surge instances, wait for health
checks, then delete old instances (0 unavilable instances)
* Instances are replaced to ensure new Ignition/Butane snippets
are respected
* Add managed instance group autohealing (i.e. health checks) to
ensure new instances' Kubelet is running

Renames

* Name apiserver and kubelet health checks consistently
* Rename MIG from `${var.name}-worker-group` to `${var.name}-worker`

Rel: https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-managed-instance-groups
  • Loading branch information
dghubble committed Aug 14, 2022
1 parent 6facfca commit 20b76d6
Show file tree
Hide file tree
Showing 7 changed files with 138 additions and 23 deletions.
16 changes: 14 additions & 2 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,23 @@ version: 1.0.0

### AWS

* Rename worker autoscaling group `${cluster_name}-worker`
* Rename launch configuration `${cluster_name}-worker`
* Rename worker autoscaling group `${cluster_name}-worker` ([#1202](https://github.com/poseidon/typhoon/pull/1202))
* Rename launch configuration `${cluster_name}-worker` instead of a random id

### Google

* [Roll](https://cloud.google.com/compute/docs/instance-groups/rolling-out-updates-to-managed-instance-groups) instance template changes to worker managed instance groups ([#1207](https://github.com/poseidon/typhoon/pull/1207)) (**important**)
* Changes to worker instance templates roll out by gradually replacing instances
* Automatic rollouts create surge instances, wait for Kubelet health checks, then delete old instances (0 unavailable instances)
* Changing `worker_type`, `disk_size`, `preemptible`, or Butane `worker_snippets` on existing worker nodes will replace instances
* New AMIs or changing `os_stream` will be ignored, to allow Fedora CoreOS or Flatcar Linux to keep themselves updated
* Previously, new instance templates were made in the same way, but not applied to instances unless manually replaced
* Add health checks to worker managed instance groups (i.e. "autohealing") ([#1207](https://github.com/poseidon/typhoon/pull/1207))
* Use SSL health checks to probe the Kubelet every 30s
* Replace worker nodes that fail the health check 6 times (3min)
* Name `kube-apiserver` and `kubelet` health checks consistently ([#1207](https://github.com/poseidon/typhoon/pull/1207))
* Use name `${cluster_name}-apiserver-health` and `${cluster_name}-kubelet-health`
* Rename managed instance group from `${cluster_name}-worker-group` to `${cluster_name}-worker` ([#1207](https://github.com/poseidon/typhoon/pull/1207))
* Fix bug provisioning clusters with multiple controller nodes ([#1195](https://github.com/poseidon/typhoon/pull/1195))

### Addons
Expand Down
8 changes: 4 additions & 4 deletions google-cloud/fedora-coreos/kubernetes/apiserver.tf
Original file line number Diff line number Diff line change
Expand Up @@ -75,18 +75,18 @@ resource "google_compute_instance_group" "controllers" {
)
}

# TCP health check for apiserver
# Health check for kube-apiserver
resource "google_compute_health_check" "apiserver" {
name = "${var.cluster_name}-apiserver-tcp-health"
description = "TCP health check for kube-apiserver"
name = "${var.cluster_name}-apiserver-health"
description = "Health check for kube-apiserver"

timeout_sec = 5
check_interval_sec = 5

healthy_threshold = 1
unhealthy_threshold = 3

tcp_health_check {
ssl_health_check {
port = "6443"
}
}
Expand Down
18 changes: 18 additions & 0 deletions google-cloud/fedora-coreos/kubernetes/network.tf
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,24 @@ resource "google_compute_firewall" "allow-ingress" {
target_tags = ["${var.cluster_name}-worker"]
}

resource "google_compute_firewall" "google-kubelet-health-checks" {
name = "${var.cluster_name}-kubelet-health"
network = google_compute_network.network.name

allow {
protocol = "tcp"
ports = [10250]
}

# https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs
source_ranges = [
"35.191.0.0/16",
"130.211.0.0/22",
]

target_tags = ["${var.cluster_name}-worker"]
}

resource "google_compute_firewall" "google-ingress-health-checks" {
name = "${var.cluster_name}-ingress-health"
network = google_compute_network.network.name
Expand Down
46 changes: 39 additions & 7 deletions google-cloud/fedora-coreos/kubernetes/workers/workers.tf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Managed instance group of workers
resource "google_compute_region_instance_group_manager" "workers" {
name = "${var.name}-worker-group"
name = "${var.name}-worker"
description = "Compute instance group of ${var.name} workers"

# instance name prefix for instances in the group
Expand All @@ -11,6 +11,16 @@ resource "google_compute_region_instance_group_manager" "workers" {
instance_template = google_compute_instance_template.worker.self_link
}

# Roll out MIG instance template changes by replacing instances.
# - Surge to create new instances, then delete old instances.
# - Replace ensures new Ignition is picked up
update_policy {
type = "PROACTIVE"
max_surge_fixed = 3
max_unavailable_fixed = 0
minimal_action = "REPLACE"
}

target_size = var.worker_count
target_pools = [google_compute_target_pool.workers.self_link]

Expand All @@ -23,21 +33,45 @@ resource "google_compute_region_instance_group_manager" "workers" {
name = "https"
port = "443"
}

auto_healing_policies {
health_check = google_compute_health_check.worker.id
initial_delay_sec = 120
}
}

# Health check for worker Kubelet
resource "google_compute_health_check" "worker" {
name = "${var.name}-kubelet-health"
description = "Health check for worker Kubelet"

timeout_sec = 20
check_interval_sec = 30

healthy_threshold = 1
unhealthy_threshold = 6

ssl_health_check {
port = "10250"
}
}

# Worker instance template
resource "google_compute_instance_template" "worker" {
name_prefix = "${var.name}-worker-"
description = "Worker Instance template"
description = "${var.name} worker instance template"
machine_type = var.machine_type

metadata = {
user-data = data.ct_config.worker.rendered
}

scheduling {
automatic_restart = var.preemptible ? false : true
preemptible = var.preemptible
provisioning_model = var.preemptible ? "SPOT" : "STANDARD"
preemptible = var.preemptible
automatic_restart = var.preemptible ? false : true
# Spot instances with termination action DELETE cannot be used with MIGs
instance_termination_action = var.preemptible ? "STOP" : null
}

disk {
Expand All @@ -49,10 +83,8 @@ resource "google_compute_instance_template" "worker" {

network_interface {
network = var.network

# Ephemeral external IP
access_config {
}
access_config {}
}

can_ip_forward = true
Expand Down
8 changes: 4 additions & 4 deletions google-cloud/flatcar-linux/kubernetes/apiserver.tf
Original file line number Diff line number Diff line change
Expand Up @@ -75,18 +75,18 @@ resource "google_compute_instance_group" "controllers" {
)
}

# TCP health check for apiserver
# Health check for kube-apiserver
resource "google_compute_health_check" "apiserver" {
name = "${var.cluster_name}-apiserver-tcp-health"
description = "TCP health check for kube-apiserver"
name = "${var.cluster_name}-apiserver-health"
description = "Health check for kube-apiserver"

timeout_sec = 5
check_interval_sec = 5

healthy_threshold = 1
unhealthy_threshold = 3

tcp_health_check {
ssl_health_check {
port = "6443"
}
}
Expand Down
18 changes: 18 additions & 0 deletions google-cloud/flatcar-linux/kubernetes/network.tf
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,24 @@ resource "google_compute_firewall" "allow-ingress" {
target_tags = ["${var.cluster_name}-worker"]
}

resource "google_compute_firewall" "google-kubelet-health-checks" {
name = "${var.cluster_name}-kubelet-health"
network = google_compute_network.network.name

allow {
protocol = "tcp"
ports = [10250]
}

# https://cloud.google.com/compute/docs/instance-groups/autohealing-instances-in-migs
source_ranges = [
"35.191.0.0/16",
"130.211.0.0/22",
]

target_tags = ["${var.cluster_name}-worker"]
}

resource "google_compute_firewall" "google-ingress-health-checks" {
name = "${var.cluster_name}-ingress-health"
network = google_compute_network.network.name
Expand Down
47 changes: 41 additions & 6 deletions google-cloud/flatcar-linux/kubernetes/workers/workers.tf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Managed instance group of workers
resource "google_compute_region_instance_group_manager" "workers" {
name = "${var.name}-worker-group"
name = "${var.name}-worker"
description = "Compute instance group of ${var.name} workers"

# instance name prefix for instances in the group
Expand All @@ -11,6 +11,16 @@ resource "google_compute_region_instance_group_manager" "workers" {
instance_template = google_compute_instance_template.worker.self_link
}

# Roll out MIG instance template changes by replacing instances.
# - Surge to create new instances, then delete old instances.
# - Replace ensures new Ignition is picked up
update_policy {
type = "PROACTIVE"
max_surge_fixed = 3
max_unavailable_fixed = 0
minimal_action = "REPLACE"
}

target_size = var.worker_count
target_pools = [google_compute_target_pool.workers.self_link]

Expand All @@ -23,6 +33,27 @@ resource "google_compute_region_instance_group_manager" "workers" {
name = "https"
port = "443"
}

auto_healing_policies {
health_check = google_compute_health_check.worker.id
initial_delay_sec = 120
}
}

# Health check for worker Kubelet
resource "google_compute_health_check" "worker" {
name = "${var.name}-kubelet-health"
description = "Health check for worker Kubelet"

timeout_sec = 20
check_interval_sec = 30

healthy_threshold = 1
unhealthy_threshold = 6

ssl_health_check {
port = "10250"
}
}

# Worker instance template
Expand All @@ -36,8 +67,11 @@ resource "google_compute_instance_template" "worker" {
}

scheduling {
automatic_restart = var.preemptible ? false : true
preemptible = var.preemptible
provisioning_model = var.preemptible ? "SPOT" : "STANDARD"
preemptible = var.preemptible
automatic_restart = var.preemptible ? false : true
# Spot instances with termination action DELETE cannot be used with MIGs
instance_termination_action = var.preemptible ? "STOP" : null
}

disk {
Expand All @@ -49,10 +83,8 @@ resource "google_compute_instance_template" "worker" {

network_interface {
network = var.network

# Ephemeral external IP
access_config {
}
access_config {}
}

can_ip_forward = true
Expand All @@ -64,6 +96,9 @@ resource "google_compute_instance_template" "worker" {
}

lifecycle {
ignore_changes = [
disk[0].source_image
]
# To update an Instance Template, Terraform should replace the existing resource
create_before_destroy = true
}
Expand Down

0 comments on commit 20b76d6

Please sign in to comment.