State drift due to upgrade_settings -> max_surge #522

Israphel · 2024-03-11T19:19:59Z

Is there an existing issue for this?

I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

1.5.5

Module Version

8.0.0

AzureRM Provider Version

3.93.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

All defaults.

tfvars variables values

Kubernetes version set to 1.28.x

Debug Output/Panic Output

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.aks-us-east["use-1"].azurerm_kubernetes_cluster.main will be updated in-place
  ~ resource "azurerm_kubernetes_cluster" "main" {
        id                                  = "/subscriptions/xxx/resourceGroups/prod-us-east/providers/Microsoft.ContainerService/managedClusters/use-prod-1-aks"
        name                                = "use-prod-1-aks"
        tags                                = {
            "environment" = "prod"
            "managed_by"  = "terraform"
            "region"      = "eastus"
        }
        # (31 unchanged attributes hidden)

      ~ default_node_pool {
            name                         = "default"
            tags                         = {
                "environment" = "prod"
                "managed_by"  = "terraform"
                "region"      = "eastus"
            }
            # (25 unchanged attributes hidden)

          - upgrade_settings {
              - max_surge = "10%" -> null
            }
        }

        # (6 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Expected Behaviour

We're not setting up any setting for max_surge, it should respect default (10%)

Actual Behaviour

Every plan wants to remove the block again.

Steps to Reproduce

terraform plan

Important Factoids

No response

References

Related to: hashicorp/terraform-provider-azurerm#24020

lonegunmanb · 2024-03-12T02:40:37Z

@Israphel thanks for reporting this issue to us. As Aks release log shows:

The default max surge value during upgrades will be changed from 1 to 10% for AKS 1.28+ on new clusters to improve upgrade latency.

This is an expected behavior if you've set kubernetes_version to 1.28+. I can reproduce this issue by the following code:

resource "random_id" "prefix" {
  byte_length = 8
}

resource "random_id" "name" {
  byte_length = 8
}

resource "azurerm_resource_group" "main" {
  location = "eastus"
  name     = "${random_id.prefix.hex}-rg"
}

module "aks" {
  source = "../.."

  kubernetes_version = "1.29.0"
  prefix              = random_id.prefix.hex
  resource_group_name = azurerm_resource_group.main.name
  rbac_aad            = false
}

This is not this module's issue, the root cause is by the service team's design. You can bypass this drift by setting max_surge to "10%" explicitly. I'd like to keep this issue open and add a check block to provide a warning to users, but we can do nothing to eliminate this configuration drift I think.

Israphel · 2024-03-12T14:57:29Z

what's the simplest way to send that 10% to the module? considering we use it with (mostly) default values:

is it with https://github.com/Azure/terraform-azurerm-aks?tab=readme-ov-file#input_agents_pool_max_surge ?

# AKS clusters (US East)
module "aks-us-east" {
  source   = "Azure/aks/azurerm"
  version  = "8.0.0"
  for_each = local.config[local.environment]["aks"]["us-east"]

  prefix                            = each.value.name
  resource_group_name               = module.resource-group-us-east["default"].name
  node_resource_group               = "${each.value.name}-nodes"
  kubernetes_version                = each.value.kubernetes_version.control_plane
  orchestrator_version              = each.value.kubernetes_version.node_pool
  oidc_issuer_enabled               = true
  workload_identity_enabled         = true
  agents_pool_name                  = "default"
  agents_availability_zones         = ["1", "2", "3"]
  agents_type                       = "VirtualMachineScaleSets"
  agents_size                       = try(each.value.agents_size, "Standard_D2s_v3")
  temporary_name_for_rotation       = "tmp"
  enable_auto_scaling               = true
  agents_count                      = null
  agents_min_count                  = try(each.value.agents_min_count, 1)
  agents_max_count                  = try(each.value.agents_max_count, 3)
  azure_policy_enabled              = true
  log_analytics_workspace_enabled   = try(each.value.log_analytics_workspace_enabled, true)
  log_retention_in_days             = try(each.value.log_retention_in_days, 30)
  network_plugin                    = "azure"
  load_balancer_sku                 = "standard"
  ebpf_data_plane                   = "cilium"
  os_disk_size_gb                   = try(each.value.os_disk_size_gb, 30)
  rbac_aad                          = true
  rbac_aad_managed                  = true
  rbac_aad_azure_rbac_enabled       = true
  role_based_access_control_enabled = true
  rbac_aad_admin_group_object_ids   = [local.inputs["groups"]["infra"]]
  sku_tier                          = "Standard"
  vnet_subnet_id                    = module.virtual-network-us-east["default"].vnet_subnets_name_id["nodes"]
  pod_subnet_id                     = module.virtual-network-us-east["default"].vnet_subnets_name_id["pods"]
  agents_labels                     = try(each.value.agents_labels, {})
  agents_tags                       = try(each.value.agents_tags, {})

  tags = {
    environment = local.environment
    region      = module.resource-group-us-east["default"].location
    managed_by  = "terraform"
  }

  providers = {
    azurerm = azurerm.us-east
  }
}

aks:
  us-east:
    use-1:
      name: use-prod-1
      kubernetes_version:
        control_plane: 1.28.5
        node_pool: 1.28.5
      log_analytics_workspace_enabled: false
      agents_size: Standard_D4as_v5
      agents_min_count: 1
      agents_max_count: 16
      os_disk_size_gb: 60
      agents_labels:
        node.kubernetes.io/node-type: default

The funny thing is, we have two Terraform workspaces (stage and prod), sharing the same code but not the same state. Both of them are now 1.28.x but only one has the drift. And if I do a state pull, the 10% is not there.

lonegunmanb · 2024-03-13T02:06:18Z

@Israphel

is it with https://github.com/Azure/terraform-azurerm-aks?tab=readme-ov-file#input_agents_pool_max_surge ?

Correct!

Israphel added the bug Something isn't working label Mar 11, 2024

lonegunmanb added enhancement New feature or request and removed bug Something isn't working labels Mar 12, 2024

lonegunmanb self-assigned this May 17, 2024

Israphel closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State drift due to upgrade_settings -> max_surge #522

State drift due to upgrade_settings -> max_surge #522

Israphel commented Mar 11, 2024

lonegunmanb commented Mar 12, 2024

Israphel commented Mar 12, 2024 •

edited

Loading

lonegunmanb commented Mar 13, 2024

State drift due to upgrade_settings -> max_surge #522

State drift due to upgrade_settings -> max_surge #522

Comments

Israphel commented Mar 11, 2024

Is there an existing issue for this?

Greenfield/Brownfield provisioning

Terraform Version

Module Version

AzureRM Provider Version

Affected Resource(s)/Data Source(s)

Terraform Configuration Files

tfvars variables values

Debug Output/Panic Output

Expected Behaviour

Actual Behaviour

Steps to Reproduce

Important Factoids

References

lonegunmanb commented Mar 12, 2024

Israphel commented Mar 12, 2024 • edited Loading

lonegunmanb commented Mar 13, 2024

Israphel commented Mar 12, 2024 •

edited

Loading