Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State drift due to upgrade_settings -> max_surge #522

Closed
1 task done
Israphel opened this issue Mar 11, 2024 · 3 comments
Closed
1 task done

State drift due to upgrade_settings -> max_surge #522

Israphel opened this issue Mar 11, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@Israphel
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

1.5.5

Module Version

8.0.0

AzureRM Provider Version

3.93.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

All defaults.

tfvars variables values

Kubernetes version set to 1.28.x

Debug Output/Panic Output

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # module.aks-us-east["use-1"].azurerm_kubernetes_cluster.main will be updated in-place
  ~ resource "azurerm_kubernetes_cluster" "main" {
        id                                  = "/subscriptions/xxx/resourceGroups/prod-us-east/providers/Microsoft.ContainerService/managedClusters/use-prod-1-aks"
        name                                = "use-prod-1-aks"
        tags                                = {
            "environment" = "prod"
            "managed_by"  = "terraform"
            "region"      = "eastus"
        }
        # (31 unchanged attributes hidden)

      ~ default_node_pool {
            name                         = "default"
            tags                         = {
                "environment" = "prod"
                "managed_by"  = "terraform"
                "region"      = "eastus"
            }
            # (25 unchanged attributes hidden)

          - upgrade_settings {
              - max_surge = "10%" -> null
            }
        }

        # (6 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Expected Behaviour

We're not setting up any setting for max_surge, it should respect default (10%)

Actual Behaviour

Every plan wants to remove the block again.

Steps to Reproduce

terraform plan

Important Factoids

No response

References

Related to: hashicorp/terraform-provider-azurerm#24020

@Israphel Israphel added the bug Something isn't working label Mar 11, 2024
@lonegunmanb
Copy link
Member

@Israphel thanks for reporting this issue to us. As Aks release log shows:

  • The default max surge value during upgrades will be changed from 1 to 10% for AKS 1.28+ on new clusters to improve upgrade latency.

This is an expected behavior if you've set kubernetes_version to 1.28+. I can reproduce this issue by the following code:

resource "random_id" "prefix" {
  byte_length = 8
}

resource "random_id" "name" {
  byte_length = 8
}

resource "azurerm_resource_group" "main" {
  location = "eastus"
  name     = "${random_id.prefix.hex}-rg"
}

module "aks" {
  source = "../.."

  kubernetes_version = "1.29.0"
  prefix              = random_id.prefix.hex
  resource_group_name = azurerm_resource_group.main.name
  rbac_aad            = false
}

This is not this module's issue, the root cause is by the service team's design. You can bypass this drift by setting max_surge to "10%" explicitly. I'd like to keep this issue open and add a check block to provide a warning to users, but we can do nothing to eliminate this configuration drift I think.

@lonegunmanb lonegunmanb added enhancement New feature or request and removed bug Something isn't working labels Mar 12, 2024
@Israphel
Copy link
Author

Israphel commented Mar 12, 2024

what's the simplest way to send that 10% to the module? considering we use it with (mostly) default values:

is it with https://github.com/Azure/terraform-azurerm-aks?tab=readme-ov-file#input_agents_pool_max_surge ?

# AKS clusters (US East)
module "aks-us-east" {
  source   = "Azure/aks/azurerm"
  version  = "8.0.0"
  for_each = local.config[local.environment]["aks"]["us-east"]

  prefix                            = each.value.name
  resource_group_name               = module.resource-group-us-east["default"].name
  node_resource_group               = "${each.value.name}-nodes"
  kubernetes_version                = each.value.kubernetes_version.control_plane
  orchestrator_version              = each.value.kubernetes_version.node_pool
  oidc_issuer_enabled               = true
  workload_identity_enabled         = true
  agents_pool_name                  = "default"
  agents_availability_zones         = ["1", "2", "3"]
  agents_type                       = "VirtualMachineScaleSets"
  agents_size                       = try(each.value.agents_size, "Standard_D2s_v3")
  temporary_name_for_rotation       = "tmp"
  enable_auto_scaling               = true
  agents_count                      = null
  agents_min_count                  = try(each.value.agents_min_count, 1)
  agents_max_count                  = try(each.value.agents_max_count, 3)
  azure_policy_enabled              = true
  log_analytics_workspace_enabled   = try(each.value.log_analytics_workspace_enabled, true)
  log_retention_in_days             = try(each.value.log_retention_in_days, 30)
  network_plugin                    = "azure"
  load_balancer_sku                 = "standard"
  ebpf_data_plane                   = "cilium"
  os_disk_size_gb                   = try(each.value.os_disk_size_gb, 30)
  rbac_aad                          = true
  rbac_aad_managed                  = true
  rbac_aad_azure_rbac_enabled       = true
  role_based_access_control_enabled = true
  rbac_aad_admin_group_object_ids   = [local.inputs["groups"]["infra"]]
  sku_tier                          = "Standard"
  vnet_subnet_id                    = module.virtual-network-us-east["default"].vnet_subnets_name_id["nodes"]
  pod_subnet_id                     = module.virtual-network-us-east["default"].vnet_subnets_name_id["pods"]
  agents_labels                     = try(each.value.agents_labels, {})
  agents_tags                       = try(each.value.agents_tags, {})

  tags = {
    environment = local.environment
    region      = module.resource-group-us-east["default"].location
    managed_by  = "terraform"
  }

  providers = {
    azurerm = azurerm.us-east
  }
}
aks:
  us-east:
    use-1:
      name: use-prod-1
      kubernetes_version:
        control_plane: 1.28.5
        node_pool: 1.28.5
      log_analytics_workspace_enabled: false
      agents_size: Standard_D4as_v5
      agents_min_count: 1
      agents_max_count: 16
      os_disk_size_gb: 60
      agents_labels:
        node.kubernetes.io/node-type: default

The funny thing is, we have two Terraform workspaces (stage and prod), sharing the same code but not the same state. Both of them are now 1.28.x but only one has the drift. And if I do a state pull, the 10% is not there.

@lonegunmanb
Copy link
Member

@lonegunmanb lonegunmanb self-assigned this May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

No branches or pull requests

2 participants