Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes_cluster_node_pool: Fix race condition with virtual network status when creating node pool #25888

Merged
merged 2 commits into from
May 16, 2024

Conversation

c4milo
Copy link
Contributor

@c4milo c4milo commented May 7, 2024

Related to #13105. We are able to consistently trigger it in our testing.

❯  make acctests SERVICE='containers' TESTARGS='-run=TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition' TESTTIMEOUT='60m'
==> Checking that code complies with gofmt requirements...
==> Checking that Custom Timeouts are used...
==> Checking that acceptance test packages are used...
TF_ACC=1 go test -v ./internal/services/containers -run=TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition -timeout 60m -ldflags="-X=github.com/hashicorp/terraform-provider-azurerm/version.ProviderVersion=acc"
=== RUN   TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== PAUSE TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== CONT  TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
--- PASS: TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition (1395.32s)
PASS
ok  	github.com/hashicorp/terraform-provider-azurerm/internal/services/containers	1398.094s

Error

    Error: creating Agent Pool (Subscription: "312d56f6-697b-4377-9a9a-83257ce33066"
        Resource Group Name: "acctestRG-aks-240506121623880995"
        Managed Cluster Name: "acctestaks240506121623880995"
        Agent Pool Name: "internal3"): polling after CreateOrUpdate: polling failed: the Azure API returned the following error:

        Status: "SetVNetOwnershipFailed"
        Code: ""
        Message: "Set virtual network ownership failed. Subscription: 312d56f6-697b-4377-9a9a-83257ce33066; resource group: acctestRG-aks-240506121623880995; virtual network name: acctestvirtnet240506121623880995. autorest/azure: Service returned an error. Status=400 Code=\"VirtualNetworkNotInSucceededState\" Message=\"Virtual network /subscriptions/312d56f6-697b-4377-9a9a-83257ce33066/resourceGroups/acctestRG-aks-240506121623880995/providers/Microsoft.Network/virtualNetworks/acctestvirtnet240506121623880995 is in Updating state. It needs to be in Succeeded state in order to set resource ownership.\" Details=[]\nVirtual network /subscriptions/312d56f6-697b-4377-9a9a-83257ce33066/resourceGroups/acctestRG-aks-240506121623880995/providers/Microsoft.Network/virtualNetworks/acctestvirtnet240506121623880995 is in Updating state. It needs to be in Succeeded state in order to set resource ownership."
        Activity Id: ""

        ---

        API Response:

        ----[start]----
        {
          "name": "da6f9f40-6f18-4380-b252-d516cd3659f0",
          "status": "Failed",
          "startTime": "2024-05-06T16:21:01.6132767Z",
          "endTime": "2024-05-06T16:21:09.0898922Z",
          "error": {
           "code": "SetVNetOwnershipFailed",
           "message": "Set virtual network ownership failed. Subscription: 312d56f6-697b-4377-9a9a-83257ce33066; resource group: acctestRG-aks-240506121623880995; virtual network name: acctestvirtnet240506121623880995. autorest/azure: Service returned an error. Status=400 Code=\"VirtualNetworkNotInSucceededState\" Message=\"Virtual network /subscriptions/312d56f6-697b-4377-9a9a-83257ce33066/resourceGroups/acctestRG-aks-240506121623880995/providers/Microsoft.Network/virtualNetworks/acctestvirtnet240506121623880995 is in Updating state. It needs to be in Succeeded state in order to set resource ownership.\" Details=[]",
           "details": [
            {
             "code": "",
             "message": "Virtual network /subscriptions/312d56f6-697b-4377-9a9a-83257ce33066/resourceGroups/acctestRG-aks-240506121623880995/providers/Microsoft.Network/virtualNetworks/acctestvirtnet240506121623880995 is in Updating state. It needs to be in Succeeded state in order to set resource ownership."
            }
           ]
          }
         }
        -----[end]-----


          with azurerm_kubernetes_cluster_node_pool.test3,
          on terraform_plugin_test.tf line 97, in resource "azurerm_kubernetes_cluster_node_pool" "test3":
          97: resource "azurerm_kubernetes_cluster_node_pool" "test3" {

Community Note

  • Please vote on this PR by adding a 👍 reaction to the original PR to help the community and maintainers prioritize for review
  • Please do not leave "+1" or "me too" comments, they generate extra noise for PR followers and do not help prioritize for review

Description

PR Checklist

  • I have followed the guidelines in our Contributing Documentation.
  • I have checked to ensure there aren't other open Pull Requests for the same update/change.
  • I have checked if my changes close any open issues. If so please include appropriate closing keywords below.
  • I have used a meaningful PR title to help maintainers and other users understand this change and help prevent duplicate work.
    For example: “resource_name_here - description of change e.g. adding property new_property_name_here

Changes to existing Resource / Data Source

  • I have added an explanation of what my changes do and why I'd like you to include them (This may be covered by linking to an issue above, but may benefit from additional explanation).
  • I have written new tests for my resource or datasource changes & updated any relevent documentation.
  • I have successfully run tests with my changes locally. If not, please provide details on testing challenges that prevented you running the tests.

Testing

  • My submission includes Test coverage as described in the Contribution Guide and the tests pass. (if this is not possible for any reason, please include details of why you did or could not add test coverage)

Change Log

Below please provide what should go into the changelog (if anything) conforming to the Changelog Format documented here.

  • azurerm_kubernetes_cluster_node_pool - Fix race condition between virtual network status and node pool creation. The virtual network must be in Succeeded state for an AKS node pool to be correctly created.

This is a (please select all that apply):

  • Bug Fix

Related Issue(s)

#13105

Note

If this PR changes meaningfully during the course of review please update the title and description as required.

@github-actions github-actions bot added the size/L label May 7, 2024
@c4milo c4milo changed the title Fix race condition when creating multiple node pools. kubernetes_cluster_node_pool: Fix race condition when creating multiple node pools. May 7, 2024
@c4milo c4milo changed the title kubernetes_cluster_node_pool: Fix race condition when creating multiple node pools. kubernetes_cluster_node_pool: Fix race condition when creating multiple node pools May 7, 2024
@c4milo c4milo changed the title kubernetes_cluster_node_pool: Fix race condition when creating multiple node pools kubernetes_cluster_node_pool: Fix race condition with virtual network status when creating multiple node pools May 7, 2024
@c4milo c4milo changed the title kubernetes_cluster_node_pool: Fix race condition with virtual network status when creating multiple node pools kubernetes_cluster_node_pool: Fix race condition with virtual network status when creating node pool May 7, 2024
Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @c4milo.

I left some review comments and suggestions in-line. Could you take a look and fix those up? Once that's done we can run the rests and give this another review.

@c4milo
Copy link
Contributor Author

c4milo commented May 8, 2024

@stephybun, I've made the changes you kindly requested. Thanks for reviewing!

Related to hashicorp#13105

```
❯  make acctests SERVICE='containers' TESTARGS='-run=TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition' TESTTIMEOUT='60m'
==> Checking that code complies with gofmt requirements...
==> Checking that Custom Timeouts are used...
==> Checking that acceptance test packages are used...
TF_ACC=1 go test -v ./internal/services/containers -run=TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition -timeout 60m -ldflags="-X=github.com/hashicorp/terraform-provider-azurerm/version.ProviderVersion=acc"
=== RUN   TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== PAUSE TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== CONT  TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
--- PASS: TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition (1395.32s)
PASS
ok  	github.com/hashicorp/terraform-provider-azurerm/internal/services/containers	1398.094s
```

blah
Copy link
Contributor

@tombuildsstuff tombuildsstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @c4milo

Thanks for pushing those changes, I've taken a look through and left a few comments inline, but this otherwise LGTM 👍

Thanks!

@c4milo
Copy link
Contributor Author

c4milo commented May 13, 2024

Thanks for taking the time to review @tombuildsstuff. I've addressed the comments from the last review. Please let me know if there is anything else you all will want me to correct.

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a test failure:

------- Stdout: -------
=== RUN   TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== PAUSE TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== CONT  TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
    testcase.go:113: Step 1/1 error: Error running pre-apply refresh: exit status 1
        Error: Invalid index
          on terraform_plugin_test.tf line 67, in resource "azurerm_kubernetes_cluster" "test":
          67:     vnet_subnet_id = azurerm_subnet.test["19"].id
            ├────────────────
            │ azurerm_subnet.test is tuple with 8 elements
        The given key does not identify an element in this collection value.
        Error: Invalid index
          on terraform_plugin_test.tf line 68, in resource "azurerm_kubernetes_cluster" "test":
          68:     pod_subnet_id  = azurerm_subnet.test["18"].id
            ├────────────────
            │ azurerm_subnet.test is tuple with 8 elements
        The given key does not identify an element in this collection value.
--- FAIL: TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition (12.73s)
FAIL

@c4milo
Copy link
Contributor Author

c4milo commented May 16, 2024

@stephybun, thank you, I think it is due to the PR review request of lowering the number of subnets. I will fix the tests.

Co-authored-by: Tom Harvey <tombuildsstuff@users.noreply.github.com>
Co-authored-by: stephybun <stephybun@users.noreply.github.com>
@c4milo c4milo requested a review from stephybun May 16, 2024 14:23
@c4milo
Copy link
Contributor Author

c4milo commented May 16, 2024

@stephybun, I fixed the tests. Thank you!

make acctests SERVICE='containers' TESTARGS='-run=TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition' TESTTIMEOUT='60m'
==> Checking that code complies with gofmt requirements...
==> Checking that Custom Timeouts are used...
==> Checking that acceptance test packages are used...
TF_ACC=1 go test -v ./internal/services/containers -run=TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition -timeout 60m -ldflags="-X=github.com/hashicorp/terraform-provider-azurerm/version.ProviderVersion=acc"
=== RUN   TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== PAUSE TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
=== CONT  TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition
--- PASS: TestAccKubernetesClusterNodePool_virtualNetworkOwnershipRaceCondition (1163.83s)
PASS
ok  	github.com/hashicorp/terraform-provider-azurerm/internal/services/containers	1166.633s

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @c4milo LGTM 🍍

@stephybun stephybun merged commit 9857f2e into hashicorp:main May 16, 2024
30 checks passed
@github-actions github-actions bot added this to the v3.104.0 milestone May 16, 2024
stephybun added a commit that referenced this pull request May 16, 2024
dduportal pushed a commit to jenkins-infra/azure that referenced this pull request May 20, 2024
<Actions>
<action
id="f410411e63aff4bb73a81c2aec1d373cf8a903e63b30dee2006b0030d8a94cc8">
        <h3>Bump Terraform `azurerm` provider version</h3>
<details
id="1d9343c012f5434ac9fe8a98135bae3667b399259be16d9b14302ea3bd424a24">
            <summary>Update Terraform lock file</summary>
<p>changes detected:&#xA;&#x9;&#34;hashicorp/azurerm&#34; updated from
&#34;3.103.1&#34; to &#34;3.104.0&#34; in file
&#34;.terraform.lock.hcl&#34;</p>
            <details>
                <summary>3.104.0</summary>
<pre>Changelog retrieved
from:&#xA;&#x9;https://github.com/hashicorp/terraform-provider-azurerm/releases/tag/v3.104.0&#xA;FEATURES:&#xA;&#xA;*
New Data Source: `azurerm_elastic_san`
([#25719](https://github.com/hashicorp/terraform-provider-azurerm/issues/25719))&#xA;&#xA;ENHANCEMENTS:&#xA;&#xA;*
New Resource - `azurerm_key_vault_managed_hardware_security_module_key`
([#25935](hashicorp/terraform-provider-azurerm#25935
Data Source - `azurerm_kubernetes_service_version` - support for the
`default_version` property
([#25953](hashicorp/terraform-provider-azurerm#25953
`network/applicationgateways` - update to use `hashicorp/go-azure-sdk`
([#25844](hashicorp/terraform-provider-azurerm#25844
`dataprotection` - update API version to `2024-04-01`
([#25882](hashicorp/terraform-provider-azurerm#25882
`databasemigration` - update API version to `2021-06-30`
([#25997](hashicorp/terraform-provider-azurerm#25997
`network/ips` - update to use `hashicorp/go-azure-sdk`
([#25905](hashicorp/terraform-provider-azurerm#25905
`network/localnetworkgateway` - update to use `hashicorp/go-azure-sdk`
([#25905](hashicorp/terraform-provider-azurerm#25905
`network/natgateway` - update to use `hashicorp/go-azure-sdk`
([#25905](hashicorp/terraform-provider-azurerm#25905
`network/networksecuritygroup` - update to use `hashicorp/go-azure-sdk`
([#25971](hashicorp/terraform-provider-azurerm#25971
`network/publicips` - update to use `hashicorp/go-azure-sdk`
([#25971](hashicorp/terraform-provider-azurerm#25971
`network/virtualwan` - update to use `hashicorp/go-azure-sdk`
([#25971](hashicorp/terraform-provider-azurerm#25971
`network/vpn` - update to use `hashicorp/go-azure-sdk`
([#25971](hashicorp/terraform-provider-azurerm#25971
`azurerm_databricks_workspace` - support for the
`default_storage_firewall_enabled` property
([#25919](hashicorp/terraform-provider-azurerm#25919
`azurerm_key_vault` - allow previously existing key vaults to continue
to manage the `contact` field prior to the `v3.93.0` conditional polling
change
([#25777](hashicorp/terraform-provider-azurerm#25777
`azurerm_linux_function_app` - support for the PowerShell `7.4`
([#25980](hashicorp/terraform-provider-azurerm#25980
`azurerm_log_analytics_cluster` - support for the value `UserAssigned`
in the `identity.type` property
([#25940](hashicorp/terraform-provider-azurerm#25940
`azurerm_pim_active_role_assignment` - remove hard dependency on the
`roleAssignmentScheduleRequests` API, so that role assignments will not
become unmanageable over time
([#25956](hashicorp/terraform-provider-azurerm#25956
`azurerm_pim_eligible_role_assignment` - remove hard dependency on the
`roleEligibilityScheduleRequests` API, so that role assignments will not
become unmanageable over time
([#25956](hashicorp/terraform-provider-azurerm#25956
`azurerm_windows_function_app` - support for the PowerShell `7.4`
([#25980](https://github.com/hashicorp/terraform-provider-azurerm/issues/25980))&#xA;&#xA;BUG
FIXES:&#xA;&#xA;* `azurerm_container_app_job` - Allow
`event_trigger_config.scale.min_executions` to be `0`
([#25931](hashicorp/terraform-provider-azurerm#25931
`azurerm_container_app_job` - update validation to allow the
`replica_retry_limit` property to be set to `0`
([#25984](hashicorp/terraform-provider-azurerm#25984
`azurerm_data_factory_trigger_custom_event` - one of
`subject_begins_with` and `subject_ends_with` no longer need to be set
([#25932](hashicorp/terraform-provider-azurerm#25932
`azurerm_kubernetes_cluster_node_pool` - prevent race condition by
checking the virtual network status when creating a node pool with a
subnet ID
([#25888](hashicorp/terraform-provider-azurerm#25888
`azurerm_postgresql_flexible_server` - fix for default `storage_tier`
value when `storage_mb` field has been changed
([#25947](hashicorp/terraform-provider-azurerm#25947
`azurerm_pim_active_role_assignment` - resolve a number of potential
crashes
([#25956](hashicorp/terraform-provider-azurerm#25956
`azurerm_pim_eligible_role_assignment` - resolve a number of potential
crashes
([#25956](hashicorp/terraform-provider-azurerm#25956
`azurerm_redis_enterprise_cluster_location_zone_support` - add `Central
India` zones support
([#26000](hashicorp/terraform-provider-azurerm#26000
`azurerm_sentinel_alert_rule_scheduled` - the
`alert_rule_template_version` property is no longer `ForceNew`
([#25688](hashicorp/terraform-provider-azurerm#25688
`azurerm_storage_sync_server_endpoint` - preventing a crashed due to
`initial_upload_policy`
([#25968](https://github.com/hashicorp/terraform-provider-azurerm/issues/25968))&#xA;&#xA;&#xA;</pre>
            </details>
        </details>
<a
href="https://infra.ci.jenkins.io/job/updatecli/job/azure/job/main/185/">Jenkins
pipeline link</a>
    </action>
</Actions>

---

<table>
  <tr>
    <td width="77">
<img src="https://www.updatecli.io/images/updatecli.png" alt="Updatecli
logo" width="50" height="50">
    </td>
    <td>
      <p>
Created automatically by <a
href="https://www.updatecli.io/">Updatecli</a>
      </p>
      <details><summary>Options:</summary>
        <br />
<p>Most of Updatecli configuration is done via <a
href="https://www.updatecli.io/docs/prologue/quick-start/">its
manifest(s)</a>.</p>
        <ul>
<li>If you close this pull request, Updatecli will automatically reopen
it, the next time it runs.</li>
<li>If you close this pull request and delete the base branch, Updatecli
will automatically recreate it, erasing all previous commits made.</li>
        </ul>
        <p>
Feel free to report any issues at <a
href="https://github.com/updatecli/updatecli/issues">github.com/updatecli/updatecli</a>.<br
/>
If you find this tool useful, do not hesitate to star <a
href="https://github.com/updatecli/updatecli/stargazers">our GitHub
repository</a> as a sign of appreciation, and/or to tell us directly on
our <a
href="https://matrix.to/#/#Updatecli_community:gitter.im">chat</a>!
        </p>
      </details>
    </td>
  </tr>
</table>

Co-authored-by: Jenkins Infra Bot (updatecli) <60776566+jenkins-infra-bot@users.noreply.github.com>
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants