Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Network] Eventual Consistency Bug in the Virtual Network Gateway API #1233

Closed
tombuildsstuff opened this issue May 16, 2017 · 34 comments
Closed
Assignees
Labels
Network Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@tombuildsstuff
Copy link
Contributor

tombuildsstuff commented May 16, 2017

👋 Hey y'all

@pmcatominey and @dominik-lekse have been investigating adding support for Virtual Network Gateway's using the Azure SDK for Go - which has led to the PR's hashicorp/terraform#9255 and hashicorp/terraform#13886

Whilst adding support, they've identified an eventual consistency bug in the API where deleting a Virtual Network Gateway completes successfully (and returns a 404 when attempting to retrieve it), but the resource still exists, which prevents the Subnet from being deleted.

Whilst there's several workarounds we could implement - this isn't an ideal for several reasons:

  • At present this only affects Subnet's, so we could potentially implement a workaround, but there's no guarantee this doesn't affect other resources - and that we'll almost certainly end up working around this bug in random subnet-dependent resources.
  • Whilst we could pause for a period of time (say 15 minutes) - there's no guarantee the Virtual Network Gateway will actually be deleted within that time period - so we'd be opening ourselves up to random failures

Would it be possible for someone from the Networking team to look into this issue?

Thanks!

cc @DeepakRajendranMsft / @Nilambari

@veronicagg
Copy link
Contributor

veronicagg commented May 16, 2017

@DeepakRajendranMsft could you please take a look and help re-route the issue as appropriate? thanks!

@tombuildsstuff tombuildsstuff changed the title [Network] Eventual Consistency Bus in the Virtual Network Gateway API [Network] Eventual Consistency BuG in the Virtual Network Gateway API May 17, 2017
@tombuildsstuff tombuildsstuff changed the title [Network] Eventual Consistency BuG in the Virtual Network Gateway API [Network] Eventual Consistency Bug in the Virtual Network Gateway API May 17, 2017
@tombuildsstuff
Copy link
Contributor Author

👋 hey @DeepakRajendranMsft @veronicagg - is there any update on this? :)

@DeepakRajendranMsft
Copy link
Contributor

@Nilambari @lamchester please take a look

@tombuildsstuff
Copy link
Contributor Author

Hey @DeepakRajendranMsft / @Nilambari / @lamchester

Just an FYI that I've opened a bug this morning about a similar issue with Redis on the Internal Network - which seems like it might be related: #1347

Is there any update on this? :)

Thanks!

@liumichelle
Copy link

@DeepakRajendranMsft to assign to @Nilambari @lamchester

@tombuildsstuff
Copy link
Contributor Author

👋 hey @Nilambari @lamchester

Is there any update on this issue, or a rough timeframe for when this issue would be fixed?

Thanks!

@MohitGargVpn
Copy link

Hey,

We acknowledge the issue and we will work on getting the estimates required to fix this issue. We will get back with the release dates for the fix.

@tombuildsstuff
Copy link
Contributor Author

Awesome, thanks @MohitGargVpn!

@dominik-lekse
Copy link

Many thanks @MohitGargVpn

@tombuildsstuff
Copy link
Contributor Author

Hey @MohitGargVpn

Did you manage to work out a rough timeframe for this issue in the end? :)

Thanks!

@MohitGargVpn
Copy link

We will fix and release the patch by 9/15/17. Thanks for reporting the issue.

@MohitGargVpn
Copy link

The fix is already checked in and still on-track for 9/15

@tombuildsstuff
Copy link
Contributor Author

Hey @MohitGargVpn

@dominik-lekse has run the acceptance tests (which create and tear down a Virtual Network Gateway) and it appears we're still seeing issues on deletion, as such I just wanted to confirm if this fix has been released yet?

Thanks!

@MohitGargVpn
Copy link

The fix is in rollout and has gone to many regions but not all. Please try it again after a week or so. BTW which region did you try and on which date? I can check if that region had the fix or not? If you could also give me subscriptionId, and VNET URI i can investigate why vnet/subnet deletion failed for you.

@tombuildsstuff
Copy link
Contributor Author

tombuildsstuff commented Sep 27, 2017

@MohitGargVpn has this finished rolling out yet? I've just tried running this in West Europe and we're still seeing this issue:

azurerm_subnet.test: network.SubnetsClient#Delete: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="InternalServerError" Message="An error occurred."

Thanks!

@dominik-lekse
Copy link

@MohitGargVpn I also tried this in the region West Europe on Monday without success. Could you reference a region in which the fix has been rolled out already?

@tombuildsstuff
Copy link
Contributor Author

ping @MohitGargVpn :)

@bulletprooffool
Copy link

Is this good to go?

@tombuildsstuff
Copy link
Contributor Author

@veronicagg would you mind seeing if there's an update here? Thanks! :)

@MohitGargVpn
Copy link

The first fix was rolled out on 9/15. The original race condition was fixed. But we identified one more race condition which can lead to same behavior. The fix for that is planned to be rolled out by 11/15.

@tombuildsstuff
Copy link
Contributor Author

@MohitGargVpn thanks for the update. Once this has been rolled out - would you be able to tell us a region for us to test and confirm against? Thanks!

@MohitGargVpn
Copy link

Sure. I will update here with the region info.

@gloverc
Copy link

gloverc commented Nov 3, 2017

@MohitGargVpn, any update on this?

@dominik-lekse
Copy link

As a side remark on this issue: The Azure portal gets tricked into this one as well. If the user is too quick in deleting the subnet after deleting the vnet gateway, an error appears in the notifications.

@bingosummer
Copy link
Member

bingosummer commented Nov 21, 2017

@MohitGargVpn When I test the application gateway, I hit a similar issue hashicorp/terraform-provider-azurerm#488. I was using terraform to build a CI, and the failure rate is pretty high because of this issue.
My questions are:

  1. Will the fix also work for application gateway?
  2. I'm using westeurope as the region. Is the fix rolled out to all regions?

@kmcquade
Copy link

@MohitGargVpn - updates? thanks

@tshafeev
Copy link

@MohitGargVpn common, half of year to fix issue.

@tombuildsstuff
Copy link
Contributor Author

ping @MohitGargVpn - is there any update available for this issue? :)

@genevieve
Copy link

@tombuildsstuff Are there plans for this provider to delete the resource group instead of (or in addition to) the individual resources? If there is another issue/thread with that answer, would you direct me to it?

terraform destroy continues to fail for us on the first try due to this InternalServerError from trying to delete the subnet. Since we have a single resource group, we can add logic to delete the resource group with the api, but if there are plans to add a delete_by_resource_group or something to the provider, we would like to use that.

@kmcquade
Copy link

@MohitGargVpn - updates?

@MohitGargVpn
Copy link

This should now be fixed for all the regions. Can you please try it out?

@tombuildsstuff
Copy link
Contributor Author

@MohitGargVpn from what I can see, the bugs in the API appear to have been fixed.

The one remaining error is a HTTP 429 (please retry in X seconds) status code which isn't handled in v11 of the Azure SDK for Go. We're planning on upgrading to v12 this month; so I'm going to leave this open for the moment until we've confirmed that this is handled in v12 of the SDK.

Thanks!

@tombuildsstuff
Copy link
Contributor Author

@MohitGargVpn I've spent a little while validating this and can confirm that this appears to be fixed :)

As such we've just merged support for Virtual Network Gateways into Terraform - and I'm going to close this issue.

Thanks!

@genevieve
Copy link

Hey @tombuildsstuff. We tried to bring back testing our azure templates but are back to seeing this error every time.

azurerm_subnet.cf-sn: Destroying... 

Error: Error applying plan:

1 error(s) occurred:

* azurerm_subnet.cf-sn (destroy): 1 error(s) occurred:

* azurerm_subnet.cf-sn: Error waiting for completion for Subnet "bbl-ci-up-env-cf-sn" (VN "bbl-ci-up-env-bosh-vn" / Resource Group "bbl-ci-up-env-bosh"): Long running operation terminated with status 'Failed': Code="InternalServerError" Message="An error occurred."

@bsiegel bsiegel added the Service Attention Workflow: This issue is responsible by Azure service team. label Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Network Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests