Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource/iam_server_certificate: Increase deletion timeout #907

Merged
merged 1 commit into from
Jun 19, 2017

Conversation

radeksimko
Copy link
Member

This is to address the following test failure:

=== RUN   TestAccAWSLBSSLNegotiationPolicy_missingLB
--- FAIL: TestAccAWSLBSSLNegotiationPolicy_missingLB (718.52s)
    testing.go:492: Error destroying resource! WARNING: Dangling resources
        may exist. The full state and error is shown below.
        
        Error: Error applying: 1 error(s) occurred:
        
        * aws_iam_server_certificate.test_cert (destroy): 1 error(s) occurred:
        
        * aws_iam_server_certificate.test_cert: DeleteConflict: Certificate: ASCAJXDEAGR64DZQSW564 is currently in use by arn:aws:elasticloadbalancing:us-west-2:*******:loadbalancer/tf-test-lb-wsxm9. Please remove it first before deleting it from IAM.
            status code: 409, request id: 200c2905-53f5-11e7-a79c-5921a71de6df
        
        State: aws_iam_server_certificate.test_cert:
          ID = ASCAJXDEAGR64DZQSW564
          arn = arn:aws:iam::*******:server-certificate/tf-acctest-xdb2fjbm3b
          certificate_body = 32c8d055f078a2a9fd053c0da56307b2041a5fc8
          certificate_chain = 62d98850b90a08fe0a8fa48803099cbab71913bf
          name = tf-acctest-xdb2fjbm3b
          path = /
          private_key = ed98e97354e5d9d1ebe5b2cb595c94ae6d4c3d69

@radeksimko radeksimko added the bug Addresses a defect in current functionality. label Jun 19, 2017
@radeksimko radeksimko merged commit 862aa75 into master Jun 19, 2017
@radeksimko radeksimko deleted the b-iam-server-cert-timeout branch June 19, 2017 14:28
@genevieve
Copy link
Contributor

Would we be able to make a PR to make this timeout configurable? 15 minutes is still not enough and we would like to support the timeouts parameter.

@Ninir
Copy link
Contributor

Ninir commented Nov 22, 2017

Hi @genevievelesperance

Yes we would :) Do you feel you could throw a PR for that?

@genevieve
Copy link
Contributor

Yes

@radeksimko
Copy link
Member Author

@genevievelesperance I've been previously hesitant in adding configurable timeouts to resources which have no reasons for variable creation/update/deletion times. Good reasons for variable times usually exist for EC2/db instances with varying instance sizes and similar, where we want the default timeout to cover most common instance sizes so that most users don't need to set custom timeout and allow users to cover edge-case scenarios (huge instances with long provisioning time) via configurable timeouts.

I don't see any reason for certificates to vary in the same way - unless the reason is varying volume of traffic on the LB at the time of deletion (which it may be).

I feel like 15mins is already quite high and I think we could/should do something smarter/cleaner here, similar to what we did in ALB/ELB: #1036 and #1427

I saw some of our acceptance tests are intermittently failing, so this clearly isn't a problem unique to you @genevievelesperance - for that reason I'd rather dig into the problem and figure out why it's taking so long and what can we do to avoid increasing the timeout further.

It's possible that we won't find any better/reasonable solution and we may end up just bumping the timeout, but that should come as the last option and I'm already feeling bad for raising it here in this PR.

@genevieve
Copy link
Contributor

genevieve commented Nov 28, 2017

@radeksimko Are you suggesting that the load balancer clean up it's certificates?

The clean up of the network interface has a hardcoded 10 minute timeout in that PR. If we were to do a similar function for the certificate, is that relocating where the delete timeout is? Would that timeout be configurable since the relocation of the deletion won't change the time it takes AWS to confirm the cert has been deleted?

@radeksimko
Copy link
Member Author

@genevievelesperance that's timeout for detaching a single network interface, not for retrying deletion on error code and hoping that things will magically detach, which is a significant difference.

Although the EC2 API is eventually consistent we know what we're waiting for in that context and the operation has high chance of succeeding - as opposed to retrying delete operation on DeleteConflict which may be a genuine conflict. Informing the user early about genuine conflict is another strong reason for avoiding blind retry & high timeouts as that would lead to poor user experience in such context (users waiting for ages to see the error).

The timeout for detaching NI isn't configurable because the default one was clearly sufficient for most users in aws_network_interface and/or elsewhere in the codebase - feel free to correct me if I'm wrong though.

I hope the explanation makes sense. I understand this is annoying. I also want to fix this, but do so without causing troubles to other users in other context. 😉

@genevieve
Copy link
Contributor

I agree. Instead of increasing the default timeout like this PR, users that are not having genuine conflicts but are only dealing with a slow API can configure it instead of regular failures to destroy only that resource.

@genevieve
Copy link
Contributor

We can try something similar to your ENI PR's for the cert and let you know how it goes.

@genevieve
Copy link
Contributor

Thinking about this a bit more, I'm not sure it makes sense to couple the deletion of the certificate to the deletion of the load balancer.

There is no traffic to our load balancers in the tests that are failing. In the terraform destroy output, we see the load balancers be successfully destroyed and then it attempts to destroy the server certificate before timing out after 15 minutes. Nothing else is using the certificate in our tests.

It seems the only way it continues to retry is if it's seeing a DeleteConflict with a currently in use by arn message.

If we explicitly coupled the deletion of the certificate to the call the delete the load balancer, I don't see how we wouldn't run into the same issue.

Additionally, if we want to couple deleting the certificate to the load balancer, it may cause issues for others if the certificate is being used by something else.

cc @mcwumbly

@genevieve
Copy link
Contributor

One more piece of information:

While our CI server was hanging, with this output:

aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m20s elapsed)
aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m30s elapsed)
aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m40s elapsed)

We went ahead and deleted the certificate manually with the aws cli, which did not return any error:

aws iam delete-server-certificate --server-certificate-name=bbl-ci-cre-5c33d6320171128234901664200000004

Then, the CI job immediately moved on and terraform destroy succeeded:

...
aws_iam_server_certificate.lb_cert: Still destroying... (ID: ASCAIPJPTO6UFSSDBNYBW, 7m40s elapsed)
aws_iam_server_certificate.lb_cert: Destruction complete after 7m43s

@radeksimko
Copy link
Member Author

Thanks for providing more details. It may be an effect of IAM eventual consistency and your AWS CLI may be just connecting to a different data centre where the certificate deletion was already propagated, but it's hard to prove this.

I'm thinking we can put some extra logging in place - e.g. parse the ARN of the load balancer from the error message and subsequently ask the LB API about that load balancer. That in combination with iam/DeleteServerCertificate request/response may give us more evidence.

I took some time to also investigate this by looking into debug logs produced by failed tests and I can see the sequence of API calls is as follows:

  • elasticloadbalancing/DeleteLoadBalancer (200 OK)
  • elasticloadbalancing/DescribeLoadBalancers (400 Bad Request, LoadBalancerNotFound)
  • iam/DeleteServerCertificate (409 Conflict)

which suggest that the theory about eventual consistency may be right.

It seems the only way it continues to retry is if it's seeing a DeleteConflict with a currently in use by arn message.

It seems that's what we may end up doing, unfortunately - but before we do that I'd like to take a moment to add the logging as mentioned above and collect some more data before jumping to the radical solution.

@genevievelesperance Is there any more details you can provide? e.g. region of your LB, how often do you get to see this error? Does it occur with any LB configuration or do you have more tests with different configurations and it only occurs in some?

@genevieve
Copy link
Contributor

The region is usually us-west-1. We see this every 1/3 runs. We have one LB configured with the certificate and it looks like this: https://github.com/cloudfoundry/bosh-bootloader/blob/master/terraform/aws/templates/cf_lb.tf#L154-L191

@genevieve
Copy link
Contributor

Hey Radek, would you like to us to make a PR for logging to collect more data on both our ends?

@radeksimko
Copy link
Member Author

@genevievelesperance PR would be greatly appreciated. Thanks.

@genevieve
Copy link
Contributor

Hey @radeksimko. Added logging to query the lb api for the load balancer causing the delete conflict in #2533

@genevieve
Copy link
Contributor

genevieve commented Dec 20, 2017

Hey @radeksimko. We are continuing to see this timeout. Both the aws cli and terraform requests are receiving the DeleteConflict on trying to delete the server certificate, and both are showing the load balancer as gone. Would it be possible to add the configurable delete timeout (and possibly only use this extended delete timeout in the case that the load balancer endpoint is reporting the lb as gone and we know we are only retrying in the case that the api is slow to be consistent)?

@radeksimko
Copy link
Member Author

@genevievelesperance I'm more than happy to increase the default timeout as long as we have a debug log which proves this is necessary - hopefully with the record of the API call you kindly added in #2533

Do you mind providing a log from any of the mentioned runs (minus any secret/sensitive data)?

Thanks.

@ghost
Copy link

ghost commented Apr 8, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants