Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform state gets out of sync with AWS CloudHSM v2 resources, and creates more CloudHSM v2 instances than defined in the code #8648

Closed
chamindg opened this issue May 15, 2019 · 11 comments · Fixed by #18580
Assignees
Labels
bug Addresses a defect in current functionality. service/cloudhsmv2 Issues and PRs that pertain to the cloudhsmv2 service.
Milestone

Comments

@chamindg
Copy link

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.11.8

Affected Resource(s)

aws_cloudhsm_v2_hsm

Terraform Configuration Files

resource "aws_cloudhsm_v2_cluster" "main" {
  hsm_type   = "hsm1.medium"
  subnet_ids = ["${var.subnet_ids}"]

  tags {
    Name        = "cloudhsm.${var.region}.${var.env}.${var.stack}"
    environment = "${var.env}"
    stack       = "${var.stack}"
  }
}

resource "aws_cloudhsm_v2_hsm" "hsm1" {
  subnet_id  = "${element(var.subnet_ids, 0)}"
  cluster_id = "${aws_cloudhsm_v2_cluster.main.cluster_id}"
}

resource "aws_cloudhsm_v2_hsm" "hsm2" {
  subnet_id  = "${element(var.subnet_ids, 1)}"
  cluster_id = "${aws_cloudhsm_v2_cluster.main.cluster_id}"
}

Expected Behavior

Once two cloudhsm instances are created, subsequent terraform runs don't change anything. More specifically, terraform doesn't create more cloudhsm instances.

Actual Behavior

Some subsequent runs randomly create a new cloudhsm instance, in addition to existing. This happens randomly from time to time, and we end up with more than two instances. I.e.; terraform state says only two instances are there, while there are more than two in AWS.

Steps to Reproduce

Haven't been able to reproduce yet. From what we've seen, terraform can run for months before it starts showing this weird behavior. Haven't noticed any patterns.

Important Factoids

This is happening in two AWS accounts

@chamindg
Copy link
Author

Found the problem.

Please note the following excerpt from https://aws.amazon.com/cloudhsm/faqs:

Q: What happens in case of failure?

The CloudHSM service provides fully managed HSMs in the AWS cloud. The service handles all updates and failover for you. Replacements are transparent to your application, as the CloudHSM client automatically handles failover and load balancing. HSMs are replaced to the same ENI as the original HSM. You can see when an HSM has been replaced in your audit logs in CloudWatch. You will see the log stream for one HSM ID terminate, and a new HSM ID begin, when a replacement occurs. Refer to the Monitoring AWS CloudHSM Audit Logs in Amazon CloudWatch Logs documentation at https://docs.aws.amazon.com/cloudhsm/latest/userguide/get-hsm-audit-logs-using-cloudwatch.html

This means terraform state cannot use CloudHSM instance ID as the unique identifier. ENI ID should work, as it stays the same.

Hope this information is helpful to provide a fix.

@bflad bflad added service/cloudhsm service/cloudhsmv2 Issues and PRs that pertain to the cloudhsmv2 service. and removed service/cloudhsm labels May 16, 2019
@chamindg
Copy link
Author

@bflad , is this assigned to any release yet? Thanks

@bflad
Copy link
Contributor

bflad commented May 21, 2019

Hi @chamindg 👋 This is currently not a focus of the maintainers, but we would be happy to look at a pull request for a fix. We do not generally assign items more than a week or two out at the moment.

@aeschright aeschright added the needs-triage Waiting for first response or review from a maintainer. label Jun 24, 2019
blairboy362 added a commit to alphagov/gsp that referenced this issue Jul 2, 2019
Due to:
hashicorp/terraform-provider-aws#8648
and the inability to scale an HSM cluster out from terraform it makes
sense to not manage the cloudHSMs from terraform.
blairboy362 added a commit to alphagov/gsp that referenced this issue Jul 2, 2019
Due to:
hashicorp/terraform-provider-aws#8648
and the inability to scale an HSM cluster out from terraform it makes
sense to not manage the cloudHSMs from terraform.
@aeschright aeschright added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Dec 17, 2019
@mattburgess
Copy link
Collaborator

@bflad - we've been hit by this bug a couple of times. I'd like to help out by working on the PR. My initial concern is one of backwards compatibility though; as it'll require tracking resources by a different unique identifier (ENI ID, rather than HSM instance ID) I'd hate for my naive implementation to throw everyone's HSMs away when they upgrade their provider.

Are there any examples where such tracking-id-migrations (for want of a better term) have been necessary? If so I'll gladly take a look at sorting this one out. Thanks!

mattburgess added a commit to mattburgess/terraform-provider-aws that referenced this issue Mar 17, 2020
Closes hashicorp#8648

From https://aws.amazon.com/cloudhsm/faqs/:

> Amazon monitors and maintains the HSM and network for availability and error
> conditions. If an HSM fails or loses network connectivity, the HSM will be
> automatically replaced

When this happens, the replacement HSM joins the cluster with a new HSM ID
but attached to the same ENI ID as the failed HSM. So, track HSM instances
using their ENI ID rather than their HSM ID.
@keksipurkki
Copy link

I also encountered this bug and it's a good thing you're working on it. I'd like to point out that while not critical, the financial ramifications of the issue are rather significant. An extra rogue HSM in a cluster costs some $1500 per month. Yikes!

@stevecrozz
Copy link

@mattburgess I didn't yet find precedent for changing IDs, maybe I can with a bit more digging, but one idea could be to make the provider attempt to first find HSM by eni_id as suggested by @chamindg, and if that fails, then match by id:
https://github.com/terraform-providers/terraform-provider-aws/blob/v3.3.0/aws/resource_aws_cloudhsm2_hsm.go#L93

That way, whatever ID happens to live in your state file, whether it is an eni_id, or a hsm_id, it will still find the right one. Then we would need to store that eni_id in place of the hsm_id for new resources.

I'm going to try to independently confirm that when our hsm_id changes, the eni_id remains the same.

@mseelye-well
Copy link

I've encountered this today, and the ip addresses and eni stayed the same. The only change were the "hsm_id" and "id" (same value).

We're considering a number of workarounds to this, but it would be great if this were updated/fixed.

@adamjlow
Copy link

We too are experiencing this, subsequent terraform plan's attempt to create additional nodes and at approx. $1000 USD per month, it's an expensive issue ;)

@jonphilpott
Copy link

I am also experiencing this same issue, one HSM instance in our cluster was deemed unhealthy for a reason and AWS automatically rebuilt a replacement, same ENI, new HSM ID.

@bflad bflad self-assigned this Apr 6, 2021
bflad added a commit that referenced this issue Apr 6, 2021
…ionally matching on ENI identifier during lookup

Reference: #8648
Reference: #16796

Also implements the paginated function to prevent missed matches in large environments and tidies up the existing test.

Output from acceptance testing:

```
--- PASS: TestAccAWSCloudHsmV2Hsm_basic (856.31s)
```
bflad added a commit that referenced this issue Apr 6, 2021
…ionally matching on ENI identifier during lookup (#18580)

* resource/aws_cloudhsm_v2_hsm: Prevent orphaned HSM Instances by additionally matching on ENI identifier during lookup

Reference: #8648
Reference: #16796

Also implements the paginated function to prevent missed matches in large environments and tidies up the existing test.

Output from acceptance testing:

```
--- PASS: TestAccAWSCloudHsmV2Hsm_basic (856.31s)
```

* Update CHANGELOG for #18580
@github-actions github-actions bot added this to the v3.36.0 milestone Apr 6, 2021
@ghost
Copy link

ghost commented Apr 9, 2021

This has been released in version 3.36.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@ghost
Copy link

ghost commented May 6, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators May 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/cloudhsmv2 Issues and PRs that pertain to the cloudhsmv2 service.
Projects
None yet
9 participants