Terraform state gets out of sync with AWS CloudHSM v2 resources, and creates more CloudHSM v2 instances than defined in the code #8648

chamindg · 2019-05-15T16:59:00Z

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

Terraform v0.11.8

Affected Resource(s)

aws_cloudhsm_v2_hsm

Terraform Configuration Files

resource "aws_cloudhsm_v2_cluster" "main" {
  hsm_type   = "hsm1.medium"
  subnet_ids = ["${var.subnet_ids}"]

  tags {
    Name        = "cloudhsm.${var.region}.${var.env}.${var.stack}"
    environment = "${var.env}"
    stack       = "${var.stack}"
  }
}

resource "aws_cloudhsm_v2_hsm" "hsm1" {
  subnet_id  = "${element(var.subnet_ids, 0)}"
  cluster_id = "${aws_cloudhsm_v2_cluster.main.cluster_id}"
}

resource "aws_cloudhsm_v2_hsm" "hsm2" {
  subnet_id  = "${element(var.subnet_ids, 1)}"
  cluster_id = "${aws_cloudhsm_v2_cluster.main.cluster_id}"
}

Expected Behavior

Once two cloudhsm instances are created, subsequent terraform runs don't change anything. More specifically, terraform doesn't create more cloudhsm instances.

Actual Behavior

Some subsequent runs randomly create a new cloudhsm instance, in addition to existing. This happens randomly from time to time, and we end up with more than two instances. I.e.; terraform state says only two instances are there, while there are more than two in AWS.

Steps to Reproduce

Haven't been able to reproduce yet. From what we've seen, terraform can run for months before it starts showing this weird behavior. Haven't noticed any patterns.

Important Factoids

This is happening in two AWS accounts

chamindg · 2019-05-16T14:32:53Z

Found the problem.

Please note the following excerpt from https://aws.amazon.com/cloudhsm/faqs:

Q: What happens in case of failure?

The CloudHSM service provides fully managed HSMs in the AWS cloud. The service handles all updates and failover for you. Replacements are transparent to your application, as the CloudHSM client automatically handles failover and load balancing. HSMs are replaced to the same ENI as the original HSM. You can see when an HSM has been replaced in your audit logs in CloudWatch. You will see the log stream for one HSM ID terminate, and a new HSM ID begin, when a replacement occurs. Refer to the Monitoring AWS CloudHSM Audit Logs in Amazon CloudWatch Logs documentation at https://docs.aws.amazon.com/cloudhsm/latest/userguide/get-hsm-audit-logs-using-cloudwatch.html

This means terraform state cannot use CloudHSM instance ID as the unique identifier. ENI ID should work, as it stays the same.

Hope this information is helpful to provide a fix.

chamindg · 2019-05-21T14:37:38Z

@bflad , is this assigned to any release yet? Thanks

bflad · 2019-05-21T17:54:51Z

Hi @chamindg 👋 This is currently not a focus of the maintainers, but we would be happy to look at a pull request for a fix. We do not generally assign items more than a week or two out at the moment.

Due to: hashicorp/terraform-provider-aws#8648 and the inability to scale an HSM cluster out from terraform it makes sense to not manage the cloudHSMs from terraform.

mattburgess · 2020-03-17T17:26:18Z

@bflad - we've been hit by this bug a couple of times. I'd like to help out by working on the PR. My initial concern is one of backwards compatibility though; as it'll require tracking resources by a different unique identifier (ENI ID, rather than HSM instance ID) I'd hate for my naive implementation to throw everyone's HSMs away when they upgrade their provider.

Are there any examples where such tracking-id-migrations (for want of a better term) have been necessary? If so I'll gladly take a look at sorting this one out. Thanks!

Closes hashicorp#8648 From https://aws.amazon.com/cloudhsm/faqs/: > Amazon monitors and maintains the HSM and network for availability and error > conditions. If an HSM fails or loses network connectivity, the HSM will be > automatically replaced When this happens, the replacement HSM joins the cluster with a new HSM ID but attached to the same ENI ID as the failed HSM. So, track HSM instances using their ENI ID rather than their HSM ID.

keksipurkki · 2020-04-23T10:51:31Z

I also encountered this bug and it's a good thing you're working on it. I'd like to point out that while not critical, the financial ramifications of the issue are rather significant. An extra rogue HSM in a cluster costs some $1500 per month. Yikes!

stevecrozz · 2020-08-24T16:33:26Z

@mattburgess I didn't yet find precedent for changing IDs, maybe I can with a bit more digging, but one idea could be to make the provider attempt to first find HSM by eni_id as suggested by @chamindg, and if that fails, then match by id:
https://github.com/terraform-providers/terraform-provider-aws/blob/v3.3.0/aws/resource_aws_cloudhsm2_hsm.go#L93

That way, whatever ID happens to live in your state file, whether it is an eni_id, or a hsm_id, it will still find the right one. Then we would need to store that eni_id in place of the hsm_id for new resources.

I'm going to try to independently confirm that when our hsm_id changes, the eni_id remains the same.

mseelye-well · 2021-03-15T16:20:08Z

I've encountered this today, and the ip addresses and eni stayed the same. The only change were the "hsm_id" and "id" (same value).

We're considering a number of workarounds to this, but it would be great if this were updated/fixed.

adamjlow · 2021-03-19T15:57:11Z

We too are experiencing this, subsequent terraform plan's attempt to create additional nodes and at approx. $1000 USD per month, it's an expensive issue ;)

jonphilpott · 2021-03-29T20:11:07Z

I am also experiencing this same issue, one HSM instance in our cluster was deemed unhealthy for a reason and AWS automatically rebuilt a replacement, same ENI, new HSM ID.

…ionally matching on ENI identifier during lookup Reference: #8648 Reference: #16796 Also implements the paginated function to prevent missed matches in large environments and tidies up the existing test. Output from acceptance testing: ``` --- PASS: TestAccAWSCloudHsmV2Hsm_basic (856.31s) ```

…ionally matching on ENI identifier during lookup (#18580) * resource/aws_cloudhsm_v2_hsm: Prevent orphaned HSM Instances by additionally matching on ENI identifier during lookup Reference: #8648 Reference: #16796 Also implements the paginated function to prevent missed matches in large environments and tidies up the existing test. Output from acceptance testing: ``` --- PASS: TestAccAWSCloudHsmV2Hsm_basic (856.31s) ``` * Update CHANGELOG for #18580

ghost · 2021-04-09T02:42:20Z

This has been released in version 3.36.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost · 2021-05-06T17:10:12Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

bflad added service/cloudhsm service/cloudhsmv2 Issues and PRs that pertain to the cloudhsmv2 service. and removed service/cloudhsm labels May 16, 2019

aeschright added the needs-triage Waiting for first response or review from a maintainer. label Jun 24, 2019

blairboy362 mentioned this issue Jul 2, 2019

No longer use terraform to manage the state of the cloudHSMs. alphagov/gsp#253

Merged

aeschright added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Dec 17, 2019

mattburgess mentioned this issue Mar 17, 2020

resource/aws_cloudhsm2_hsm: Prevent creation of extra HSMs #12435

Closed

bflad self-assigned this Apr 6, 2021

bflad mentioned this issue Apr 6, 2021

resource/aws_cloudhsm_v2_hsm: Prevent orphaned HSM Instances by additionally matching on ENI identifier during lookup #18580

Merged

bflad closed this as completed in #18580 Apr 6, 2021

github-actions bot added this to the v3.36.0 milestone Apr 6, 2021

ghost locked as resolved and limited conversation to collaborators May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terraform state gets out of sync with AWS CloudHSM v2 resources, and creates more CloudHSM v2 instances than defined in the code #8648

Terraform state gets out of sync with AWS CloudHSM v2 resources, and creates more CloudHSM v2 instances than defined in the code #8648

chamindg commented May 15, 2019

chamindg commented May 16, 2019

chamindg commented May 21, 2019

bflad commented May 21, 2019

mattburgess commented Mar 17, 2020

keksipurkki commented Apr 23, 2020

stevecrozz commented Aug 24, 2020

mseelye-well commented Mar 15, 2021

adamjlow commented Mar 19, 2021

jonphilpott commented Mar 29, 2021

ghost commented Apr 9, 2021

ghost commented May 6, 2021

Terraform state gets out of sync with AWS CloudHSM v2 resources, and creates more CloudHSM v2 instances than defined in the code #8648

Terraform state gets out of sync with AWS CloudHSM v2 resources, and creates more CloudHSM v2 instances than defined in the code #8648

Comments

chamindg commented May 15, 2019

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

chamindg commented May 16, 2019

chamindg commented May 21, 2019

bflad commented May 21, 2019

mattburgess commented Mar 17, 2020

keksipurkki commented Apr 23, 2020

stevecrozz commented Aug 24, 2020

mseelye-well commented Mar 15, 2021

adamjlow commented Mar 19, 2021

jonphilpott commented Mar 29, 2021

ghost commented Apr 9, 2021

ghost commented May 6, 2021