Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform plan fails while AWS Elasticache Redis cluster is scaling out #18116

Closed
fromz opened this issue Mar 16, 2021 · 9 comments · Fixed by #21185
Closed

Terraform plan fails while AWS Elasticache Redis cluster is scaling out #18116

fromz opened this issue Mar 16, 2021 · 9 comments · Fixed by #21185
Labels
bug Addresses a defect in current functionality. service/elasticache Issues and PRs that pertain to the elasticache service.
Milestone

Comments

@fromz
Copy link

fromz commented Mar 16, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v0.14.5

  • provider registry.terraform.io/hashicorp/aws v3.31.0
  • provider registry.terraform.io/hashicorp/local v2.1.0
  • provider registry.terraform.io/hashicorp/null v3.1.0

Affected Resource(s)

  • aws_elasticache_replication_group

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

resource "aws_elasticache_replication_group" "this" {
  count = 1

  at_rest_encryption_enabled    = true
  multi_az_enabled              = true
  automatic_failover_enabled    = true
  replication_group_id          = "users-cache"
  replication_group_description = "Users Redis cache"
  node_type                     = "cache.t3.medium"
  parameter_group_name          = "default.redis6.x.cluster.on"
  port                          = 6379

  cluster_mode {
    num_node_groups         = 1 # Number of initial shards
    replicas_per_node_group = 1 # Number of initial replicas within each shard
  }

  apply_immediately = true

  lifecycle {
    ignore_changes = [
      # Scaling the instances in AWS will change cluster_mode.num_node_groups, custer_mode.replicas_per_node_group;
      # disregard drift from initial configuration.
      cluster_mode,
    ]
  }
}

Debug Output

Running terraform cloud which doesn't allow running debug, but I get:

Error: error listing tags for resource (arn:aws:elasticache:ap-southeast-2::cluster:users-cache-0001-001): CacheClusterNotFound: users-cache-0001-001 is either not present or not available.
        status code: 404, request id: b6cfcff3-dfa7-41cf-b099-0eb0c9767990

Expected Behavior

When cluster status is not 'available', e.g. due to adding shards, terraform plan/apply should work without error.

Actual Behavior

Whenever cluster is not available due to online resizing, terraform plan/apply fail.

Steps to Reproduce

  1. terraform apply
  2. wait for operation to complete
  3. log into AWS UI
  4. find generated Elasticache cluster
  5. scale out the cluster (e.g. click "Add shard")
  6. note that the cluster goes into 'modifying' state
  7. run terraform plan
  8. observe failure

Important Factoids

References

  • #0000
@ghost ghost added the service/elasticache Issues and PRs that pertain to the elasticache service. label Mar 16, 2021
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Mar 16, 2021
@bill-rich bill-rich added bug Addresses a defect in current functionality. and removed needs-triage Waiting for first response or review from a maintainer. labels Mar 17, 2021
@gdavison
Copy link
Contributor

Implementation note: Based on the error message, this is likely related to how the resource manages tags on the individual cluster nodes. The attempt to read tags on the node has failed because the node has been removed or (possibly) is scaling.

@ktham
Copy link
Contributor

ktham commented Apr 16, 2021

Is it advisable to catch this error and proceed while skipping any changes based on tags during Elasticache scale up operations? Otherwise, we are effectively DOS-ed from running Terraform for hours (or however long the scale up takes) 😢

@rawrgulmuffins
Copy link

I've hit this a few times recently. I would also be interested in the answer to the catch question above.

@github-actions
Copy link

github-actions bot commented Oct 8, 2021

This functionality has been released in v3.62.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@ktham
Copy link
Contributor

ktham commented Oct 21, 2021

Hi @gdavison , this issue is still not fixed. Terraform plans continue to fail during the "list tags" operation when the Elasticache cluster is not "available" due to some cluster operation. I'm hoping that perhaps when Elasticache is in the middle of the operation, we can skip refreshing the tag state perhaps.

 Error: error listing tags for resource (arn:aws:elasticache:us-east-1:xxx:xxx): timeout while waiting for state to become 'available' (last state: 'snapshotting', timeout: 40m0s)

@ktham
Copy link
Contributor

ktham commented Oct 21, 2021

From https://docs.aws.amazon.com/cli/latest/reference/elasticache/list-tags-for-resource.html

If the cluster is not in the available state, ListTagsForResource returns an error.

The AWS provider ideally should be able to handle this situation gracefully during the plan stage, so that TF plans can continue to run even when Elasticache Redis is undergoing routine nightly snapshotting, or when Elasticache is scaling up.

@gdavison - I would propose re-opening this ticket as I think #21185 does not address this. (cc @ewbankkit who reviewed the PR)

@jeffery-jen
Copy link

Please reopen this issue as this is creating a problem for any elasticache provisioned with TF and happened to run into snapshot state

@okelitse
Copy link

okelitse commented Apr 7, 2022

Hi.

I am facing a similar issue, is there a fix for this.

Error: error listing tags for ElastiCache Cluster (cache_instance_name_here-dev): CacheClusterNotFound: cache_instance_name_here-dev is either not present or not available.

Thanks?

@github-actions
Copy link

github-actions bot commented May 8, 2022

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/elasticache Issues and PRs that pertain to the elasticache service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants