Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: AWS RDS global cluster very long-delay in minor version upgrade #36107

Closed
YakDriver opened this issue Mar 5, 2024 · 3 comments · Fixed by #36246
Closed

[Bug]: AWS RDS global cluster very long-delay in minor version upgrade #36107

YakDriver opened this issue Mar 5, 2024 · 3 comments · Fixed by #36246
Labels
bug Addresses a defect in current functionality. service/rds Issues and PRs that pertain to the rds service. service/vpc Issues and PRs that pertain to the vpc service.
Milestone

Comments

@YakDriver
Copy link
Member

YakDriver commented Mar 5, 2024

Originally posted by @catcharbind in #30358 (comment)

Terraform Core Version

1.7.4

AWS Provider Version

5.39.1

Affected Resource(s)

aws_rds_global_cluster

Expected Behavior

The aws_rds_global_cluster will perform minor upgrade without error and without delay.

Actual Behavior

An attempt to perform a minor version update of a global RDS cluster DB results in the error being repeated for 90 minutes, until timeout.

Relevant Error/Panic Output Snippet

ModifyGlobalCluster only supports Major Version Upgrades. To patch the members of your global cluster to a newer minor version you need to call ModifyDbCluster in each one of them.
Error: updating RDS Global Cluster (aurora-globaldb): while upgrading minor version of RDS Global Cluster (aurora-globaldb): failed to update engine_version on RDS Global Cluster Cluster (aurora-us-east-2): InvalidParameterValue: Cannot upgrade DB cluster 'aurora-us-east-2' to engine version 15.4 as global replica(s) are running lower version. Please upgrade global replicas first before upgrading the primary member.
│       status code: 400, request id: f86914e5-7cdc-47c0-aae4-41db3e3a9b93
│
│   with module.aurora.aws_rds_global_cluster.globaldb[0],
│   on ../main.tf line 142, in resource "aws_rds_global_cluster" "globaldb":
│  142: resource "aws_rds_global_cluster" "globaldb" {

Terraform Configuration Files

provider "aws" {
  alias  = "primary"
  region = "us-east-2"
}

provider "aws" {
  alias  = "secondary"
  region = "us-east-1"
}

resource "aws_rds_global_cluster" "example" {
  global_cluster_identifier = "global-test"
  engine                    = "aurora-postgresql"
  engine_version            = "15.5"
  database_name             = "example_db"
}

resource "aws_rds_cluster" "primary" {
  apply_immediately         = true
  provider                  = aws.primary
  engine                    = aws_rds_global_cluster.example.engine
  engine_version            = aws_rds_global_cluster.example.engine_version
  cluster_identifier        = "test-primary-cluster"
  master_username           = "username"
  master_password           = "somepass123"
  database_name             = "example_db"
  global_cluster_identifier = aws_rds_global_cluster.example.id
  db_subnet_group_name      = aws_db_subnet_group.default_p.name
  #  depends_on = [ aws_db_subnet_group.default ]
  skip_final_snapshot = true
}

resource "aws_rds_cluster_instance" "primary" {
  apply_immediately  = true
  provider           = aws.primary
  engine             = aws_rds_global_cluster.example.engine
  engine_version     = aws_rds_global_cluster.example.engine_version
  identifier         = "test-primary-cluster-instance"
  cluster_identifier = aws_rds_cluster.primary.id
  instance_class     = "db.r5.large"
  #  db_subnet_group_name      = aws_db_subnet_group.default_name
  #  db_subnet_group_name = "default"
}

resource "aws_rds_cluster" "secondary" {
  apply_immediately         = true
  provider                  = aws.secondary
  engine                    = aws_rds_global_cluster.example.engine
  engine_version            = aws_rds_global_cluster.example.engine_version
  cluster_identifier        = "test-secondary-cluster"
  global_cluster_identifier = aws_rds_global_cluster.example.id
  #  master_username = "username"
  db_subnet_group_name = aws_db_subnet_group.default_s.name
  # db_subnet_group_name      = "default"
  skip_final_snapshot = true
  # depends_on = [
  #   aws_rds_cluster_instance.primary
  # ]
}

resource "aws_rds_cluster_instance" "secondary" {
  apply_immediately    = true
  provider             = aws.secondary
  engine               = aws_rds_global_cluster.example.engine
  engine_version       = aws_rds_global_cluster.example.engine_version
  identifier           = "test-secondary-cluster-instance"
  cluster_identifier   = aws_rds_cluster.secondary.id
  instance_class       = "db.r5.large"
  db_subnet_group_name = aws_db_subnet_group.default_s.name
}

data "aws_vpc" "default_p" {
  provider = aws.primary
  default  = true
}

data "aws_subnets" "example" {
  provider = aws.primary
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default_p.id]
  }
}

resource "aws_db_subnet_group" "default_p" {
  provider = aws.primary
  #  name       = "aaa"
  subnet_ids = data.aws_subnets.example.ids
  depends_on = [data.aws_subnets.example]
}

data "aws_vpc" "default_s" {
  provider = aws.secondary
  default  = true
}

data "aws_subnets" "example2" {
  provider = aws.secondary
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default_s.id]
  }
}

resource "aws_db_subnet_group" "default_s" {
  provider   = aws.secondary
  name       = "aaa"
  subnet_ids = data.aws_subnets.example2.ids
}

Steps to Reproduce

  1. terraform apply with version 15.4
  2. change version to 15.5
  3. terraform apply again

Debug Output

<snip>
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m1s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m11s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m21s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m31s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Modifications complete after 1h38m40s [id=tfap-aurora-ap-globaldb]
module.aurora.aws_rds_cluster.primary: Modifying... [id=tfap-aurora-ap-us-east-2]
module.aurora.aws_rds_cluster.primary: Still modifying... [id=tfap-aurora-ap-us-east-2, 10s elapsed]
module.aurora.aws_rds_cluster.primary: Still modifying... [id=tfap-aurora-ap-us-east-2, 20s elapsed]
module.aurora.aws_rds_cluster.primary: Still modifying... [id=tfap-aurora-ap-us-east-2, 30s elapsed]
module.aurora.aws_rds_cluster.primary: Modifications complete after 31s [id=tfap-aurora-ap-us-east-2]
module.aurora.aws_rds_cluster_instance.primary[0]: Modifying... [id=tf-aurora-ap-us-east-2-1]
module.aurora.aws_rds_cluster_instance.primary[0]: Still modifying... [id=tf-aurora-ap-us-east-2-1, 10s elapsed]
module.aurora.aws_rds_cluster_instance.primary[0]: Still modifying... [id=tf-aurora-ap-us-east-2-1, 20s elapsed]
module.aurora.aws_rds_cluster_instance.primary[0]: Still modifying... [id=tf-aurora-ap-us-east-2-1, 30s elapsed]
module.aurora.aws_rds_cluster_instance.primary[0]: Modifications complete after 31s [id=tf-aurora-ap-us-east-2-1]
module.aurora.aws_rds_cluster.secondary[0]: Modifying... [id=tfap-aurora-ap-us-west-2]
module.aurora.aws_rds_cluster.secondary[0]: Still modifying... [id=tfap-aurora-ap-us-west-2, 10s elapsed]
module.aurora.aws_rds_cluster.secondary[0]: Still modifying... [id=tfap-aurora-ap-us-west-2, 20s elapsed]
module.aurora.aws_rds_cluster.secondary[0]: Still modifying... [id=tfap-aurora-ap-us-west-2, 30s elapsed]
module.aurora.aws_rds_cluster.secondary[0]: Modifications complete after 33s [id=tfap-aurora-ap-us-west-2]
module.aurora.aws_rds_cluster_instance.secondary[0]: Modifying... [id=tf-aurora-ap-us-west-2-1]
module.aurora.aws_rds_cluster_instance.secondary[0]: Still modifying... [id=tf-aurora-ap-us-west-2-1, 10s elapsed]
module.aurora.aws_rds_cluster_instance.secondary[0]: Still modifying... [id=tf-aurora-ap-us-west-2-1, 20s elapsed]
module.aurora.aws_rds_cluster_instance.secondary[0]: Still modifying... [id=tf-aurora-ap-us-west-2-1, 30s elapsed]
module.aurora.aws_rds_cluster_instance.secondary[0]: Modifications complete after 31s [id=tf-aurora-ap-us-west-2-1]

Panic Output

No response

Important Factoids

This was introduced in #30996 when the logic was flipped for retry behavior for errors. A specific AWS error is used to determine when to do a minor version upgrade vs. a major. As of #30996, the logic will keep retrying a major upgrade when the error indicates a minor version is required. This is why there is such a long delay. Then, at the very end, it tries the minor upgrade, which succeeds.

The workaround mentioned above also doesn't work for me. Upgraded the secondary cluster using AWS console. Then applied Terraform. But its just stuck in "Still modifying.." phase with no action on the AWS resource.

I also tried this on the latest AWS provider version 5.30.0. But seeing the same issue.

Update 12/16/2023:

Now the Minor version upgrade was successfully completed using Terraform. But it took a very long time and was stuck in the modifying the global cluster node for 1hr and 38 minutes! The upgrade should have just attempted to upgrade the secondary cluster first and then the primary cluster. The log below shows it modified the primary first followed by the secondary. But in AWS console, I see that the secondary cluster was upgraded first and then the primary. Otherwise the Aurora global database minor version upgrade wont work!

References

Copy link

github-actions bot commented Mar 5, 2024

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

@github-actions github-actions bot added bug Addresses a defect in current functionality. service/rds Issues and PRs that pertain to the rds service. service/vpc Issues and PRs that pertain to the vpc service. labels Mar 5, 2024
@github-actions github-actions bot added this to the v5.40.0 milestone Mar 7, 2024
Copy link

github-actions bot commented Mar 7, 2024

This functionality has been released in v5.40.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

Copy link

github-actions bot commented Apr 7, 2024

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/rds Issues and PRs that pertain to the rds service. service/vpc Issues and PRs that pertain to the vpc service.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant