[Bug]: AWS RDS global cluster very long-delay in minor version upgrade #36107

YakDriver · 2024-03-05T21:09:40Z

Originally posted by @catcharbind in #30358 (comment)

Terraform Core Version

1.7.4

AWS Provider Version

5.39.1

Affected Resource(s)

aws_rds_global_cluster

Expected Behavior

The aws_rds_global_cluster will perform minor upgrade without error and without delay.

Actual Behavior

An attempt to perform a minor version update of a global RDS cluster DB results in the error being repeated for 90 minutes, until timeout.

Relevant Error/Panic Output Snippet

ModifyGlobalCluster only supports Major Version Upgrades. To patch the members of your global cluster to a newer minor version you need to call ModifyDbCluster in each one of them.

Error: updating RDS Global Cluster (aurora-globaldb): while upgrading minor version of RDS Global Cluster (aurora-globaldb): failed to update engine_version on RDS Global Cluster Cluster (aurora-us-east-2): InvalidParameterValue: Cannot upgrade DB cluster 'aurora-us-east-2' to engine version 15.4 as global replica(s) are running lower version. Please upgrade global replicas first before upgrading the primary member.
│       status code: 400, request id: f86914e5-7cdc-47c0-aae4-41db3e3a9b93
│
│   with module.aurora.aws_rds_global_cluster.globaldb[0],
│   on ../main.tf line 142, in resource "aws_rds_global_cluster" "globaldb":
│  142: resource "aws_rds_global_cluster" "globaldb" {

Terraform Configuration Files

provider "aws" {
  alias  = "primary"
  region = "us-east-2"
}

provider "aws" {
  alias  = "secondary"
  region = "us-east-1"
}

resource "aws_rds_global_cluster" "example" {
  global_cluster_identifier = "global-test"
  engine                    = "aurora-postgresql"
  engine_version            = "15.5"
  database_name             = "example_db"
}

resource "aws_rds_cluster" "primary" {
  apply_immediately         = true
  provider                  = aws.primary
  engine                    = aws_rds_global_cluster.example.engine
  engine_version            = aws_rds_global_cluster.example.engine_version
  cluster_identifier        = "test-primary-cluster"
  master_username           = "username"
  master_password           = "somepass123"
  database_name             = "example_db"
  global_cluster_identifier = aws_rds_global_cluster.example.id
  db_subnet_group_name      = aws_db_subnet_group.default_p.name
  #  depends_on = [ aws_db_subnet_group.default ]
  skip_final_snapshot = true
}

resource "aws_rds_cluster_instance" "primary" {
  apply_immediately  = true
  provider           = aws.primary
  engine             = aws_rds_global_cluster.example.engine
  engine_version     = aws_rds_global_cluster.example.engine_version
  identifier         = "test-primary-cluster-instance"
  cluster_identifier = aws_rds_cluster.primary.id
  instance_class     = "db.r5.large"
  #  db_subnet_group_name      = aws_db_subnet_group.default_name
  #  db_subnet_group_name = "default"
}

resource "aws_rds_cluster" "secondary" {
  apply_immediately         = true
  provider                  = aws.secondary
  engine                    = aws_rds_global_cluster.example.engine
  engine_version            = aws_rds_global_cluster.example.engine_version
  cluster_identifier        = "test-secondary-cluster"
  global_cluster_identifier = aws_rds_global_cluster.example.id
  #  master_username = "username"
  db_subnet_group_name = aws_db_subnet_group.default_s.name
  # db_subnet_group_name      = "default"
  skip_final_snapshot = true
  # depends_on = [
  #   aws_rds_cluster_instance.primary
  # ]
}

resource "aws_rds_cluster_instance" "secondary" {
  apply_immediately    = true
  provider             = aws.secondary
  engine               = aws_rds_global_cluster.example.engine
  engine_version       = aws_rds_global_cluster.example.engine_version
  identifier           = "test-secondary-cluster-instance"
  cluster_identifier   = aws_rds_cluster.secondary.id
  instance_class       = "db.r5.large"
  db_subnet_group_name = aws_db_subnet_group.default_s.name
}

data "aws_vpc" "default_p" {
  provider = aws.primary
  default  = true
}

data "aws_subnets" "example" {
  provider = aws.primary
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default_p.id]
  }
}

resource "aws_db_subnet_group" "default_p" {
  provider = aws.primary
  #  name       = "aaa"
  subnet_ids = data.aws_subnets.example.ids
  depends_on = [data.aws_subnets.example]
}

data "aws_vpc" "default_s" {
  provider = aws.secondary
  default  = true
}

data "aws_subnets" "example2" {
  provider = aws.secondary
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default_s.id]
  }
}

resource "aws_db_subnet_group" "default_s" {
  provider   = aws.secondary
  name       = "aaa"
  subnet_ids = data.aws_subnets.example2.ids
}

Steps to Reproduce

terraform apply with version 15.4
change version to 15.5
terraform apply again

Debug Output

<snip>
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m1s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m11s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m21s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Still modifying... [id=tfap-aurora-ap-globaldb, 1h38m31s elapsed]
module.aurora.aws_rds_global_cluster.globaldb[0]: Modifications complete after 1h38m40s [id=tfap-aurora-ap-globaldb]
module.aurora.aws_rds_cluster.primary: Modifying... [id=tfap-aurora-ap-us-east-2]
module.aurora.aws_rds_cluster.primary: Still modifying... [id=tfap-aurora-ap-us-east-2, 10s elapsed]
module.aurora.aws_rds_cluster.primary: Still modifying... [id=tfap-aurora-ap-us-east-2, 20s elapsed]
module.aurora.aws_rds_cluster.primary: Still modifying... [id=tfap-aurora-ap-us-east-2, 30s elapsed]
module.aurora.aws_rds_cluster.primary: Modifications complete after 31s [id=tfap-aurora-ap-us-east-2]
module.aurora.aws_rds_cluster_instance.primary[0]: Modifying... [id=tf-aurora-ap-us-east-2-1]
module.aurora.aws_rds_cluster_instance.primary[0]: Still modifying... [id=tf-aurora-ap-us-east-2-1, 10s elapsed]
module.aurora.aws_rds_cluster_instance.primary[0]: Still modifying... [id=tf-aurora-ap-us-east-2-1, 20s elapsed]
module.aurora.aws_rds_cluster_instance.primary[0]: Still modifying... [id=tf-aurora-ap-us-east-2-1, 30s elapsed]
module.aurora.aws_rds_cluster_instance.primary[0]: Modifications complete after 31s [id=tf-aurora-ap-us-east-2-1]
module.aurora.aws_rds_cluster.secondary[0]: Modifying... [id=tfap-aurora-ap-us-west-2]
module.aurora.aws_rds_cluster.secondary[0]: Still modifying... [id=tfap-aurora-ap-us-west-2, 10s elapsed]
module.aurora.aws_rds_cluster.secondary[0]: Still modifying... [id=tfap-aurora-ap-us-west-2, 20s elapsed]
module.aurora.aws_rds_cluster.secondary[0]: Still modifying... [id=tfap-aurora-ap-us-west-2, 30s elapsed]
module.aurora.aws_rds_cluster.secondary[0]: Modifications complete after 33s [id=tfap-aurora-ap-us-west-2]
module.aurora.aws_rds_cluster_instance.secondary[0]: Modifying... [id=tf-aurora-ap-us-west-2-1]
module.aurora.aws_rds_cluster_instance.secondary[0]: Still modifying... [id=tf-aurora-ap-us-west-2-1, 10s elapsed]
module.aurora.aws_rds_cluster_instance.secondary[0]: Still modifying... [id=tf-aurora-ap-us-west-2-1, 20s elapsed]
module.aurora.aws_rds_cluster_instance.secondary[0]: Still modifying... [id=tf-aurora-ap-us-west-2-1, 30s elapsed]
module.aurora.aws_rds_cluster_instance.secondary[0]: Modifications complete after 31s [id=tf-aurora-ap-us-west-2-1]

Panic Output

No response

Important Factoids

This was introduced in #30996 when the logic was flipped for retry behavior for errors. A specific AWS error is used to determine when to do a minor version upgrade vs. a major. As of #30996, the logic will keep retrying a major upgrade when the error indicates a minor version is required. This is why there is such a long delay. Then, at the very end, it tries the minor upgrade, which succeeds.

The workaround mentioned above also doesn't work for me. Upgraded the secondary cluster using AWS console. Then applied Terraform. But its just stuck in "Still modifying.." phase with no action on the AWS resource.

I also tried this on the latest AWS provider version 5.30.0. But seeing the same issue.

Update 12/16/2023:

Now the Minor version upgrade was successfully completed using Terraform. But it took a very long time and was stuck in the modifying the global cluster node for 1hr and 38 minutes! The upgrade should have just attempted to upgrade the secondary cluster first and then the primary cluster. The log below shows it modified the primary first followed by the secondary. But in AWS console, I see that the secondary cluster was upgraded first and then the primary. Otherwise the Aurora global database minor version upgrade wont work!

References

Add validation to the global_cluster_identifier property on a global cluster #30996

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-05T21:09:51Z

Community Note

Voting for Prioritization

Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
Please see our prioritization guide for information on how we prioritize.
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

If you are interested in working on this issue, please leave a comment.
If this would be your first contribution, please review the contribution guide.

github-actions · 2024-03-07T23:05:18Z

This functionality has been released in v5.40.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions · 2024-04-07T02:04:24Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

github-actions bot added bug Addresses a defect in current functionality. service/rds Issues and PRs that pertain to the rds service. service/vpc Issues and PRs that pertain to the vpc service. labels Mar 5, 2024

YakDriver mentioned this issue Mar 6, 2024

rds/global_cluster: Fix version upgrade errors #36246

Merged

YakDriver closed this as completed in #36246 Mar 7, 2024

github-actions bot added this to the v5.40.0 milestone Mar 7, 2024

github-actions bot locked as resolved and limited conversation to collaborators Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: AWS RDS global cluster very long-delay in minor version upgrade #36107

[Bug]: AWS RDS global cluster very long-delay in minor version upgrade #36107

YakDriver commented Mar 5, 2024 •

edited

Loading

github-actions bot commented Mar 5, 2024

github-actions bot commented Mar 7, 2024

github-actions bot commented Apr 7, 2024

[Bug]: AWS RDS global cluster very long-delay in minor version upgrade #36107

[Bug]: AWS RDS global cluster very long-delay in minor version upgrade #36107

Comments

YakDriver commented Mar 5, 2024 • edited Loading

Terraform Core Version

AWS Provider Version

Affected Resource(s)

Expected Behavior

Actual Behavior

Relevant Error/Panic Output Snippet

Terraform Configuration Files

Steps to Reproduce

Debug Output

Panic Output

Important Factoids

References

github-actions bot commented Mar 5, 2024

Community Note

github-actions bot commented Mar 7, 2024

github-actions bot commented Apr 7, 2024

YakDriver commented Mar 5, 2024 •

edited

Loading