Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RDS Instance creation, resource already exists #1346

Closed
1 task done
mjnovice opened this issue Jun 6, 2024 · 17 comments
Closed
1 task done

[Bug]: RDS Instance creation, resource already exists #1346

mjnovice opened this issue Jun 6, 2024 · 17 comments
Labels
bug Something isn't working needs:triage

Comments

@mjnovice
Copy link

mjnovice commented Jun 6, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

No response

Resource MRs required to reproduce the bug

apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
  annotations:
    crossplane.io/external-create-pending: "2024-06-06T17:43:44Z"
    crossplane.io/external-create-succeeded: "2024-06-06T17:43:44Z"
  creationTimestamp: "2024-06-06T15:55:19Z"
  finalizers:
  - finalizer.managedresource.crossplane.io
  generateName: ci-aseks6062802
  generation: 3
  labels:
    clusterId: ci-aseks6062802
  name: ci-aseks6062802qrvcs
  resourceVersion: "67343"
  uid: 48919564-24f7-43fb-b1ba-9dbbd8a9b1fa
spec:
  deletionPolicy: Delete
  forProvider:
    allocatedStorage: 256
    applyImmediately: true
    autoGeneratePassword: true
    backupRetentionPeriod: 3
    copyTagsToSnapshot: true
    dbSubnetGroupName: ci-aseks6062802k28ws
    dbSubnetGroupNameRef:
      name: ci-aseks6062802k28ws
    dbSubnetGroupNameSelector:
      matchLabels:
        clusterId: ci-aseks6062802
    engine: sqlserver-se
    engineVersion: 15.00.4236.7.v1
    identifier: ci-aseks6062802
    instanceClass: db.m5.xlarge
    licenseModel: license-included
    maxAllocatedStorage: 1000
    multiAz: false
    passwordSecretRef:
      key: password
      name: ci-aseks6062802
      namespace: sfd-ci-aseks6062802
    publiclyAccessible: true
    region: us-east-1
    skipFinalSnapshot: true
    storageEncrypted: true
    storageType: gp2
    username: testadmin
    vpcSecurityGroupIdRefs:
    - name: ci-aseks6062802-redis-mssql-sg2qb9w
    vpcSecurityGroupIdSelector:
      matchLabels:
        clusterId: ci-aseks6062802
        crossplane.io/forResource: redis-mssql
    vpcSecurityGroupIds:
    - sg-0ee65bd8753e2fdb8
  initProvider: {}
  managementPolicies:
  - '*'
  providerConfigRef:
    name: service-fabric
  writeConnectionSecretToRef:
    name: mssql-creds
    namespace: sfd-ci-aseks6062802

Steps to Reproduce

  • Created an RDS Instance on a kind cluster running crossplane.
  • It gets stuck sometimes with a resource already exists error, even though initially there was no such instance.

What happened?

  • A single RDS Instance should have been created.

Relevant Error Output Snippet

- lastTransitionTime: "2024-06-06T17:44:46Z"
    message: "async create failed: failed to create the resource: [{0 creating RDS
      DB Instance (ci-aseks6062802): DBInstanceAlreadyExists: DB instance already
      exists\n\tstatus code: 400, request id: 1be9b4f8-31de-42e5-89f5-99cdd8885597
      \ []}]"


### Crossplane Version

1.16.1

### Provider Version

1.2.0

### Kubernetes Version

1.29

### Kubernetes Distribution

Kind

### Additional Info

_No response_
@mjnovice mjnovice added bug Something isn't working needs:triage labels Jun 6, 2024
@turkenf
Copy link
Collaborator

turkenf commented Jun 7, 2024

@mjnovice, could you please try changing the field identifier: ci-aseks6062802?

@mjnovice
Copy link
Author

mjnovice commented Jun 7, 2024

@turkenf that helps, but that also creates another RDS no ? Is there a known cause for this issue ?

@turkenf
Copy link
Collaborator

turkenf commented Jun 7, 2024

I could not reproduce the bug, when you get this error, are you sure that there is no other instance with the identifier ci-aseks6062802?

@mjnovice
Copy link
Author

mjnovice commented Jun 7, 2024

@turkenf nope. There was none before applying the CR. This happens intermittently for me.

@PatTheSilent
Copy link

PatTheSilent commented Jun 19, 2024

I've got a similar problem. I'm trying to create an RDS instance from another instance using restoreToPointInTime

apiVersion: rds.aws.upbound.io/v1beta2
kind: Instance
metadata:
  annotations:
    crossplane.io/external-create-failed: '2024-06-19T10:42:51Z'
    crossplane.io/external-create-pending: '2024-06-19T10:42:51Z'
    crossplane.io/external-create-succeeded: '2024-06-19T09:18:22Z'
  creationTimestamp: '2024-06-19T09:18:21Z'
  finalizers:
    - finalizer.managedresource.crossplane.io
  generation: 2
  name: obfuscated-c75kw-mssql-db
  resourceVersion: '1261333276'
  uid: e04aadb8-ed95-4ec9-b09e-940f6bc4ade2
  selfLink: >-
    /apis/rds.aws.upbound.io/v1beta2/instances/obfuscated-c75kw-mssql-db
status:
  atProvider: {}
  conditions:
    - lastTransitionTime: '2024-06-19T09:18:22Z'
      reason: Creating
      status: 'False'
      type: Ready
    - lastTransitionTime: '2024-06-19T10:42:51Z'
      message: "create failed: async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d  []}]"
      reason: ReconcileError
      status: 'False'
      type: Synced
    - lastTransitionTime: '2024-06-19T10:42:51Z'
      message: "async create failed: failed to create the resource: [{0 creating RDS DB Instance (restore to point-in-time) (obfuscated-c75kw-mssql-db): DBInstanceAlreadyExists: DB instance already exists\n\tstatus code: 400, request id: 5edaec05-73f3-4a22-8a3a-cf6ebe39610d  []}]"
      reason: AsyncCreateFailure
      status: 'False'
      type: LastAsyncOperation
spec:
  deletionPolicy: Delete
  forProvider:
    autoGeneratePassword: true
    autoMinorVersionUpgrade: true
    backupRetentionPeriod: 0
    caCertIdentifier: rds-ca-rsa4096-g1
    dbSubnetGroupName: prod1
    deleteAutomatedBackups: true
    engine: sqlserver-ee
    engineVersion: ''
    identifier: obfuscated-c75kw-mssql-db
    instanceClass: db.r5.xlarge
    multiAz: false
    optionGroupName: prod1-obfuscated
    region: eu-west-1
    skipFinalSnapshot: true
    tags:
      crossplane-kind: instance.rds.aws.upbound.io
      crossplane-name: obfuscated-c75kw-mssql-db
      crossplane-providerconfig: default
    vpcSecurityGroupIds:
      - sg-xxxxxxxxxxxxxxx
  initProvider:
    restoreToPointInTime:
      - sourceDbInstanceIdentifier: prod1-obfuscated-mssql-db
        useLatestRestorableTime: true
  managementPolicies:
    - '*'
  providerConfigRef:
    name: default
  writeConnectionSecretToRef:
    name: obfuscated-c75kw-mssql-db
    namespace: production

This worked correctly when I was using v1beta1 and the rds provider in version 0.47.0. K8s cluster running 1.26

Edit: I've managed to get the controller to work correctly by updating the manager resource with the crossplane.io/external-name annotation and setting it to the same value as metadata.name and spec.forProvider.identifier. @mjnovice something that might help you

Edit2: After the controller decided that there's actually no problem with the Instance resource it updated the crossplane.io/external-name annotation as well to something more like an ID I'd expect crossplane.io/external-name=db-MVTTMGI6KFQ66F5X2DWJH7WNWQ

@mbbush
Copy link
Collaborator

mbbush commented Jun 20, 2024

The terraform provider version 5.0 upgrade contained a breaking change to this resource, which we mitigated, but may not have been able to fully restore the terraform provider 4.x behavior.

The details of the change are both important and confusing. In terraform provider aws v4.x, the terraform id was the user-defined 'identifier' field. In v5.0 they changed it to instead use the aws-generated database resource id. I think this code comment from the terraform aws provider does a pretty good job of explaining it:

// NOTE ON "ID", "IDENTIFIER":
// ID is overloaded and potentially confusing. Hopefully this clears it up.
// * ID, as in d.Id(), d.SetId(), is:
//    - the same as AWS calls the "dbi-resource-id" a/k/a "database instance resource ID"
//    - unchangeable/immutable
//    - called either "id" or "resource_id" in schema/state (previously was only "resource_id")
// * "identifier" is:
//    - user-defined identifier which AWS calls "identifier"
//    - can be updated
//    - called "identifier" in the schema/state (previously was also "id")

For an upjet-based crossplane provider, the format of the terraform id is very important. Ideally, we can construct it from the metadata.name of the resource, or from its spec. For resources where we have configured the provider to do this, when a managed resource is created, the provider can then (more or less) run a terraform import to determine if there's an existing resource with this id. If there is, the provider assumes control of it, and makes the necessary update calls to make it match the desired spec. If not, it creates a new resource.

In terraform provider aws v4.x, which used the identifier field (which is user-defined) as the terraform id, it was easy to construct the terraform id from the spec of the Instance.rds managed resource, and we did so. But in terraform provider v5.0, they changed the provider to instead use the database resource id (which is generated by AWS) as the terraform id. This id is not in the spec of the managed resource, nor is it possible/reasonable to expect the user to always provide it. So instead we switched to use the IdentifierFromProvider external name configuration for this resource.

IdentifierFromProvider should still work just fine. We use it for a ton of resource kinds. It's close to if not the most common external name configuration in provider-upjet-aws. But it does mean that versions 0.x of this crossplane provider could reliably detect (by making an api call) whether a resource already existed before trying to create one, and versions 1.x of this provider cannot.

I have not yet tried to reproduce this issue, but my working theory is that the sequence of events goes like this:

  1. The provider creates a new RDS instance with the user-supplied identifier my-db-identifier.
  2. Somehow, that instance gets created (or into a state that results in the reported DBInstanceAlreadyExists error while somehow not actually being created), but the provider somehow loses track of it.
  3. The provider, thinking there is no existing instance, tries to create it again.
  4. AWS returns the DBInstanceAlreadyExists error.

If my theory is correct (which I hope some answers below can help confirm or refute), then the remaining task is to figure out what's happening in step 2, and make it not happen. If we're lucky, it could be as simple as a timeout configuration needing to be tweaked, but it may end up being much more complicated.

Some questions for those of you who have experienced this issue:

  • When it happens, does the database instance with the identifier set in your spec.forProvider actually exist in RDS? Is there any noticeable difference from the state in the managed resource? Does it have the crossplane-added tags?
  • How much time elapses between creating the Instance managed resource and getting the error?
  • Some have stated that this is an intermittent issue. Roughly how often does it happen? Almost but not quite all the time? Rarely? Somewhere in between?
  • If you can run the provider-aws-rds pod with the --debug argument (using either a ControllerConfig or DeploymentRuntimeConfig), and can provide debug logs reproducing the issue, that would be very helpful.

@PatTheSilent
Copy link

I am attaching debug logs from the RDS provider that mention this instance.

Explore-logs-2024-06-20 12_08_32.txt

To further add to that, from the events, it seems that after 30 minutes, problems start appearing. The point-in-time restore takes a fair bit longer for that database than 30 minutes. The instance itself is configured correctly. All changes specified in the manifest are there. Right now, it happens each time I try to recreate that database, which is around once per day.

@PatTheSilent
Copy link

Ok, one thing that actually is misconfigured compared with the manifest is the backup retention window. In the manifest, I set backupRetentionPeriod to 0, and the instance has 7. It couldn't set to 0, because apparently you can't disable backups when an instance is restored or something. I honestly don't know why that'd be the case but I changed the retention to 1 and it passed. Now the provider is happy about the resource. I wonder if the providers inability to apply that setting had some part in this. Will test that out once I get the chance.

@mbbush
Copy link
Collaborator

mbbush commented Jun 20, 2024

Thanks for the logs @PatTheSilent, they were really helpful.

What I see confirms my working theory above. It looks like the request to create the database instance is timing out after about 40 minutes, and then the provider is repeatedly trying to make more Create api calls, which all fail quickly due to the duplicate identifier, then are retried immediately, fail again, etc. The terraform provider configures a 40 minute timeout for Create requests for this resource. It sounds like in your case, that's not long enough.

This is also consistent with your finding about backupRetentionPeriod, as the terraform aws provider applies that parameter after the initial db creation call, when restoring a db from a snapshot (because the particular RDS api command for restoring a db from a snapshot doesn't support that parameter). If the Create timed out, that would never get called.

I'm not sure, but I don't think there's a user-facing way for you to adjust the timeout, and it can only be adjusted in the provider source code. That might be a good enhancement to add to upjet. It seems like this is a resource that could potentially take a very long time to create, especially if you're restoring from a large snapshot. Can you tell how long it's taking this RDS instance to actually be created?

There's an additional problem, which is that this situation should result in the provider adding the external create incomplete annotation, raising an error, and stopping further reconciliation of this resource, because it's lost track of the resource it created. That's not happening, and I want to figure out why.

@PatTheSilent
Copy link

@mbbush from what I could see, it took over an hour for it to be available. It's a MS SQL database with a few TB of data. And yeah, it would be great if we could configure timeouts like in Terraform, maybe via an annotation.
Let me know if I can help you further in any way.

@mergenci
Copy link
Collaborator

Hey @PatTheSilent, I haven't read all of the past discussion above, but based on my impression from the issue description and on what we discussed off-channel with @mbbush, here's what I believe happens:

  1. We call Terraform.
  2. Terraform creates RDS instance and receives instance's ID.
  3. Terraform doesn't return right away, but waits for instance to become operational. It does so by polling its state.
  4. An error occurs (operation times out, connection breaks, authentication expires, etc.) and Terraform receives an error in its next poll.
  5. Terraform returns us the error.

Here's the crucial part: Upon encountering an error, Terraform doesn't let us know of the instance ID that it received in Step 2. If it did, we could have set the external-name annotation. When the external-name annotation is not set, we try to create the resource again in the next reconciliation loop. Rather fortunately, trying to recreate an existing RDS instance fails, so that the user is aware of the problem. Recreating other types (probably VPC) of existing resources may succeed. Therefore, we may end up creating and leaking multiple resources in case of errors described above, without user being aware.

The solution is to have Terraform set the instance ID to the state in case of errors. Indeed, doing so is Terraform's policy. From their Resource Contribution Guideline:

Implements Immediate Resource ID Set During Create: Immediately after calling the API creation function, the resource ID should be set with d.SetId() before other API operations or returning the Read function.

I opened upstream issue and fix last week.

  1. [Bug]: RDS instance shouldn't leak on error during create hashicorp/terraform-provider-aws#38251
  2. r/aws_db_instance: Fix resource leak on error during creation. hashicorp/terraform-provider-aws#38252

The fix has been merged yesterday. Next time we update our Terraform provider dependency, the fix should take effect in our provider.

@PatTheSilent
Copy link

@mergenci Awesome stuff! Thank you very much for your work!

@karlderkaefer
Copy link

karlderkaefer commented Aug 22, 2024

I faced same issue. It is producible but only for a specific instanceIdentifier. We tried other claims in same configuration, there were created just fine. I can share some details to these question

When it happens, does the database instance with the identifier set in your spec.forProvider actually exist in RDS? Is there any noticeable difference from the state in the managed resource? Does it have the crossplane-added tags?

Yes identifier exists on managed resource. The instance name in the aws console was the same. Yes crossplane tags are set on actual rds instance: crossplane-kind, crossplane-name, crossplane-providerconfig. We see following diff after the Session Token expired: allocated_storage:

  • allocated_storage: Old="", New="50"
  • apply_immediately: Old="", New="true"
  • backup_retention_period: Old="", New="14"
  • backup_window: Old="", New="09:46-10:16"
  • db_name: Old="", New="in***"
  • engine: Old="", New="postgres"
  • engine_version: Old="", New="16.2"
  • instance_class: Old="", New="db.t4g.large

How much time elapses between creating the Instance managed resource and getting the error?
Some have stated that this is an intermittent issue. Roughly how often does it happen? Almost but not quite all the time? Rarely? Somewhere in between?

It's reproducible and it takes 2min from initial successful create event to error state

Aug 21, 2024 @ 17:15:50.427	2024-08-21T15:15:49Z	DEBUG	provider-aws	Successfully requested creation of external resource	{"controller": "managed/rds.aws.upbound.io/v1beta2, kind=instance", "request": {"name":"redacted-db-rmhn9-pftjc"}, "uid": "596e470f-785c-4c73-ae10-2dbce233e424", "version": "54911722", "external-name": "", "external-name": ""}

Aug 21, 2024 @ 17:18:02.430	2024-08-21T15:18:01Z	DEBUG	provider-aws	Async create ended.	{"trackerUID": "596e470f-785c-4c73-ae10-2dbce233e424", "resourceName": "redacted-db-rmhn9-pftjc", "gvk": "rds.aws.upbound.io/v1beta2, Kind=Instance", "error": "async create failed: failed to create the resource: [{0 waiting for RDS DB Instance (redacted-db) create: ExpiredToken: The security token included in the request is expired\n\tstatus code: 403, request id: 42eeb4e2-13f3-4a75-92cf-63600cd87a35  []}]", "tfID": ""}

If you can run the provider-aws-rds pod with the --debug argument (using either a ControllerConfig or DeploymentRuntimeConfig), and can provide debug logs reproducing the issue, that would be very helpful.

tmp-logs.txt

Also here are cloudtrail request logs from first successful create event and first failing event
tmp-success-redacted.json
tmp-failing-reducted.json

Maybe worth to mention we set the identifier in our composition:

             - type: FromCompositeFieldPath
                fromFieldPath: metadata.labels[crossplane.io/claim-name]
                toFieldPath: spec.forProvider.identifier
             - type: FromCompositeFieldPath
                fromFieldPath: spec.resourceConfig.instanceIdentifier
                toFieldPath: spec.forProvider.identifier

@mergenci
Copy link
Collaborator

mergenci commented Aug 22, 2024

@karlderkaefer, thank you for your well-prepared report. Which provider version are you using?

The upstream Terraform fix that addresses the issue was consumed in #1406, which was released in v1.10.0 — latest provider version is v1.11.0 by the way.

RDS creation errors are still possible in versions v1.10.0 or newer, simply because creation takes very long and something may timeout, like credentials do in your case. But such errors should be recovered from automatically in the next reconciliation loop, because now we receive the RDS ID (even if something timed out) with which we can successfully detect that the instance exists. In other words, we detect that the creation that we thought failed indeed succeeded.

In case your provider cannot detect that the instance was created successfully, and therefore is trying to recreate it and fail, it means there is another edge case we have yet to cover.

@karlderkaefer
Copy link

@mergenci, we are currently using version v1.6.0 we are going to try the latest version, and I will report back if the upgrade resolves the issue

@karlderkaefer
Copy link

@mergenci I can confirm the issue is now fixed with version v1.11.0. Although I had to recreate the claim. Thanks for fix!

@mergenci
Copy link
Collaborator

I think we have enough evidence for this issue being fixed, therefore I'm closing it. Feel free to reopen, or preferably, to open a new one. Thank you everyone for your contribution 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

No branches or pull requests

6 participants