Graceful controller pod termination #158

project-administrator · 2023-05-17T11:49:16Z

What happened?

In case the provider-terraform pod is terminated in the middle of the terraform apply operation then terraform exits with error "command terminated with exit code 137" and does not save its state.

This becomes a problem for us because some resources created by terraform are not present in the state. Next terraform run fails because of the error "cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed", but even after you remove the annotation, you might end up with errors stating that some of the terraform-created resources already exist (because they were not been persisted to the state properly when terraform pod was terminated).

Also, terraform state lock persists after the provider-terraform pod restarted, but that's a smaller issue because generally you can just run the terraform force-unlock to remove the stale lock.

I'd like to ask about what would be the proper way to ensure that terraform either does not get terminated in the middle of the run, or make sure that its state is saved in case it's being terminated. For example, during the local terraform run you can hit CTRL+C and wait some time between 5-100 seconds for it to save its state. But that does not seem to be the case with terraform-provider pod termination.

Can the provider send a termination signal to the terraform process and wait for some grace period instead of killing it immediately? ..

How can we reproduce it?

Terminate the provider-terraform pod during the terraform apply operation.

What environment did it happen in?

Crossplane Version: v1.12.1
Provider Version: v0.7.0
Kubernetes Version: v1.26.2
Kubernetes Distribution: EKS

The text was updated successfully, but these errors were encountered:

geowalrus4gh · 2023-08-13T14:28:42Z

we are also affected by this.

geowalrus4gh · 2023-08-13T14:32:06Z

Also, terraform state lock persists after the provider-terraform pod restarted, but that's a smaller issue because generally you can just run the terraform force-unlock to remove the stale lock.

any idea how to run this 'terraform force-unlock' inside the provider pod ?

project-administrator · 2023-08-13T14:43:36Z

Do the `kubectl exec` into the pod, then change dir to /tf/UUID (it matches the UUID of your workspace TF k8 resource). Then just run `terraform plan` and `terraform force-unlock`.

geowalrus4gh · 2023-08-13T17:22:30Z

executed plan and force unclock. Both worked but after that am getting this error as below in the workspace status. My scenario was EKS creation.

    "update failed: cannot apply Terraform configuration: Terraform
    encountered an error. Summary: creating EKS Cluster :
    ResourceInUseException: Cluster already exists with name:"

When lock is removed, it seems it takes a fresh apply. how to tackle this ?

project-administrator · 2023-08-25T12:09:48Z

That means that terraform failed to save the state during the apply operation.
What we typically do is delete the conflicting resource manually (AWS resource in this example), and after that let provider-terraform re-create it.

bobh66 · 2023-08-25T12:42:33Z

Sorry this issue slipped by - there is some discussion of the problem here - crossplane-contrib/provider-terraform#46 - it's a complex problem that we haven't come up with a good solution for yet. We can send the termination signal to the terraform CLI command processes but there is no way to know how long to wait for the processes to exit "cleanly", and in some cases like worker node failure it won't matter anyway since everything just comes crashing down. I tried some manual tests using the CLI and it can take quite a while to return for complex terraform modules.

It would definitely be nice to find a way to "drain" the pod before terminating it for restarts/upgrades/etc. For example if we could pause reconciliation on all existing Workspaces and wait for the last CLI command to finish, it would be safe to restart the pod and un-pause all of the Workspaces.

project-administrator added bug Something isn't working needs:triage labels May 17, 2023

project-administrator changed the title ~~Handle pod termination: terraform resources not saved to the state~~ Graceful controller pod termination Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful controller pod termination #158

Graceful controller pod termination #158

project-administrator commented May 17, 2023 •

edited

Loading

geowalrus4gh commented Aug 13, 2023

geowalrus4gh commented Aug 13, 2023

project-administrator commented Aug 13, 2023 via email •

edited

Loading

geowalrus4gh commented Aug 13, 2023 •

edited

Loading

project-administrator commented Aug 25, 2023

bobh66 commented Aug 25, 2023

Graceful controller pod termination #158

Graceful controller pod termination #158

Comments

project-administrator commented May 17, 2023 • edited Loading

What happened?

How can we reproduce it?

What environment did it happen in?

geowalrus4gh commented Aug 13, 2023

geowalrus4gh commented Aug 13, 2023

project-administrator commented Aug 13, 2023 via email • edited Loading

geowalrus4gh commented Aug 13, 2023 • edited Loading

project-administrator commented Aug 25, 2023

bobh66 commented Aug 25, 2023

project-administrator commented May 17, 2023 •

edited

Loading

project-administrator commented Aug 13, 2023 via email •

edited

Loading

geowalrus4gh commented Aug 13, 2023 •

edited

Loading