Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful controller pod termination #158

Open
project-administrator opened this issue May 17, 2023 · 6 comments
Open

Graceful controller pod termination #158

project-administrator opened this issue May 17, 2023 · 6 comments
Labels
bug Something isn't working needs:triage

Comments

@project-administrator
Copy link

project-administrator commented May 17, 2023

What happened?

In case the provider-terraform pod is terminated in the middle of the terraform apply operation then terraform exits with error "command terminated with exit code 137" and does not save its state.

This becomes a problem for us because some resources created by terraform are not present in the state. Next terraform run fails because of the error "cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed", but even after you remove the annotation, you might end up with errors stating that some of the terraform-created resources already exist (because they were not been persisted to the state properly when terraform pod was terminated).

Also, terraform state lock persists after the provider-terraform pod restarted, but that's a smaller issue because generally you can just run the terraform force-unlock to remove the stale lock.

I'd like to ask about what would be the proper way to ensure that terraform either does not get terminated in the middle of the run, or make sure that its state is saved in case it's being terminated. For example, during the local terraform run you can hit CTRL+C and wait some time between 5-100 seconds for it to save its state. But that does not seem to be the case with terraform-provider pod termination.

Can the provider send a termination signal to the terraform process and wait for some grace period instead of killing it immediately? ..

How can we reproduce it?

Terminate the provider-terraform pod during the terraform apply operation.

What environment did it happen in?

  • Crossplane Version: v1.12.1
  • Provider Version: v0.7.0
  • Kubernetes Version: v1.26.2
  • Kubernetes Distribution: EKS
@project-administrator project-administrator added bug Something isn't working needs:triage labels May 17, 2023
@project-administrator project-administrator changed the title Handle pod termination: terraform resources not saved to the state Graceful controller pod termination Jun 6, 2023
@geowalrus4gh
Copy link

we are also affected by this.

@geowalrus4gh
Copy link

Also, terraform state lock persists after the provider-terraform pod restarted, but that's a smaller issue because generally you can just run the terraform force-unlock to remove the stale lock.

any idea how to run this 'terraform force-unlock' inside the provider pod ?

@project-administrator
Copy link
Author

project-administrator commented Aug 13, 2023 via email

@geowalrus4gh
Copy link

geowalrus4gh commented Aug 13, 2023

executed plan and force unclock. Both worked but after that am getting this error as below in the workspace status. My scenario was EKS creation.

    "update failed: cannot apply Terraform configuration: Terraform
    encountered an error. Summary: creating EKS Cluster :
    ResourceInUseException: Cluster already exists with name:"

When lock is removed, it seems it takes a fresh apply. how to tackle this ?

@project-administrator
Copy link
Author

That means that terraform failed to save the state during the apply operation.
What we typically do is delete the conflicting resource manually (AWS resource in this example), and after that let provider-terraform re-create it.

@bobh66
Copy link
Collaborator

bobh66 commented Aug 25, 2023

Sorry this issue slipped by - there is some discussion of the problem here - crossplane-contrib/provider-terraform#46 - it's a complex problem that we haven't come up with a good solution for yet. We can send the termination signal to the terraform CLI command processes but there is no way to know how long to wait for the processes to exit "cleanly", and in some cases like worker node failure it won't matter anyway since everything just comes crashing down. I tried some manual tests using the CLI and it can take quite a while to return for complex terraform modules.

It would definitely be nice to find a way to "drain" the pod before terminating it for restarts/upgrades/etc. For example if we could pause reconciliation on all existing Workspaces and wait for the last CLI command to finish, it would be safe to restart the pod and un-pause all of the Workspaces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

No branches or pull requests

3 participants