-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful controller pod termination #158
Comments
we are also affected by this. |
any idea how to run this 'terraform force-unlock' inside the provider pod ? |
Do the `kubectl exec` into the pod, then change dir to /tf/UUID (it matches
the UUID of your workspace TF k8 resource).
Then just run `terraform plan` and `terraform force-unlock`.
|
executed plan and force unclock. Both worked but after that am getting this error as below in the workspace status. My scenario was EKS creation.
When lock is removed, it seems it takes a fresh apply. how to tackle this ? |
That means that terraform failed to save the state during the apply operation. |
Sorry this issue slipped by - there is some discussion of the problem here - crossplane-contrib/provider-terraform#46 - it's a complex problem that we haven't come up with a good solution for yet. We can send the termination signal to the terraform CLI command processes but there is no way to know how long to wait for the processes to exit "cleanly", and in some cases like worker node failure it won't matter anyway since everything just comes crashing down. I tried some manual tests using the CLI and it can take quite a while to return for complex terraform modules. It would definitely be nice to find a way to "drain" the pod before terminating it for restarts/upgrades/etc. For example if we could pause reconciliation on all existing Workspaces and wait for the last CLI command to finish, it would be safe to restart the pod and un-pause all of the Workspaces. |
What happened?
In case the provider-terraform pod is terminated in the middle of the
terraform apply
operation then terraform exits with error "command terminated with exit code 137" and does not save its state.This becomes a problem for us because some resources created by terraform are not present in the state. Next terraform run fails because of the error "cannot determine creation result - remove the crossplane.io/external-create-pending annotation if it is safe to proceed", but even after you remove the annotation, you might end up with errors stating that some of the terraform-created resources already exist (because they were not been persisted to the state properly when terraform pod was terminated).
Also, terraform state lock persists after the provider-terraform pod restarted, but that's a smaller issue because generally you can just run the
terraform force-unlock
to remove the stale lock.I'd like to ask about what would be the proper way to ensure that terraform either does not get terminated in the middle of the run, or make sure that its state is saved in case it's being terminated. For example, during the local terraform run you can hit CTRL+C and wait some time between 5-100 seconds for it to save its state. But that does not seem to be the case with terraform-provider pod termination.
Can the provider send a termination signal to the terraform process and wait for some grace period instead of killing it immediately? ..
How can we reproduce it?
Terminate the provider-terraform pod during the
terraform apply
operation.What environment did it happen in?
The text was updated successfully, but these errors were encountered: