-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote state is not incrementally updated #14487
Comments
@mitchellh : Could you please comment on whether this is terraform design/implementation or whether this is S3 behavior. Thanks |
This is a big problem for us with our recent upgrade to 0.9. We run end-to-end tests creating a full TF setup from scratch, and the initial apply can take quite a while and create hundreds of resources. If the initial apply fails (eg, if it exceeds the timeout set by our end-to-end test suite), the tfstate is essentially empty and we've leaked a huge number of resources. |
BTW, I would probably describe this as "apply only persists remote state once at the end". I don't think it's S3-specific. Looking at the code, there's a sense in which this is not a regression from pre-0.9: the older code also only called PersistState once at the end. But it does feel different to me. I think that's because the pre-0.9 code still updated the local file periodically during a long apply, which could be used from a follow-up destroy on the same machine at least? The 0.9 code does at least drop a terraform.tfstate.backup while using remote state, but it appears to not get updated as resources are created. |
I'd be really interested in fixing this if I have some advice about how to go about it. Based on the past few issues I've filed it seems like @jbardin is the person to talk to? If I'm understanding the code correctly, backend_apply.go uses a StateHook to continuously update its I would argue that StateHook should also call PersistState. Looking at the implementations of State, most of them have no-ops for PersistState, so this wouldn't make much of a difference. The only two with non-trivial PersistState are
And since My one concern would be error handling: should a failure to Anyway, let me know if this is a good change and I'll make it.
|
Actually, I might make an even stronger argument: we should just get rid of the WriteState/PersistState separation and move remote persistence into WriteState. Looking at the codebase, every single call to WriteState is immediately followed by PersistState except the one I want to change and this one: terraform/command/meta_backend.go Line 1207 in 5c85c9c
Is it possible that this is a bug and this other case should PersistState too? I don't really understand the Hash stuff. If I understand properly, a PersistState here might be redundant, but it's also a relatively edge-case piece of code that shouldn't run every time you run a TF command, so it might be better to simplify the API even if it causes one extra network round trip in this rare case. |
Hi @glasser Thanks for the feedback here. I think we want to keep the separation between Write and Persist, even though we don't make good use of it at the moment. PersistState can be a very expensive operation for some backends, and calling it on every modification may be overkill. The PR you referenced definitely helps with the "lost state" situation, but I intend to review the codepaths handling the state here to see what can be improved. Now that remote state backends operate directly on the remote storage, we should probably implement a write cache similar to how the old remote state worked. I've been thinking about leveraging the BackupState for this purpose, so that it not only saves a complete copy of the previous state, but also records modifications during the run.
This is purely for local configuration, which just happens to use the same state data structures for storage, but is separate from the actual state that users are concerned with and stored in separate files. |
That makes sense. So maybe the hook should persist the state, but throttled to do so, say, at most once every 15 seconds? Is this something you're going to work on soon, or would it be helpful for me to give it a stab? |
@glasser, You mentioned that your tests can timeout and leak resources? If you're not signaling for terraform to shutdown gracefully, or killing before it completes you're pretty much guaranteed to lose state, the only difference now is that you may lose more of the state, but there's was no guarantee previously that it was remotely complete either. Another issue here is that updates to S3 are only eventually consistent, so multiple updates during a Terraform execution wouldn't be guaranteed to appear immediately, or even in order from the view of other clients. Now losing "more" state than before is still less than desirable, since it can require more manual intervention to take clean up. I have some changes in progress to help make saving remote state more robust, but we need to understand that the previous behavior isn't something that can be relied upon. |
99% of the time when an apply is timing out, it's because an individual resource inside it is hanging and will never finish (eg, it's trying to bring up an AWS ASG with buggy code where no machine will ever come up successfully). I've found in those cases that the Terraform graceful shutdown support is basically useless, since I don't think it interrupts in-process resource updates. (The resources in question have very long timeouts set on them because sometimes they really do legitimately take a long time to roll out.) Maybe something about graceful shutdowns has changed recently, but in the particular use cases where I want to cancel an apply, I have never had an experience where graceful shutdown actually shut down at the end. That said, throwing a single call to opState.PersistState into backend_apply.go when the graceful shutdown starts might help a bit. (Though literally just doing that might have a race condition.) I do understand your point that it's hard to have any guarantees with a non-graceful shutdown. But this really is a significant regression from non-backend Terraform. If you're doing a giant create-from-scratch apply before 0.9 and it dies, you are likely to be left with a local file listing the resources it created. With 0.9 backends, you are 100% guaranteed that no Terraform-created resource will have any reference to it written on any file or remote resource unless the terraform.Context.Apply finishes. I'm not really worried about race conditions here — I accept that they may exist in the world. I'm talking about creating brand new resources and not writing the name down anywhere even if the process doesn't get killed for another half an hour. Would you be likely to accept a PR that adds a throttled PersistState call to StateHook? I think that would solve this problem and not add too much overhead, if it's only calling PersistState every 30 seconds or so. |
I totally agree here. The changes I mentioned actually include both a call to PersistState when the operation context is cancelled, and have StateHook periodically calling PersistState. |
Thanks for prioritizing this! I hope my earlier comments weren't too aggressive - just wanted to be clear that I would be happy to help out if you weren't going to get to it soon. |
No problem @glasser! I already had these changes in mind for the next release, we just got them merged a little faster ;) |
Note that this fix was reverted in #15176 because it caused a race condition. |
@jbardin : Can this issue be re-opened again. Thanks. |
Hi, Once the 0.10 betas are underway we can revisit this to determine why the partial states adversely effected Terraform Enterprise. |
Thanks @jbardin! |
I have same problem with tf 0.11. |
This might seem like ancient history, but this is biting me hard right now. Could you please, if nothing else, add a parameter for the backend config for this? |
This is still perfectly relevant for us with Terraform 1.0.3 using the S3 state backend. Our CI server may under certain conditions abort What's the current stance on this and is there anything planned to address the issue? |
I also recently ran into this issue when using a long-running Terraform job from a Jenkins pipeline where my deployment container was terminated abruptly. I was using S3 as a remote backend, and my lock was also stuck because Terraform did not get a chance to exit cleanly. The Neither In this sort of scenario and others that have been mentioned above, I think having remote state updated continuously makes a lot of sense. At the very least, the scope of orphaned resources would be vastly reduced, making for a far better recovery experience. |
Same issue - specifically on GitHub Actions where the job timeout seems to just SIGKILL the container and we're left with a swathe of orphaned resources and an empty state file in s3. |
Can I reboot this as well? We've just had a nightmare tracking down orphaned resources, and now I'm finding it hard to justify why we would use this over cloudformation. As mentioned above github actions can send SIGKILL to a running cicd container, and we are using Azure DevOps and have experienced the same thing - it didn't even timeout 🤦 If we're using remote state then surely this is the expected behaviour? Even if a resource is in the process of being created, the service will have given you the id of the resource that is in pending state - there just doesn't seem like any reason why the remote state cannot be completely up to date always? |
Thanks for the continued feedback on this issue! At this point, we understand the use cases, and know that we would like to address this issue in the future, pending prioritization. For this reason I am going to lock comments, please continue to express your interest in this issue using the upvote mechanic (👍 ) on the original description. Thanks very much! |
Terraform Version
Terraform v0.9.1
Affected Resource(s)
Please list the resources as a list, for example:
s3 remote state
It is a remote backend issue affecting tfstate
Terraform Configuration Files
Terraform has multiple modules with lots of tf files along with remote backend configured as S3
Expected Behavior
Remote backend behavior for tfstate stored on S3 should be such that as my resources get created one by one, the tfstate on S3 should immediately get updated with the details of these resources in real time
For local tfstate files, tftsate gets updated immediately when a resource becomes "creation completed" and I can see it in tfstate immediately. But that is not the case with Remote S3 backends
Actual Behavior
What actually happened?
I saw the the remote tfstate file in S3 UI , only after terraform apply completed and all resources were already created by then. I did not see the tfstate file getting updated as resources were getting created on by one.
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
terraform apply
The text was updated successfully, but these errors were encountered: