Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow toggling of state refresh #998

Closed
wants to merge 1 commit into from
Closed

feat: allow toggling of state refresh #998

wants to merge 1 commit into from

Conversation

Apollorion
Copy link

Im part of a large organization that manages github with this provider for everything.

Unfortunately, our organization is so large that the GITHUB_TOKEN used to manage our org gets rate limited by GitHub in just the plan phase of terraform. By the time we get to the actual apply phase (if we even can) we have to wait over an hour for the token to be usable again. Additionally, plans cannot fail fast because of the refresh period (we were seeing up to 45 minute plans sometimes).

This change forces the provider to "trust" the state in terraform if detect_drift = false. This allows us to make changes quickly and keeps requests to the github api low, therefore avoiding rate limiting. It also allows us to turn on the state refresh feature detect_drift = true so we can, every so often, run a plan that will actually detect any drift within our configuration.

@jcudit
Copy link
Contributor

jcudit commented Jan 5, 2022

I'm unsure if this solution to rate limiting will cause correctness issues elsewhere. I do agree that the provider needs to do a better job of efficiently performing resource refreshes, though. This will be a tough one to ship as-is. I recommend getting more discussion and buy-in from the community as a next step if we are to proceed.

Other workarounds observed in the wild include breaking a state file into multiple using terraform state mv ... as well as targetted plans / applies that only allowlist certain resources to be evaluated and changed.

@morremeyer
Copy link
Contributor

Thanks for opening this and thanks @jcudit for sharing your concern.

I am open to this solution as in my organization, we're using this provider to enforce state, no matter the current one. However, given the current performance implications, ensuring that we don't hit API limits with read requests seems a more stable approach to me.

As we're dealing with the same issue right now (rate limiting in the plan phase already) and the speed of execution is not that important to us, I'll contribute a solution to have a configurable read_delay_ms similar to the write_delay_ms that already exists.

I will link the PR here so that you can check it out, @Apollorion.

@Apollorion
Copy link
Author

I'm unsure if this solution to rate limiting will cause correctness issues elsewhere.

FWIW, our organization has been using this modified provider for a few months and we haven't noticed any issues.

I recommend getting more discussion and buy-in from the community as a next step if we are to proceed.

Yeah, I knew this was going to be a tough sell 😁. Is there a place for us to do this discussion? Here? Discord? Slack?

Other workarounds observed in the wild include breaking a state file into multiple using terraform state mv ... as well as targetted plans / applies that only allowlist certain resources to be evaluated and changed.

We had considered doing this.

The problem with breaking up the state file was it felt technically wrong to us. The issue arises when working strictly within the same few resources. For example, we have hundreds of repos. So to break the state apart we would have 2 workspaces managing the same resources, repos. Then questions come up of, does this go into repo workspace 1 or repo workspace 2? When in reality it made way more sense to just keep all the resources together in the same workspace.

The problem with targeting is that we are using terraform cloud, so we cant target our resources in an easy automatable way with our current terraform solution.

@morremeyer
Copy link
Contributor

@Apollorion I'm working on #1054 in the meantime which I think is going to be an easier sell. I'm testing it against our infra right now and will mark the PR ready for review once I'm happy.

@Apollorion
Copy link
Author

@Apollorion I'm working on #1054 in the meantime which I think is going to be an easier sell. I'm testing it against our infra right now and will mark the PR ready for review once I'm happy.

This isn't a great solution for us, but I have no doubt it is useful to others. Only because we want to fail fast. With our current IaC, with 0 delay, it takes nearly right up to an hour for the plan phase. I cant imagine what our plan time would look like if we added a delay into every read.

This proposed solution is more fitting to us because it basically skips state refresh which allows the plan phase to complete in seconds. Then we can just fail during the few write operations if needed.

@morremeyer
Copy link
Contributor

Good to know, thanks. I'm definitely looking forward to more discussion around this.

I would definitely love to have it as a configurable option as I know I want to enforce the state in the module no matter the current state.

@Apollorion
Copy link
Author

Good to know, thanks. I'm definitely looking forward to more discussion around this.

I would definitely love to have it as a configurable option as I know I want to enforce the state in the module no matter the current state.

Honestly I think this PR is very complementary to your PR. We would like to run nightly automated state refreshes with this change. This way, during the day, our developers can fail fast with the added benefit that, at night, we can verify any drift is caught.

In some automated fashion, we will flip the detect_drift flag this PR provides back to true and use your timeout flag to run a long lived state refresh / apply.

@jcudit
Copy link
Contributor

jcudit commented Feb 11, 2022

thanks everyone. this is the kind of feedback i was hoping for when hesitating before.

seems safe to ship with the right documentation. 👍🏾

@Apollorion
Copy link
Author

@jcudit just added documentation for the flag to website/docs/index.html.markdown let me know if you need anything else to get this approved / merged.

@Apollorion
Copy link
Author

You need anything else from me @jcudit ?

@Apollorion
Copy link
Author

Sorry to keep bugging you @jcudit I know youre probably a busy person. Anything we can do to move this forward?

@anthonyangel
Copy link

FWIW independent of the provider you can set Terraform to skip the refresh on plan / apply via https://www.terraform.io/cloud-docs/run/modes-and-options#skipping-automatic-state-refresh - this can also be set in TFC by setting an env var TF_CLI_ARGS_plan="-refresh=false"

@Apollorion
Copy link
Author

FWIW independent of the provider you can set Terraform to skip the refresh on plan / apply via https://www.terraform.io/cloud-docs/run/modes-and-options#skipping-automatic-state-refresh - this can also be set in TFC by setting an env var TF_CLI_ARGS_plan="-refresh=false"

Oh, wow, I did not know this. Thank you!

@Apollorion
Copy link
Author

Going to close this PR, since we can just use the TF_CLI_ARGS env variable to do basically the same thing.

@Apollorion Apollorion closed this May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants