providercache: Ignore lock-mismatching global cache entries #32129

apparentlymart · 2022-11-01T00:17:28Z

When we originally introduced the trust-on-first-use checksum locking mechanism in v0.14, we had to make some tricky decisions about how it should interact with the pre-existing optional read-through global cache of provider packages:

The global cache conflicts with the checksum locking because if the needed provider is already in the cache then Terraform skips installing the provider from upstream and therefore misses the opportunity to capture the signed checksums published by the provider developer. We can't use the signed checksums to verify a cache entry because the origin registry protocol is still using the legacy ziphash scheme and that is only usable for the original zipped provider packages and not for the unpacked-layout cache directory. Therefore we decided to prioritize the existing cache directory behavior at the expense of the lock file behavior, making Terraform produce an incomplete lock file in that case.

Now that we've had some real-world experience with the lock file mechanism, we can see that the chosen compromise was not ideal because it causes terraform init to behave significantly differently in its lock file update behavior depending on whether or not a particular provider is already cached. By robbing Terraform of its opportunity to fetch the official checksums, Terraform must generate a lock file that is inherently non-portable, which is problematic for any team which works with the same Terraform configuration on multiple different platforms.

This change addresses that problem by essentially flipping the decision so that we'll prioritize the lock file behavior over the provider cache behavior. Now a global cache entry is eligible for use if and only if the lock file already contains a checksum that matches the cache entry. This means that the first time a particular configuration sees a new provider it will always be fetched from the configured installation source (typically the origin registry) and record the checksums from that source.

On subsequent installs of the same provider version already locked, Terraform will then consider the cache entry to be eligible and skip re-downloading the same package.

This intentionally makes the global cache mechanism subordinate to the lock file mechanism: the lock file must be populated in order for the global cache to be effective. For those who have many separate configurations which all refer to the same provider version, they will need to re-download the provider once for each configuration in order to gather the information needed to populate the lock file, whereas before they would have only downloaded it for the first configuration using that provider.

This should therefore remove the most significant cause of folks ending up with incomplete lock files that don't work for colleagues using other platforms, at the expense of bypassing the cache for the first use of each new package with each new configuration. This tradeoff seems reasonable because otherwise such users would inevitably need to run terraform providers lock separately anyway, and that command always bypasses the cache. Although this change does decrease the hit rate of the cache, if we subtract the never-cached downloads caused by "terraform providers lock" then this is a net benefit overall, and does the right thing by default without the need to run a separate command.

apparentlymart · 2022-11-01T00:23:10Z

I can see that this has broken an assumption being made by one of the end-to-end tests that didn't run in my dev environment because I was only running the "offline" subset of the tests. I'll see about updating that tomorrow so that it properly models the two cases of this new behavior, both with and without an existing lock file.

When we originally introduced the trust-on-first-use checksum locking mechanism in v0.14, we had to make some tricky decisions about how it should interact with the pre-existing optional read-through global cache of provider packages: The global cache essentially conflicts with the checksum locking because if the needed provider is already in the cache then Terraform skips installing the provider from upstream and therefore misses the opportunity to capture the signed checksums published by the provider developer. We can't use the signed checksums to verify a cache entry because the origin registry protocol is still using the legacy ziphash scheme and that is only usable for the original zipped provider packages and not for the unpacked-layout cache directory. Therefore we decided to prioritize the existing cache directory behavior at the expense of the lock file behavior, making Terraform produce an incomplete lock file in that case. Now that we've had some real-world experience with the lock file mechanism, we can see that the chosen compromise was not ideal because it causes "terraform init" to behave significantly differently in its lock file update behavior depending on whether or not a particular provider is already cached. By robbing Terraform of its opportunity to fetch the official checksums, Terraform must generate a lock file that is inherently non-portable, which is problematic for any team which works with the same Terraform configuration on multiple different platforms. This change addresses that problem by essentially flipping the decision so that we'll prioritize the lock file behavior over the provider cache behavior. Now a global cache entry is eligible for use if and only if the lock file already contains a checksum that matches the cache entry. This means that the first time a particular configuration sees a new provider it will always be fetched from the configured installation source (typically the origin registry) and record the checksums from that source. On subsequent installs of the same provider version already locked, Terraform will then consider the cache entry to be eligible and skip re-downloading the same package. This intentionally makes the global cache mechanism subordinate to the lock file mechanism: the lock file must be populated in order for the global cache to be effective. For those who have many separate configurations which all refer to the same provider version, they will need to re-download the provider once for each configuration in order to gather the information needed to populate the lock file, whereas before they would have only downloaded it for the _first_ configuration using that provider. This should therefore remove the most significant cause of folks ending up with incomplete lock files that don't work for colleagues using other platforms, and the expense of bypassing the cache for the first use of each new package with each new configuration. This tradeoff seems reasonable because otherwise such users would inevitably need to run "terraform providers lock" separately anyway, and that command _always_ bypasses the cache. Although this change does decrease the hit rate of the cache, if we subtract the never-cached downloads caused by "terraform providers lock" then this is a net benefit overall, and does the right thing by default without the need to run a separate command.

github-actions · 2022-11-04T23:18:28Z

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

joe-a-t · 2022-11-09T18:39:58Z

I am extremely concerned about this change. It sounds like you are saying that this will result in the global provider cache being entirely ignored if people don't use checksum lockfiles in all of their root modules. We deliberately do not use those lockfiles so it sounds like we will take a huge performance hit as a result of this change since the cache would always be ignored for us. Is that understanding correct?

joe-a-t · 2022-11-09T18:47:42Z

Also, there are some custom providers we use that do not have an upstream to pull from, we have to pre-populate them in the cache. If I am reading this correctly, all of those will break if this change gets merged. I strongly believe that this would wreck so much chaos on our workflow (which I don't think is unique) that it would be entirely inappropriate to do without a major version release.

apparentlymart · 2022-11-09T18:55:41Z

Yes, it is correct that with this change using lock files will be mandatory if you want to use the global plugin cache. We have merged this early in the v1.4 development period to allow plenty of time to refine it based on feedback.

This should not break local (non-registry) modules if you use them in the documented way, by placing them into a local filesystem mirror directory instead of in the global cache. Terraform itself is the only process allowed to write into the global cache directory.

The alpha release today is one vehicle for that feedback; I'd love if you'd try the alpha and share your experiences in a new issue. Thanks!

tjstansell · 2022-11-11T17:33:39Z

So, this completely breaks our use of the cache in a terragrunt world where we're initializing hundreds of terraform repositories. We pre-populate the provider cache once and then in the terraform 1.3 world, every repo we initializes shows it's Using hashicorp/aws v4.38.0 from the shared cache directory. With 1.4.0-alpha, every repo shows

- Installing hashicorp/aws v4.38.0...
- Installed hashicorp/aws v4.38.0 (unauthenticated)

This effectively negates the use of the cache for us.

tjstansell · 2022-11-11T17:54:07Z

I'll add that even though all of this is running inside a custom docker image with a local mirror directory the providers come from (note the unauthenticated above), it's still re-copying the exact same provider every time into the cache directory. Since this process is not concurrency-safe, we end up with corrupted files or invalid checksums, not to mention the fact it adds significant delay when this is happening for hundreds of terraform repositories.

tjstansell · 2022-11-11T17:58:18Z

Sorry, I'll open a new ticket with this feedback.

github-actions · 2022-12-12T02:14:45Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

apparentlymart requested a review from a team November 1, 2022 00:17

apparentlymart self-assigned this Nov 1, 2022

vercel bot deployed to Preview November 1, 2022 00:18 View deployment

apparentlymart force-pushed the f-provider-global-cache-checksum-mismatch branch from 1fca13b to 581f0b9 Compare November 1, 2022 22:51

vercel bot deployed to Preview November 1, 2022 22:56 View deployment

kmoe approved these changes Nov 4, 2022

View reviewed changes

crw linked an issue Nov 4, 2022 that may be closed by this pull request

Terraform lock file irresolvable problem #32124

Closed

apparentlymart merged commit d0a35c6 into main Nov 4, 2022

apparentlymart deleted the f-provider-global-cache-checksum-mismatch branch November 4, 2022 23:18

tjstansell mentioned this pull request Nov 11, 2022

provider cache becomes ineffective with 1.4.0-alpha release #32205

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

providercache: Ignore lock-mismatching global cache entries #32129

providercache: Ignore lock-mismatching global cache entries #32129

apparentlymart commented Nov 1, 2022 •

edited

Loading

apparentlymart commented Nov 1, 2022

github-actions bot commented Nov 4, 2022

joe-a-t commented Nov 9, 2022

joe-a-t commented Nov 9, 2022

apparentlymart commented Nov 9, 2022 •

edited

Loading

tjstansell commented Nov 11, 2022

tjstansell commented Nov 11, 2022

tjstansell commented Nov 11, 2022

github-actions bot commented Dec 12, 2022

providercache: Ignore lock-mismatching global cache entries #32129

providercache: Ignore lock-mismatching global cache entries #32129

Conversation

apparentlymart commented Nov 1, 2022 • edited Loading

apparentlymart commented Nov 1, 2022

github-actions bot commented Nov 4, 2022

joe-a-t commented Nov 9, 2022

joe-a-t commented Nov 9, 2022

apparentlymart commented Nov 9, 2022 • edited Loading

tjstansell commented Nov 11, 2022

tjstansell commented Nov 11, 2022

tjstansell commented Nov 11, 2022

github-actions bot commented Dec 12, 2022

apparentlymart commented Nov 1, 2022 •

edited

Loading

apparentlymart commented Nov 9, 2022 •

edited

Loading