-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't make redundant copies of the same module package #29503
Comments
Hi @YuriGal! Thanks for sharing this use-case. The important design consideration here is that existing modules unfortunately expect to be able to create new files inside We don't yet have a design that can address that need in a backward-compatible way, and thus we have so far retained the original design of having each An example of such a design might be for Terraform to analyze a module to determine whether it uses This is also related to the existing issue #21308, which discusses a possible alternative to As we currently stand, Terraform's module installer does recognize situations where two modules have a trivially equal source address (where the syntax of the address is the same, as opposed to where the addresses could both refer to the same location with different syntax) and will copy the existing local directory rather than repeatedly downloading in that case. That mechanism is intended to address the performance impact of repeatedly downloading the same package, but it doesn't address the cost of storing the module multiple times on local disk. Terraform has historically assumed that local disk storage is cheap and thus there are various behaviors that spend more disk space than they technically need to in exchange for other benefits, such as a lightweight measure of isolation between configurations/modules. We do recognize that there are situations where heavy use of local storage is problematic, such as in transient "scale to nothing" execution environments where the local filesystem is often transient and limited in size, but I do want to be clear that if you wish to use Terraform today you should design your use of it around the assumption that it will create a potentially large footprint in whatever filesystem contains your working directory, because these behaviors run deep and are unlikely to change in the near future. |
@apparentlymart TBH, I'm not sure what kind of size we are talking about .. but AFAIK, if u follow best practices for both Terraform Modules & Git repositories..u shouldn't be on scenarios where the storage can become a problem? |
@apparentlymart regarding what I put in #31422, can you please elaborate on the concerns about compatibility? I'm unclear on how having a |
To be concrete about it, it is currently possible to write something like the following in a Terraform module and have it work, and so we are bound to keep this working even though I would assert that it's not a good idea to use the local filesystem for transient storage during a Terraform operation: variable "content" {
type = string
}
resource "local_file" "example" {
content = var.content
filename = "${path.module}/temporary.txt"
}
data "local_file" "example" {
filename = local_file.example.filename
}
output "content" {
value = data.local_file.example.content
} If the above were published as a shared module called from multiple separate The above example is of course contrived just to keep it simple but this applies to any situation where a module writes to a file inside its If we can find a design compromise that remains compatible with both the above and the more-intended use of |
Ok, so sticking with the backwards compatibility on that, could we either:
Would either of those be an acceptable compromise and within the capabilities of Terraform without going too crazy? |
This (more reasonable, more common) pattern must also keep working: resource "aws_s3_bucket_object" "example" {
bucket = "example"
key = "anything"
source = "${path.module}/bucket-content/anything"
} In this case, The key design challenge here is that I'm not currently seeing any design that can allow (There are some possible designs relying on OS-level features such as overlay filesystems, but Terraform does not typically run with suitable permissions to make use of those and they are not available in a standard way across all of the platforms we target.) |
So what I was thinking of for reads is that when Terraform sees |
For example, consider that in my second case which contains an argument We know from the AWS provider documentation that the provider is going to treat that path as a filename to read from, but Terraform Core cannot know that: the provider could just send that path to a remote system and not treat it as a filename at all, or it might write something to that path. |
Ahhhh, I did not realize exactly where the boundary was there between Terraform Core and the providers. I did not realize that providers directly interact with the file system without going through Terraform Core. I thought maybe Terraform Core could handle the retry logic and implement it once for all providers, but it sounds like you're saying that all providers would need to implement the retry logic themselves (unless if maybe Terraform Core was intercepting those filesystem calls and able to handle the retry logic there? But that sounds like it would be introducing a ton of complexity that I completely understand wanting to avoid). |
Indeed... the only situation where Terraform Core is directly interacting with the filesystem is in function calls like |
Just to clarify a bit further, the racing herd problem when writing files to I actually don't see anything on https://www.terraform.io/language/expressions/references#filesystem-and-workspace-info warning about this behavior or advising against using |
Also I opened #31441 as a documentation issue to add the risks with |
Unfortunately we know that the existing uses are prevalent enough that we cannot use the lack of documentation as justification for the breaking change. The following language in our compatibility promises summarizes this compromise:
This situation fails the above test in two ways:
You are right that there are already situations where using |
Understood. Hopefully this refactor can make the list for |
@apparentlymart Is it possible to allow the user calling the module to specify the behaviour (symlink or copy) in the module block. e.g.
This way its in the users hands whether to incur a copy at extra disk space, or linked with the disadvantages with files above. In my case I am invoking the same module for every AWS account for every AWS region - thats about 800 invocations. It is adding up to quite a lot of disk space and quite slow to cleanup afterwards too, I'd love to be able to choose the behaviour |
If ^ was an option, would strongly prefer a global setting instead of needing to specify per module invocation. |
It seems to me that whether it's safe/correct to reuse the same directory for multiple instances of the same module is a property of the module itself rather than a property of the call to that module, and so if we were going to make this an opt-in sort of thing I expect it would be better for that opt-in to be inside the module being called rather than in the It would essentially be a promise from the module developer that planning and applying the module will not cause any modifications to the module's source directory. We might consider helping to reinforce that by having Terraform mark the directory and the files inside as read-only. However, I will say that one challenge with this approach which occurs to me from some initial thought experimenting is that there's a slight mismatch of concept here: the artifact Terraform is placing in the filesystem is the entire module package which contains the module being requested, and so it seems like a statement about this being read-only ought to be something that is scoped to the module package as a whole rather than to an individual module inside it. We don't have any precedent for metadata on module packages, so we'd need to figure out what is the best way to represent that. |
I'm using my own modules and I know what's inside. If this happens, I really hope there's an option like #29503 (comment) |
Current Terraform Version
Use-cases
We have a Github repository of modules with each module in its own subfolder. Module usage in a project looks something like this
A project can reference quite a few of such modules from that repo. And for every referenced module a full package with full repo content is downloaded. This results in multiple identical copies of the repository being downloaded taking extra space and possibly causing other issues.
Proposal
I understand that this is a behavior by design, but it would be great if terraform could optimize it. Basically detect that multiple module references are pointing to the same source repository and same release version and download the package into one common location. Kinda like Yarn does with workspaces and common node_modules.
-->
The text was updated successfully, but these errors were encountered: