-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize package installation for space and speed by using copy-on-write file clones ("reflinks") and storing wheel cache unpacked #11092
Comments
Symlinks are tricky because they make it impossible to know whether you can safely remove an entry from the cache. Of course AFAIK pip doesn't actually have any policy for evicting items from the cache currently, but this would rule it out forever, and also mean that it's no longer safe for the user to blow away the cache. (Also, I suspect there are plenty of automated tools out there that do this? It's easy to imagine a Hardlinks avoid these issues. I dunno if they create any new ones – might want to check with the conda folks, since they have years of experience with doing this (with hardlinks). |
A third option to consider might be reflinks. |
Hardlinks have a different version of this issue, too, which is knowing the side effects of editing the file in place. I definitely scribble on my venvs periodically because of my experience of unreliability of Python debuggers, and scribbling on all of them at once would definitely be an unwelcome surprise. |
So I thought of this and immediately discarded the thought, because reflinks are a super obscure feature that only barely works on Btrfs, right? But your comment got me to do a little bit of research and discovered they're supported on Windows, APFS on macOS, and Btrfs, CIFS, NFS 4.2, OCFS2, overlayfs, and XFS on Linux. Given this surprisingly wide deployment, and the relative lack of any issues of refcounting or accidental mutability, maybe it would be good to implement these first? |
As far as I understand, they are supported on ReFS, but this isn't the default filesystem on Windows (my laptop is still using NTFS). Unless ReFS presents itself as NTFS (and hence I'm using it without knowing) I suspect that the number of Windows environments where reflinks work is likely to be extremely small... |
AFAIR pythons copy tree shutil automatically tries to use reflink when avaliable, (verification needed) |
I don't see any references to reflink in 3.10's shutil.py... |
@pfmoore they come in via the copy file range heleprs used to optimize https://docs.python.org/3/library/shutil.html#shutil-platform-dependent-efficient-copy-operations since python 3.8 |
@RonnyPfannschmidt Are you sure of that ?
This explanation does not match reflink. Reflink is a feature of "some" Filesystem (nice explanation here) https://blog.ram.rachum.com/post/620335081764077568/symlinks-and-hardlinks-move-over-make-room-for and the shutil Python 3.8 implementation just mentions "fast-copy" operation done in the kernel rather than using user-space. Optimization coming from avoiding going through multiple user<->kernel syscalls and using userspace buffers for that. This is different thing than reflinks altogether IMHO. I believe reflinks require explicit system calls (like for example https://pypi.org/project/reflink/ provides) and it's very much tied to which filesystem you have the files on). |
On linux, I think using No idea what it does on Windows, or if that function is even available on Windows, but it looks like reflink is only available on Windows with ReFS. It doesn't appear like shutil currently uses We'd still get performance improvements from not having to unzip into a temporary location and copy out of that, and IIRC we're using the default temporary directory by default, which is often times on another file system, so we'd be more likely to use fast copying at a minimum as well. |
indeed, it seems like i misremembered a detail |
TIL about the reflinks. BTW. Reflinks is nice feature - pity it's only available for some "obscurish" filesystems. |
I have a feeling that this may need to be solved one layer up, in the virtual environment abstraction. Node has more or less the same problem, and the way they currently address this is (pnpm) is to share package installations between environments is possible. This would be more doable from referencing files directly from the pip cache. All the same issues with soft-linking still persist, of course, although Node has never been that friendly to development environments on Windows, so probably they just don’t care that much (I didn’t check).
And there is actually work toward this right now (see recent comments in #2984 and other issues referenced in it), so we probably don’t want to go toward this particular direction, at least not without a lot of discussion. |
I don't think you can solve this in the virtual environment abstraction? At least I'm not sure how you're envisioning that working? The virtual environment abstraction largely is just setting up I think the only reasonable path here is pretty straight forward:
This has some immediate benefits:
With some immediate downsides:
Then it also has some longer term benefits:
|
this practically is a bit like the proposal for shared storage i tried to bring forward a while back |
Here, We are talking about few things to be improved:
soft link can point to directories or files, therefore much more viable, but to overcome your fear about the (safely removing), we can create another utility for file linking (specifically for PyPi) that generates the links in soft format but keeps the original data for all remaining links (unless there is no links any more to the actual data, only then will be removed by the last link remove action), the approach is kinda simple: Notes:
|
It's a very interesting concept, but there must be a lot of edge cases to explore. What happens with this feature if I run Is it correctly understood that this fits best for a CI environment where virtualenvs are created often? In this case, perhaps it could be possible to enable a behavior like this as non-default through a switch for |
With my proposed idea few posts up, there is no semantic change for any operation as it is today. Reflinks COW properties here are well suited, but since we can't rely on them existing, we can let shutils handle that for us, and then just use shutil to copy an unpacked cached wheel into a virtual environment. Without reflink it will just copy the file contents, basically the same as we're doing today except we skip unzipping the wheel (since it's already unzipped). With reflink support it will COW the files using reflinks. |
not using Cache (by other venvs) is not a big deal, compare to the current pains we get..! |
@benjaoming most virtualenvs break on dist upgrade because of python changes, not because of so changes in wheels, in particular all the wheels from for pypi will not break the shared libs |
to elaborate - @benjaoming the proposed change tactically just changes the following - the wheels would be stored unpacked to use fast copy/cow copy instead of "unzip" to put them from cache to virtualenv as such the expected behaviour post install will match the current mechanism 1:1 |
Creating a virtual environment is a major time sink compared to actually running the pipeline in a CI system I am working with. I like @dstufft 's proposal of caching unpacked wheels because it unlocks immediate improvements without having to figure out the intricacies of sharing installed packages between environments. However, I don't have a clear idea how big the cache would become. That's not a problem on a beefy CI server but should probably be opt-in. |
@RonnyPfannschmidt Does a cache w/unzipped wheels impose new constraints on pip's cache expiry mechanism? I take it that the consensus here is "No" - but I think it's good to ask for the sake of clarity, since caching is commonly understood as a hard problem. |
Do we have any idea how much extra space this would take? Over in #11143 we're having a debate over trying to reduce the space usage of the HTTP cache, it seems inconsistent to do that and yet increase the space usage of the wheel cache without worrying about it... (Personally, my machine is big enough that cache size isn't a significant issue, but we have enough users on space-limited systems such as the Raspberry Pi, that we can't assume disk space isn't an issue in general). |
We could potentially help systems like that by offering a flag to try and issue hard links, and failing that fall back to soft links and let people opt into space saving prior to reflink support being available for them. We could also just say that we're not super interested in this until reflink support is available in Python itself. Or implement a cache clean up mechanism with some sort of LRU or something. I don't know offhand how much compression a wheel achieves over uncompressed, zip files members are only stored individually compressed, so it won't be as high as it could be. Shouldn't be too hard to pull down a bunch of wheels from PyPI and look though. |
Ah, yes, I misread this. That's unfortunate. Still, their general availability on APFS potentially serves a lot of Python developers, and Btrfs is coming to more and more linux distros as the default root filesystem. Although https://en.wikipedia.org/wiki/ReFS looks very messy in terms of its development history / availability (it was available on most client versions of Windows until Windows 10 Creators Update and now it's reserved for Pro & Enterprise?), it still claims it has "the intent of becoming the "next generation" file system after NTFS" |
(I am going to try to stop falling into the rabbit hole of reading the tea leaves on Microsoft's future plans for this filesystem, but it does seem like https://github.com/microsoft/CopyOnWrite at least implies that Microsoft cares about the feature a little? ) |
To provide a few numbers I measured the time it took to install the following packages into a clean virtual environment. Requirements
The results were:
The size of the unpacked wheels does not seem unreasonable to me. However, it looks like the compiled files would have to be cached as well to get the best performance improvements. |
FWIW, if someone wants to help move this forward, a prototype of this would be very welcome and should be relatively straightforward to implement with The logic you'd need to implement would be to derive from Having a cross-platform prototype of this would be a major piece in helping move this forward; since I reckon it's unlikely that one of pip's existing maintainers will have the bandwidth to explore this. |
I did work on a proof of concept that tries to solve this issue just in a slightly different way, it uses installer to implement a basic wheel installer that installs packages to Made a post on the Python forums here if anybody would like to join the discussion. |
What's the problem this feature will solve?
Creating a new virtual environment in a modern Python project can be quite slow, sometimes on the order of tens of seconds even on very high-end hardware, once you have a lot of dependencies. It also takes up a lot of space; my
~/.virtualenvs/
is almost 3 gigabytes, and this is a relatively new machine; and that isn't even counting my~/.local/pipx
, which is another 434M.Describe the solution you'd like
Rather than unpacking and duplicating all the data in wheels, pip could store the cache unpacked, so all the files are already on the filesystem, and then clone them into place on copy-on-write filesystems rather than copying them. While there may be other bottlenecks, this would also reduce disk usage by an order of magnitude. (My
~/Library/Caches/pip
is only 256M, and presumably all those virtualenvs contain multiple full, uncompressed copies of it!)Alternative Solutions
You could get a similar reduction effect by setting up an import hook, using zipimport, or doing some kind of
.pth
file shenanigans but I feel like those all have significant drawbacks.Additional context
Given that platforms generally use shared memory-maps for shared object files, if it's done right this could additionally reduce the memory footprint of python interpreters in different virtualenvs with large C extensions loaded.
Code of Conduct
The text was updated successfully, but these errors were encountered: