Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[perf] Automatically prefer Windows Dev Drive for temp files #12055

Open
1 task done
zooba opened this issue May 26, 2023 · 10 comments
Open
1 task done

[perf] Automatically prefer Windows Dev Drive for temp files #12055

zooba opened this issue May 26, 2023 · 10 comments
Labels
S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature

Comments

@zooba
Copy link
Contributor

zooba commented May 26, 2023

What's the problem this feature will solve?

Windows Dev Drive is a new OS feature that allows users to create a high performance disk that's designed for development activities. The performance improvement is based on new optimisations to the file system driver, and separating it out from the default OS drive reduces other overheads.

It's perfect for storing temporary files, however, it was decided that it's not safe for Windows to redirect all TEMP accesses to a Dev Drive by default. But our testing showed the improvements get significantly better when you do it, and we believe apps that can switch ought to switch. Pip's temporary files and package cache featured heavily in our testing.

I was part of the design team for this feature, so happy to add as much more context as you'd like and as I'm able to. Obviously these are public statements, so I have to be careful about things that might be interpreted as promises and not merely dreams, hopes and ambitions 😉

Describe the solution you'd like

At a high level, I'd like to see pip installs targeting an environment on a Dev Drive also use that drive for temporary files, including the build environment.

I'm in the process of adding an os.path.isdevdrive function to CPython for 3.12, so would love to hear whether this is something you'd consider using, and whether a simple test like this suits your needs.1

This is the kind of logic I'd expect to see (around here):

root = None
# sys.prefix feels wrong here, but you're already using site.getsitepackages() in this
# code path, so I guess either we know we're in the target runtime or there's other
# code cleanup to do?
if hasattr(os.path, 'isdevdrive'):
    root = os.path.join(''.join(os.path.splitroot(sys.prefix)[:2]), '.pip')
    if not os.path.isdevdrive(root): # negative check to clear the overridden root
        root = None
path = os.path.realpath(tempfile.mkdtemp(prefix=f"pip-{kind}-", dir=root))

isdevdrive will work on the full path, but I strip back to the drive name first because it could change which drive is actually used (e.g. the project might be in a mounted directory). There's not really an efficient way to handle this case, and on balance it makes the most sense to just ignore the optimisation anyway (access through a mounted directory is typically slow).

We'd also love to see the Dev Drive used for the various caches, and the perf benefits are solid. However, that seems a bit more complex, and might be better served by suggesting users set a global environment variable. There is already an os.listdrives() API that could be used to find a Dev Drive, but in the presence of multiple drives it's not really decidable which one they intend for caching (we think some people will create separate drives for each project, while others will have one big one).

Obviously this can only light up in 3.12 and not earlier (unless you want to port the code into ctypes, which I'll understand if you don't bother). Dev Drive doesn't become widely available until the end of the year anyway, so 3.12 will also be available. Just another reason for people to upgrade! 😄

Alternative Solutions

An alternative would be to detect and suggest to users that since their code is on a Dev Drive, they should also manually override settings to store temporary files on it as well. I'm not sure this would be really valuable for pip users, but it's

Another alternative would be to merely document it. I'd be disappointed by this, because we really want the fact that a Dev Drive is being used to be the signal that dev-related files should be stored there, and we don't want to encourage users to set global environment variables for something like this. It also seems unlikely to be discovered by existing users, who probably aren't going back to read this documentation regularly.

Additional context

If you prefer video demonstrations: https://build.microsoft.com/en-US/sessions/7ed9bb72-b4f4-4490-9b26-911d1ac263d1?source=/home

The docs again: https://learn.microsoft.com/en-us/windows/dev-drive/

Right now it's available on the Dev Channel of Windows Insider, which means it is public and people are trying it out. pip's test suite was one of the key scenarios we were testing during development, and the reason it isn't mentioned in the promotional material is because the perf improvement was so big that it looked like a contrived scenario, so we decided to cut it 😆

Code of Conduct

Footnotes

  1. Obviously we're past the first beta, but I got an allowance to add it before beta 2 (next Tuesday!) because of NDAs surrounding the feature.

@zooba zooba added S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature labels May 26, 2023
@pfmoore
Copy link
Member

pfmoore commented May 26, 2023

A very quick initial view - I don't think I'd like pip to get too low-level with this. We'd have to find a way of testing it, and our Windows testing is already slow enough that I don't relish adding too much to it. Also, we don't have the detailed Windows knowledge to maintain something like this - at least not based on the description you provide here.

My preference would be to enhance the stdlib tempfile module (or create an enhanced 3rd-party version that pip could vendor) to handle this. Presumably, because the decision about whether to use a dev drive comes down to "it depends", the module would need to take the appropriate arguments (the values it depends on), but that seems like a fair thing to do. This approach also makes the functionality available to users other than pip, which is also a benefit for pip, because we'll see other projects exercising the code as well as us.

The same sort of logic could be used for caches - if there's non-trivial logic needed to decide "where to put the cache", I'd rather see that built in a 3rd-party module that we could consume (replacing our current platformdirs-based code). The responsibility for getting that logic right and testing it would then be handled by the maintainers of that project, who would presumably have the necessary expertise.

Finally, I don't really want pip to be an "early adopter" here. Stability is critical for us, and implementing support for something that's brand new in both Windows and Python seems to go against that. Let's allow the OS and language support time to stabilise and get adopted by other projects before pip takes the plunge.

@notatallshaw
Copy link
Member

notatallshaw commented May 26, 2023

What about running a test suite? Pip currently sets up a RAM Disk for running is Windows tests because file system performance is so poor, but it's janky and don't work sometimes, could this be switched a Dev Disk?

It would also give assurance that if Pip did switch to use this feature for temp files that it is something that's reliable.

@zooba
Copy link
Contributor Author

zooba commented May 26, 2023

My preference would be to enhance the stdlib tempfile module (or create an enhanced 3rd-party version that pip could vendor) to handle this

Yeah, I would love to be able to do that. But unfortunately, we can't predict whether callers are using it as a persistent store or not, and if they are, this will break them. So it'd be a new API that is something like:

def get_tempdir_for_path(path, app_distinguisher):
    if hasattr(os.path, 'isdevdrive'):
        new_tempdir = os.path.join(''.join(os.path.splitroot(sys.prefix)[:2]), '.' + app_distinguisher)
        if os.path.isdevdrive(new_tempdir):
            os.mkdir(new_tempdir)
            return new_tempdir
    return _default_tempdir

Would you really vendor that much code? And the only change likely to occur is where on the drive it gets put, which is a breaking change for all the same reasons, so the library wouldn't be able to do it even though pip could handle the change just fine.

The same sort of logic could be used for caches...

Caches are more complicated because they're persistent, but not essential. So while we can change the location, users will notice and potentially care about the extra resource usage - they certainly notice and care when the cache fails today. I think the best logic here is going to be telling users to set their own global configuration (and at most, add an info message if it's detected that they have their cache on a slow drive, but I know how popular new messages are with pip users so I get you won't want to do even that much).

What about running a test suite? Pip currently sets up a RAM Disk for running is Windows tests because file system performance is so poor, but it's janky and don't work sometimes, could this be switched a Dev Disk?

I sure hope so! Like I said, we were benchmarking the pip test suite as a representative scenario for Python devs, but the perf gains were so unreasonably good that we didn't want to announce them.

The risk isn't in the file system itself - all the technology is decades old. It's more about helping users take advantage of it, rather than only discovering it on random blogs when they eventually google why pip is so slow on Windows. But breaking user's workflows is often worse than a technology break, so we've got to approach it carefully.

@pfmoore
Copy link
Member

pfmoore commented May 26, 2023

What about running a test suite? Pip currently sets up a RAM Disk for running is Windows tests because file system performance is so poor, but it's janky and don't work sometimes, could this be switched a Dev Disk?

One reason the RAM disk is janky is that we don't have much expertise - I've no idea who would be able to maintain the script that creates it, so we just sort of hope it doesn't break much... Will whatever's needed to set up a dev drive be any better?

But unfortunately, we can't predict whether callers are using it as a persistent store or not, and if they are, this will break them.

I'm not clear what that means, or why your code depends on the drive where the environment interpreter is located. How would this work, for example, with something like --target or --prefix? So how would I maintain this code, if (for example) I added a new option that used a different install location, or installed into a different Python interpreter?

Let me ask a different question. As an existing Python user, with my Python installation and development environments set up, how would I have a dev drive in the first place? What would I need to do, how would it alter my workflow, and (ignoring the benefits you are suggesting this change would add to pip) what would I gain from doing this? I ask because if I need to do a bunch of configuration to even use the feature, why would I not also simply set $env:TEMP = somewhere on the dev drive when I need it? I don't know whether I would even be "using it as a persistent store" so I don't know how to assess this statement, or what breakage we're talking about here.

Would you really vendor that much code?

I don't really mind, TBH. The size overhead of a vendored package isn't that important. Sure, there are shades of "left pad" about such a small module, but that's more a problem for whoever maintains it to struggle with 🙂 Or maybe the amount of code suggests that the os.path.isdevdrive API wasn't the right abstraction for the stdlib, and there should have been something a little higher level (tempfile.temp_location_for(path) seems like a good location for this in the stdlib).

But what I really want to vendor is the support burden.

I'm going to assume your proposed code was off-the cuff, because in trying to understand it I found 2 bugs (the path argument is never used, and os.path has no splitroot function). I was trying to understand it to give examples of the sorts of concern I'd rather a 3rd party handles, but I think if I did that we'd just get bogged down in side issues. My point was basically that I fully expect the code to grow (unanticipated error handling needs, if nothing else) and the pip maintainers aren't in the best position to do that, or to test such changes (even triggering error conditions for something like this would be problematic).

@zooba
Copy link
Contributor Author

zooba commented May 26, 2023

Will whatever's needed to set up a dev drive be any better?

Hopefully CI systems will just set it up by default. So as long as you're using the $(Build...) drives for temporary files rather than the OS drive (which you should be doing anyway), you'll get all the benefits in CI for free.

I'm not clear what that means, or why your code depends on the drive where the environment interpreter is located. How would this work, for example, with something like --target or --prefix?

The code in pip I linked to seems to assume it's already running in the correct interpreter (it uses site.getsitepackages()), but otherwise simply using the directory it's going to install into would be fine. That's the signal that the user is intentionally using a Dev Drive.

Let me ask a different question. As an existing Python user, with my Python installation and development environments set up, how would I have a dev drive in the first place?

The two links I provided show the process. And it's going to be prominent in the new Dev Home app (shown in the video), so people will be led to create it that way.

I don't know whether I would even be "using it as a persistent store" so I don't know how to assess this statement, or what breakage we're talking about here.

The persistent store is your cache. If you regularly install the same packages, you'll be used to them being in the cache, and so they aren't downloaded. But if pip keeps changing the cache location based on the target install location, users will keep getting cache misses. I presume they won't like it.

The actual breakage would be for apps that assume that temp is "stable enough" for them to rely on, or that it's entirely private to their user account, or that it's going to be exactly the same path for every arbitrary app. None of this is strictly true today, but it's a good enough approximation for many to assume it. Changing TEMP out from under these apps would break them. I don't believe pip is affected by this, it's just the explanation for why Windows doesn't redirect TEMP entirely as soon as you add a Dev Drive.

Or maybe the amount of code suggests that the os.path.isdevdrive API wasn't the right abstraction for the stdlib, and there should have been something a little higher level (tempfile.temp_location_for(path) seems like a good location for this in the stdlib).

The problem with going higher level is now it has to be well defined and logical for all platforms. We can still do that for 3.13 (it would never have made 3.12), but a platform-specific API in os can squeeze in for 3.12 and at least enable experimentation or simple checks.

I'm going to assume your proposed code was off-the cuff, because in trying to understand it I found 2 bugs (the path argument is never used, and os.path has no splitroot function).

Apparently os.path.splitroot is new in 3.12, and yes, I copied the earlier code and forgot to replace sys.prefix with path. But it was only meant to be illustrative, not copy-pasted into the sources and immediately released 😉

@dstufft
Copy link
Member

dstufft commented May 26, 2023

A very quick couple of thoughts, and taking the idea that getting things onto a dev drive is a good thing for pip users at face value:

Using a temp directory for building/unpacking co-located at least on the same drive as the target environment seems like a good thing to do regardless of whether it's a dev drive or not? I don't know windows filesystem semantics very well, but I assume that shutil can do things in a more optimized way when we're on the same drive rather than cross drives. Is there a reason to treat dev drives as special here?

I suspect the answer for storing caches on a dev drive is going to involve slowly phasing it in over time. Perhaps something like:

  1. Start by just documenting explicitly that putting your pip cache onto a dev drive is (or can?) be a good idea, and how to do it using --cache-dir.
  2. Add an opt in flag that will automatically put the cache dir onto a dev drive if one is available 1.
    • From a practical point of view, this feels like something that would be best exposed in platformdirs, where pip currently gets the cache directory from, maybe something like platformdirs.user_cache_dir(..., devdrive=True). The means that pip doesn't have to make decisions about where the cache is located on a dev drive, so there ends up being a standard (or it would be really great if Microsoft just recommended a standard).
  3. Start detecting that a dev drive is available, and recommend trying out the flag in (2) to people.
  4. Do the typical swap the default, wait a long while, deprecate the flags dance.

A slow rollout means that pip isn't on the bleeding edge here, unless users explicitly try to be, and it provides many options for getting off or pausing the rollout as problems are discovered that can be reported back to Microsoft or platformdirs.

I dunno though, that seems reasonable to me?

pip's test suite was one of the key scenarios we were testing during development, and the reason it isn't mentioned in the promotional material is because the perf improvement was so big that it looked like a contrived scenario, so we decided to cut it

Can you share those results here?

Footnotes

  1. This could be bundled with (1), but splitting it out lets us onboard it at a slower pace, and provides an easier back out if people start coming to pip complaining of problems.

@pfmoore
Copy link
Member

pfmoore commented May 26, 2023

Hopefully CI systems will just set it up by default.

Github actions? I don't know what the $(Build) drive is. We don't use it in our ci.yml. And our test suite creates many virtual environment, themselves in the temporary directory, so what does that mean? You're welcome to review our test suite if you want. (But don't, it'll make you cry 😉)

The two links I provided show the process.

Sorry, I had avoided the video (I don't like watching videos for this sort of thing) and hadn't got round to looking at the docs. Having now done so, I doubt I'd use this, as I hate partitioning my main drive. The reasons may be irrational, but I've always had enormous frustration trying to have my development on a drive other than my OS/tools.

Of course, if there were an option to create a fast filesystem just for temporary files, that everything would use transparently, I'd be all for it 🙂

So with that said, I now get the process a bit better, but personally I think that this comes under the heading of "if the user has to make deliberate workflow changes to use this feature, asking them to opt into using it for temp files as well isn't that bad. For example, I'd be OK with adding a pip option (which can be configured using an environment variable or per-environment config setting) to set a non-standard temp directory. We already have an option for the cache, of course, and this is noted in the dev drive documentation.

And it's going to be prominent in the new Dev Home app (shown in the video)

Is there a non-video description of what that is? Slides, or less ideally a summary document? Describing it as an app makes me think of an IDE, or some sort of GUI development tool. I don't know how pip will fit into such an environment, except in the sense of it being a different host for a terminal.

The actual breakage would be for apps that assume that temp is "stable enough" for them to rely on, or that it's entirely private to their user account, or that it's going to be exactly the same path for every arbitrary app. None of this is strictly true today, but it's a good enough approximation for many to assume it. Changing TEMP out from under these apps would break them. I don't believe pip is affected by this, it's just the explanation for why Windows doesn't redirect TEMP entirely as soon as you add a Dev Drive.

Cool. But again, that suggests to me that many people could just redirect TEMP, they just need to review their apps first and make the decision that it's OK. Or redirect TEMP in their development shell sessions.

The problem with going higher level is now it has to be well defined and logical for all platforms. We can still do that for 3.13 (it would never have made 3.12), but a platform-specific API in os can squeeze in for 3.12 and at least enable experimentation or simple checks.

I really don't have any problem with holding off on supporting this in pip until core Python support for it has matured (i.e., 3.13 or later), or we're getting so many users adopting it and needing pip to support it that we have a solid user base to collect actual use cases and requirements from.

Basically, it's a pretty strong -1 from me on rushing this. We're not like CPython where release cycles are a year long and people remain on older versions. From deciding to ship a feature in pip to having it in most users' hands can happen in 3 months or less, so we don't need to plan for adoption of dev drives, we can wait for it to happen and react afterwards.

@pfmoore
Copy link
Member

pfmoore commented May 26, 2023

Add an opt in flag that will automatically put the cache dir onto a dev drive if one is available

One final comment and then I'll shut up. There seems to be a confusion between "if a dev drive is available" and "if the user has selected a dev drive" (or however @zooba characterised the user "opting in" - sorry, I can't find the reference). I can imagine someone deliberately creating a dev drive just for temp files. Why would we not use that if possible? And how would we find it, in any case, other that by the user setting TEMP? I don't know that deliberately ignoring the user's explicit choice of a temp drive is the right thing either (after all, it's what we do right now with the RAM disk in the test suite...)

I should also say that while I'm being cautious in my response to the Windows feature itself (as opposed to special-casing it in pip), I do think it sounds really cool. I just don't know whether it'll fit my workflow, or how I'll adopt it. Again, a good reason to wait until it's in use by developers before committing to how we support it in pip, rather than basing our approach on Microsoft's expectations1.

By the way, this is ReFS, if I'm reading the docs right. There was an issue somewhere (found it - #11092) which suggested people could get significant benefits using reflinks, which are available in ReFS. Maybe we'd do better looking at reviving that discussion, if dev drives mean that more Windows developers will be using ReFS in the relatively near future?

Footnotes

  1. No matter how well-informed they might be thanks to @zooba's involvement 🙂

@zooba
Copy link
Contributor Author

zooba commented May 27, 2023

Using a temp directory for building/unpacking co-located at least on the same drive as the target environment seems like a good thing to do regardless of whether it's a dev drive or not? I don't know windows filesystem semantics very well, but I assume that shutil can do things in a more optimized way when we're on the same drive rather than cross drives. Is there a reason to treat dev drives as special here?

Yes, it's generally a good thing. (I had a proposal somewhere to unpack wheels alongside the target location and then rename into place when it succeeds, which is even better in that case, but doesn't help with creating a temporary environment for building.) I think it's just a case of "is it okay to drop a shared directory (on this drive/in this source/env/project directory)?" My own preference (channeling generic-pip-users more than myself) is no, unless the user has explicitly signalled that it's okay.

Creating a Dev Drive is an explicit signal by the user, it's just one that pip isn't aware of. But I think it's a strong enough signal that if they're also installing to that drive, it's okay to use it for temporary files as well.

I suspect the answer for storing caches on a dev drive is going to involve slowly phasing it in over time ...

Generally agree with everything you say here, though I'm not sure whether detecting a Dev Drive will ever be a good idea. It's possible (os.listdrives() exists in 3.12), but it's possible to have many of them, and I don't think guessing which one to use is a good idea. The exception is when the user is already running on one/installing to one/etc.

Github actions? I don't know what the $(Build) drive is. We don't use it in our ci.yml. And our test suite creates many virtual environment, themselves in the temporary directory, so what does that mean?

My generic $(Build) is because I don't remember what the context variables GitHub provides are called, but there'll be a variable for "directory to store build stuff in" which will be on a fast drive. Unless %TMP% is updated automatically (it might be?), it's going to be pointing at a slow drive. Considering we got much better results from your test suite when redirecting both TMP and having the sources on a Dev Drive, I expect you're probably suffering right now in CI. (And sorry, I can't get specific with the numbers. 😞 )

Sorry, I had avoided the video (I don't like watching videos for this sort of thing)

Fair enough, I normally avoid the videos too. I only watched this one to make sure my idea of what a good demo session looks like is still calibrated - some of my mentees right now are getting into live demoing.

Basically, it's a pretty strong -1 from me on rushing this. We're not like CPython where release cycles are a year long and people remain on older versions. From deciding to ship a feature in pip to having it in most users' hands can happen in 3 months or less, so we don't need to plan for adoption of dev drives, we can wait for it to happen and react afterwards.

Agreed on not rushing it in pip. The main urgency was whether the API I'm "rushing" into CPython would work, to make sure I don't put the wrong thing in (it's not been rushed in terms of implementation, just a short period having it out in public). There's only so much value in having any support in pip prior to 3.12's release (though it would be nice to have the default pip included with 3.12 be as fast as it can be).

I can imagine someone deliberately creating a dev drive just for temp files. Why would we not use that if possible? And how would we find it, in any case, other that by the user setting TEMP?

Yeah, it's possible. Really, it's just the design/intention of the feature to be a workspace, not a specialised storage type. So the intended design really is that the user would use it, and tools being used on it will know about the files they're accessing. Searching all drives to see if they are a Dev Drive is really only for providing information, and I don't think it's appropriate for what we're discussing here. If a user wants some files to go there, but keep their workspace somewhere else, they're "misusing" the feature and can set TEMP themselves.

I just don't know whether it'll fit my workflow, or how I'll adopt it.

Out of genuine interest, how does your workflow around storing code look today? You can contact me offline if you prefer, and I'm not trying to persuade you of anything (though I might use it to help the team make the marketing of this more persuasive in general). Personally I switched to using a separate virtual disk a couple of years ago when we first started talking about this and after a bit of adjustment (from C:\Projects to X:\ because... I picked X: 😄 ) it's been fine.

We also added an "automatically mount virtual disk at startup" feature as part of this (used when you create a virtual Dev Drive rather than repartitioning an actual disk), which I'm definitely looking forward to.

By the way, this is ReFS, if I'm reading the docs right. There was an issue somewhere (found it - #11092) which suggested people could get significant benefits using reflinks, which are available in ReFS.

Yep, though we're aiming to make as many changes under the covers as possible. So hopefully the standard CopyFile APIs will be able to do it transparently, and we'll switch shutil.copy2 to use those. So if you're already copying that way, you could well get it for free, which is my preference. (Though my stronger preference is to avoid copies and extract directly into the target location (with renames for rollback/etc.).)

@samypr100
Copy link

there'll be a variable for "directory to store build stuff in" which will be on a fast drive.

@zooba I recently moved uv's CI to use a VHDX + ReFS and the speed gains are quite notable. It didn't seem like Github Actions were providing Dev Drives out of the box yet, hence I ended up creating a Github Action samypr100/setup-dev-drive that does that for you from my experience implementing this in uv.

Not sure if it will be any help here at this point given the date on this thread, but hopefully it can make it easier for others wanting to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

5 participants