-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File list performance with symlinks on Windows #633
Comments
AFAIK, this is technically incorrect. See this comment here: https://github.com/BurntSushi/same-file/blob/master/src/win.rs#L19-L58 |
Maybe the cache could keep the directories open to avoid the filesystem reusing the index numbers. That shouldn't be too expensive, because for each 'cursor' (not sure if it's all threads) in the traversal you'd only need to keep the parent directories open which should be few. |
Broadly speaking, this commit is an attempt to fix this issue: BurntSushi/ripgrep#633 It was reported that symlink checking was taking a long amount of time, and that one possible way to fix this was to reduce number of times a file descriptor is opened. In this commit, we amortize opening file descriptors by keeping a file handle open for each ancestor in the directory tree. We also open a handle for the candidate file path at most once, instead of once every iteration. Note that we only perform this optimization on Windows, where opening a file handle seems inordinately expensive. In particular, this now causes us to potentially open more file descriptors than the limit set by the user, which only happens when following symbolic links. We document this behavior.
I have an attempt at fixing this here: BurntSushi/walkdir@5bcc5b8 --- It's in the directory traversal library, so I still need to pull it into ripgrep. |
Broadly speaking, this commit is an attempt to fix this issue: BurntSushi/ripgrep#633 It was reported that symlink checking was taking a long amount of time, and that one possible way to fix this was to reduce number of times a file descriptor is opened. In this commit, we amortize opening file descriptors by keeping a file handle open for each ancestor in the directory tree. We also open a handle for the candidate file path at most once, instead of once every iteration. Note that we only perform this optimization on Windows, where opening a file handle seems inordinately expensive. In particular, this now causes us to potentially open more file descriptors than the limit set by the user, which only happens when following symbolic links. We document this behavior.
This should be fixed in ripgrep 0.7.0. I didn't actually test that it resolves the performance problem, but the symlink loop checking should be opening far fewer file handles. We can definitely re-open this if it's still a problem (and having an easy to use test corpus to try this on would be a huge help, e.g., what steps can I take with |
Unfortunately, I had to partially revert the optimization for the parallel iterator, which is used by default in ripgrep. The problem is that hanging on to file handles in the parallel iterator can cause file descriptor limits to be exceeded quite easily since many directories are being traversed in parallel. I say "partially" revert because we are at least keeping the same child handle open while walking up the ancestor path, but unfortunately, we are still opening a handle to each ancestor in that loop. If ripgrep is run with a single thread ( |
Great, thanks @BurntSushi! When testing it I see it panicking after a few (maybe unrelated) errors:
|
@chrmarti Thanks! Few questions:
|
This is from running 1aec4b1 . A simple reproducible case on my machine is:
The full output with this and
|
Btw., would it be an option to say that duplicate/missing results are unlikely and not fatal when you don't keep the files open while comparing ids? Maybe that could be enabled by a command line flag? |
@chrmarti That thought has crossed my mind, and it is tempting to take. It's a tough balance to strike, and I'd rather not make something like this configurable. |
@chrmarti It looks like your comment containing the panic isn't actually related to symlinks, but rather, the length of the path name. As an example, I tried running this in
This told me the directory couldn't be found. I then removed the
This showed me the contents of the directory, which included a
My understanding is that the I did a little more investigation in the
And now running against
So I compared ripgrep's time to traverse the directory with parallelism enabled (because the single threaded walker has the bug you found) with the time it takes for So here's my summary:
|
cc @roblourens |
w.r.t. long file names, @retep998 claims that I can enable a switch in a manifest to turn on long file name support in Windows 10, but I don't understand how to do that. That would certainly be easier than doing path conversion manually. |
@BurntSushi Makes sense, thanks for the detailed explanation! |
Wow. I'm guessing this is because of duplicates in the node_modules tree. Common dependencies are probably symlinked in multiple times, everywhere they are required. |
This fixes a bug in walkdir that happened on Windows when following symlinks. It was triggered when opening a handle to the symlink failed. In particular, this resulted in the two stacks in the walkdir iterator getting out of sync. At some point, this tripped a panic when popping from one stack would be fine but popping from the other failed because it was empty. We fix this by only pushing to both stacks if and only if both pushes would succeed. This bug was found via ripgrep. See: BurntSushi/ripgrep#633 (comment)
@chrmarti All right, I've fixed I'm going to mark this issue as fixed since I think all issues that I'm aware of are either fixed (the panic is fixed and I think symlink performance is reasonable) or tracked elsewhere (e.g., #364 for long path names). I'd be happy to revisit the symlink performance issue if there's compelling evidence that we can do measurably better. A good example there might be showing the execution of another tool that can 1) successfully traverse |
I am using VS Code and the last version has a performance bug which seems to be related to this one. I downgraded to RG version |
@warpdesign Thanks for the tip! Unfortunately, it doesn't really help. We need more details. Could you please open a new bug and fill in the issue template? Please remember that this is the ripgrep repo, and bugs should be reported against ripgrep using |
I'm investigating a report that listing files (
--files
) on Windows when following symlinks (--follow
) takes much more time than without symlinks.It appears that in some cases
cnpm
(annpm
replacement) uses symlinks to organize thenode_modules
folder and that then slows down ripgrep. (microsoft/vscode#35659 (comment) has an examplepackage.json
that when installed usingcnpm i
reproduces the problem.)Looking at the trace in Process Monitor I see many files being opened for reading metadata in
DirEntryRaw:from_link
and to detect loops in same_file'sHandle::from_path
.One area where this might be improved is where
check_symlink_loop()
loops over parent directories of each symlinked directory, causing is_same_file to open the same parent directory once for each of its symlinked subdirectories. Maybe a directory's id could be cached for the duration of its content being traversed?/cc @roblourens
The text was updated successfully, but these errors were encountered: