Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match_files() is not a pure generator function, and it impacts tree_*() gravely #52

Closed
orens opened this issue Oct 19, 2021 · 1 comment

Comments

@orens
Copy link

orens commented Oct 19, 2021

Hey @cpburnz , thanks for the great lib!
In match_files() (https://github.com/cpburnz/python-path-specification/blob/c00b332b2075548ee0c0673b72d7f2570d12ffe6/pathspec/pathspec.py#L170), the line

file_map = util.normalize_files(files, separators=separators)

(L190) requires files to be completely exhausted before even the first file is matched. If files is a list-like, this is not a problem, but when calling it from the tree_*() methods it means that the whole iterator mechanics is pretty much useless.
It also means that if I have an ignored folder containing a very complex structure, which I want pathspec to ignore, pathspec will search through it although there is no way it will play a role in the results.

As an example, for an automation I'm writing on a real life repository containing a frontend application, the scan of npm generated files took about 10 minutes (before yielding the first result) and then I gave up and stopped it.

I think a possible solution is to remove this dictionary and simply doing:

for file in files:
  if util.match_file(self.patterns, util.normalize_file(file)):
    yield file

(I bypassed util.match_files() here as it, too, is not a generator and will try to convert files to list first)

@cpburnz
Copy link
Owner

cpburnz commented Aug 18, 2022

This is fixed and will be in v0.10.0 when it is released. All of the PathSpec.match_*() methods now properly use files as an iterator without exhausting it before hand, and yields results as the iterator is consumed (i.e., basically what your example does). I also improved the util.iter_tree_*() functions (used by PathSpec.match_tree_*()) to use os.scandir() which is supposed to be more efficient than os.listdir() with large directories.

Yeah, util.match_files() doesn't have a useful signature because it returns a set rather than iterator.

@cpburnz cpburnz closed this as completed Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants