-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Path.rglob performance issues in deeply nested directories compared to glob.glob(recursive=True) #102613
Comments
At a depth of 400, on my system, 50% of the time is spent in |
Add a keyword-only *follow_symlinks* parameter to `pathlib.Path.glob()` and `rglob()`, defaulting to false. When set to true, symlinks to directories are followed as if they were directories. Previously these methods followed symlinks except when evaluating "`**`" wildcards; on Windows they returned paths in filesystem casing except when evaluating non-wildcard tokens. Both these problems are solved here. This will allow us to address pythonGH-102613 and pythonGH-81079 in future commits.
I believe that the following PRs will need to land before we can properly address this issue: |
It occurs to me that we only need to de-duplicate results if two or more
Nevernevermind, I was right! |
My plan here is:
|
Stop de-duplicating results in `_RecursiveWildcardSelector`. A new `_DoubleRecursiveWildcardSelector` class is introduced which performs de-duplication, but this is used _only_ for patterns with multiple non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding the use of a set, `PurePath.__hash__()` is not called, and so paths do not need to be parsed and (case-) normalised. Also merge adjacent '**' segments in patterns.
Update to my previous plan:
|
Stop de-duplicating results in `_RecursiveWildcardSelector`. A new `_DoubleRecursiveWildcardSelector` class is introduced which performs de-duplication, but this is used _only_ for patterns with multiple non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding the use of a set, `PurePath.__hash__()` is not called, and so paths do not need to be stringified and case-normalised. Also merge adjacent '**' segments in patterns.
…nGH-104244) Stop de-duplicating results in `_RecursiveWildcardSelector`. A new `_DoubleRecursiveWildcardSelector` class is introduced which performs de-duplication, but this is used _only_ for patterns with multiple non-adjacent `**` segments, such as `path.glob('**/foo/**')`. By avoiding the use of a set, `PurePath.__hash__()` is not called, and so paths do not need to be stringified and case-normalised. Also merge adjacent '**' segments in patterns.
* main: (47 commits) pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188) pythonGH-104308: socket.getnameinfo should release the GIL (python#104307) pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311) pythongh-99113: A Per-Interpreter GIL! (pythongh-104210) pythonGH-104284: Fix documentation gettext build (python#104296) pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251) pythongh-104223: Fix issues with inheriting from buffer classes (python#104227) pythongh-99108: fix typo in Modules/Setup (python#104293) pythonGH-104145: Use fully-qualified cross reference types for the bisect module (python#104172) pythongh-103193: Improve `getattr_static` test coverage (python#104286) Trim trailing whitespace and test on CI (python#104275) pythongh-102500: Remove mention of bytes shorthand (python#104281) pythongh-97696: Improve and fix documentation for asyncio eager tasks (python#104256) pythongh-99108: Replace SHA3 implementation HACL* version (python#103597) pythongh-104273: Remove redundant len() calls in argparse function (python#104274) pythongh-64660: Don't hardcode Argument Clinic return converter result variable name (python#104200) pythongh-104265 Disallow instantiation of `_csv.Reader` and `_csv.Writer` (python#104266) pythonGH-102613: Improve performance of `pathlib.Path.rglob()` (pythonGH-104244) pythongh-103650: Fix perf maps address format (python#103651) pythonGH-89812: Churn `pathlib.Path` methods (pythonGH-104243) ...
Use `Path.walk()` to implement the recursive wildcard `**`. This method uses an iterative (rather than recursive) walk - see pythonGH-100282.
Use `Path.walk()` to implement the recursive wildcard `**`. This method uses an iterative (rather than recursive) walk - see GH-100282.
This commit replaces selector classes with selector functions. These generators directly yield results rather calling through to their successor. A new internal `Path._glob()` takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance.
* main: pythonGH-104510: Fix refleaks in `_io` base types (python#104516) pythongh-104539: Fix indentation error in logging.config.rst (python#104545) pythongh-104050: Don't star-import 'types' in Argument Clinic (python#104543) pythongh-104050: Add basic typing to CConverter in clinic.py (python#104538) pythongh-64595: Fix write file logic in Argument Clinic (python#104507) pythongh-104523: Inline minimal PGO rules (python#104524) pythongh-103861: Fix Zip64 extensions not being properly applied in some cases (python#103863) pythongh-69152: add method get_proxy_response_headers to HTTPConnection class (python#104248) pythongh-103763: Implement PEP 695 (python#103764) pythongh-104461: Run tkinter test_configure_screen on X11 only (pythonGH-104462) pythongh-104469: Convert _testcapi/watchers.c to use Argument Clinic (python#104503) pythongh-104482: Fix error handling bugs in ast.c (python#104483) pythongh-104341: Adjust tstate_must_exit() to Respect Interpreter Finalization (pythongh-104437) pythonGH-102613: Fix recursion error from `pathlib.Path.glob()` (pythonGH-104373)
Final PR available: As a result of these changes, for the particular repro case in the original post with 1000 nested |
This commit introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal `**` wildcard, such as `**/*.py`. For this example, the previous implementation recursively walked directories using `os.scandir()` when it expanded the `**` component, and then **scanned those same directories again** when expanded the `*.py` component. This is wasteful. In the new implementation, any components following a `**` wildcard are used to build a `re.Pattern` object, which is used to filter the results of the recursive walk. A pattern like `**/*.py` uses half the number of `os.scandir()` calls; a pattern like `**/*/*.py` a third, etc. This new algorithm does not apply if either: 1. The *follow_symlinks* argument is set to `None` (its default), or 2. The pattern contains `..` components. In these cases we fall back to the old implementation. This commit also replaces selector classes with selector functions. These generators directly yield results rather calling through to their successors. A new internal `Path._glob()` method takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance. It should also be easier to understand and maintain.
Tis fixed! Most changes made it into 3.12, but a couple are 3.13-only. |
…er Coverage (pythonGH-105744) (cherry picked from commit 4e80082) Co-authored-by: Łukasz Langa <lukasz@langa.pl>
Bug report
Pathlib.rglob can be orders of magnitudes slower than
glob.glob(recursive=True)
With a 1000-deep nested directory,
glob.glob
andPath.glob
both took under 1 second.Path.rglob
took close to 1.5 minutes.Linked PRs
pathlib.Path.rglob()
#104244pathlib.Path.glob()
#104373pathlib.Path.glob()
#104512The text was updated successfully, but these errors were encountered: