gix-dir: Walk #1252

Byron · 2024-01-17T18:58:18Z

Based on #1219

diff-correctness → `gix-status` → gix reset

Improve gix status to the point where it's suitable for use in reset functinoality.
Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.

Architecture

The reason this PR deals quite a bit with gix status is that for a safe implementation of reset() we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff the current-index with target-index. The set of files to touch can then be used to lookup information provided by git-status, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.

This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.

This leaves us in a situation where the current checkout() implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).

On the way to reset() it's a valid choice to warm up more with the matter by improving on the current gix status implementation and assure correctness of what's there, which currently doesn't seem to be the case in comparison. Further, implementing gix status similarly to git status should be made possible.

Research of Walk

Integrate with the untracked-file extension and provide a way to update it.
It looks like this wants to be directory traversal with index, untracked and ignored support, as well as empty-dir and .git dir/file handling
Should also include pathspec-based filtering, and looks like this is also useful for gix add then
ignore index untracked-cache and name-hash - these are optimizations. At the current time, we don't even write the untracked-cache back so it's OK to ignore. Also Git doesn't use it by default and it can crash if enabled.
- Note that for WebKit this optimization is very important, even though we might be able to have some easy wins with parallelization of dirwalk + modification check, as mod-check takes 1.8s at best, and dirwalk takes 1.5s single-threaded.
- It turns out that WebKit traversal takes 0.5s at best (without any other processing), and 0.23s with optimal thread count. So it seems worth going for jwalk enablement.
special attention needs to be paid to case-insensitive file systems (and looking up their path in the index)
needs to deal with precompose-unicode settings
let's have statistics right away
let's do a custom traversal for most control (i.e. not an iterator for now, need correctness, can easily create threaded iter from it if needed)
symlink handling - assure we don't follow symlinks (I presume?)
assure CWD doesn't have any say in the outcome
see what it returns for submodules
see what it does with cone-mode partial checkouts, or other forms of partial checkouts

Tasks

Next PR

minimal integration of dirwalk() into gix status to learn more about it (and its performance)
Integrate untracked and ignored files with rename tracking of index-worktree diffs
status in gix crate with index-worktree
diff index with index to learn what we would want to do in the worktree, or alternatively,
diff tree with index (with reverse-diff functionality to simulate diff of index with tree), for better performance as it
would avoid having to allocate a whole index even though we are only interested in a diff.
- Must include rename tracking.
how to make diff results available from status with all transformations applied, to allow user to obtain diffs of any kind?

Benchmarks

gix can hold its own and comes in close to Git, but only if it doesn't do any additional work by means of precomposing unicode or building an ignore-case hash table (which Git always builds). Interestingly, Git doesn't get slowed at all by any of these, which is surprising as precompose unicode costs about 20% of the time in our case.
Multithreading the hashtable generation seems to be the way to go.

Data

WebKit ( main) via △
❯ hyperfine -N -r4 -w1 'git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean'
Benchmark 1: git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n
  Time (mean ± σ):     676.7 ms ±  15.2 ms    [User: 223.0 ms, System: 473.8 ms]
  Range (min … max):   656.4 ms … 689.4 ms    4 runs

Benchmark 2: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
  Time (mean ± σ):     696.1 ms ±  11.5 ms    [User: 233.2 ms, System: 481.4 ms]
  Range (min … max):   679.2 ms … 703.9 ms    4 runs

Benchmark 3: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
  Time (mean ± σ):     821.3 ms ±   7.5 ms    [User: 360.8 ms, System: 482.3 ms]
  Range (min … max):   811.4 ms … 828.7 ms    4 runs

Benchmark 4: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean
  Time (mean ± σ):     893.7 ms ±   9.1 ms    [User: 438.2 ms, System: 473.1 ms]
  Range (min … max):   885.5 ms … 905.1 ms    4 runs

Benchmark 5: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
  Time (mean ± σ):     772.1 ms ±  10.2 ms    [User: 308.0 ms, System: 485.2 ms]
  Range (min … max):   761.6 ms … 785.7 ms    4 runs

Summary
  git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n ran
    1.03 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
    1.14 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
    1.21 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
    1.32 ± 0.03 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean

WebKit ( main) via △ took 19s
❯ hyperfine -N -r4 -w1 'git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n' 'git  -c core.precomposeUnicode=1 -c core.ignorecase=0  clean -n' 'git  -c core.precomposeUnicode=1 -c core.ignorecase=1  clean -n' 'git  -c core.precomposeUnicode=0 -c core.ignorecase=1  clean -n' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean' '/Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean'
Benchmark 1: git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n
  Time (mean ± σ):     667.7 ms ±   5.2 ms    [User: 220.0 ms, System: 465.4 ms]
  Range (min … max):   663.0 ms … 673.2 ms    4 runs

Benchmark 2: git  -c core.precomposeUnicode=1 -c core.ignorecase=0  clean -n
  Time (mean ± σ):     690.0 ms ±  23.2 ms    [User: 231.7 ms, System: 476.2 ms]
  Range (min … max):   663.0 ms … 709.9 ms    4 runs

Benchmark 3: git  -c core.precomposeUnicode=1 -c core.ignorecase=1  clean -n
  Time (mean ± σ):     667.0 ms ±  15.6 ms    [User: 230.1 ms, System: 480.8 ms]
  Range (min … max):   657.1 ms … 690.0 ms    4 runs

Benchmark 4: git  -c core.precomposeUnicode=0 -c core.ignorecase=1  clean -n
  Time (mean ± σ):     657.3 ms ±  12.5 ms    [User: 225.8 ms, System: 477.7 ms]
  Range (min … max):   646.6 ms … 672.7 ms    4 runs

Benchmark 5: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
  Time (mean ± σ):     699.1 ms ±  23.6 ms    [User: 234.3 ms, System: 483.4 ms]
  Range (min … max):   678.7 ms … 730.3 ms    4 runs

Benchmark 6: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
  Time (mean ± σ):     842.9 ms ±  49.3 ms    [User: 365.1 ms, System: 491.1 ms]
  Range (min … max):   815.7 ms … 916.9 ms    4 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 7: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean
  Time (mean ± σ):     897.8 ms ±  30.3 ms    [User: 429.6 ms, System: 481.0 ms]
  Range (min … max):   870.2 ms … 924.0 ms    4 runs

Benchmark 8: /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
  Time (mean ± σ):     769.9 ms ±  24.3 ms    [User: 298.3 ms, System: 490.2 ms]
  Range (min … max):   744.0 ms … 795.8 ms    4 runs

Summary
  git  -c core.precomposeUnicode=0 -c core.ignorecase=1  clean -n ran
    1.01 ± 0.03 times faster than git  -c core.precomposeUnicode=1 -c core.ignorecase=1  clean -n
    1.02 ± 0.02 times faster than git  -c core.precomposeUnicode=0 -c core.ignorecase=0  clean -n
    1.05 ± 0.04 times faster than git  -c core.precomposeUnicode=1 -c core.ignorecase=0  clean -n
    1.06 ± 0.04 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=0 clean
    1.17 ± 0.04 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=0 -c core.ignorecase=1 clean
    1.28 ± 0.08 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=0 clean
    1.37 ± 0.05 times faster than /Users/byron/dev/github.com/Byron/gitoxide/target/release/gix -c core.precomposeUnicode=1 -c core.ignorecase=1 clean

Status Enables

Add gitoxide backend Byron/built#1
starship native dirty check
onefetch finalize integration
helix PR

Next PR: Reset

reset() that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hard

Postponed

What follows is important for resets, but won't be needed for cargo worktree resets.

a way to expand sparse dirs (but figure out if this is truly always necessary) - probably not, unless sparse dirs can be empty, but even then no expansion is needed
- wire it up in gix index entries to optionally expand sparse entries
gix status with actual submodule support - needs status in gix (crate) effectively
gix status with actual conflict support

Research

Ignored files are considered expandable and can be overwritten on reset
How to integrate submodules - probably easy to answer once gix status can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.
How to deal with various modes like merge and keep? How to control refresh? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.
Worthwhile to make explicit the difference between git reset and git checkout in terms of HEAD modifications. With the former changing HEADs referent, and the latter changing HEAD itself.
figure out how this relates to the current checkout() method as technically that's a reset --hard with optional overwrite check. Could it be rolled into one, with pathspec support added?
- just keep them separate until it's clear that reset() performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.
for git status: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking. No, also can have worktree rename tracking, even though it's hard to imagine how this can be fast unless it's tightly integrated with untracked-files handling. This screams for a generalization of the tracking code though as the testing and implementation is complex, but should be generalisable.

Re-learn

pathspecs normalize themselves to turn from any kind of specification into repo-root relative patterns.
attribute/ignore file sources are naturally relative to the root of the repo, which remains relative (i.e. can be .. and that root will be always be used to open files like ../.gitignore, which is useful for display to the user)

pascalkuthe · 2024-01-25T08:54:36Z

For the parallelism: Another rating you can consider is a workstealking executed (basically rayon nut I believe you said you already have one in core). That will call the equivalent of rayon_core:Scope::spawn for every dir entry and update any state.

You can abstract the join function to switch between single threaded and multi threaded.

Then you can just transverse the directories with std and have full manual control.

Byron · 2024-01-25T09:10:29Z

Thanks for sharing, I am glad you think this can be parallelised!

I was thinking to have the following progression:

make it work with code that quite closely resembles the source material (even though it already diverges as it's a library)
make it work correctly and assure there are tests for every little thing that is important

The above would already satisfy as the integration into the top-level status call will then run a bunch of different status-tasks in parallel and we'll see what that does to the overall performance (after all, there are 3 different status operations needed to get what git status does, they can all run in parallel).

I already checked the potential for parallelisation on the WebKit repository, and on my machien it was close to 2X faster when having a parallel walk, merely by comparing jwalk and walkdir. Not great, but meaningful.

The simplest way, in theory, would be to see if the algorithm in all its complexity can work with an iterator, as then one could just gix-features::fs::WalkDir for it which is parallel with the right cargo feature.

Of course, it would be more optimal to use parallelise directly, and but at least for now I am so ignorant towards rayon and for that reason along wouldn't want to use it directly. Maybe that changes one day though. moonwalk of course is also a contender as it would represent such an abstraction.

Long story short, I definitely plan for future upgrades and will do my best to have a bullet-proof test-suite that assures everything still works even after massive refactors that might come with these changes.

Byron · 2024-01-25T10:13:18Z

Just now I realised that the aforementioned 3 steps have interdependencies, as the index-to-worktree comparison is required to run before the walk() can happen, because it needs the index entries to have the gix_index::entry::Flags::UPTODATE flag set or else the entry will be ignored. This is a bit sad as that comparison is the most costly.
Maybe something smart can be done in due time that involves assuming all entries are up-to-date, and then re-running the walk() on just the paths that we know changed in the meantime., Oh well, kommt Zeit kommt Rat.

pascalkuthe · 2024-01-25T12:16:27Z

Hmm I didn't match the git implementation quite as closely when I worked on the status stuff but it did seems like this was more of a superficial limitation of the implementation rathee than an actual interdependence.

Doing serial first makes lots of sense ofcourse.

Byron · 2024-01-25T13:24:36Z

Hmm I didn't match the git implementation quite as closely when I worked on the status stuff but it did seems like this was more of a superficial limitation of the implementation rathee than an actual interdependence.

It changed quite a bit since then and doesn't have mutable access to the index anymore either, instead it emits change requests (stat updates) specifically. Thus the one orchestrating these calls must know about this flag and set it before calling the dirwalk. Effectively, having this allows for an optimisation as otherwise another lstat call will be triggered.

At least that's my current understanding, thus far I think it will all be fine and optimisation comes later. Even now I am very optimistic that the implementation here will be en-par with Git, while git reset will be much faster on big trees.

That way it's easier to use them without adding own dependnecies.

This way it's possible to match partial input against a pathspec to see if this root would have a chance to actually match.

…atch`. With it the caller can learn how or why the pathspec matched, which allows to make decisions based on it that are relevant to the user interface.

…f a pathspec matches. That way it's possible to see if some paths can never match.

That way, the caller can obtain the amount of patterns more easily.

The new implementation does the same as Git, and initializes an alternative lookup location. All previous implementations with `_icase` suffix were removed without replacement.

…ns were available. These are only relevant when reading, and have to be regenerated after writing. During reading, they can speed up performance by allowing multi-threading.

That way it's possible to obtain the current working directory with which the repository was opened.

That way it's possible to perform arbitrary directory walks, useful for status, clean, and add.

Byron force-pushed the dirwalk branch 3 times, most recently from f71396e to 8432e97 Compare January 25, 2024 06:04

Byron mentioned this pull request Jan 25, 2024

Changed file picker helix-editor/helix#5645

Merged

Byron force-pushed the dirwalk branch 3 times, most recently from 2f0d55d to 8bb7d66 Compare January 25, 2024 08:27

Byron force-pushed the dirwalk branch 9 times, most recently from a9ece59 to b4691eb Compare January 28, 2024 20:04

Byron mentioned this pull request Jan 29, 2024

Add gitoxide feature lukaslueg/built#52

Closed

Byron force-pushed the dirwalk branch 8 times, most recently from 84e1b5f to 4ed507e Compare February 2, 2024 20:02

Byron force-pushed the dirwalk branch 4 times, most recently from cf311e3 to 109505a Compare February 8, 2024 16:29

Byron added 3 commits February 8, 2024 19:08

fix: re-export crates whose types are used in the API.

d0c5a0e

That way it's easier to use them without adding own dependnecies.

feat: add Search::can_match_relative_path().

e409e8d

This way it's possible to match partial input against a pathspec to see if this root would have a chance to actually match.

feat!: add search::MatchKind, which is available for any `search::M…

021d4c1

…atch`. With it the caller can learn how or why the pathspec matched, which allows to make decisions based on it that are relevant to the user interface.

Byron force-pushed the dirwalk branch from 109505a to 0da5da0 Compare February 8, 2024 19:04

Byron added 4 commits February 9, 2024 13:55

feat: Add Search::directory_matches_prefix() to see if the prefix o…

f520a51

…f a pathspec matches. That way it's possible to see if some paths can never match.

fix: Return ExactSizeIterator where applicable.

ea2e5bb

That way, the caller can obtain the amount of patterns more easily.

feat: add str::precompose_bstr() for convenience

9e3acde

feat: add try_os_str_into_bstr(), with Cow<OsStr> as input.

c8ccbe5

Byron force-pushed the dirwalk branch 2 times, most recently from 85b30d3 to f3aea6a Compare February 10, 2024 20:13

Byron added 2 commits February 11, 2024 09:02

fix!: refactor ignore-case functionality to actually work.

cbda06d

The new implementation does the same as Git, and initializes an alternative lookup location. All previous implementations with `_icase` suffix were removed without replacement.

feat: Provide information about whether or not EOIE and IEOT extensio…

ec37db4

…ns were available. These are only relevant when reading, and have to be regenerated after writing. During reading, they can speed up performance by allowing multi-threading.

Byron force-pushed the dirwalk branch 3 times, most recently from 8d413ba to fa1eda0 Compare February 11, 2024 14:25

Byron added 4 commits February 11, 2024 17:08

Implementation of the Git-style directory walk.

3252cfd

feat: add open::Options::current_dir().

d8bd45e

That way it's possible to obtain the current working directory with which the repository was opened.

feat: add Repository::dirwalk_with_delegate().

6914d1a

That way it's possible to perform arbitrary directory walks, useful for status, clean, and add.

Adjust gitignore files with precious declarations

ae86a6a

Byron force-pushed the dirwalk branch 2 times, most recently from 9a11d02 to 9f9b0d7 Compare February 11, 2024 17:24

Byron added 2 commits February 11, 2024 18:45

feat: basic gix clean

e8597f3

feat: gix free index info now lists EOIE and IEOT extensions.

a3ab5bc

Byron force-pushed the dirwalk branch from 9f9b0d7 to a3ab5bc Compare February 11, 2024 17:52

Byron merged commit face359 into main Feb 11, 2024
18 checks passed

Byron mentioned this pull request Feb 11, 2024

integrate gix-status #1285

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gix-dir: Walk #1252

gix-dir: Walk #1252

Byron commented Jan 17, 2024 •

edited

Loading

pascalkuthe commented Jan 25, 2024

Byron commented Jan 25, 2024

Byron commented Jan 25, 2024

pascalkuthe commented Jan 25, 2024

Byron commented Jan 25, 2024

gix-dir: Walk #1252

gix-dir: Walk #1252

Conversation

Byron commented Jan 17, 2024 • edited Loading

diff-correctness → gix-status → gix reset

Architecture

Research of Walk

Tasks

Next PR

Benchmarks

Status Enables

Next PR: Reset

Postponed

Research

Re-learn

pascalkuthe commented Jan 25, 2024

Byron commented Jan 25, 2024

Byron commented Jan 25, 2024

pascalkuthe commented Jan 25, 2024

Byron commented Jan 25, 2024

Byron commented Jan 17, 2024 •

edited

Loading

diff-correctness → `gix-status` → gix reset