-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gix-dir: Walk #1252
gix-dir: Walk #1252
Conversation
f71396e
to
8432e97
Compare
2f0d55d
to
8bb7d66
Compare
For the parallelism: Another rating you can consider is a workstealking executed (basically rayon nut I believe you said you already have one in core). That will call the equivalent of rayon_core:Scope::spawn for every dir entry and update any state. You can abstract the join function to switch between single threaded and multi threaded. Then you can just transverse the directories with std and have full manual control. |
Thanks for sharing, I am glad you think this can be parallelised! I was thinking to have the following progression:
The above would already satisfy as the integration into the top-level status call will then run a bunch of different status-tasks in parallel and we'll see what that does to the overall performance (after all, there are 3 different status operations needed to get what I already checked the potential for parallelisation on the WebKit repository, and on my machien it was close to 2X faster when having a parallel walk, merely by comparing The simplest way, in theory, would be to see if the algorithm in all its complexity can work with an iterator, as then one could just Of course, it would be more optimal to use parallelise directly, and but at least for now I am so ignorant towards Long story short, I definitely plan for future upgrades and will do my best to have a bullet-proof test-suite that assures everything still works even after massive refactors that might come with these changes. |
Just now I realised that the aforementioned 3 steps have interdependencies, as the |
Hmm I didn't match the git implementation quite as closely when I worked on the status stuff but it did seems like this was more of a superficial limitation of the implementation rathee than an actual interdependence. Doing serial first makes lots of sense ofcourse. |
It changed quite a bit since then and doesn't have mutable access to the index anymore either, instead it emits change requests (stat updates) specifically. Thus the one orchestrating these calls must know about this flag and set it before calling the dirwalk. Effectively, having this allows for an optimisation as otherwise another At least that's my current understanding, thus far I think it will all be fine and optimisation comes later. Even now I am very optimistic that the implementation here will be en-par with Git, while |
a9ece59
to
b4691eb
Compare
84e1b5f
to
4ed507e
Compare
cf311e3
to
109505a
Compare
That way it's easier to use them without adding own dependnecies.
This way it's possible to match partial input against a pathspec to see if this root would have a chance to actually match.
…atch`. With it the caller can learn how or why the pathspec matched, which allows to make decisions based on it that are relevant to the user interface.
…f a pathspec matches. That way it's possible to see if some paths can never match.
That way, the caller can obtain the amount of patterns more easily.
85b30d3
to
f3aea6a
Compare
The new implementation does the same as Git, and initializes an alternative lookup location. All previous implementations with `_icase` suffix were removed without replacement.
…ns were available. These are only relevant when reading, and have to be regenerated after writing. During reading, they can speed up performance by allowing multi-threading.
8d413ba
to
fa1eda0
Compare
That way it's possible to obtain the current working directory with which the repository was opened.
That way it's possible to perform arbitrary directory walks, useful for status, clean, and add.
9a11d02
to
9f9b0d7
Compare
Based on #1219
diff-correctness →
gix-status
→ gix resetImprove
gix status
to the point where it's suitable for use inreset
functinoality.Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.
Architecture
The reason this PR deals quite a bit with
gix status
is that for a safe implementation ofreset()
we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff thecurrent-index with target-index
. The set of files to touch can then be used to lookup information provided bygit-status
, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.
This leaves us in a situation where the current
checkout()
implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).On the way to
reset()
it's a valid choice to warm up more with the matter by improving on the currentgix status
implementation and assure correctness of what's there, which currently doesn't seem to be the case in comparison. Further, implementinggix status
similarly togit status
should be made possible.Research of Walk
index
,untracked
andignored
support, as well as empty-dir and.git
dir/file handlinggix add
thenWebKit
this optimization is very important, even though we might be able to have some easy wins with parallelization of dirwalk + modification check, as mod-check takes 1.8s at best, and dirwalk takes 1.5s single-threaded.Tasks
dir/*
). It's probably about being exact matches and what not, also works with prefix-matches likedir/
ordir
(but it knows how much matched to know what to return, files or the containing directory).also figure out how pathspecs are used when determining if recursion is allowed - can it force recursion?- postponedgix clean
clean -nxd '*.o'
should only list matching files, not collapse directories (in delete mode)generated/*.o
clean -nx d/d
matches and forces the directory to be shown even though no-d
is given:!*generated*
seems to expand to muchgix
gix clean
uses walkdir fromgix
Next PR
dirwalk()
intogix status
to learn more about it (and its performance)gix
crate with index-worktreediff tree with index (with reverse-diff functionality to simulate diff of index with tree), for better performance as it
would avoid having to allocate a whole index even though we are only interested in a diff.
Benchmarks
gix
can hold its own and comes in close toGit
, but only if it doesn't do any additional work by means of precomposing unicode or building an ignore-case hash table (which Git always builds). Interestingly, Git doesn't get slowed at all by any of these, which is surprising as precompose unicode costs about 20% of the time in our case.Multithreading the hashtable generation seems to be the way to go.
Data
Status Enables
gitoxide
backend Byron/built#1Next PR: Reset
reset()
that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hardPostponed
What follows is important for resets, but won't be needed for
cargo
worktree resets.gix index entries
to optionally expand sparse entriesgix status
with actual submodule support - needsstatus
ingix
(crate) effectivelygix status
with actual conflict supportResearch
gix status
can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.merge
andkeep
? How to controlrefresh
? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.git reset
andgit checkout
in terms ofHEAD
modifications. With the former changingHEAD
s referent, and the latter changingHEAD
itself.checkout()
method as technically that's areset --hard
with optional overwrite check. Could it be rolled into one, with pathspec support added?reset()
performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.git status
: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking. No, also can have worktree rename tracking, even though it's hard to imagine how this can be fast unless it's tightly integrated with untracked-files handling. This screams for a generalization of the tracking code though as the testing and implementation is complex, but should be generalisable.Re-learn
pathspecs
normalize themselves to turn from any kind of specification into repo-root relative patterns...
and that root will be always be used to open files like../.gitignore
, which is useful for display to the user)