Improve matching performance. #293

kevincox · 2017-06-18T17:12:16Z

Note: This is based off of #253 and #258 .

This improves matching performance by using a trie to store the paths.
Hit-or-miss matching is then done iteratively, allowing the pruning of
subtrees and requiring less work to identify matches.

Additionally this structure usually uses much less memory as common
prefixes of paths are only stored once.

Testing on benchmarks and use cases shows a 2-10x performance
improvement for common scenarios. For some edge cases effectively
infinite speedup can be seen as huge numbers of paths can be
pruned. This is particularly common with a lot of hidden paths.

One side note is that I kept the threading feature in although in
practice it seems to provide a very small improvement and only on very
large sets (>1M). Future work will to investigate why there is little
benefit and either fix or remove the threading. (Removing provides a
small performance boost as some checks can be removed).

wincent · 2017-06-18T18:52:36Z

Can't really review this on my phone but looks pretty epic. The trie makes sense: trade a bit of overhead for its construction in exchange for cheaper search space pruning during matching. Great stuff. Thanks for contributing it.

wincent · 2017-06-22T14:50:58Z

@kevincox: would you be able to rebase this against "master"?

kevincox · 2017-06-22T15:10:51Z

If you are interested in merging I gladly will. However it is Dependant on the C Scanner Infrastructure so that will have to be merged first. If that sounds good let me know. I should have some free time this weekend.

wincent · 2017-06-22T16:27:10Z

Yeah, definitely interested. I just want to confirm your benchmark results on top of current master before going too far.

kevincox · 2017-06-22T18:59:01Z

Well I'll try to start porting to current master. Also I realized the reason I wasn't seeing much parallel speedup was because the benchmarks had no limit. Adding a limit the multithreading is quite effective.

This scanner allows low overhead reading of files. It can currently read paths from an array of strings (good for ruby-based scanners) and a terminated string (great for command executor scanners). This also converts the find scanner and the git scanner to use the string based infrastructure.

This option provides the user with ultimate flexibility. They have the choice to provide any shell command, script or pipeline in order to generate a file list. To give the command more context in the future variables can be set before calling the user command in the shell to pass more information to the external command. This allows future backwards compatible extension. The only drawback I see is that the terminator is "static" for example if a user has a script that switches the technique used based on context (for example current path) it will always have to use the same terminator. Luckily the workaround is simple by using `| tr '\n' '\0'` to convert newlines (or whatever the alternative separator is) to nulls and setting CommandT to always use nulls. This has a slight CPU and memory overhead but since it runs in parallel it should be insignificant on a multi-core machine.

This improves matching performance by using a trie to store the paths. Hit-or-miss matching is then done iteratively, allowing the pruning of subtrees and requiring less work to identify matches. Additionally this structure usually uses much less memory as common prefixes of paths are only stored once. Testing on benchmarks and use cases shows a 2-10x performance improvement for common scenarios. For some edge cases effectively infinite speedup can be seen as huge numbers of paths can be pruned. This is particularly common with a lot of hidden paths. Other notes: - A tiny cache boost might be gained by putting all strings into the same buffer Instead of strdup'ing strings separately. - I considered putting the path segments inline into the paths_t structure but it performed worse. This was surprising but probably due to good cache prediction on the strings being in order in memory. - Most of the time is still spent in calculate_match. While we call this function much less now (on every match rather then every string) it is still expensive. While this might be acceptable because its complexity provides a useful result order it is a good optimization target. - Another boost might be gained by post-processing the paths into a single array that stores the delta from the previous. This would give excellent cache-locality but would make skipping over hidden or mask-failing files difficult/impossible. This would also be difficult/impossible to do threaded.

Without a limit the time is dominated by sorting all the results in a single thread. The limit is configurable so it can be raised for testing merging.

kevincox · 2017-06-24T10:37:59Z

The rebase was simpler then expected. This is now based on master.

wincent · 2017-06-27T01:01:06Z

Thanks, @kevincox. So I had a chance to play around with this. The "big" matcher benchmark is indeed a bit faster, but I found in practice that on a large repo (1.5m files) this is significantly slower overall. How large a repo have you been testing this on? I think the approach here shows promise, but it will need to meet and improve the bar on large repos in order to be useful (because the truth is, the perf doesn't matter on small repos, because anything is fast enough there).

kevincox · 2017-06-27T06:48:12Z

That's odd. I was trying it on a set of ~4M files (my real world use case) and it was significantly faster. Can you explain exactly what your setup is.

What do the files look like. Are they a shallow set or deeply nested?
Are you using threads? And how many cores does your system have?
- What is the difference with and without threads?
Are you setting a limit on the number of results? How large is it?
What OS are you on?
What compiler and optimization level are you using?
Could you give me concrete numbers? I'm wondering it both before and after are fast or slow.
What does your query look like and how much of the set is it actually matching and how selective is the mask?

kevincox · 2017-06-27T09:27:24Z

Hmm, it actually appears like there was a regression in the cleanup. I'll look into it.

However even still the performance is better then master for me with ~2M paths and a real query. The numbers below are running the matcher benchmark as it exists in the latest commit in this branch (except master had to pass the paths differently) on a 6 physical 12 logical core linux system.

# master
                     best      avg       sd   +/- p   (best)      (avg)      (sd)   +/- p
     pathological   1.13000   1.22750 0.06418  ?     (1.12612)  (1.22599) (0.06424)  ?   
        command-t   0.89000   0.90750 0.03031  ?     (0.88601)  (0.90683) (0.03032)  ?   
chromium (subset)   3.79000   3.80500 0.01118  ?     (1.03332)  (1.07116) (0.04665)  ?   
 chromium (whole)   4.29000   4.37250 0.07361  ?     (0.58969)  (0.64701) (0.05316)  ?   
       big (400k)   6.66000   6.67750 0.01479  ?     (0.90809)  (0.96113) (0.05408)  ?   
             real 385.38000 386.36750 0.70892  ?    (46.41973) (47.30632) (0.54375)  ?   
# faster-filter
                     best      avg       sd    +/- p   (best)      (avg)      (sd)   +/- p
     pathological   1.75000   2.10500  0.27409  ?     (1.74928)  (2.10546) (0.27541)  ?   
        command-t   1.39000   1.49500  0.07921  ?     (1.38612)  (1.49320) (0.07835)  ?   
chromium (subset)   2.83000   3.15750  0.23317  ?     (2.82870)  (3.14835) (0.22487)  ?   
 chromium (whole)   4.22000   4.45000  0.33786  ?     (0.60953)  (0.67107) (0.06295)  ?   
       big (400k)   5.28000   5.73500  0.38817  ?     (0.84286)  (0.93130) (0.09734)  ?   
             real 336.87000 367.88250 30.48526  ?    (42.95834) (46.65727) (3.38012)  ?

kevincox · 2017-08-06T21:48:33Z

Sorry for the delay, life gets in the way.

I looked into it and it appears that the reason the benchmarks look very similar is because this new branch is slightly slower when everything matches. This is because because every item still gets scored then sorted, and scoring is expensive. However the old method scores right away so it is doing the minimal amount of work (not counting cache differences), however it lacks the fast path.

Once your selector has much selectivity at all the new version is significantly faster as only matching entries are scored. The benchmark.rb currently benchmarks every prefix of the query so it spends most of it's time on the short non-selective queries.

I suspect that this isn't too useful of a measure because I tend to type a number of characters before I wait for the result. So while the first couple of searches are roughtly the same performance it doesn't matter to me because I only start using the results once I have typed a number of characters and the result set is significantly reduced.

Benchmark just using the full query:

# master
                    best    avg      sd   +/- p   (best)    (avg)      (sd)   +/- p
     pathological 0.05000 0.05733 0.00573  ?    (0.05428) (0.05838) (0.00570)  ?   
        command-t 0.07000 0.07467 0.00618  ?    (0.07070) (0.07511) (0.00282)  ?   
chromium (subset) 0.14000 0.15400 0.00712  ?    (0.03793) (0.04515) (0.00383)  ?   
 chromium (whole) 0.30000 0.31333 0.00699  ?    (0.05481) (0.05879) (0.00315)  ?   
       big (400k) 0.44000 0.46200 0.01046  ?    (0.07632) (0.08399) (0.00327)  ?   
# faster-filter
Summary of cpu time and (wall-clock time):
                    best    avg      sd   +/- p   (best)    (avg)      (sd)   +/- p
     pathological 0.10000 0.18800 0.03563  ?    (0.09133) (0.18863) (0.03951)  ?   
        command-t 0.06000 0.06600 0.00490  ?    (0.06305) (0.06872) (0.00324)  ?   
chromium (subset) 0.04000 0.06333 0.00869  ?    (0.04379) (0.05980) (0.00785)  ?   
 chromium (whole) 0.02000 0.03733 0.00772  ?    (0.00724) (0.00854) (0.00077)  ?   
       big (400k) 0.06000 0.06400 0.00490  ?    (0.01074) (0.01278) (0.00100)  ?

wincent · 2017-08-18T18:48:24Z

Interesting. Thanks for the analysis.

The benchmark.rb currently benchmarks every prefix of the query

Yep, this is pretty intentional because it reflects what I think is a fairly common use pattern. (Although I totally take your point about your preference for typing in bursts, and relying on debouncing to prevent intermediate searches from being dispatched.)

A couple of thoughts:

I wonder whether it would make sense to have a hybrid strategy that uses the method that is faster for "short non-selective queries", then cuts over to the other strategy (which your benchmarks seem to show is faster for long, selective queries) at some point. Obviously, would add some complexity, and the pay-off may not sufficiently justify it.
You mention that "every item still gets scored then sorted, and scoring is expensive". There is some nuance to that. Scoring is costly, but sorting is too (got a big speed-up from reducing the sorting workload by switching to a heap-based strategy in 03aedee, and then another substantial bump when we avoided some unnecessary heap operations in 34b03d5). Commit ee97325 was another big bump because it makes scoring basically free after the first non-match (that is, if you search for "abc" and it doesn't match, when you add another character and search for "abcd" we don't bother scoring at all seeing as we know it can't possibly match, meaning we can cheaply prune the search space much like you are with the trie, at least for this case).

Thanks for you work and analysis on this. I'm not going to pull the trigger on this yet because I do think we need to solve the sluggish feel on large repos when not using g:CommandTInputDebounce before we merge it.

kevincox · 2017-08-18T19:50:12Z

The benchmark.rb currently benchmarks every prefix of the query
Yep, this is pretty intentional...

I figured it was, and I knew about it but it slipped my mind. I think in the future it would be benficial to have a couple of different benchmarks so that it is easier to identify where the performance changes, but this is probably sufficent.

I wonder whether it would make sense to have a hybrid strategy...

I'm not sure it is. I very rarely use non-selective queries. Also I suspect the less selective your query the more you are relying on scoring to suface an interesting result. Searching a large corpus with a non-selective query doesn't seem particularly useful otherwise.

There is some low-hanging fruit in this change to make the empty string only use on thread and just read off the first limit items. Currently it spawns all the threads and scans everything. However since the scoring has a short circuit for the empty needle it is still fairly fast (~30% of time is in scoring) and I didn't implement it. (It's probably enought to just set sort = false, threads = 1 and this will work).

Scoring is costly, but sorting is too.

The first two changes are preserved, but I haven't looked into the partial results one. It could probably save some cycles on the incremental case so I'll look into it. It is unclear if it will be as clear of a win in this case because we already do some pruning, however more pruning is unlikely to hurt.

the sluggish feel on large repos when not using g:CommandTInputDebounce

I haven't tried this, I'll play with the setting and see the effect.

wincent · 2017-08-18T20:33:05Z

I'm not sure it is. I very rarely use non-selective queries. Also I suspect the less selective your query the more you are relying on scoring to suface an interesting result. Searching a large corpus with a non-selective query doesn't seem particularly useful otherwise.

In order to actually find the needle in the haystack you're generally going to need something pretty "selective", as you say, but that doesn't mean that performance of "non-selective" queries doesn't matter. The benchmarks don't tell the full story; this is why I commented above "I found in practice that on a large repo (1.5m files) this is significantly slower overall" — it felt like treacle, not a subtle effect at all. I didn't need to run the benchmarks to see the regression, and I bet other people working in huge repos would have seen it too; this isn't a true edge-case like the "pathological" dataset in the benchmarks, but the really common starting-to-search case, so it's pretty important that we don't regress it.

[Edit: is → isn't]

kevincox · 2017-08-18T21:20:52Z

I found in practice that on a large repo (1.5m files) this is significantly slower overall

I wonder what makes your large repo test slow. Because I tested with 0.5M, 2M and 4M and saw only minor differences (< 20% for empty queries). If you can figure out why your use case is so slow it might be possible to optimize it.

I have been using this since before I opened the request originaly with the 2M, but later cut down to 0.5M repo and it feels significantly faster for me in both cases.

it's pretty important that we don't regress it.

For sure, I think it should be possible to make this change with no, or only very minor regressions in the worst cases.

kevincox · 2017-08-18T21:38:15Z

I'll add a little checklist for planned improvements:

Cache results of previous search. This will allow faster pruning when search is strictly more selective. Restricts to a single concurrent reader but that is probably fine.
Seperate benchmarks for empty, non-selective, selective and extending cases.

wincent · 2022-08-26T21:06:52Z

Given the big rewrite for v6.0.x, older PRs will need to be re-rolled against main and target the new version, so I'm closing this one out. Feedback issue for 6.0.x is here:

🆕 Command-T 6.0.0 release — compendium issue #393

kevincox force-pushed the faster-filter branch from 3e59e37 to 4c38277 Compare June 18, 2017 17:13

kevincox added 3 commits June 24, 2017 11:10

kevincox force-pushed the faster-filter branch from 4c38277 to f91f894 Compare June 24, 2017 10:28

Add limit to matcher benchmark.

be96c6d

Without a limit the time is dominated by sorting all the results in a single thread. The limit is configurable so it can be raised for testing merging.

kevincox force-pushed the faster-filter branch from d44b560 to be96c6d Compare June 24, 2017 10:35

wincent changed the base branch from master to main June 7, 2021 22:34

wincent closed this Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve matching performance. #293

Improve matching performance. #293

kevincox commented Jun 18, 2017

wincent commented Jun 18, 2017

wincent commented Jun 22, 2017

kevincox commented Jun 22, 2017

wincent commented Jun 22, 2017

kevincox commented Jun 22, 2017

kevincox commented Jun 24, 2017

wincent commented Jun 27, 2017

kevincox commented Jun 27, 2017

kevincox commented Jun 27, 2017

kevincox commented Aug 6, 2017

wincent commented Aug 18, 2017

kevincox commented Aug 18, 2017

wincent commented Aug 18, 2017 •

edited

Loading

kevincox commented Aug 18, 2017 •

edited

Loading

kevincox commented Aug 18, 2017

wincent commented Aug 26, 2022

Improve matching performance. #293

Improve matching performance. #293

Conversation

kevincox commented Jun 18, 2017

wincent commented Jun 18, 2017

wincent commented Jun 22, 2017

kevincox commented Jun 22, 2017

wincent commented Jun 22, 2017

kevincox commented Jun 22, 2017

kevincox commented Jun 24, 2017

wincent commented Jun 27, 2017

kevincox commented Jun 27, 2017

kevincox commented Jun 27, 2017

kevincox commented Aug 6, 2017

wincent commented Aug 18, 2017

kevincox commented Aug 18, 2017

wincent commented Aug 18, 2017 • edited Loading

kevincox commented Aug 18, 2017 • edited Loading

kevincox commented Aug 18, 2017

wincent commented Aug 26, 2022

wincent commented Aug 18, 2017 •

edited

Loading

kevincox commented Aug 18, 2017 •

edited

Loading