Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce worst-case alg complexity of MemoryExtensions.IndexOfAny #53652

Merged
merged 3 commits into from
Jun 4, 2021

Conversation

GrabYourPitchforks
Copy link
Member

Based on a comment thread at #53115 (comment).

In a nutshell, given a large enough needle, the code pattern below may have O(n^2 * l) complexity, where n is the length of the haystack and l is the number of needles.

Span<T> haystack = GetHaystack();
Span<T> needles = GetNeedles();

while (true)
{
    int idxOfFirstNeedle = haystack.IndexOfAny(needles);
    if (idxOfFirstNeedle >= 0)
    {
        DoSomethingWith(haystack.Slice(0, idxOfFirstNeedle);
        haystack = haystack.Slice(idxOfFirstNeedle + 1);
    }
    else
    {
        DoSomethingWith(haystack); // remainder
        break;
    }
}

This PR reduces the algorithmic complexity of this slice-and-loop pattern to worst-case O(n * l), which is what most callers probably expect. The downside is that there's a higher constant factor, and this higher constant factor will dominate when the haystack is small or when first needle has a much higher probability of being found than any subsequent needle.

I'm intentionally keeping this PR simple, and a future PR can introduce vectorization or any other optimization technique to help get some of this perf back. The unit tests introduced as part of this PR should help catch future regressions in this area.

I don't have perf numbers right now but should be able to share them by the end of the week if there's interest in seeing them. Per investigation in the previously linked thread, I don't expect this to impact typical libraries code, as MemoryExtensions.[Last]IndexOfAny already has logic to handle specific needle counts typical of perf-critical code paths.

@ghost
Copy link

ghost commented Jun 3, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

Based on a comment thread at #53115 (comment).

In a nutshell, given a large enough needle, the code pattern below may have O(n^2 * l) complexity, where n is the length of the haystack and l is the number of needles.

Span<T> haystack = GetHaystack();
Span<T> needles = GetNeedles();

while (true)
{
    int idxOfFirstNeedle = haystack.IndexOfAny(needles);
    if (idxOfFirstNeedle >= 0)
    {
        DoSomethingWith(haystack.Slice(0, idxOfFirstNeedle);
        haystack = haystack.Slice(idxOfFirstNeedle + 1);
    }
    else
    {
        DoSomethingWith(haystack); // remainder
        break;
    }
}

This PR reduces the algorithmic complexity of this slice-and-loop pattern to worst-case O(n * l), which is what most callers probably expect. The downside is that there's a higher constant factor, and this higher constant factor will dominate when the haystack is small or when first needle has a much higher probability of being found than any subsequent needle.

I'm intentionally keeping this PR simple, and a future PR can introduce vectorization or any other optimization technique to help get some of this perf back. The unit tests introduced as part of this PR should help catch future regressions in this area.

I don't have perf numbers right now but should be able to share them by the end of the week if there's interest in seeing them. Per investigation in the previously linked thread, I don't expect this to impact typical libraries code, as MemoryExtensions.[Last]IndexOfAny already has logic to handle specific needle counts typical of perf-critical code paths.

Author: GrabYourPitchforks
Assignees: -
Labels:

area-System.Memory, tenet-performance

Milestone: 6.0.0

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jun 3, 2021

Need to investigate why the new unit tests are failing on arm / arm64 / OSX x64.

Edit: It's because BoundedMemory (on non-Windows) points to a GC array instead of native memory, and in this particular test I'm intentionally pointing the gcref at a bad address. I'll change BoundedMemory to use the native heap on Unix, just like it does for Windows, which should resolve the issue.

@GrabYourPitchforks
Copy link
Member Author

Benchmark results

Full benchmark app: https://github.com/GrabYourPitchforks/ConsoleApplicationBenchmark/blob/471fa6822a5c5e5fecfc14196e54dbfe82a0f20c/ConsoleAppBenchmark/IndexOfAnyRunner.cs

[Benchmark]
public int SliceInALoop()
{
    var haystack = _haystack;
    _ = haystack.Length; // allow JIT to prove not null
    ReadOnlySpan<T> haystackSpan = haystack;

    var needles = _needles;
    _ = needles.Length; // allow JIT to prove not null
    ReadOnlySpan<T> needlesSpan = needles;

    while (true)
    {
        int idx = haystackSpan.IndexOfAny(needlesSpan);
        if (idx < 0)
        {
            return haystackSpan.Length; // length of final slice
        }
        haystackSpan = haystackSpan.Slice(idx + 1);
    }
}
Method Toolchain HaystackLength LastNeedleMatches Mean Error StdDev Ratio RatioSD
SliceInALoop idxofany 4 False 18.33 ns 28.879 ns 1.583 ns 0.64 0.06
SliceInALoop main 4 False 28.79 ns 21.655 ns 1.187 ns 1.00 0.00
SliceInALoop idxofany 4 True 17.12 ns 4.562 ns 0.250 ns 0.38 0.01
SliceInALoop main 4 True 44.52 ns 12.959 ns 0.710 ns 1.00 0.00
SliceInALoop idxofany 16 False 65.40 ns 16.572 ns 0.908 ns 0.68 0.02
SliceInALoop main 16 False 96.56 ns 21.013 ns 1.152 ns 1.00 0.00
SliceInALoop idxofany 16 True 60.07 ns 9.261 ns 0.508 ns 0.46 0.00
SliceInALoop main 16 True 131.59 ns 10.865 ns 0.596 ns 1.00 0.00
SliceInALoop idxofany 64 False 236.47 ns 38.295 ns 2.099 ns 0.63 0.03
SliceInALoop main 64 False 378.68 ns 309.722 ns 16.977 ns 1.00 0.00
SliceInALoop idxofany 64 True 195.13 ns 24.816 ns 1.360 ns 0.22 0.00
SliceInALoop main 64 True 878.64 ns 9.714 ns 0.532 ns 1.00 0.00
SliceInALoop idxofany 256 False 901.88 ns 1,002.251 ns 54.937 ns 0.62 0.04
SliceInALoop main 256 False 1,450.75 ns 54.688 ns 2.998 ns 1.00 0.00
SliceInALoop idxofany 256 True 757.16 ns 39.841 ns 2.184 ns 0.07 0.00
SliceInALoop main 256 True 11,050.08 ns 419.823 ns 23.012 ns 1.00 0.00
SliceInALoop idxofany 1024 False 3,561.17 ns 4,301.921 ns 235.803 ns 0.62 0.05
SliceInALoop main 1024 False 5,781.88 ns 2,345.776 ns 128.580 ns 1.00 0.00
SliceInALoop idxofany 1024 True 2,988.88 ns 11.492 ns 0.630 ns 0.02 0.00
SliceInALoop main 1024 True 167,176.09 ns 29,612.891 ns 1,623.182 ns 1.00 0.00
SliceInALoop idxofany 4096 False 14,217.25 ns 11,186.894 ns 613.191 ns 0.62 0.02
SliceInALoop main 4096 False 22,928.52 ns 4,179.188 ns 229.075 ns 1.00 0.00
SliceInALoop idxofany 4096 True 11,919.86 ns 730.079 ns 40.018 ns 0.005 0.00
SliceInALoop main 4096 True 2,495,866.67 ns 136,852.103 ns 7,501.326 ns 1.000 0.00

Discussion

This benchmark tests IndexOfAny when called in a loop (a typical use case) for various haystack lengths, and it alternates whether the first needle or the last needle is discovered. In particular, this shows how the original IndexOfAny logic when called in a loop devolves into an O(n^2) operation if the last needle in the collection is the one that keeps being found, and how with the new logic the loop maintains an O(n) worst-case performance guarantee.

Don't read too deeply into the benchmark showing that the new IndexOfAny logic consistently outperforms the original logic. This is because the benchmark is specifically written to ensure the needles appear very frequently within the haystack, which results in many calls to IndexOfAny, and the overhead of setting up the SIMD loop on each entry is showing up as a large constant factor. Ignore this for now, as the really important takeaway is the worst-case algorithmic complexity as discussed in the previous paragraph.

// and when this is called in a loop, we want the entire loop to be bounded by O(n * l)
// rather than O(n^2 * l).

if (typeof(T).IsValueType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks, for chars, should we do the same thing string.IndexOfAny does, using ProbabilisticMap? Then presumably string.IndexOfAny could just always delegate to MemoryExtensions.IndexOfAny rather than special-casing specific lengths?

It also occurs to me that for really large sets, we could invert the vectorization, and for each character in the haystack, vectorize the searching of the needles. Presumably that would require a relatively large set of input characters, though, e.g. at least 8.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants