Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Collection Count() is Slower in Core than the CLR #11971

Closed
RealDotNetDave opened this issue Feb 4, 2019 · 25 comments
Closed

Perf: Collection Count() is Slower in Core than the CLR #11971

RealDotNetDave opened this issue Feb 4, 2019 · 25 comments
Labels
area-VM-coreclr tenet-performance Performance related issue
Milestone

Comments

@RealDotNetDave
Copy link

I have been benchmarking performance of Count() for a collection. I am seeing that Core is a lot slower than the CLR.

Count() (Clr) | VM 8.138 | LAPTOP 4.847
Count() (Core) | VM 18.073 | LAPTOP 12.355
Count() with Predicate (Clr) | VM 275,516.47 | LAPTOP 183,284.90
Count() with Predicate (Core) | VM 307,900.18 | LAPTOP 217,629.30

The VM above is an Azure VM: Size: Standard D8s v3 (8 vcpus, 32 GB memory)
Benchmarks were done with BenchmarkDotNet.

@RealDotNetDave RealDotNetDave changed the title Collection Count() is Slower in Core than the CLR Perf: Collection Count() is Slower in Core than the CLR Feb 4, 2019
@stephentoub
Copy link
Member

Which collection?

Which "Count()"? LINQ's?

What OS?

What version of .NET Core?

Can you share the benchmark?

@RealDotNetDave
Copy link
Author

RealDotNetDave commented Feb 4, 2019

Thanks for the quick response. Here is the info you requested:

Test was done with List<> using Count() from Linq. The collection has 10,000 items in it.

Tests were done with Clr 4.7.2 and Core 2.2 using BenchmarkDotNet.

The benchmark code looks like this:
_personCollection.Count() > 0;

Here are the computer stats (in ns):
Laptop
OS Name: Microsoft Windows 10 Pro Version 10.0.17134 Build 17134
System Manufacturer: Microsoft Corporation
System Model: Surface Laptop
System Type: x64-based PC
Processor: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz, 2496 Mhz, 2 Core(s), 4 Logical Processor(s)
BIOS Version/Date: Microsoft Corporation 137.2307.769, 8/3/2018
SMBIOS Version: 3.1
Locale: United States
Installed Physical Memory (RAM): 16.0 GB
Total Virtual Memory: 18.3 GB

Virtual Machine
Azure VM Size: Standard D8s v3 (8 vcpus, 32 GB memory)
OS Name: Microsoft Windows 10 Pro Version 10.0.17134 Build 17134
System Manufacturer: Microsoft Corporation
System Model: Virtual Machine
System Type: x64-based PC
Processor: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, 2295 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date: American Megatrends Inc. 090007, 6/2/2017
SMBIOS Version: 2.3
Locale: United States
Installed Physical Memory (RAM): 32.0 GB
Total Virtual Memory: 36.7 GB

@EgorBo
Copy link
Member

EgorBo commented Feb 4, 2019

Wrote a small benchmark (not sure it's what @RealDotNetDave meant but anyway):

[CoreJob(true)]
[ClrJob]
public class Program
{
    static void Main(string[] args)
    {
        BenchmarkRunner.Run<Program>();
    }

    public IEnumerable<string> _list;

    [GlobalSetup]
    public void Setup() => _list = Enumerable.Range(0, 10000).Select(i => i.ToString()).ToList();

    [Benchmark]
    public bool Count() => _list.Count() > 0;

    [Benchmark]
    public bool IsICollectionT() => _list is ICollection<string>;
}
// * Summary *

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.523 (1803/April2018Update/Redstone4)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
Frequency=3609377 Hz, Resolution=277.0561 ns, Timer=TSC
.NET Core SDK=3.0.100-preview-010184
  [Host] : .NET Core 2.2.1 (CoreCLR 4.6.27207.03, CoreFX 4.6.27207.03), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3260.0
  Core   : .NET Core 2.2.1 (CoreCLR 4.6.27207.03, CoreFX 4.6.27207.03), 64bit RyuJIT


         Method |  Job | Runtime |      Mean |     Error |    StdDev | Ratio |
--------------- |----- |-------- |----------:|----------:|----------:|------:|
          Count |  Clr |     Clr |  4.300 ns | 0.0020 ns | 0.0019 ns |  0.42 |
          Count | Core |    Core | 10.226 ns | 0.0126 ns | 0.0105 ns |  1.00 |
                |      |         |           |           |           |       |
 IsICollectionT |  Clr |     Clr |  1.598 ns | 0.0018 ns | 0.0014 ns |  0.76 |
 IsICollectionT | Core |    Core |  2.103 ns | 0.0013 ns | 0.0011 ns |  1.00 |

^ .NET Core 2.2 (3.0-master has the same numbers)
The Count<T> implementation is almost the same (at least the fast path) in .NET Core and .NET Fx.

@RealDotNetDave
Copy link
Author

So it looks like you are seeing the same thing as I am.

@jkotas
Copy link
Member

jkotas commented Feb 7, 2019

Here are a few diffs that I have noticed:

  • Better optimizations of the loop in Count<> slow path (less stack spilling) are regressing the fast path (more registers saved).

  • Instruction mix used for setup of interface dispatch:

Before:

mov r11,7FF7DE100020h
mov     rax,qword ptr [r11]
cmp     dword ptr [rcx],ecx
add     rsp,20h

After (seems to be slower in microbenchmark):

mov r11,7FFF92EF0030h
mov     rax,qword ptr [00007fff`92ef0030]	<- notice that this instruction takes more bytes than before, but it avoids the data-flow dependency
cmp     dword ptr [rcx],ecx
add     rsp,20h

@AndyAyersMS
Copy link
Member

Wonder if the interface dispatch change was a result of dotnet/coreclr#16267?

@jkotas are you still investigating?

@jkotas
Copy link
Member

jkotas commented Feb 8, 2019

I do not plan to do any more investigations on this.

@AndyAyersMS AndyAyersMS self-assigned this Feb 8, 2019
@AndyAyersMS
Copy link
Member

Looks more like the jit has trouble with the the current expression of Count. I'll have to dig in and figure out exactly what is tripping it up.

Looking at some variations in implementing Count -- Count2 is similar but peels off the slow and exception cases, Count3 is similar to what desktop does, Count4 is desktop plus peeling -- core does better on my box for all but the one we use now (not 100% fair comparison because I can't check for IIListProvider, but we don't get that far in this test anyways).

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
  [Host] : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
  Clr    : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3740.0
  Core   : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT

Method Job Runtime Mean Error StdDev Ratio RatioSD
Count Clr Clr 6.963 ns 0.0359 ns 0.0336 ns 0.37 0.00
Count Core Core 18.933 ns 0.2310 ns 0.2161 ns 1.00 0.00
Count2 Clr Clr 7.528 ns 0.1878 ns 0.3032 ns 1.11 0.05
Count2 Core Core 6.927 ns 0.1084 ns 0.0961 ns 1.00 0.00
Count3 Clr Clr 7.169 ns 0.2585 ns 0.2418 ns 1.10 0.04
Count3 Core Core 6.566 ns 0.1467 ns 0.1225 ns 1.00 0.00
Count4 Clr Clr 7.166 ns 0.0330 ns 0.0293 ns 1.11 0.03
Count4 Core Core 6.509 ns 0.1598 ns 0.1709 ns 1.00 0.00
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

[CoreJob(true)]
[ClrJob]

public class Program
{
    static void Main(string[] args)
    {
        BenchmarkRunner.Run<Program>();
    }

    public IEnumerable<string> _list;

    [GlobalSetup]
    public void Setup() => _list = Enumerable.Range(0, 10000).Select(i => i.ToString()).ToList();

    [Benchmark]
    public bool Count() => _list.Count() > 0;

    [Benchmark]
    public bool Count2() => _list.Count2() > 0;

    [Benchmark]
    public bool Count3() => _list.Count3() > 0;

    [Benchmark]
    public bool Count4() => _list.Count4() > 0;
}

static class Extensions
{
    public static int Count2<TSource>(this IEnumerable<TSource> source)
    {
        if (source == null)
        {
            CountError();
        }
        
        if (source is ICollection<TSource> collectionoft)
        {
            return collectionoft.Count;
        }
        
        if (source is ICollection collection)
        {
            return collection.Count;
        }

        return Count2Slow(source);
    }

    public static int Count2Slow<TSource>(IEnumerable<TSource> source)
    {
        int count = 0;
        using (IEnumerator<TSource> e = source.GetEnumerator())
        {
            checked
            {
                while (e.MoveNext())
                {
                    count++;
                }
            }
        }
        
        return count;
    }

    public static int Count3<TSource>(this IEnumerable<TSource> source)
    {
        if (source == null) CountError();
        ICollection<TSource> collectionoft = source as ICollection<TSource>;
        if (collectionoft != null) return collectionoft.Count;
        ICollection collection = source as ICollection;
        if (collection != null) return collection.Count;
        int count = 0;
        using (IEnumerator<TSource> e = source.GetEnumerator())
        {
            checked
            {
                while (e.MoveNext()) count++;
            }
        }
        return count;
    }

    public static int Count4<TSource>(this IEnumerable<TSource> source)
    {
        if (source == null) CountError();
        ICollection<TSource> collectionoft = source as ICollection<TSource>;
        if (collectionoft != null) return collectionoft.Count;
        ICollection collection = source as ICollection;
        if (collection != null) return collection.Count;

        return Count4Slow<TSource>(source);
    }

    public static int Count4Slow<TSource>(IEnumerable<TSource> source)
    {
        int count = 0;
        using (IEnumerator<TSource> e = source.GetEnumerator())
        {
            checked
            {
                while (e.MoveNext()) count++;
            }
        }
        return count;
    }

    public static void CountError()
    {
        throw new Exception("source");
    }
}

@stephentoub
Copy link
Member

Count3 is similar to what desktop does

So since this basically differs in:

ICollection<TSource> collectionoft = source as ICollection<TSource>;
if (collectionoft != null) return collectionoft.Count;

vs

if (source is ICollection<TSource> collectionoft)
    return collectionoft.Count;

the JIT is stumbling over the IL pattern generated by the latter? That pattern has started to be used fairly aggressively across corefx, and coreclr in some places, so if the JIT is having more difficulties with it, it'd be great to get that fixed.

@jkotas
Copy link
Member

jkotas commented Feb 9, 2019

the JIT is stumbling over the IL pattern generated by the latter

Right, the IL generated for this pattern looks like this:

ldarg.0
isinst     class [mscorlib]System.Collections.Generic.ICollection`1<!!0>
dup
stloc.0
brfalse.s  IL_XXX

ldloc.0
callvirt   instance int32 class [mscorlib]System.Collections.Generic.ICollection`1<!!0>::get_Count()
...

The problem is the dup instruction. The dup IL instruction is not as cheap as it looks. It causes the JIT to introduce a new local. As far as the JIT is concerned, the above IL is equivalent to:

.local temp;

ldarg.0
isinst     class [mscorlib]System.Collections.Generic.ICollection`1<!!0>
stloc temp
ldloc temp
ldloc temp
stloc.0
brfalse.s  IL_XXX

ldloc.0
callvirt   instance int32 class [mscorlib]System.Collections.Generic.ICollection`1<!!0>::get_Count()
...

The JIT optimizations have to work hard to get rid of the extra local and figure out that it is redundant with the other local that is there already. We can fix the JIT to do that, but it would be better if Roslyn avoids generating the problematic dup instruction here, i.e. generate the following IL:

ldarg.0
isinst     class [mscorlib]System.Collections.Generic.ICollection`1<!!0>
stloc.0
ldloc.0
brfalse.s  IL_XXX

ldloc.0
callvirt   instance int32 class [mscorlib]System.Collections.Generic.ICollection`1<!!0>::get_Count()
...

cc @jaredpar

@jaredpar
Copy link
Member

jaredpar commented Feb 9, 2019

@agocke who has been looking at this recently.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 9, 2019

The generated code for Count and Count2 (which [look like the should] have the same IL constructs for the fast paths) shows more prolog/epilog pressure in Count, but given the path length we're testing here I'm surprised that matters so much for perf. So there may be something else going on here too.

@AndyAyersMS
Copy link
Member

Ah, Count2 doesn't have dup in my case -- maybe I'm using a different CSC version.

@jaredpar
Copy link
Member

jaredpar commented Feb 9, 2019

@AndyAyersMS make sure you specify /release otherwise the dup won't be emitted.

@AndyAyersMS
Copy link
Member

I'm building the code above with the 3.0 preview 2 sdk, via dotnet build -c Release. It is invoking

 Microsoft (R) Visual C# Compiler version 3.0.19.6714 (359844cc)

Looks the this passes /debug- /optimize+ to csc ... is that not sufficient?

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 10, 2019

With current master I don't see a lot of codegen or perf diffs for the dup and non-dup versions. Here Count2c is written to more or less exactly mimic Count.

Count2a and Count2b factor out the slow loop, Count2a, throws via a helper, Count2 is as above.

;; 3.0 preview 2
D:\bugs\22400\ex>dotnet run -c Release
  Count: 1987.81ms
  Count2: 870.48ms
  Count2a: 1437.34ms
  Count2b: 1417.22ms
  Count2c: 1373.82ms

;; current master
  Count: 1555.89ms
  Count2: 933.90ms
  Count2a: 1385.73ms
  Count2b: 1417.63ms
  Count2c: 1532.44ms

Via perfview, the big difference seen in Count2 versus the others seemingly comes from it only testing for two interfaces instead of three -- once we add in a third interface test it seems like we start missing fast lookup and have to resolve via the JitGenericHandleCache and the cost of the runtime method handle access goes way up. It doesn't seem to matter that the first interface test succeeds and so we don't even execute the other tests.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 10, 2019

As far as Count vs Count2, I think what happens something like the following.

Generic methods are given 4 special dictionary slots that can be used for fast lookup. Here one of these gets used for the type parameter, and the others are used for various types and methods and so on that depend on the type parameter. These are allocated on demand during jit time as the jit runs and calls back into the runtime for various constructs.

At runtime, if a handle request isn't in one of these slots, it is looked up in a global hash table.

In Count there are 6 dependent items -- 3 interface types and the 3 methods they invoke. When we jit we selectively follow control flow in the method in some DFS manner, and in the case of Count the jit happens to import all 3 type tests before any of the associated calls, as the importer tends to follow taken branches in its DFS. Since there are 3 type tests in the method and they are reachable from entry by taken branches, they fill up the remaining 3 special slots, and when the jit circles back to generating code for the the called methods, they don't get special slots. So accessing them is slower at runtime.

If we look at Count2 there are only two type tests, and so the first method also gets a preferred slot, and this is the one we end up invoking. So perf is improved.

For isinst; brfalse we might be able to tweak the importer so it looks at the fall-through code first, presuming the earliest encountered test is the most important (though in these cases there are other IL instructions in between these two). Or maybe there is some way to arrange for some generic methods to get more slots...?

@jkotas
Copy link
Member

jkotas commented Feb 10, 2019

Generic methods are given 4 special dictionary slots that can be used for fast lookup

This limit came from fragile NGen. With R2R, there should be nothing fundamental preventing us from making the number of dictionary slots for generic methods dynamic - give each generic method the exact number of slots it needs to run efficiently.

@jaredpar
Copy link
Member

@AndyAyersMS

Looks the this passes /debug- /optimize+ to csc ... is that not sufficient?

That should indeed be sufficient here. I'm not sure why you don't see the dup ...

@AndyAyersMS
Copy link
Member

I also plan to look into the perf difference between 3.0 preview 2 and the current master.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 14, 2019

I have a hypothesis about the 3.0 preview 2 vs current master diffs, related to the above. If correct, this can also help explain the Core vs Desktop perf difference that initiated this issue.

It appears that when we rejit a ready to run precompiled method, the dictionary slots used by ready to run can't be shared by jitted code. So we run out of slots "faster" in a tier1 rejit if we prejit than we would if we hadn't prejitted, and this impacts perf of the tier1 code.

My guess is that in the 3.0 preview 2 SDK we must be prejitting System.Linq and so it suffers from this effect -- the R2R version is slow, and uses up slots, so the Tier1 version is not as fast as it would be if we hadn't prejitted at all.

When I run locally I'm likely not using a prejitted System.Linq and (usually) am not prejitting my test code. So I don't see this extra cost.

We can simulate this with a simple test program.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;

class P
{
    const int N = 100_000_000;
    public IEnumerable<string> _list;

    [MethodImpl(MethodImplOptions.NoOptimization)]
    public static void Main()
    {
        var p = new P();
        p._list = Enumerable.Range(0, 10000).Select(i => i.ToString()).ToList();
        p.Count2();

        Stopwatch stopWatch = new Stopwatch();
        stopWatch.Start();        
        for (int i = 0; i < N; i++)
        {
            p.Count2();
        }
        stopWatch.Stop();
        
        TimeSpan ts = stopWatch.Elapsed;
        Console.WriteLine($"  Count2: {ts.TotalMilliseconds:F2}ms");
    }

    public bool Count2() => _list.Count2() > 0;
}

static class Extensions
{
    public static int Count2<TSource>(this IEnumerable<TSource> source)
    {
        if (source == null)
        {
            return 0;
        }
        
        if (source is ICollection<TSource> collectionoft)
        {
            return collectionoft.Count;
        }

        int count = 0;
        using (IEnumerator<TSource> e = source.GetEnumerator())
        {
            checked
            {
                while (e.MoveNext())
                {
                    count++;
                }
            }
        }
        
        return count;
    }
}

I get the following perf numbers for local runs via corerun:

Prejitted ? Tiered ? Time (ms) Notes
No No 1034
No Yes 1091 small bit of tiering overhead
Yes No 2032 large bit of R2R overhead
Yes Yes 1543 R2R uses up slots and slows down Tier1

One would expect that last row would be closer to 1091.

So if we do work adjust dictionary slot counts for R2R to make sure we don't run out, we should also look into how R2R shares slots with rejitted Tier1 code.

Desktop doesn't use R2R and doesn't rejit -- it will either be jitting or using NGEN prejitted code, so it doesn't hit this case, and (assuming NGEN has overhead here similar to jitted code) it should see "fast" perf.

Core 2.2 uses R2R and doesn't rejit by default, so it will see the slowest case.

cc @kouvel

@jkotas
Copy link
Member

jkotas commented Feb 14, 2019

If the tiered JIT is on, we may want to modify the R2R to never use the fast slots. Reserve the fast slots for use by the tiered JIT code only. It should be easy change to make (it is just a change in the runtime, it does not affect the file format or the compiler).

cc @fadimounir

@AndyAyersMS
Copy link
Member

To to sum up, there seem to be a few avenues for follow up:

  • consider increasing number of fast slots for generic methods
  • look into how prejitted and rejitted share or don't share fast slots
  • consider not allowing R2R to use fast slots
  • if number of fast slots is limited, try and maximize effective fast slot usage (eg dotnet/coreclr#22601)

I assume similar things apply to generic types, though perhaps less often? It might be worth some investigation. Perhaps we could monitor how often we end up checking the generic handle cache.

Since this is mostly work on the runtime side, I'm going to update the issue area.

@AndyAyersMS AndyAyersMS removed their assignment Feb 14, 2019
jkotas referenced this issue in jkotas/coreclr Feb 15, 2019
Disable use of fast dictionary slots for R2R images when tiered JITing is enabled.

Fixes #22400
jkotas referenced this issue in dotnet/coreclr Feb 15, 2019
Disable use of fast dictionary slots for R2R images when tiered JITing is enabled.

Fixes #22400
@AndyAyersMS
Copy link
Member

@sergiy-k @fadimounir this is the issue I was referring to....

@RealDotNetDave
Copy link
Author

Wow, I really like all the back and forth on this. Will this be fixed in .NET 3.x?

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants