-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf: Collection Count() is Slower in Core than the CLR #11971
Comments
Which collection? Which "Count()"? LINQ's? What OS? What version of .NET Core? Can you share the benchmark? |
Thanks for the quick response. Here is the info you requested: Test was done with List<> using Count() from Linq. The collection has 10,000 items in it. Tests were done with Clr 4.7.2 and Core 2.2 using BenchmarkDotNet. The benchmark code looks like this: Here are the computer stats (in ns): Virtual Machine |
Wrote a small benchmark (not sure it's what @RealDotNetDave meant but anyway): [CoreJob(true)]
[ClrJob]
public class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<Program>();
}
public IEnumerable<string> _list;
[GlobalSetup]
public void Setup() => _list = Enumerable.Range(0, 10000).Select(i => i.ToString()).ToList();
[Benchmark]
public bool Count() => _list.Count() > 0;
[Benchmark]
public bool IsICollectionT() => _list is ICollection<string>;
}
^ .NET Core 2.2 (3.0-master has the same numbers) |
So it looks like you are seeing the same thing as I am. |
Here are a few diffs that I have noticed:
Before:
After (seems to be slower in microbenchmark):
|
Wonder if the interface dispatch change was a result of dotnet/coreclr#16267? @jkotas are you still investigating? |
I do not plan to do any more investigations on this. |
Looks more like the jit has trouble with the the current expression of Looking at some variations in implementing BenchmarkDotNet=v0.11.3, OS=Windows 10.0.18323
Intel Core i7-4770HQ CPU 2.20GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
[Host] : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.8.3740.0
Core : .NET Core 3.0.0-preview-27324-5 (CoreCLR 4.6.27322.0, CoreFX 4.7.19.7311), 64bit RyuJIT
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
[CoreJob(true)]
[ClrJob]
public class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<Program>();
}
public IEnumerable<string> _list;
[GlobalSetup]
public void Setup() => _list = Enumerable.Range(0, 10000).Select(i => i.ToString()).ToList();
[Benchmark]
public bool Count() => _list.Count() > 0;
[Benchmark]
public bool Count2() => _list.Count2() > 0;
[Benchmark]
public bool Count3() => _list.Count3() > 0;
[Benchmark]
public bool Count4() => _list.Count4() > 0;
}
static class Extensions
{
public static int Count2<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
CountError();
}
if (source is ICollection<TSource> collectionoft)
{
return collectionoft.Count;
}
if (source is ICollection collection)
{
return collection.Count;
}
return Count2Slow(source);
}
public static int Count2Slow<TSource>(IEnumerable<TSource> source)
{
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
{
checked
{
while (e.MoveNext())
{
count++;
}
}
}
return count;
}
public static int Count3<TSource>(this IEnumerable<TSource> source)
{
if (source == null) CountError();
ICollection<TSource> collectionoft = source as ICollection<TSource>;
if (collectionoft != null) return collectionoft.Count;
ICollection collection = source as ICollection;
if (collection != null) return collection.Count;
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
{
checked
{
while (e.MoveNext()) count++;
}
}
return count;
}
public static int Count4<TSource>(this IEnumerable<TSource> source)
{
if (source == null) CountError();
ICollection<TSource> collectionoft = source as ICollection<TSource>;
if (collectionoft != null) return collectionoft.Count;
ICollection collection = source as ICollection;
if (collection != null) return collection.Count;
return Count4Slow<TSource>(source);
}
public static int Count4Slow<TSource>(IEnumerable<TSource> source)
{
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
{
checked
{
while (e.MoveNext()) count++;
}
}
return count;
}
public static void CountError()
{
throw new Exception("source");
}
} |
So since this basically differs in:
vs
the JIT is stumbling over the IL pattern generated by the latter? That pattern has started to be used fairly aggressively across corefx, and coreclr in some places, so if the JIT is having more difficulties with it, it'd be great to get that fixed. |
Right, the IL generated for this pattern looks like this:
The problem is the
The JIT optimizations have to work hard to get rid of the extra local and figure out that it is redundant with the other local that is there already. We can fix the JIT to do that, but it would be better if Roslyn avoids generating the problematic dup instruction here, i.e. generate the following IL:
cc @jaredpar |
@agocke who has been looking at this recently. |
The generated code for Count and Count2 (which [look like the should] have the same IL constructs for the fast paths) shows more prolog/epilog pressure in Count, but given the path length we're testing here I'm surprised that matters so much for perf. So there may be something else going on here too. |
Ah, |
@AndyAyersMS make sure you specify |
I'm building the code above with the 3.0 preview 2 sdk, via
Looks the this passes |
With current master I don't see a lot of codegen or perf diffs for the
Via perfview, the big difference seen in |
As far as Generic methods are given 4 special dictionary slots that can be used for fast lookup. Here one of these gets used for the type parameter, and the others are used for various types and methods and so on that depend on the type parameter. These are allocated on demand during jit time as the jit runs and calls back into the runtime for various constructs. At runtime, if a handle request isn't in one of these slots, it is looked up in a global hash table. In If we look at For |
This limit came from fragile NGen. With R2R, there should be nothing fundamental preventing us from making the number of dictionary slots for generic methods dynamic - give each generic method the exact number of slots it needs to run efficiently. |
That should indeed be sufficient here. I'm not sure why you don't see the |
I also plan to look into the perf difference between 3.0 preview 2 and the current master. |
I have a hypothesis about the 3.0 preview 2 vs current master diffs, related to the above. If correct, this can also help explain the Core vs Desktop perf difference that initiated this issue. It appears that when we rejit a ready to run precompiled method, the dictionary slots used by ready to run can't be shared by jitted code. So we run out of slots "faster" in a tier1 rejit if we prejit than we would if we hadn't prejitted, and this impacts perf of the tier1 code. My guess is that in the 3.0 preview 2 SDK we must be prejitting System.Linq and so it suffers from this effect -- the R2R version is slow, and uses up slots, so the Tier1 version is not as fast as it would be if we hadn't prejitted at all. When I run locally I'm likely not using a prejitted System.Linq and (usually) am not prejitting my test code. So I don't see this extra cost. We can simulate this with a simple test program. using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
class P
{
const int N = 100_000_000;
public IEnumerable<string> _list;
[MethodImpl(MethodImplOptions.NoOptimization)]
public static void Main()
{
var p = new P();
p._list = Enumerable.Range(0, 10000).Select(i => i.ToString()).ToList();
p.Count2();
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
for (int i = 0; i < N; i++)
{
p.Count2();
}
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
Console.WriteLine($" Count2: {ts.TotalMilliseconds:F2}ms");
}
public bool Count2() => _list.Count2() > 0;
}
static class Extensions
{
public static int Count2<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
return 0;
}
if (source is ICollection<TSource> collectionoft)
{
return collectionoft.Count;
}
int count = 0;
using (IEnumerator<TSource> e = source.GetEnumerator())
{
checked
{
while (e.MoveNext())
{
count++;
}
}
}
return count;
}
} I get the following perf numbers for local runs via corerun:
One would expect that last row would be closer to 1091. So if we do work adjust dictionary slot counts for R2R to make sure we don't run out, we should also look into how R2R shares slots with rejitted Tier1 code. Desktop doesn't use R2R and doesn't rejit -- it will either be jitting or using NGEN prejitted code, so it doesn't hit this case, and (assuming NGEN has overhead here similar to jitted code) it should see "fast" perf. Core 2.2 uses R2R and doesn't rejit by default, so it will see the slowest case. cc @kouvel |
If the tiered JIT is on, we may want to modify the R2R to never use the fast slots. Reserve the fast slots for use by the tiered JIT code only. It should be easy change to make (it is just a change in the runtime, it does not affect the file format or the compiler). cc @fadimounir |
To to sum up, there seem to be a few avenues for follow up:
I assume similar things apply to generic types, though perhaps less often? It might be worth some investigation. Perhaps we could monitor how often we end up checking the generic handle cache. Since this is mostly work on the runtime side, I'm going to update the issue area. |
Disable use of fast dictionary slots for R2R images when tiered JITing is enabled. Fixes #22400
Disable use of fast dictionary slots for R2R images when tiered JITing is enabled. Fixes #22400
@sergiy-k @fadimounir this is the issue I was referring to.... |
Wow, I really like all the back and forth on this. Will this be fixed in .NET 3.x? |
I have been benchmarking performance of Count() for a collection. I am seeing that Core is a lot slower than the CLR.
Count() (Clr) | VM 8.138 | LAPTOP 4.847
Count() (Core) | VM 18.073 | LAPTOP 12.355
Count() with Predicate (Clr) | VM 275,516.47 | LAPTOP 183,284.90
Count() with Predicate (Core) | VM 307,900.18 | LAPTOP 217,629.30
The VM above is an Azure VM: Size: Standard D8s v3 (8 vcpus, 32 GB memory)
Benchmarks were done with BenchmarkDotNet.
The text was updated successfully, but these errors were encountered: