-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String.GetHashCode() consistently slower on dotnetcore than .Net Framework #26779
Comments
.NET Core uses randomized string hashing, e.g. with this program: using System;
class Program
{
static void Main()
{
Console.WriteLine("hello".GetHashCode());
Console.WriteLine("hello".GetHashCode());
Console.WriteLine("hello".GetHashCode());
Console.WriteLine("hello".GetHashCode());
}
} By default (https://docs.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/runtime/userandomizedstringhashalgorithm-element) every run of this program on .NET 4.7.2 will produce the same value:
On .NET Core, each run of the program will produce the same output from the four lines, but that value will differ from program execution to program execution:
That randomization incurs some additional cost, so it's effectively by-design that this is a bit slower. |
@stephentoub Why .NET Core utilize randomized string hashing? Won't this make porting applications from .NET Framework harder? Also shouldn't hashing be stable for the same input? |
Security, prevention against DoS attacks, etc.
It is stable for the same input, for a given execution of a program. |
@stephentoub Thanks. Is UseRandomizedStringHashAlgorithm applicable for .NET Core as well? |
UseRandomizedStringHashAlgorithm exists in the .NET Framework to be able to opt-in to randomized string hashing. There is no mechanism provided in .NET Core to opt-out of it. |
Is there something left to be done here, or are we going to call it by design perf difference? |
Depends on whether there are important enough scenarios that don't care about the possibility of DoS (and their users and callers don't either) and those scenarios are unacceptably regressed. @SteveL-MSFT in what Powershell scenarios is the perf regression occurring? |
@danmosemsft We are discussing this in PowerShell/PowerShell#7112 The scenario is that we construct PSObject dynamically (everywhere in PowerShell), this uses a type cache where GetHash is used. Since PowerShell is completely based on PSObject, then it becomes totally slower. |
What is the perf contribution of GetHashCode on real-world scenarios? (%, numbers) |
In the original issue opened by @iSazonov, he noticed that |
Was the entire 2x slowdown of |
@stephentoub The randomized seed should only introduce a small initial setup cost. I don't think it can impact GetHashCode() this much. There were, however, a few performance regressions in the Marvin hash that were attributed to the removal of a couple of AggressiveInlining attributes, but I don't seem to remember that all the regressions were found and fixed. |
@karelz at this time, unless we get reports of other use cases impacted, I don't consider this blocking for PSCore6.1 release which also means unless the community picks it up, we don't have bandwidth to investigate further. What profiler do you guys use with dotnetcore? |
@SteveL-MSFT we own and use https://github.com/Microsoft/perfview/blob/master/documentation/Downloading.md on Windows. I am not sure of Linux procedures. |
For Linux perf investigations, use the same: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md |
Yup, both of those are good links - if you have any questions, I can help. |
I'm sure there are ways to make Marvin faster, even exploring vectorizing it. That doesn't negate anything I said. |
I agree.
I guess all cmdlets that generate PSObject objects currently become two slower. I found that this Import-Csv cmdlet has become slower when parsing and analyzing large web server logs (~ 10 GB). It took several hours with PowerShell Core and less than an hour with Windows PowerShell. I can say that Select-String and ConvertTo-Csv cmdlet is slow too. (We have also problem with web cmdlets PowerShell/PowerShell#6199). This makes it impossible to use PowerShell Core to handle large (is 1GB large in current days?) files. |
@iSazonov do you have traces which would show that 50%+ time is spent in GetHashCode? |
@karelz I'll prepare and attach sample trace. |
I can also attest to the fact that Marvin is close to twice as slow, but it's not specific to .NET Core as pointed out and is reproducible on .NET Framework. In my opinion we should allow a configuration setting to disable Randomized Hashing, just like we do on .NET Framework. It would be both beneficial in porting code that is (unfortunately) dependent on a stable To reproduce on .NET Framework, you can change the app.config and notice it is in fact close to 2X slower. app.config <configuration>
<runtime>
<UseRandomizedStringHashAlgorithm enabled="1" />
</runtime>
</configuration> Program.cs class Program
{
static void Main(string[] args)
{
int a = 0;
Stopwatch sw = new Stopwatch();
sw.Start();
PoorBenchmark();
sw.Stop();
Console.WriteLine(a);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static int PoorBenchmark()
{
int a = 0;
for (int j = 0; j < 1000000000; ++j)
{
a = "PowerShell6HasSlowerHashCode".GetHashCode();
}
return a;
}
} |
Link to perf trace file https://ufile.io/nkzgh cd c:\tmp\
# Prepare data before start perfview
Write-Host "Creating source data..."
"random_numbers" | Out-File source.csv
Get-Random -SetSeed 1 | Out-Null
for ($i=0; $i -le 10000; $i++) {
(Get-Random).ToString() | Out-File source.csv -Append
}
# Run after start perfview
Write-Host "Measuring Import-Csv performance over time..."
"index,duration_ms,bytes_consumed" | Out-File 7112_results.csv
for ($i=0; $i -le 1000; $i++) {
$millis = (Measure-Command { Import-Csv source.csv }).TotalMilliseconds
$memory = [System.GC]::GetTotalMemory($false)
$i.ToString() + "," + $millis.ToString() + "," + $memory | Out-File 7112_results.csv -Append
}
Write-Host "Done" |
@iSazonov thanks! Just to confirm: Does the trace show that 50%+ time is spent in GetHashCode? |
@iSazonov I took a quick look at the trace. The inclusive time spent in system.private.corelib!Marvin.ComputeHash32 is 1% of the trace time. It seems to me then the next step is either for Powershell folks to compare a profile with .NET Framework to see what changed, or analyze the scenario top down, and return back here if improvements are needed in .NET Core itself (especially if .NET Core is slower than .NET Framework doing the same work) I do see much of the time is in collection operations - OrderedDictionary, ConditionalWeakTable, ArrayList, List, Hashtable. @SteveL-MSFT does this seem reasonable? |
It might be worth looking eg at whether any of these collections can be presized at least roughly. 6.7% inclusive time is in simply constructing Hashtables. 4.1% in resizing ArrayList. (Both those types backing OrderedDictionary). 3.3% time in resizing List (used directly by Powershell) I'm going to close this please feel free to reopen if I mis-analyzed. |
@danmosemsft This issue is talking about GetHashCode() being slower on .NET Core vs .NET Framework, the PowerShell issue has shed some light onto it but I think it is independent of this. It may very well be that the PowerShell issue is a set of issues but I would think we want to improve GetHashCode on .NET Core, both by improving the code quality and in my opinion also providing the ability to quirk it off if needed. In that sense I think closing this issue is premature. |
Yep, certainly it would help everyone to make it faster. Happy to keep the issue to track possible work there. I think we should be clear though whether real workloads are significantly impacted by the randomization because that affects the priority of working on it. |
@danmosemsft Thank you for your analysis. I guess the code for creating PSObject is the same in PowerShell Core and Windows PowerShell (@SteveL-MSFT can you confirm?). Therefore, I can not explain why we see such a strong difference between PowerShell Core and Windows PowerShell. Even if we add all the percentages (6.7 + 4.1 + 3.3 = 14.1) it does not explain why PowerShell Core is slower then twice. |
@iSazonov did a diff between PSCore6 and WinPS5.1 MSHObject.cs and the only differences are formatting changes. I think there's some optimizations we can make in PSCore6 for PSObject creation. Also need to do more profiling of PSCore6, although the links to dotnetcore debug symbols currently doesn't work so not able to get good logs from Perfview. |
If the code has not changed but it works more slowly twice (I mean cmdlets), then it is more necessary to look into .Net Core. |
@iSazonov the first step should be proper perf investigation / comparison with .NET Framework - ideally someone familiar with the code using .NET (Core) here. |
I tried to do it today but without success - I do not have such experience with PerfView. |
@iSazonov - if you have taken PerfView traces of some scenario on both .NET Core and .NET Desktop that show the poorer perf, then that should be sufficient for the investigation. If you publish those two files in some way (you can copy it to \clrmain\public\writable\users\vancem if you want), I will take a look. |
I have just run benchmark @mjsabby have above on both .net Core and Desktop. Here are the results
Thus .NET Core is time is 1.23X the Destkop time (23% slower). I ran a perfView trace, and it definately shows that the difference is simply the different hash method used. This is roughly 'by design' as mentioned earlier.I looked at the trace in above) and as @danmosemsft indicates GetHashCode is not an importnat part of the execution cost (~ 1%). More time was spet in upper casing strings because the hash was case-insenstive. Over half the time is being spend in creating new objects or arrays. which is where the real action is, and this seems to be dominated by the PSObject (and hashtable that are used there). I actually think we can close this issue (because the regression in GetHashCode is really by design). However if power shell runs slower on .NET Core we do want to understand it. All you need to do is do a 'PerfView collect' while running your benchmark. Collect a trace with .NET Core and another with Desktop. If .NET Core is slower, we would definately be interested in seeing those traces. |
After that "Thus .NET Core is time is 1.23X the Destkop time (23% slower)." I believe that developers will postpone migration to .Net Core - this is too noticeable a slowdown. I expect that .Net Core team will mobilize and make a significant investment to find bottlenecks and to remedy them.
Many thanks! The issue is for tracking GetHashCode() - I suggest move to PowerShell/PowerShell#7112 |
We have done that already. There are no major bottlenecks that we are aware of. We believe that the current implementation of GetHashCode is the right choice for .NET Core even though it is slower to compute. Note that it produces much higher quality hashcodes and it is not hard to come up with examples where it makes the end-to-end execution faster because of reduced hash collisions. If the .NET Core string hash function does not work well in some cases, our recommendation is to replace it with a custom hash code function specifically designed for the given case. For example, Roslyn has their own implementation of string hash function that works well for names used as C# identifiers. |
I see! I see tons PRs and blog posts that say that .Net Core is faster. It is fantastic result!!! This story began with the fact that I wanted to take speed advantage of .Net Core and tried to use Import-Csv cmdlet on real data (week IIS logs > 2GB). It turned out that this is not acceptable because the difference with Windows PowerShell was many hours!
I really hope that faster! The first comment from PowerShell community was that we need to pay attention to GetHashCode() method because it's two slower. As noted above, the problem is rather in HashSet, Dictionary or OrderedDictionary. I am very glad that @vancem agreed to look this problem deeply. Then I think we will know exactly why PowerShell Core consumes so many resources of CPU and memory and that we need to fix in PowerShell and/or .Net Core.
Thanks for great sample! I also found such PR in the repo for XML. Perhaps we could use this in PowerShell repo if the only problem in OrderedDictionary that we use in PSObject. I still believe that this method is root and worth it to be faster. I hope there is a way to do this:
|
First, it is naive to think that everything can be made faster. Many key perf benefits are usually trade-offs - some scenarios get faster at the cost of other scenarios. Of course, we're trying hard to focus on the most important scenarios to get faster, while not making other important scenarios (much) slower.
With the previous paragraph in mind, the best next step is to investigate the perf difference and find the root-cause. Is PS using some weird pattern, which can be replaced with better one? Or did we truly accidentally hurt perf of a scenario which is important for more than PS? (BTW: so far only PS reported it)
Agreed that it is good.
It would be best to support it with data by showing how much it could help end-to-end scenarios (e.g. PS ones). It will help us prioritize such improvements against other bugs and enhancements. Given the mix up on this issue, I would recommend to close it and file separate issue for the potential GetHashCode improvement via vecotrization (with data how much it contributes to end-to-end scenarios like PS). We can continue root-causing the "general perf regression" either in PS issue, or a new one in CoreFX and see what it turns into. |
@karelz Thanks! I totally agree that it's impossible to make everything work fast.
I believe that most developers of end application systems never care about resources until customers require this because they are experiencing a real problem. Even after that, the developer will make improvements in his code taking the .Net platform features "as is". I think .Net team will never find out about these problems until it looks at the real projects that use .Net Core.
As mentioned in the issue description PowerShell uses classes that depends actively on GetHashCode() method. A best feature of PowerShell is that it can provide access to any .Net classes. This means that if, for example, HashSet has problems due to this method on large data set, then users can feel it in their script right away.
Seems the issue description is good. After receiving the results of the performance analysis, we can decide more precisely how to proceed. |
I want to summarize my understanding of where we are.
The next step is for me to look at the traces. I will do this later today... |
@vancem Thanks! I agree. I forgot to mention that the PowerShell Core test suite became noticeably faster after moving to .Net Core 2.1 (up to 25% on some scenarios!). So the overall performance of .Net is very good! Nevertheless, there are scenarios with very poor indicators. |
PowerShell is using a generic, hash heavy way, of creating dynamic objects, which we use a lot when importing CSVs. To work around this in PowerShell, I think a new way of creating objects need to be introduced, where we can amortize the cost of creating the object properties (where the hashes of the names are calculated). In theory, these hashes needs to be calculated once per column in the csv, not for every property and every line. I haven't yet looked at how much work this would involve. |
It might be worth using the same approach for Don't think much could be done for |
First, as a matter of principle, I think both @mjsabby idea of having an opt-out and @benaadams idea of dynamically switching hashes under high collisions are worth having 'on the table'. However I will mention that if there is something we do ALOT at Microsoft, it is make things more complex for what ultimately turns out to be a dubious tradeoff. I think we should very clear on the benefits of adding such complexity before embarking. In particular, I have done some benchmarking of a V2.1 based Powershell running the CSV importer (before they did the optimization that @powercode suggested), and the system.private.corelib!System.Marvin.ComputeHash32 method used 5.7% of the time. Thus at MOST that coudl be recovered. To put that in perspective 10.4% was spent in ToUpperInvariant (It seems that the hash table is case insensitive). Also these hashs come from ConcurrentDictionary and HashTable so you can see that @benaadams idea would have to be applied several places to have a uesful effect in this case. As mentioned I am not against either of the suggestions at this time, but I do think that the original benefit (a 2X slowdown), cannot all be attributed to the String Hashing function (that is the mitigations above will only help by maybe 4-5%, and frankly because Powershell fixed their code, this particualr case has gone away (although I am sure that string based Dictionary lookups are hot in some scenarios). I argue that we shoudl accumulate more 'good reasons' to make the change before proceeding. |
(security hat on) I don't like the idea of adding an application-wide opt-out switch for the randomized hash code calculation. There's a lot of code in the framework and in third-party libraries that either explicitly or implicitly depends on this behavior for its security guarantees. Consider Making it per-collection (such as via an explicit Having collections optimistically use a non-randomizing hash (for |
My understanding of things is that randomized hash codes is to protect against a denial of service attack where you don't trust your user. While there are instances where this is very important (a service open to all), in many cases it is just not interesting because you trust the user (if your users alreayd 'own' the machine (e.g. this is an 'ordinary' app), then you don't care (they are creating a DOS against themselves). Another scneario is simply compat (my software took a deep dependency, and I need something! This case is probably where @mjsabby is coming from. Here you are simply willing to trade off security (often because it was no worse than what it is replacing). So I think there are useful scenarios for an opt-out. (This came up in the context of Powershell, and it is an intersting question whether they should opt-out). One could argue that this is what the do defacto on the desktop runtime (which is most of their user base today. So far, it seems that there are better places to look for perf, and it can be mititgated in important cases like what @benaadams did. Until someone actually does the PR for the opt-out, the issue won't be forced... |
It's not that you don't trust the user - it's that you don't trust the data. Consider something like Office or Visual Studio. I as a human being driving the mouse and keyboard am fully trusted. But the .docx / .csproj / whatever file I'm opening might be from some random source. Even within DevDiv we've spent significant resources within the past year or so trying to address reports where somebody crafted a malicious project and caused the trusted user's IDE to misbehave. It's true that these reports tend to be lower priority since it's a local DoS, but it is still some customer's workflow that's being impacted every time we need to address one of these situations when they inevitably do arise. |
Sure, I get that there are many possible security scenarios. My main point is that DOS attacks are not as critical, and that allowing people to opt-out is reasonable. That said, I don't like making features unless we actually know who will use it, and it is not clear who will use this opt-out, so I am not motivated to actually implement it. I do note that event with the out-out, we are safe by default. which I suspect is your main point. |
Additionally, per PowerShell/PowerShell#7112, it looks like the original scenario which precipitated the opening of this issue has since been resolved in PowerShell Core. Since it looks like there's nothing left to do on the runtime side of things I'm closing the issue for now. Please reopen this issue if needed. Thanks! |
Using this PowerShell script on Win10 17713
Computes the hash for the string 10M times and averages the time over 10 runs.
On .Net 4.7.2 it takes 12455ms
Same machine with PSCore6.1 w/ dotnetcore 2.1.301 17579ms
This is causing a perf regression in PowerShell due to use of GetHashCode() calls.
The text was updated successfully, but these errors were encountered: