Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String.GetHashCode() consistently slower on dotnetcore than .Net Framework #26779

Closed
SteveL-MSFT opened this issue Jul 11, 2018 · 49 comments
Closed
Labels
area-System.Runtime question Answer questions and provide assistance, not an issue with source code or documentation. tenet-performance Performance related issue
Milestone

Comments

@SteveL-MSFT
Copy link
Contributor

Using this PowerShell script on Win10 17713

$totalms = 0
1..10 | % { $totalms += (measure-command { foreach($i in 1..10000000) {$Null = "PowerShell6HasSlowerHashCode".GetHashCode()} }).totalmilliseconds }
$totalms/10

Computes the hash for the string 10M times and averages the time over 10 runs.

On .Net 4.7.2 it takes 12455ms
Same machine with PSCore6.1 w/ dotnetcore 2.1.301 17579ms

This is causing a perf regression in PowerShell due to use of GetHashCode() calls.

@stephentoub
Copy link
Member

stephentoub commented Jul 11, 2018

.NET Core uses randomized string hashing, e.g. with this program:

using System;

class Program
{
    static void Main()
    {
        Console.WriteLine("hello".GetHashCode());
        Console.WriteLine("hello".GetHashCode());
        Console.WriteLine("hello".GetHashCode());
        Console.WriteLine("hello".GetHashCode());
    }
}

By default (https://docs.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/runtime/userandomizedstringhashalgorithm-element) every run of this program on .NET 4.7.2 will produce the same value:

c:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f net472
-327419862
-327419862
-327419862
-327419862

c:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f net472
-327419862
-327419862
-327419862
-327419862

c:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f net472
-327419862
-327419862
-327419862
-327419862

On .NET Core, each run of the program will produce the same output from the four lines, but that value will differ from program execution to program execution:

c:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f netcoreapp2.1
1608943144
1608943144
1608943144
1608943144

c:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f netcoreapp2.1
219552412
219552412
219552412
219552412

c:\Users\stoub\Desktop\tmpapp>dotnet run -c Release -f netcoreapp2.1
-1769219286
-1769219286
-1769219286
-1769219286

That randomization incurs some additional cost, so it's effectively by-design that this is a bit slower.

@HFadeel
Copy link

HFadeel commented Jul 11, 2018

@stephentoub Why .NET Core utilize randomized string hashing? Won't this make porting applications from .NET Framework harder? Also shouldn't hashing be stable for the same input?

@stephentoub
Copy link
Member

Why .NET Core utilize randomized string hashing?

Security, prevention against DoS attacks, etc.

Also shouldn't hashing be stable for the same input?

It is stable for the same input, for a given execution of a program.

@HFadeel
Copy link

HFadeel commented Jul 11, 2018

@stephentoub Thanks. Is UseRandomizedStringHashAlgorithm applicable for .NET Core as well?

@stephentoub
Copy link
Member

Is UseRandomizedStringHashAlgorithm applicable for .NET Core as well?

UseRandomizedStringHashAlgorithm exists in the .NET Framework to be able to opt-in to randomized string hashing. There is no mechanism provided in .NET Core to opt-out of it.

@karelz
Copy link
Member

karelz commented Jul 11, 2018

Is there something left to be done here, or are we going to call it by design perf difference?
cc @valenis @brianrob @adamsitnik

@danmoseley
Copy link
Member

Depends on whether there are important enough scenarios that don't care about the possibility of DoS (and their users and callers don't either) and those scenarios are unacceptably regressed.

@SteveL-MSFT in what Powershell scenarios is the perf regression occurring?

@iSazonov
Copy link
Contributor

@danmosemsft We are discussing this in PowerShell/PowerShell#7112

The scenario is that we construct PSObject dynamically (everywhere in PowerShell), this uses a type cache where GetHash is used. Since PowerShell is completely based on PSObject, then it becomes totally slower.

@karelz
Copy link
Member

karelz commented Jul 12, 2018

What is the perf contribution of GetHashCode on real-world scenarios? (%, numbers)
What are the PS real-world scenarios? How common are they? Is it something normal users will notice, or is it "just" chasing down a perf microbenchmark? (the microbenchmark numbers above show 41% degradation - which may be ok, given it brings security advantages for server-side scenarios)

@SteveL-MSFT
Copy link
Contributor Author

In the original issue opened by @iSazonov, he noticed that Import-CSV was about 2x slower for a large (>100MB) csv file. However, we are talking about ~3.7s to ~6s in absolute terms. GetHashCode() is used quite a bit in PowerShell code base. However, I can certainly see the security benefit. @iSazonov are you ok with the perf degradation for your specific use case? If we find more common scenarios with significant impact, we should revisit this.

@karelz
Copy link
Member

karelz commented Jul 12, 2018

Was the entire 2x slowdown of Import-CSV only due to GetHashCode being slower? (it would mean it has higher contribution there than the microbenchmark above - which is a bit suspicious/interesting)

@Genbox
Copy link

Genbox commented Jul 12, 2018

@stephentoub The randomized seed should only introduce a small initial setup cost. I don't think it can impact GetHashCode() this much. There were, however, a few performance regressions in the Marvin hash that were attributed to the removal of a couple of AggressiveInlining attributes, but I don't seem to remember that all the regressions were found and fixed.

@SteveL-MSFT
Copy link
Contributor Author

@karelz at this time, unless we get reports of other use cases impacted, I don't consider this blocking for PSCore6.1 release which also means unless the community picks it up, we don't have bandwidth to investigate further. What profiler do you guys use with dotnetcore?

@danmoseley
Copy link
Member

@SteveL-MSFT we own and use https://github.com/Microsoft/perfview/blob/master/documentation/Downloading.md on Windows. I am not sure of Linux procedures.

@karelz
Copy link
Member

karelz commented Jul 13, 2018

For Linux perf investigations, use the same: https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md
@brianrob @valenis @vancem can provide additional guidance if necessary.

@brianrob
Copy link
Member

Yup, both of those are good links - if you have any questions, I can help.

@stephentoub
Copy link
Member

There were, however, a few performance regressions in the Marvin hash that were attributed to the removal of a couple of AggressiveInlining attributes, but I don't seem to remember that all the regressions were found and fixed.

I'm sure there are ways to make Marvin faster, even exploring vectorizing it. That doesn't negate anything I said.

@iSazonov
Copy link
Contributor

iSazonov commented Jul 13, 2018

The randomized seed should only introduce a small initial setup cost. I don't think it can impact GetHashCode() this much.

I agree.

at this time, unless we get reports of other use cases impacted, I don't consider this blocking for PSCore6.1 release

I guess all cmdlets that generate PSObject objects currently become two slower.

I found that this Import-Csv cmdlet has become slower when parsing and analyzing large web server logs (~ 10 GB). It took several hours with PowerShell Core and less than an hour with Windows PowerShell. I can say that Select-String and ConvertTo-Csv cmdlet is slow too. (We have also problem with web cmdlets PowerShell/PowerShell#6199). This makes it impossible to use PowerShell Core to handle large (is 1GB large in current days?) files.
I started looking for ways to optimize this Import-Csv cmdlet to reach LogParser (sed/awk) performance and found that the problem is deeper. I used perfview for analysis and then @powercode (Thanks!) confirmed my fears.
I still do not have a full understanding of where there should be a fix but definitely this is a sensitive problem.

@karelz
Copy link
Member

karelz commented Jul 13, 2018

@iSazonov do you have traces which would show that 50%+ time is spent in GetHashCode?
I'd like to clarify that the end-to-end slowdown is caused ONLY by this issue.

@iSazonov
Copy link
Contributor

@karelz I'll prepare and attach sample trace.

@mjsabby
Copy link
Contributor

mjsabby commented Jul 13, 2018

I can also attest to the fact that Marvin is close to twice as slow, but it's not specific to .NET Core as pointed out and is reproducible on .NET Framework.

In my opinion we should allow a configuration setting to disable Randomized Hashing, just like we do on .NET Framework. It would be both beneficial in porting code that is (unfortunately) dependent on a stable GetHashCode and alleviate this performance problem for some scenarios.

To reproduce on .NET Framework, you can change the app.config and notice it is in fact close to 2X slower.

app.config

<configuration>
  <runtime>
    <UseRandomizedStringHashAlgorithm enabled="1" />
  </runtime>
</configuration>

Program.cs

    class Program
    {
        static void Main(string[] args)
        {
            int a = 0;

            Stopwatch sw = new Stopwatch();
            sw.Start();
            PoorBenchmark();
            sw.Stop();

            Console.WriteLine(a);
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static int PoorBenchmark()
        {
            int a = 0;

            for (int j = 0; j < 1000000000; ++j)
            {
                a = "PowerShell6HasSlowerHashCode".GetHashCode();
            }

            return a;
        }
    }

@iSazonov
Copy link
Contributor

Link to perf trace file https://ufile.io/nkzgh
Script for test

cd c:\tmp\

# Prepare data before start perfview

Write-Host "Creating source data..."
"random_numbers" | Out-File source.csv
Get-Random -SetSeed 1 | Out-Null
for ($i=0; $i -le 10000; $i++) {
	(Get-Random).ToString() | Out-File source.csv -Append
}

# Run after start perfview

Write-Host "Measuring Import-Csv performance over time..."
"index,duration_ms,bytes_consumed" | Out-File 7112_results.csv
for ($i=0; $i -le 1000; $i++) {
	$millis = (Measure-Command { Import-Csv source.csv }).TotalMilliseconds
	$memory = [System.GC]::GetTotalMemory($false)
	$i.ToString() + "," + $millis.ToString() + "," + $memory | Out-File 7112_results.csv -Append
}
Write-Host "Done"

@karelz
Copy link
Member

karelz commented Jul 13, 2018

@iSazonov thanks! Just to confirm: Does the trace show that 50%+ time is spent in GetHashCode?

@danmoseley
Copy link
Member

danmoseley commented Jul 13, 2018

@iSazonov I took a quick look at the trace. The inclusive time spent in system.private.corelib!Marvin.ComputeHash32 is 1% of the trace time.

It seems to me then the next step is either for Powershell folks to compare a profile with .NET Framework to see what changed, or analyze the scenario top down, and return back here if improvements are needed in .NET Core itself (especially if .NET Core is slower than .NET Framework doing the same work)

I do see much of the time is in collection operations - OrderedDictionary, ConditionalWeakTable, ArrayList, List, Hashtable.

@SteveL-MSFT does this seem reasonable?

@danmoseley
Copy link
Member

It might be worth looking eg at whether any of these collections can be presized at least roughly. 6.7% inclusive time is in simply constructing Hashtables. 4.1% in resizing ArrayList. (Both those types backing OrderedDictionary). 3.3% time in resizing List (used directly by Powershell)

I'm going to close this please feel free to reopen if I mis-analyzed.

@mjsabby
Copy link
Contributor

mjsabby commented Jul 13, 2018

@danmosemsft This issue is talking about GetHashCode() being slower on .NET Core vs .NET Framework, the PowerShell issue has shed some light onto it but I think it is independent of this. It may very well be that the PowerShell issue is a set of issues but I would think we want to improve GetHashCode on .NET Core, both by improving the code quality and in my opinion also providing the ability to quirk it off if needed.

In that sense I think closing this issue is premature.

@danmoseley danmoseley reopened this Jul 13, 2018
@danmoseley
Copy link
Member

Yep, certainly it would help everyone to make it faster. Happy to keep the issue to track possible work there. I think we should be clear though whether real workloads are significantly impacted by the randomization because that affects the priority of working on it.

@iSazonov
Copy link
Contributor

@danmosemsft Thank you for your analysis. I guess the code for creating PSObject is the same in PowerShell Core and Windows PowerShell (@SteveL-MSFT can you confirm?). Therefore, I can not explain why we see such a strong difference between PowerShell Core and Windows PowerShell. Even if we add all the percentages (6.7 + 4.1 + 3.3 = 14.1) it does not explain why PowerShell Core is slower then twice.

@SteveL-MSFT
Copy link
Contributor Author

@iSazonov did a diff between PSCore6 and WinPS5.1 MSHObject.cs and the only differences are formatting changes. I think there's some optimizations we can make in PSCore6 for PSObject creation. Also need to do more profiling of PSCore6, although the links to dotnetcore debug symbols currently doesn't work so not able to get good logs from Perfview.

@iSazonov
Copy link
Contributor

iSazonov commented Jul 18, 2018

If the code has not changed but it works more slowly twice (I mean cmdlets), then it is more necessary to look into .Net Core.

@karelz
Copy link
Member

karelz commented Jul 18, 2018

@iSazonov the first step should be proper perf investigation / comparison with .NET Framework - ideally someone familiar with the code using .NET (Core) here.

@iSazonov
Copy link
Contributor

I tried to do it today but without success - I do not have such experience with PerfView.
PerfView from link above doesn't show me call stack for .NET Framework.

@vancem
Copy link
Contributor

vancem commented Jul 18, 2018

@iSazonov - if you have taken PerfView traces of some scenario on both .NET Core and .NET Desktop that show the poorer perf, then that should be sufficient for the investigation. If you publish those two files in some way (you can copy it to \clrmain\public\writable\users\vancem if you want), I will take a look.

@vancem
Copy link
Contributor

vancem commented Jul 18, 2018

I have just run benchmark @mjsabby have above on both .net Core and Desktop. Here are the results

Desktop Elapsed time MSec 7466
.NET Core Elapsed time MSec 9201

Thus .NET Core is time is 1.23X the Destkop time (23% slower). I ran a perfView trace, and it definately shows that the difference is simply the different hash method used. This is roughly 'by design' as mentioned earlier.

I looked at the trace in above) and as @danmosemsft indicates GetHashCode is not an importnat part of the execution cost (~ 1%). More time was spet in upper casing strings because the hash was case-insenstive.

Over half the time is being spend in creating new objects or arrays. which is where the real action is, and this seems to be dominated by the PSObject (and hashtable that are used there).

I actually think we can close this issue (because the regression in GetHashCode is really by design).

However if power shell runs slower on .NET Core we do want to understand it. All you need to do is do a 'PerfView collect' while running your benchmark. Collect a trace with .NET Core and another with Desktop. If .NET Core is slower, we would definately be interested in seeing those traces.

@iSazonov
Copy link
Contributor

@vancem

I actually think we can close this issue (because the regression in GetHashCode is really by design).

After that "Thus .NET Core is time is 1.23X the Destkop time (23% slower)." I believe that developers will postpone migration to .Net Core - this is too noticeable a slowdown. I expect that .Net Core team will mobilize and make a significant investment to find bottlenecks and to remedy them.
Just this issue shows a two-fold slowdown in GetHashCode(). Perhaps this is not a key issue for PowerShell but it is definitely a problem for .Net Core as a whole.

@vancem

I will take a look.

Many thanks! The issue is for tracking GetHashCode() - I suggest move to PowerShell/PowerShell#7112

@jkotas
Copy link
Member

jkotas commented Jul 19, 2018

I expect that .Net Core team will mobilize and make a significant investment to find bottlenecks and to remedy them.

We have done that already. There are no major bottlenecks that we are aware of. We believe that the current implementation of GetHashCode is the right choice for .NET Core even though it is slower to compute. Note that it produces much higher quality hashcodes and it is not hard to come up with examples where it makes the end-to-end execution faster because of reduced hash collisions.

If the .NET Core string hash function does not work well in some cases, our recommendation is to replace it with a custom hash code function specifically designed for the given case. For example, Roslyn has their own implementation of string hash function that works well for names used as C# identifiers.

@iSazonov
Copy link
Contributor

iSazonov commented Jul 19, 2018

We have done that already.

I see! I see tons PRs and blog posts that say that .Net Core is faster. It is fantastic result!!!
But when I run the same code on .Net Core and see that it's noticeably slower and GC is working worse what conclusion should I make? Wait .Net Core 4.7 version 10 years later? I expect that .Net Core app will work better than .Net Framework, and if it is worse, then not more than a few percent.

This story began with the fact that I wanted to take speed advantage of .Net Core and tried to use Import-Csv cmdlet on real data (week IIS logs > 2GB). It turned out that this is not acceptable because the difference with Windows PowerShell was many hours!
This despite the fact that it was the same code!
If someone asks about porting heavily loaded web application to .Net Core, now we have to say - first of all buy additional resources (CPU and memory) - 23% at best and 100% if you are unlucky. For similar applications, it can be a lot of money.

Note that it produces much higher quality hashcodes and it is not hard to come up with examples where it makes the end-to-end execution faster because of reduced hash collisions.

I really hope that faster! The first comment from PowerShell community was that we need to pay attention to GetHashCode() method because it's two slower. As noted above, the problem is rather in HashSet, Dictionary or OrderedDictionary. I am very glad that @vancem agreed to look this problem deeply. Then I think we will know exactly why PowerShell Core consumes so many resources of CPU and memory and that we need to fix in PowerShell and/or .Net Core.

If the .NET Core string hash function does not work well in some cases, our recommendation is to replace it with a custom hash code function specifically designed for the given case. For example, Roslyn has their own implementation of string hash function that works well for names used as C# identifiers.

Thanks for great sample! I also found such PR in the repo for XML. Perhaps we could use this in PowerShell repo if the only problem in OrderedDictionary that we use in PSObject.
In general, this is not acceptable to rewrite many .Net Core types. On the contrary, we continuously remove a lot of custom code from PowerShell in favor of the standard .Net Core code. It's fast and reliable, is not it? :-) In particular, we plan to make extensive use of System.HashCode.

I still believe that this method is root and worth it to be faster. I hope there is a way to do this:
from @stephentoub

I'm sure there are ways to make Marvin faster, even exploring vectorizing it

@karelz
Copy link
Member

karelz commented Jul 19, 2018

But when I run the same code on .Net Core and see that it's noticeably slower and GC is working worse what conclusion should I make? Wait .Net Core 4.7 version 10 years later? I expect that .Net Core app will work better than .Net Framework, and if it is worse, then not more than a few percent.

First, it is naive to think that everything can be made faster. Many key perf benefits are usually trade-offs - some scenarios get faster at the cost of other scenarios. Of course, we're trying hard to focus on the most important scenarios to get faster, while not making other important scenarios (much) slower.
Obviously sometimes perf improvements can bring unexpected perf regressions in other scenarios. Sometimes we may misjudge importance of certain scenarios. And sometimes some products just use weird unique patterns which we don't consider important, but they hurt specific products (in which case we often recommend the products to migrate to better and more perfmant APIs).
Again, not everything can be made just better at all times. Expecting that everything is just better, not matter what you use and how, is unrealistic in any large mature project or platform.

It turned out that this is not acceptable because the difference with Windows PowerShell was many hours!
This despite the fact that it was the same code!
If someone asks about porting heavily loaded web application to .Net Core, now we have to say - first of all buy additional resources (CPU and memory) - 23% at best and 100% if you are unlucky. For similar applications, it can be a lot of money.

With the previous paragraph in mind, the best next step is to investigate the perf difference and find the root-cause. Is PS using some weird pattern, which can be replaced with better one? Or did we truly accidentally hurt perf of a scenario which is important for more than PS? (BTW: so far only PS reported it)

I am very glad that @vancem agreed to look this problem deeply. Then I think we will know exactly why PowerShell Core consumes so many resources of CPU and memory and that we need to fix in PowerShell and/or .Net Core.

Agreed that it is good.
However, I also want to clarify the expectations in general: We often do not scale to jump on every single report of allegedly slower perf or functional regression, especially when there is lots of in-the-middle code involved (like PS) - mainly because such investigations require often understanding of the higher-level framework code base to be done efficiently and to figure out where the perf fix could/should be (which may NOT be in .NET Core layer).
That's why we often ask for repros without in-the-middle frameworks (like PS), or we ask users/higher-level frameworks to do first-level analysis (ideally leading to isolated repro without the higher-level framework involved at all - the original report in this issue is great example of that).
Obviously we are more attentive to problems which have larger impact - signs to be hitting multiple customers across multiple higher-level frameworks - those are likely to be unintended regressions we caused by accident or by mistake in our layer.

I still believe that this method is root and worth it to be faster

It would be best to support it with data by showing how much it could help end-to-end scenarios (e.g. PS ones). It will help us prioritize such improvements against other bugs and enhancements.
It is unfortunate that this issue became mix up of multiple problems - the fact that GetHashCode became slower (partially by design with potential room for improvements via vectorizing) vs. end-to-end PS scenario is significantly slower (without clear root-cause at this point).

Given the mix up on this issue, I would recommend to close it and file separate issue for the potential GetHashCode improvement via vecotrization (with data how much it contributes to end-to-end scenarios like PS). We can continue root-causing the "general perf regression" either in PS issue, or a new one in CoreFX and see what it turns into.

@iSazonov
Copy link
Contributor

@karelz Thanks! I totally agree that it's impossible to make everything work fast.

Is PS using some weird pattern, which can be replaced with better one? Or did we truly accidentally hurt perf of a scenario which is important for more than PS? (BTW: so far only PS reported it)

I believe that most developers of end application systems never care about resources until customers require this because they are experiencing a real problem. Even after that, the developer will make improvements in his code taking the .Net platform features "as is". I think .Net team will never find out about these problems until it looks at the real projects that use .Net Core.

It would be best to support it with data by showing how much it could help end-to-end scenarios

As mentioned in the issue description PowerShell uses classes that depends actively on GetHashCode() method. A best feature of PowerShell is that it can provide access to any .Net classes. This means that if, for example, HashSet has problems due to this method on large data set, then users can feel it in their script right away.

Given the mix up on this issue,

Seems the issue description is good. After receiving the results of the performance analysis, we can decide more precisely how to proceed.

@vancem
Copy link
Contributor

vancem commented Jul 19, 2018

I want to summarize my understanding of where we are.

  1. We understand that .NET Core has a different String.GetHashCode, and that it is modestly (23%) slower in a microbenchmark.
  2. There is a Powershell scenario (csv processing) that is reported to be 40% slower = 1.4X (17579ms / 12455ms), however looking at the profile shows that String.GetHashCode CANNOT be the issue (It only consumes 1% of the total CPU time).
  3. Thus this issue is at best poorly named since it is about String.GetHashCode. We are likely to close it. We are leaving it open right now just to track the investigation of the 40% slowdown, but shortly we are likely to transition to another issue one way or the other.
  4. The .NET Runtime team wants to make sure that .NET Core is not noticably slower (preferably faster), in all intersting scenarios. THis is why we are looking at the regression. However this is almost certainly not going to force a change to String.GetHashCode (we may yet do that, but we want to see real scenarios that are impacted by the know 23% regression in GetHashCode first).

The next step is for me to look at the traces. I will do this later today...

@iSazonov
Copy link
Contributor

@vancem Thanks! I agree. I forgot to mention that the PowerShell Core test suite became noticeably faster after moving to .Net Core 2.1 (up to 25% on some scenarios!). So the overall performance of .Net is very good! Nevertheless, there are scenarios with very poor indicators.

@powercode
Copy link

PowerShell is using a generic, hash heavy way, of creating dynamic objects, which we use a lot when importing CSVs.
These hashes are recalculated for every line in the csv, and if there are many properties in each lines and many lines, well...

To work around this in PowerShell, I think a new way of creating objects need to be introduced, where we can amortize the cost of creating the object properties (where the hashes of the names are calculated).

In theory, these hashes needs to be calculated once per column in the csv, not for every property and every line.

I haven't yet looked at how much work this would involve.

@benaadams
Copy link
Member

In my opinion we should allow a configuration setting to disable Randomized Hashing, just like we do on .NET Framework. It would be both beneficial in porting code that is (unfortunately) dependent on a stable GetHashCode and alleviate this performance problem for some scenarios.

Dictionary<TKey, TValue> where TKey is string (for ordinal comparisions e.g. not specified comparer), starts with non-randomized hashing for the strings; switching to randomized hashing when collisions get high.

It might be worth using the same approach for HashSet<T> when T is string; also adding the concurrent use check (as that is collision count based?).

Don't think much could be done for HashTable as its not strongly typed and a mix of objects can be added, so it can't do a special cased fast-path for strings (switching hashing approach when collisions become high.)

@vancem
Copy link
Contributor

vancem commented Aug 10, 2018

First, as a matter of principle, I think both @mjsabby idea of having an opt-out and @benaadams idea of dynamically switching hashes under high collisions are worth having 'on the table'.

However I will mention that if there is something we do ALOT at Microsoft, it is make things more complex for what ultimately turns out to be a dubious tradeoff. I think we should very clear on the benefits of adding such complexity before embarking.

In particular, I have done some benchmarking of a V2.1 based Powershell running the CSV importer (before they did the optimization that @powercode suggested), and the system.private.corelib!System.Marvin.ComputeHash32 method used 5.7% of the time. Thus at MOST that coudl be recovered. To put that in perspective 10.4% was spent in ToUpperInvariant (It seems that the hash table is case insensitive). Also these hashs come from ConcurrentDictionary and HashTable so you can see that @benaadams idea would have to be applied several places to have a uesful effect in this case.

As mentioned I am not against either of the suggestions at this time, but I do think that the original benefit (a 2X slowdown), cannot all be attributed to the String Hashing function (that is the mitigations above will only help by maybe 4-5%, and frankly because Powershell fixed their code, this particualr case has gone away (although I am sure that string based Dictionary lookups are hot in some scenarios).

I argue that we shoudl accumulate more 'good reasons' to make the change before proceeding.

@GrabYourPitchforks
Copy link
Member

(security hat on)

I don't like the idea of adding an application-wide opt-out switch for the randomized hash code calculation. There's a lot of code in the framework and in third-party libraries that either explicitly or implicitly depends on this behavior for its security guarantees. Consider Dictionary<Uri, ...> or some other TKey where an untrusted string instance is somehow worked into the hash code calculation. (In fact, we now make this very easy to do with the introduction of the HashCode type!)

Making it per-collection (such as via an explicit StringComparer instance passed in to the collection's ctor) has the benefit that the developer is saying "I know this particular usage of non-randomized hash codes is safe," and this statement is made at collection construction time. Contrast this with an application-wide switch, where the developer making that statement probably hasn't taken the time to analyze their full dependency graph and validate that each dependency is safe in the face of predictable hash codes.

Having collections optimistically use a non-randomizing hash (for TKey = string) and fall back to a randomizing hash in the face of collisions is also a viable option, as it's again a per-collection decision rather than an application-wide setting. It means that scenarios like TKey = System.Uri aren't optimized and continue to use randomizing hash code calculations, but this makes them safe by default, which is goodness.

@vancem
Copy link
Contributor

vancem commented Aug 23, 2018

My understanding of things is that randomized hash codes is to protect against a denial of service attack where you don't trust your user. While there are instances where this is very important (a service open to all), in many cases it is just not interesting because you trust the user (if your users alreayd 'own' the machine (e.g. this is an 'ordinary' app), then you don't care (they are creating a DOS against themselves).

Another scneario is simply compat (my software took a deep dependency, and I need something! This case is probably where @mjsabby is coming from. Here you are simply willing to trade off security (often because it was no worse than what it is replacing).

So I think there are useful scenarios for an opt-out. (This came up in the context of Powershell, and it is an intersting question whether they should opt-out). One could argue that this is what the do defacto on the desktop runtime (which is most of their user base today.

So far, it seems that there are better places to look for perf, and it can be mititgated in important cases like what @benaadams did. Until someone actually does the PR for the opt-out, the issue won't be forced...

@GrabYourPitchforks
Copy link
Member

It's not that you don't trust the user - it's that you don't trust the data. Consider something like Office or Visual Studio. I as a human being driving the mouse and keyboard am fully trusted. But the .docx / .csproj / whatever file I'm opening might be from some random source. Even within DevDiv we've spent significant resources within the past year or so trying to address reports where somebody crafted a malicious project and caused the trusted user's IDE to misbehave. It's true that these reports tend to be lower priority since it's a local DoS, but it is still some customer's workflow that's being impacted every time we need to address one of these situations when they inevitably do arise.

@vancem
Copy link
Contributor

vancem commented Aug 28, 2018

Sure, I get that there are many possible security scenarios. My main point is that DOS attacks are not as critical, and that allowing people to opt-out is reasonable. That said, I don't like making features unless we actually know who will use it, and it is not clear who will use this opt-out, so I am not motivated to actually implement it.

I do note that event with the out-out, we are safe by default. which I suspect is your main point.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@GrabYourPitchforks
Copy link
Member

new Dictionary<string, ...>() now has the optimization which optimistically uses the "fast" hash code calculation routine, falling back to the randomized hash code calculation routine only when collisions are encountered. We also improved the performance of Marvin32 in the .NET Core 3.0 timeframe (see dotnet/coreclr#22816).

Additionally, per PowerShell/PowerShell#7112, it looks like the original scenario which precipitated the opening of this issue has since been resolved in PowerShell Core.

Since it looks like there's nothing left to do on the runtime side of things I'm closing the issue for now. Please reopen this issue if needed. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime question Answer questions and provide assistance, not an issue with source code or documentation. tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests