Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add span-based overloads of String.GetHashCode #26924

Closed
GrabYourPitchforks opened this issue Jul 24, 2018 · 14 comments
Closed

Add span-based overloads of String.GetHashCode #26924

GrabYourPitchforks opened this issue Jul 24, 2018 · 14 comments
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Milestone

Comments

@GrabYourPitchforks
Copy link
Member

ASP.NET has specialized data structures that use span equivalents as keys in dictionaries. As such, they need an equivalent of String.GetHashCode which takes ReadOnlySpan<char> instead of string. I propose the following APIs to enable this scenario.

namespace System {
    public class String {
        // new static APIs on existing string type
        public static int GetHashCode(ReadOnlySpan<char> value);
        public static int GetHashCode(ReadOnlySpan<char> value, StringComparison comparisonType);
    }
}

namespace System.Globalization {
    public class CompareInfo {
        // new instance APIs on existing CompareInfo type
        public int GetHashCode(ReadOnlySpan<char> source, CompareOptions options);
    }
}

The new static APIs on string mimic the behaviors of the instance methods String.GetHashCode() and String.GetHashCode(StringComparison), respectively. Per previous email discussions we do not want to specialize ReadOnlySpan<T>.GetHashCode when T = char, and we cannot add this as an extension method, so the next best logical place would be to put this on the string type directly.

The new instance API on CompareInfo mimics the behavior of the existing instance method CompareInfo.GetHashCode(String, CompareOptions). Even though all the other methods on CompareInfo are virtual, per last week's API review we determined that there's no legal way for anybody to subclass this type, so new APIs on this type needn't be virtual.

/cc @rynowak

@Wraith2
Copy link
Contributor

Wraith2 commented Jul 24, 2018

Per previous email discussions we do not want to specialize ReadOnlySpan.GetHashCode when T = char,

Briefly what are the reasons for this? I was working my way towards a proposal on this topic and that was the obvious path.

Will the hashcodes for strings and equivalent spans be required to be the same? Having read through various discussions about the MArvin string hashing implementations in coreclr and corefx I seem to remember that the length and zero termination could be involved in the hash.

You also mention specialized data structures in asp.net, got a link to those? When considering this the main problem I found was that there was no resilience to hash collision if you use the hash as the key. That's not likely to be a problem for small lookups but in a larger case could cause misses.

@svick
Copy link
Contributor

svick commented Jul 24, 2018

Even though all the other methods on CompareInfo are virtual, per last week's API review we determined that there's no legal way for anybody to subclass this type, so new APIs on this type needn't be virtual.

Does this mean it would make sense to open a separate API proposal to make the methods non-virtual? This would make the API clearer, consistent with this new method and slightly more efficient. Or would that still be considered a breaking change?

@GrabYourPitchforks
Copy link
Member Author

@rynowak Can you comment on the specific use cases ASP.NET has? You pointed me earlier to this as an experimental sample, but I don't want to put words in your mouth regarding your particular motivations or use cases.

@Wraith2 The equality operator on Span<T> is a referential equality operator: do the two instances reference the same data and are they the same length? Similarly, the Span<T>GetHashCode() method throws NotSupportedException since Object.GetHashCode() is intended to support placing arbitrary objects into dictionaries as keys, but spans cannot be boxed or used as TKey for dictionaries, so there's no compelling argument for implementing it. The reason we don't want to specialize the equality operator or GetHashCode for T = char is that now sometimes equality comparison will be referential (O(1)), and sometimes it will be deep (O(n)). And sometimes GetHashCode will throw, and sometimes it won't. Worse, you cannot tell what behavior you're going to get if you write a generic method MyMethod<T>(Span<T> mySpan) and call int hash = mySpan.GetHashCode();. The least surprising alternative IMO is to put this on the string type.

Given ReadOnlySpan<char> theSpan, the return values of String.GetHashCode(theSpan) and new string(theSpan).GetHashCode() are not strictly required to compare as equal since they are fundamentally different types. But in reality I assume the string-based methods would just delegate to the span-based methods for ease of implementation, so in practice they'll probably be equal. (But again - I wouldn't take a dependency on this behavior.)

@svick Per the comment at https://github.com/dotnet/corefx/issues/28001#issuecomment-405683985, the current plan of record is simply to make newly-added APIs on the CompareInfo type non-virtual. I don't think anybody has proposed changing existing APIs on that type because there's currently no evidence that leaving them virtual harms anything, and minimizing work is always goodness. :)

@rynowak
Copy link
Member

rynowak commented Jul 24, 2018

Sure, our primary use case for this right now is inside routing. We are executing a state machine and need to make some kind of branching decision based on a section of the URL path. We'd really prefer to do this without the need to substring since most of the time we won't use that content as a string anyway.

So we're likely to end up owning our own hash table implementation that stores a set of strings as keys, but can use a ReadOnlySpan<char> to look up the value, which in our case is the next state.

The hash code part of this is something we don't want to duplicate from CoreCLR because it's pretty general-purpose.

@Wraith2
Copy link
Contributor

Wraith2 commented Jul 25, 2018

Ok, I agree with all of that reasoning. The specific case I was considering was string.GetHashCode(value.AsSpan())==value.GetHashCode() where I don't think there should be any requirement that it would return true but that it might happen to. As long as people are careful this is fine. A decision could be made to ensure that they won't ever be equal by adding a constant term in the hash method so that it doesn't ever cause an issue.

@jnm2
Copy link
Contributor

jnm2 commented Jul 28, 2018

Putting the APIs on System.String makes me more comfortable. I want to know that I'm specifically getting the String hash which would match the equivalent string's GetHashCode(), as opposed to a potentially diverging more general sequence hash.

@cdmihai
Copy link
Contributor

cdmihai commented Aug 1, 2018

Another helper method that would help MSBuild and also help StringSegment avoid allocations in its HashCode would be:

public class String {
        // new static APIs on existing string type
        public static int GetHashCode(string aString, int offset, int length, StringComparison comparisonType);
    }

public class StringComparer {
      public bool Equals(string a, int aOffset, int aLength, string b, int bOffset, int bLength);
      public int GetHashCode(string aString, int offset, int length);
}

MSBuild has such an attempt, though it's hardcoded to OrdinalIgnoreCase: https://github.com/Microsoft/msbuild/blob/master/src/Shared/MSBuildNameIgnoreCaseComparer.cs#L119

@ahsonkhan
Copy link
Member

ahsonkhan commented Aug 8, 2018

Another helper method that would help MSBuild and also help StringSegment avoid allocations in its HashCode would be

You can use the proposed span-based overloads instead (and call AsSpan on the string).

Edit: As we discussed offline, there might be a perf penalty if the user has to call AsSpan everytime the GetHashCode API gets called. We should measure if the string null check would have an impact.

@GrabYourPitchforks
Copy link
Member Author

@ahsonkhan If the JIT can elide the null check (perhaps because the string instance has already been dereferenced within the same frame), there won't be a perf penalty. For example, I've used the following pattern to elide these checks.

int dummy = theString.Length;
ReadOnlySpan<char> theSpan = theString.AsSpan();
// consume theSpan

The JIT is smart enough to know that control can't proceed past the Length property getter if the instance is null, so the null check at the beginning of the AsSpan method gets removed entirely. This results in codegen similar to the below.

; assume rax references the string instance
mov r10d, dword ptr [rax + <const_offset_to_length>] ; access Length property, raises exception if null
lea r9d, [rax + <const_offset_to_data>] ; this is the 'ref readonly char' field backing the span
mov r10d, dword ptr [rax + <const_offset_to_length>] ; this is the length field backing the span (redundant, but whatever)

It ends up being fairly efficient in the end.

@GrabYourPitchforks
Copy link
Member Author

FYI this work is already done and sitting in the feature/utf8string branch. It'll get merged in at some point in the future.

@stephentoub
Copy link
Member

stephentoub commented Sep 25, 2018

FYI this work is already done and sitting in the feature/utf8string branch. It'll get merged in at some point in the future.

@GrabYourPitchforks, could you create a PR for moving just these APIs over to master? They seem very separable from the Utf8String work, and if they're already implemented, let's move 'em along and close out this issue :)

@GrabYourPitchforks
Copy link
Member Author

@stephentoub Sure, let me ping some people in-person and see what the best path forward would be. The implementation currently depends on some other refactorings to the string and CultureInfo classes, and I'll gauge the temperature of bringing those improvements over at the same time.

@stephentoub
Copy link
Member

Thanks.

@GrabYourPitchforks
Copy link
Member Author

dotnet/coreclr#20275 is the first PR under the "depends on some other refactorings" umbrella. The helper methods introduced there and in one more subsequent PR will be used by the GetHashCode logic to give a substantial performance boost when the dust settles.

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime
Projects
None yet
Development

No branches or pull requests

9 participants