Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving fast parts of cast FCALLs to managed code. #1068

Merged
merged 22 commits into from
Jan 22, 2020

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Dec 19, 2019

  • Exposed casting cache to managed code
  • Implemented a managed version of cache lookup
  • Moved JIT_IsInstanceOfAny and JIT_ChkCastAny to managed code as the first ones to move.
  • Skip managed JIT helpers in exception stack traces and debugger
  • Managed JIT_IsInstanceOfInterface
  • All other cast helpers are managed.

Fixes:https://github.com/dotnet/coreclr/issues/27931

@VSadov VSadov force-pushed the castFcalls branch 2 times, most recently from 63cb4c5 to 9ca0b1b Compare December 21, 2019 21:27
}

// NOTE!!
// This is a copy of C++ implementation in CastCache.cpp
Copy link
Contributor

@andrew-boyarshin andrew-boyarshin Dec 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More likely castcache.h. #Resolved

Copy link
Member

@jkotas jkotas Jan 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The implementation is in the .cpp file, but the VM filename is in lower case (castcache.cpp). #Resolved

@VSadov
Copy link
Member Author

VSadov commented Dec 30, 2019

@tommcdon - tried the "logical unwind" here in the last change. Does not seem to be sufficient.
VS still visualizes invalid casts as rethrows. #Resolved

@tommcdon
Copy link
Member

tommcdon commented Jan 1, 2020

@tommcdon - tried the "logical unwind" here in the last change. Does not seem to be sufficient.
VS still visualizes invalid casts as rethrows.

Proposed fix at a lower level is here: VSadov@d269b62. Tested the fix in VS and it seems to work correctly. #Resolved

@VSadov VSadov force-pushed the castFcalls branch 6 times, most recently from 0d03012 to f0a5986 Compare January 11, 2020 03:05
@VSadov VSadov force-pushed the castFcalls branch 2 times, most recently from ff792a3 to 0729321 Compare January 13, 2020 07:47
@VSadov VSadov force-pushed the castFcalls branch 2 times, most recently from 5d743cc to a0223a9 Compare January 18, 2020 21:03
@VSadov VSadov changed the title [WIP] Moving fast parts of cast FCALLs to managed code. Moving fast parts of cast FCALLs to managed code. Jan 19, 2020
@VSadov
Copy link
Member Author

VSadov commented Jan 19, 2020

I think this is ready to be reviewed. #Resolved

@VSadov
Copy link
Member Author

VSadov commented Jan 19, 2020

The change moves fast parts of casting helpers to managed code and removes various existing implementations - in C++, in asm in C/inlined asm. Now it is all C#. #Resolved

@VSadov
Copy link
Member Author

VSadov commented Jan 19, 2020

I also hoped to not have specialized versions for class and interface casts.

Tuns out outperforming hand-tuned assembly is challenging, so specialized methods are still there.
I think cache lookup must get 20-30% faster before considering just using cache. As it is, best case scenarios for classes and interfaces are noticeably faster than cache lookups and these scenarios also seem fairly common.

Whoever wrote the AMD64 assembly helpers did a god job.

I have discussed with @davidwrighton possible ways of speeding up the cache lookup:

  • use singleton table instead of null for cache flush (will eliminate one null check)
  • use fixed size table (will simplify reducing hashcode to table index)
  • consider simpler hash function, perhaps at cost of larger table.
  • inline cache lookup by hand (some codepaths could be streamlined)
  • precompute hashed value of methodtable pointer.
    . . . something else..

This may improve the cache lookup enough to switch Interface and/or Class case to Any or maybe it will not.

I think perf tinkering should be a separate change though.
This PR achieves its goals - the helpers are managed and perf is roughly the same or better. #Resolved

@@ -423,12 +423,10 @@ enum CorInfoHelpFunc
// the right helper to use

CORINFO_HELP_ISINSTANCEOFINTERFACE, // Optimized helper for interfaces
CORINFO_HELP_ISINSTANCEOFARRAY, // Optimized helper for arrays
Copy link
Member

@jkotas jkotas Jan 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep these for now, even if it means that the implementation is same. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get them back.


In reply to: 368260447 [](ancestors = 368260447)

Copy link
Member Author

@VSadov VSadov Jan 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just could not see what could be done differently in a helper if we knew we are casting to an array.
For identity cast check we could now just compare MTs, which we do for any cast...


In reply to: 368261248 [](ancestors = 368261248,368260447)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing JIT could inline identity cast check similarly to class cast. Not sure if those are common with arrays but with classes inlining identity cast helps a lot.


In reply to: 368261350 [](ancestors = 368261350,368261248,368260447)

@VSadov
Copy link
Member Author

VSadov commented Jan 19, 2020

Here is what I see on a microbenchmark that calls in a loop a method that performs a particular kind of cast


Empty: 232ms      <-- overhead, should be _roughly_ subtracted from other measurements.
Base to Base: 417ms    <-- inlined by the JIT, nearly free
Derived to Base: 504ms    <-- IsInstanceOfClassSpecial  1 iteration
Derived2 to Base: 552ms   <-- IsInstanceOfClassSpecial  3 iterations
Derived4 to Base: 622ms   <-- IsInstanceOfClassSpecial  5 iterations
Derived6 to Base: 754ms   <-- IsInstanceOfClassSpecial  7 iterations. no-nullcheck IsInstanceAny would have similar cost. 

List<int> to IList<int>: 601ms       <-- IsInstanceOfInterface 1 iteration
List<int> to ICollection<int>: 644ms <-- IsInstanceOfInterface 2 iterations
List<int> to IEnumerable<int>: 861ms <-- IsInstanceAny   (3rd, but JIT does not like variant interfaces)
List<int> to ICollection: 589ms      <-- IsInstanceOfInterface 5 iterations
List<double> to ICollection<int>: 719ms   <-- IsInstanceOfInterface 8 iterations
Dictionary<double, double> to IDeserializationCallback: 859ms   <-- IsInstanceOfInterface 10 iterations.

List<int> to IReadOnlyCollection<int>: 1062ms  <-- IsInstanceAny, looks like a hash collision
List<string> to IReadOnlyCollection<object>: 842ms  <-- IsInstanceAny  pass
List<double> to IReadOnlyCollection<int>: 966ms     <-- IsInstanceAny  failure 
string[] to IReadOnlyCollection<object>: 858ms      <-- IsInstanceAny  pass

where the baseline is

Empty: 352ms	<-- overhead, no idea why slower than Base to Base. branch misprediction?
Base to Base: 325ms
Derived to Base: 627ms
Derived2 to Base: 599ms
Derived4 to Base: 691ms
Derived6 to Base: 789ms

List<int> to IList<int>: 550ms
List<int> to ICollection<int>: 724ms
List<int> to IEnumerable<int>: 1075ms  <-- new IsInstanceAny is a bit faster (due to cache improvements, I think).
List<int> to ICollection: 739ms
List<double> to ICollection<int>: 783ms
Dictionary<double, double> to IDeserializationCallback: 919ms

List<int> to IReadOnlyCollection<int>: 973ms
List<string> to IReadOnlyCollection<object>: 1365ms
List<double> to IReadOnlyCollection<int>: 1015ms
string[] to IReadOnlyCollection<object>: 971ms

#Resolved

@VSadov
Copy link
Member Author

VSadov commented Jan 19, 2020

The Linux is similar, but there are few interesting differences.

Since we did not use hand-written assembly on Linux and C++ helpers were not as sophisticated, there is some gain in specialized methods.

On the other hand the improvements in cache lookup seems to have been negated by something. We even see a 1-2% regression. I think it is acceptable, but wonder why we have it.

Either we have some platform specific penalty for moving to managed (call conventions, access of statics?), or GCC produces better code (compared to MSVC) for native cache lookup and JIT is not matching that.
It would be interesting to find out. Maybe we can improve something.

===== baseline

Empty: 338ms
Base to Base: 397ms
Derived to Base: 588ms
Derived2 to Base: 678ms
Derived4 to Base: 1141ms
Derived6 to Base: 1056ms
List<int> to IList<int>: 649ms
List<int> to ICollection<int>: 676ms
List<int> to IEnumerable<int>: 963ms
List<int> to ICollection: 1092ms
List<double> to ICollection<int>: 1302ms
Dictionary<double, double> to IDeserializationCallback: 1264ms
List<int> to IReadOnlyCollection<int>: 983ms
List<string> to IReadOnlyCollection<object>: 988ms
List<double> to IReadOnlyCollection<int>: 989ms
string[] to IReadOnlyCollection<object>: 991ms

=== after the change:

Empty: 286ms
Base to Base: 453ms
Derived to Base: 755ms
Derived2 to Base: 816ms
Derived4 to Base: 844ms
Derived6 to Base: 994ms
List<int> to IList<int>: 679ms
List<int> to ICollection<int>: 623ms
List<int> to IEnumerable<int>: 1051ms
List<int> to ICollection: 967ms
List<double> to ICollection<int>: 867ms
Dictionary<double, double> to IDeserializationCallback: 1164ms
List<int> to IReadOnlyCollection<int>: 1234ms
List<string> to IReadOnlyCollection<object>: 1166ms
List<double> to IReadOnlyCollection<int>: 1019ms
string[] to IReadOnlyCollection<object>: 1111ms
``` #Resolved

@VSadov
Copy link
Member Author

VSadov commented Jan 19, 2020

Note - all the measurements have some degree of noise.
At these level we are sensitive to alignment of code blocks and jumps/labels, to branch mispredictions and the like. So sometimes results vary for seemingly no reasons. #Resolved

@VSadov
Copy link
Member Author

VSadov commented Jan 22, 2020

Removed AggressiveOptimization from casting helpers.

@VSadov
Copy link
Member Author

VSadov commented Jan 22, 2020

Thanks!!

@VSadov VSadov merged commit c3dc1fd into dotnet:master Jan 22, 2020
@benaadams
Copy link
Member

Does build --subsetCategory coreclr -configuration release R2R the System.Private.CoreLib.dll (think it does as it also produces a System.Private.CoreLib.ni.pdb)

If so can these be added to R2R? Am seeing ChkCastAny as showing up as Jitting during startup:

image

@jkotas
Copy link
Member

jkotas commented Feb 3, 2020

I do not see ChkCastAny JITed during startup. For the record, here is the list of methods that I see getting JITed during startup (CoreLib R2R, empty Main):

System.SpanHelpers:IndexOf(byref,ushort,int):int
System.Runtime.Intrinsics.Vector128`1[UInt16][System.UInt16]:get_Count():int
Internal.Runtime.CompilerServices.Unsafe:Add(byref,long):byref
System.Runtime.Intrinsics.Vector128:CreateScalarUnsafe(ushort):System.Runtime.Intrinsics.Vector128`1[UInt16]
System.Runtime.Intrinsics.Vector128:AsUInt32(System.Runtime.Intrinsics.Vector128`1[UInt16]):System.Runtime.Intrinsics.Vector128`1[UInt32]
System.Runtime.Intrinsics.Vector128:AsUInt16(System.Runtime.Intrinsics.Vector128`1[UInt32]):System.Runtime.Intrinsics.Vector128`1[UInt16]
System.Runtime.Intrinsics.Vector256`1[UInt16][System.UInt16]:get_Count():int
System.Text.Unicode.Utf16Utility:GetPointerToFirstInvalidChar(long,int,byref,byref):long
System.Runtime.Intrinsics.Vector128:CreateScalarUnsafe(short):System.Runtime.Intrinsics.Vector128`1[Int16]
System.Runtime.Intrinsics.Vector128:AsInt32(System.Runtime.Intrinsics.Vector128`1[Int16]):System.Runtime.Intrinsics.Vector128`1[Int32]
System.Runtime.Intrinsics.Vector128:AsInt16(System.Runtime.Intrinsics.Vector128`1[Int32]):System.Runtime.Intrinsics.Vector128`1[Int16]
System.Runtime.Intrinsics.X86.Sse41:get_IsSupported():bool
System.Runtime.Intrinsics.Vector128:AsUInt16(System.Runtime.Intrinsics.Vector128`1[Int16]):System.Runtime.Intrinsics.Vector128`1[UInt16]
System.Runtime.Intrinsics.Vector128:AsByte(System.Runtime.Intrinsics.Vector128`1[UInt16]):System.Runtime.Intrinsics.Vector128`1[Byte]
System.Text.ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long
X64:get_IsSupported():bool
X64:ParallelBitExtract(long,long):long

What is the list of methods that you see JITed during startup?

@benaadams
Copy link
Member

What is the list of methods that you see JITed during startup?

Maybe is the Profiler causing recompilation then?

image

@xiangzhai
Copy link
Contributor

:mips-interest

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants