-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096
[Mono]: Reduce Mono AOT cross compiler x64 memory footprint. #97096
Conversation
@@ -0,0 +1,10 @@ | |||
<linker> | |||
<assembly fullname="System.Private.CoreLib"> | |||
<type fullname="System.Runtime.Intrinsics.Vector256"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true for all codegens (e.g. interpreter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like interpreter mark all vectors as not being hardware accelerated:
} else if (in_corlib &&
(!strncmp ("System.Runtime.Intrinsics", klass_name_space, 25) &&
!strncmp ("Vector", klass_name, 6) &&
!strcmp (tm, "get_IsHardwareAccelerated"))) {
*op = MINT_LDC_I4_0;
}
so for that case it should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mark v128 as hardware accelerated in interp in some cases, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so then this should be OK since it only affects v256/v512.
Wouldn't be better to split this into smaller PRs ? |
I wonder if we can make working with symbols a little more typesafe so that we have some distinction between a mempool allocated symbol and a temporary malloc-allocated symbol. Maybe we can just pass around a |
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems
Outdated
Show resolved
Hide resolved
src/mono/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.Intrinsics.x86.xml
Show resolved
Hide resolved
src/mono/mono/mini/simd-intrinsics.c
Outdated
return NULL; | ||
|
||
if (vector_size == 256 || vector_size == 512) { | ||
if (id == SN_get_IsHardwareAccelerated ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RyuJIT, we have a fallback that handles any unrecognized get_IsHardwareAccelerated
and get_IsSupported
APIs for System.Numerics
, System.Runtime.Intrinsics
, and System.Runtime.Intrinsics.*
to ensure they can be treated as constant false
. Each namespace is being handled a little bit differently since get_IsSupported
for some namespaces needs to fallback to user-code instead.
Would it be a good idea to similarly make this general-purpose. Notably vector_size == 64
should also be false
on x86/x64
for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have a common fallback for IsHardwareAccelerted in instrinsics.c:
/* Fallback if SIMD is disabled */
if (in_corlib &&
((!strcmp ("System.Numerics", cmethod_klass_name_space) && !strcmp ("Vector", cmethod_klass_name)) ||
!strncmp ("System.Runtime.Intrinsics", cmethod_klass_name_space, 25))) {
if (!strcmp (cmethod->name, "get_IsHardwareAccelerated")) {
EMIT_NEW_ICONST (cfg, ins, 0);
ins->type = STACK_I4;
return ins;
}
}
So I guess we should be able to drop that change (just made it explicitly for better visibility in this PR) and rely on that fallback to end up with the same result. Not sure why 64-bit vector size was not included in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this change to rely on fallback for get_IsHardwareAccelerated.
src/mono/mono/mini/simd-intrinsics.c
Outdated
if (!strcmp (class_ns, "System.Runtime.Intrinsics")) { | ||
if (!strcmp (class_name, "Vector128")) | ||
if (!strcmp (class_name, "Vector128") || !strcmp(class_name, "Vector256") || !strcmp(class_name, "Vector512")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about Vector64
?
src/mono/mono/mini/simd-intrinsics.c
Outdated
@@ -5911,8 +5981,14 @@ static | |||
MonoInst* | |||
arch_emit_simd_intrinsics (const char *class_ns, const char *class_name, MonoCompile *cfg, MonoMethod *cmethod, MonoMethodSignature *fsig, MonoInst **args) | |||
{ | |||
if (!strcmp (class_ns, "System.Runtime.Intrinsics.X86")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a similar line for System.Runtime.Intrinsics.Arm
and System.Runtime.Intrinsics.Wasm
missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or general purpose code handling any unrecognized System.Runtime.Intrinsics.*
namespace?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled elsewhere in that source file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will find Arm and Wasm under respective defines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they not need a path to ensure that IsSupported
returns constant false
and the intrinsics directly generate a PNSE
exception, rather than hitting the recursive fallback, or is that handled elsewhere as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly focused on x86 in this PR, other code will still go through the fallbacks, but could probably be enhanced as well.
/* | ||
* If a method has been marked with aggressive inlining, check if we support | ||
* aggressive inlining of the intrinsics type, if not, ignore aggressive inlining | ||
* since it could end up inlining a large amount of code that most likely will end | ||
* up as dead code. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand why this is needed on Mono a bit?
The software fallback for V128/V256/V512 is currently implemented as doing 2x operations on the lower/upper halves (except for a couple of methods like Shuffle
which operate on the full vector). So V512
is implemented as 2x V256
, V256
is implemented as 2x V128
, and V128
is implemented as 2x V64
. V64
is then implemented as a loop over the scalar elements with potentially large amounts of generic code.
So I would imagine that on any platform where V128
is supported, that the codegen for V256
/V512
should be generally nice/small, even if aggressively inlined and with no real dead code elimination required. Even in the case where you have an unsupported type like V256<Guid>
, the first V128
operation should have a PNSE
thrown (since whether a type is supported or not is typically a known constant).
So I'd only expect that this is needed for V64 on x86/x64 where there is no acceleration and it has to hit the scalar fallback with the loop over the elements. For RyuJIT this loop isn't an issue due to the generic specialization we do, allowing us to get it down to only the code path that is actually used. I believe this isn't possible for Mono today and is non-trivial to add, so the skipping of inlining does make sense in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So without disabling aggressive inlining I ended up with methods that where huge, and not doing aggressive inline on these types (but still honor inline size limits) made them sane again. I will need to re-iterate around that change in order to tell exactly what happened with our inliner in the aggressive case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just validated this, with disabling the aggressive inlining as above a .net9 full AOT of S.P.C takes ~1.7 GB of memory, and not doing this will consume an additional 600 MB of memory, so I believe its worth preventing aggressive inlining for these types that are not hardware accelerated on any of the Mono supported platforms. I didn't add V64 since it seems to be handled a little differently on at least ARM case. I won't have bandwidth at the moment to do more deep analysis around why aggressive inlining of these template types cause that large increase in memory so I think doing this change, at least short term is worth it, we could probably file an issue around the bloat of Vector256
1 and
Vector5121
, but maybe the better long term solution is to actually implement intrinsic support for these types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For interpreter I'm handling this in a general fashion by detecting early dead pieces of code and not inlining any of the calls there. #97514. It is possible that a similar approach for jit can produce further improvements without having to special case classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try to experiment with some dead code elimination, but that cause issues and doesn't work with our llvm codegen, so we explicitly turn that pass off for code that will be passed over to llvm. Instead I made sure we could do more in linker, but still these methods still explode and survives first simple llvm optimizations pass we now do in function manager pass, so feels like something in the inlined code prevents elimination until very late in the opt chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to reenable branch optimizations when using llvm in this pr:
#97189
|
Extracted from #97096. Author: Johan Lorensson <lateralusx.github@gmail.com>
Extracted from #97096. Author: Johan Lorensson <lateralusx.github@gmail.com>.
Hopfully I resolved the conflicts correctly, someone should review. |
Extracted from dotnet#97096. Author: Johan Lorensson lateralusx.github@gmail.com.
4e885b1
to
65f6f9c
Compare
65f6f9c
to
aa39dff
Compare
Extracted from dotnet#97096. Author: Johan Lorensson lateralusx.github@gmail.com.
Extracted from #97096. Author: Johan Lorensson lateralusx.github@gmail.com.
Building .net8 S.P.C using Mono AOT cross compiler in full AOT consumes a large amount of memory (up to 6 GB). This is mainly due to generated LLVM module not being optimized at all while kept in memory during full module generation. Mono x64 also lacks support for several intrinsics as well as Vector 256/512 that in turn leads to massive inlining of intrinsics functions generating a very large LLVM module, where majority of this code ends up as dead code due to IsSupported/IsHardwareAccelerated returning false. The follow commit adjusts several things that will bring down the memory usage, compiling .net8/.net9 Mono S.P.C on x64 Windows from 6 GB down to ~750 MB. * Use PSNE implementations on intrinsics not supported on Mono. * Add ILLinker substitutions for intrinsics not supported on Mono. Enables ILLinker to do dead code elimination, reduce code to AOT compile. * Prevent aggressive inlining for a couple of unsupported intrinsics types making sure we don't end up with excessive inlining, exploding code size. * Run a couple of LLVM optimization passes on each generated method doing early code simplification and dead code elimination during LLVM module generation. * Explicit SN_get_IsHardwareAccelerated/SN_get_IsSupported intrinsics implementation for all unsupported Mono x64 SIMD intrinsics. * Fixed numerous memory leaks in Mono AOT cross compiler code. * Fix a couple of sequence points free after use errors. * Fix an anonymous struct build warning triggering build error for LLVM enabled cross compiler on Windows.
aa39dff
to
cfaf8d9
Compare
ea8a484
to
5ab0d94
Compare
Failures are unrelated. |
This PR looks to be responsible for ~200kB package size improvement on iOS HelloWorld (range of commits 339443b...a79c62d) Good job everyone! |
Edit. after the subsequent fix to this PR (#98515), the package size improvements on iOS HelloWorld are mostly gone (i.e., back to original values). |
Building .net8 S.P.C using Mono AOT cross compiler in full AOT consumes a large amount of memory (up to 6 GB). This is mainly due to generated LLVM module not being optimized at all while kept in memory during full module generation. Mono x64 also lacks support for several intrinsics as well as Vector 256/512 that in turn leads to massive inlining of intrinsics functions generating a very large LLVM module, where majority of this code ends up as dead code due to IsSupported/IsHardwareAccelerated returning false.
The follow commit adjusts several things that will bring down the memory usage, compiling .net8/.net9 Mono S.P.C on x64 Windows from 6 GB down to ~750 MB.