Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Numerics.Tensors.Tests cause long-running/hanging tests on Mono JIT (not interpreter) #97295

Open
akoeplinger opened this issue Jan 22, 2024 · 11 comments
Labels
area-Codegen-Interpreter-mono area-Codegen-JIT-mono disabled-test The test is disabled in source code against the issue
Milestone

Comments

@akoeplinger
Copy link
Member

After #97192 was merged we started seeing timeouts in System.Numerics.Tensors.Tests on Mono across all platforms, but seemingly only when using the JIT. Interpreter runs pass fine.

There are a large number of long-running tests like this, but disabling them one-by-one wasn't helpful since then just another test was hanging so I had to disable the whole test project

[Long Running Test] 'System.Numerics.Tensors.Tests.NFloatGenericTensorPrimitives.IndexOfMinMagnitude_ReturnsNegative1OnEmpty', Elapsed: 00:03:17
[Long Running Test] 'System.Numerics.Tensors.Tests.HalfGenericTensorPrimitives.IndexOfMax_ReturnsNegative1OnEmpty', Elapsed: 00:02:03
@nealef
Copy link
Contributor

nealef commented Jan 22, 2024

@uweigand fya

@uweigand
Copy link
Contributor

@giritrivedi @saitama951 can you see if you can recreate this on s390x

@giritrivedi
Copy link
Contributor

Sure @uweigand

@pavelsavara
Copy link
Member

I believe this test on NodeJS/WASM is interpreter.
It's very very slow in linear way.
Killed after 1h 15min

Log
Build

[03:57:19] info: [STRT] System.Numerics.Tensors.Tests.NFloatGenericTensorPrimitives.SpanScalarDestination_InPlace(tensorPrimitivesMethod: SpanScalarDestinationDelegate`3 { Method = Void Min[NFloat](System.ReadOnlySpan`1[System.Runtime.InteropServices.NFloat], System.Runtime.InteropServices.NFloat, System.Span`1[System.Runtime.InteropServices.NFloat]), Target = null }, expectedMethod: Func`3 { Method = System.Runtime.InteropServices.NFloat Min(System.Runtime.InteropServices.NFloat, System.Runtime.InteropServices.NFloat), Target = null })
[03:57:19] info: Process 2804 didn't exit within 01:15:00 and will be killed

[03:57:19] info: Killing process tree of 2804...

@ghost
Copy link

ghost commented Feb 26, 2024

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

After #97192 was merged we started seeing timeouts in System.Numerics.Tensors.Tests on Mono across all platforms, but seemingly only when using the JIT. Interpreter runs pass fine.

There are a large number of long-running tests like this, but disabling them one-by-one wasn't helpful since then just another test was hanging so I had to disable the whole test project

[Long Running Test] 'System.Numerics.Tensors.Tests.NFloatGenericTensorPrimitives.IndexOfMinMagnitude_ReturnsNegative1OnEmpty', Elapsed: 00:03:17
[Long Running Test] 'System.Numerics.Tensors.Tests.HalfGenericTensorPrimitives.IndexOfMax_ReturnsNegative1OnEmpty', Elapsed: 00:02:03
Author: akoeplinger
Assignees: -
Labels:

disabled-test, area-Codegen-JIT-mono, area-Codegen-Interpreter-mono

Milestone: 9.0.0

@saitama951
Copy link
Contributor

I had a look into the issue.

Here is a extracted test case

public abstract class func
{
        static void test()
        {
                var watch = System.Diagnostics.Stopwatch.StartNew();
                Console.WriteLine(TensorPrimitives.IndexOfMinMagnitude(ReadOnlySpan<int>.Empty));
                watch.Stop();
                Console.WriteLine("Time Elapsed:"+watch.ElapsedMilliseconds);

        }
        public static void Main(string []args)
        {
                 func.test();
        }
}

now when running this extracted test case in debug and release, I see the release mode takes 20x times than that of the debug mode.

now looking at the perf hotspots I see a hotspot in move_basic_block_to_end which traverses to the end of the list and that traverses through ~100k basic blocks and these basic blocks are being generated by inline_method

debugging further I can see :

IndexOfMinMaxCore generates around ~118454 basic blocks and in which System.Numerics.Tensors.TensorPrimitives:IndexOfFinalAggregate in total generates around ~70k basic blocks
for the three different variants of the vectors (~21k basic blocks for the 128 variant, ~23k basic blocks for 256 variant, ~28k basic blocks for 512 variant ).

Now in System.Numerics.Tensors.TensorPrimitives:IndexOfFinalAggregate(System.Runtime.Intrinsics.Vector128<int>,System.Runtime.Intrinsics.Vector128<int>) I see a big bump in basic blocks in TIndexOfOperator.Invoke around ~2152 basic blocks which is invoked 10 times in System.Numerics.Tensors.TensorPrimitives:IndexOfFinalAggregate which is roughly around ~21k basic blocks

Now in TIndexOfOperator.Invoke I have commented the number of basic blocks being generated for the each call

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public static void Invoke(ref Vector128<T> result, Vector128<T> current, ref Vector128<T> resultIndex, Vector128<T> currentIndex)
            {
                Vector128<T> resultMag = Vector128.Abs(result), currentMag = Vector128.Abs(current); // 144 basic blocks
                Vector128<T> useResult = Vector128.LessThan(resultMag, currentMag); // 137 basic blocks
                Vector128<T> equalMask = Vector128.Equals(resultMag, currentMag); // 137 basic blocks

                if (equalMask != Vector128<T>.Zero) //77 basic blocks
                {
                    Vector128<T> lessThanIndexMask = IndexLessThan(resultIndex, currentIndex); //556 basic blocks
                    if (typeof(T) == typeof(float) || typeof(T) == typeof(double))
                    {
                        // bool useResult = equal && ((IsNegative(result) == IsNegative(current)) ? (resultIndex < currentIndex) : IsNegative(result));
                        Vector128<T> resultNegative = IsNegative(result); // 417 basic blocks
                        Vector128<T> sameSign = Vector128.Equals(resultNegative.AsInt32(), IsNegative(current).AsInt32()).As<int, T>(); //417 + 137 basic blocks
                        useResult |= equalMask & ElementWiseSelect(sameSign, lessThanIndexMask, resultNegative); //36 blocks
                    }
                    else
                    {
                        useResult |= equalMask & lessThanIndexMask;
                    }
                }

                result = ElementWiseSelect(useResult, result, current); //36 blocks
                resultIndex = ElementWiseSelect(useResult, resultIndex, currentIndex); //36 blocks
            }

The library call System.Numerics.Tensors.TensorPrimitives:IsNegative and System.Numerics.Tensors.TensorPrimitives:IndexLessThan invokes Vector128.LessThan (~137 basic blocks) for 3 and 4 times respectively.

The Vector128.LessThan calls 2 * Vector64.LessThan and then in the end Scalar.LessThan and Scalar.get_AllBitsSetgets called which generates around ~120 basic blocks ((36+24)*2)

https://github.com/dotnet/runtime/blob/e7e3dee0a50e1780c889d63797089acc12e564da/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs#L1438-#L1449

since all of these are dead code for s390x, Is there a way that this can be eliminated way before the other optimization passes?

Another concern is that
while inlining we can see intermediate basic blocks being generated and then being merged, although I don't see
cfg->num_bblocks being updated on-the-go (looks like this is being used as a counter to assign to cfg->block_num but also used for loop calculations in other optimizations like mono_optimize_branches) . I believe we should keep an extra variable to do the job as the counter.

cc: @uweigand @akoeplinger

@akoeplinger
Copy link
Member Author

@vargaz would you mind helping with this issue?

@vargaz vargaz self-assigned this Mar 8, 2024
@vargaz
Copy link
Contributor

vargaz commented Mar 9, 2024

I can't reproduce this. This might have been fixed recently by this and related PRs:
#97096

@vargaz
Copy link
Contributor

vargaz commented Mar 9, 2024

I can reproduce, it only seems to happen on platforms without SIMD.

vargaz added a commit to vargaz/runtime that referenced this issue Mar 9, 2024
* Move the if (cfg->opt & MONO_OPT_SIMD) checks into simd-intrinsics.c
  so the IsSupported checks can be intrinsified even if SIMD is disabled.
* Implement mono_simd_unsupported_aggressive_inline_intrinsic_type () in
  the non-simd cases as well, add more types.

Re: dotnet#97295.
@vargaz
Copy link
Contributor

vargaz commented Mar 9, 2024

The PR will improve the situation somewhat (2s -> 0.7s).

vargaz added a commit to vargaz/runtime that referenced this issue Mar 9, 2024
* Move the if (cfg->opt & MONO_OPT_SIMD) checks into simd-intrinsics.c
  so the IsSupported checks can be intrinsified even if SIMD is disabled.
* Implement mono_simd_unsupported_aggressive_inline_intrinsic_type () in
  the non-simd cases as well, add more types.

Re: dotnet#97295.
@saitama951
Copy link
Contributor

@vargaz Thanks for the fix, this reduced the regression from 2k ms -> 1k ms . however, I see some vector methods still being inlined i.e

INLINE START 0x2aa31a22938 System.Numerics.Tensors.TensorPrimitives:IndexOfMinMaxCore<int, System.Numerics.Tensors.TensorPrimitives/IndexOfMinMagnitudeOperator`1<int>> (System.ReadOnlySpan`1<int>) -> System.Runtime.Intrinsics.Vector512:Create (int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16)
created temp 418 (R983) of type System.Runtime.Intrinsics.Vector512<System.Int16>
method to IR System.Runtime.Intrinsics.Vector512:Create (int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16)

however I was thinking to disable inlining for vector methods completely for platforms which don't support it

@@ -4783,6 +4785,19 @@ inline_method (MonoCompile *cfg, MonoMethod *cmethod, MonoMethodSignature *fsig,
 
        if (!fsig)
                fsig = mono_method_signature_internal (cmethod);
+      if (!(cfg->opt & MONO_OPT_SIMD))
+       {
+               if (!strcmp (m_class_get_name_space (cmethod->klass), "System.Runtime.Intrinsics"))
+               {
+                       if (!strncmp(m_class_get_name (cmethod->klass), "Vector", 6)) {
+                               const char *vector_type = m_class_get_name (cmethod->klass) + 6;
+                       if (!strcmp(vector_type, "256`1") || !strcmp(vector_type, "512`1") || !strcmp(vector_type, "256") || !strcmp(vector_type, "512"))
+                               return 0;
+                       if (!(cfg->opt & MONO_OPT_SIMD) && (!strcmp (vector_type, "128`1") || !strcmp (vector_type, "128") || !strcmp (vector_type, "64`1") || !strcmp (vector_type, "64")))
+                               return 0;
+                       }
+               }
+       }
 
        if (cfg->verbose_level > 2)
                printf ("INLINE START %p %s -> %s\n", cmethod,  mono_method_full_name (cfg->method, TRUE), mono_method_full_name (cmethod, TRUE));

before this patch

dotnet bin/Release/net9.0/tensor.dll
-1
Time Elapsed:1028

after this patch

dotnet bin/Release/net9.0/tensor.dll
-1
Time Elapsed:151

vargaz added a commit that referenced this issue Mar 11, 2024
)

* Move the if (cfg->opt & MONO_OPT_SIMD) checks into simd-intrinsics.c
  so the IsSupported checks can be intrinsified even if SIMD is disabled.
* Implement mono_simd_unsupported_aggressive_inline_intrinsic_type () in
  the non-simd cases as well, add more types.

Re: #97295.
@vargaz vargaz removed their assignment Mar 22, 2024
@steveisok steveisok modified the milestones: 9.0.0, Future Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Codegen-Interpreter-mono area-Codegen-JIT-mono disabled-test The test is disabled in source code against the issue
Projects
None yet
Development

No branches or pull requests

9 participants