System.Numerics.Tensors.Tests cause long-running/hanging tests on Mono JIT (not interpreter) #97295

akoeplinger · 2024-01-22T11:25:45Z

After #97192 was merged we started seeing timeouts in System.Numerics.Tensors.Tests on Mono across all platforms, but seemingly only when using the JIT. Interpreter runs pass fine.

There are a large number of long-running tests like this, but disabling them one-by-one wasn't helpful since then just another test was hanging so I had to disable the whole test project

[Long Running Test] 'System.Numerics.Tensors.Tests.NFloatGenericTensorPrimitives.IndexOfMinMagnitude_ReturnsNegative1OnEmpty', Elapsed: 00:03:17
[Long Running Test] 'System.Numerics.Tensors.Tests.HalfGenericTensorPrimitives.IndexOfMax_ReturnsNegative1OnEmpty', Elapsed: 00:02:03

The text was updated successfully, but these errors were encountered:

nealef · 2024-01-22T11:36:53Z

@uweigand fya

dotnet#97297 dotnet#97295 dotnet#97296 Caused by dotnet#97192

uweigand · 2024-01-22T11:43:49Z

@giritrivedi @saitama951 can you see if you can recreate this on s390x

giritrivedi · 2024-01-22T12:28:47Z

Sure @uweigand

#97297 #97295 #97296 Caused by #97192

dotnet#97297 dotnet#97295 dotnet#97296 Caused by dotnet#97192

pavelsavara · 2024-01-25T09:01:23Z

I believe this test on NodeJS/WASM is interpreter.
It's very very slow in linear way.
Killed after 1h 15min

Log
Build

[03:57:19] info: [STRT] System.Numerics.Tensors.Tests.NFloatGenericTensorPrimitives.SpanScalarDestination_InPlace(tensorPrimitivesMethod: SpanScalarDestinationDelegate`3 { Method = Void Min[NFloat](System.ReadOnlySpan`1[System.Runtime.InteropServices.NFloat], System.Runtime.InteropServices.NFloat, System.Span`1[System.Runtime.InteropServices.NFloat]), Target = null }, expectedMethod: Func`3 { Method = System.Runtime.InteropServices.NFloat Min(System.Runtime.InteropServices.NFloat, System.Runtime.InteropServices.NFloat), Target = null })
[03:57:19] info: Process 2804 didn't exit within 01:15:00 and will be killed

[03:57:19] info: Killing process tree of 2804...

ghost · 2024-02-26T19:12:16Z

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

After #97192 was merged we started seeing timeouts in System.Numerics.Tensors.Tests on Mono across all platforms, but seemingly only when using the JIT. Interpreter runs pass fine.

There are a large number of long-running tests like this, but disabling them one-by-one wasn't helpful since then just another test was hanging so I had to disable the whole test project

[Long Running Test] 'System.Numerics.Tensors.Tests.NFloatGenericTensorPrimitives.IndexOfMinMagnitude_ReturnsNegative1OnEmpty', Elapsed: 00:03:17
[Long Running Test] 'System.Numerics.Tensors.Tests.HalfGenericTensorPrimitives.IndexOfMax_ReturnsNegative1OnEmpty', Elapsed: 00:02:03

Author:	akoeplinger
Assignees:	-
Labels:	`disabled-test`, `area-Codegen-JIT-mono`, `area-Codegen-Interpreter-mono`
Milestone:	9.0.0

saitama951 · 2024-03-08T11:23:50Z

I had a look into the issue.

Here is a extracted test case

public abstract class func
{
        static void test()
        {
                var watch = System.Diagnostics.Stopwatch.StartNew();
                Console.WriteLine(TensorPrimitives.IndexOfMinMagnitude(ReadOnlySpan<int>.Empty));
                watch.Stop();
                Console.WriteLine("Time Elapsed:"+watch.ElapsedMilliseconds);

        }
        public static void Main(string []args)
        {
                 func.test();
        }
}

now when running this extracted test case in debug and release, I see the release mode takes 20x times than that of the debug mode.

now looking at the perf hotspots I see a hotspot in move_basic_block_to_end which traverses to the end of the list and that traverses through ~100k basic blocks and these basic blocks are being generated by inline_method

debugging further I can see :

IndexOfMinMaxCore generates around ~118454 basic blocks and in which System.Numerics.Tensors.TensorPrimitives:IndexOfFinalAggregate in total generates around ~70k basic blocks
for the three different variants of the vectors (~21k basic blocks for the 128 variant, ~23k basic blocks for 256 variant, ~28k basic blocks for 512 variant ).

Now in System.Numerics.Tensors.TensorPrimitives:IndexOfFinalAggregate(System.Runtime.Intrinsics.Vector128<int>,System.Runtime.Intrinsics.Vector128<int>) I see a big bump in basic blocks in TIndexOfOperator.Invoke around ~2152 basic blocks which is invoked 10 times in System.Numerics.Tensors.TensorPrimitives:IndexOfFinalAggregate which is roughly around ~21k basic blocks

Now in TIndexOfOperator.Invoke I have commented the number of basic blocks being generated for the each call

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public static void Invoke(ref Vector128<T> result, Vector128<T> current, ref Vector128<T> resultIndex, Vector128<T> currentIndex)
            {
                Vector128<T> resultMag = Vector128.Abs(result), currentMag = Vector128.Abs(current); // 144 basic blocks
                Vector128<T> useResult = Vector128.LessThan(resultMag, currentMag); // 137 basic blocks
                Vector128<T> equalMask = Vector128.Equals(resultMag, currentMag); // 137 basic blocks

                if (equalMask != Vector128<T>.Zero) //77 basic blocks
                {
                    Vector128<T> lessThanIndexMask = IndexLessThan(resultIndex, currentIndex); //556 basic blocks
                    if (typeof(T) == typeof(float) || typeof(T) == typeof(double))
                    {
                        // bool useResult = equal && ((IsNegative(result) == IsNegative(current)) ? (resultIndex < currentIndex) : IsNegative(result));
                        Vector128<T> resultNegative = IsNegative(result); // 417 basic blocks
                        Vector128<T> sameSign = Vector128.Equals(resultNegative.AsInt32(), IsNegative(current).AsInt32()).As<int, T>(); //417 + 137 basic blocks
                        useResult |= equalMask & ElementWiseSelect(sameSign, lessThanIndexMask, resultNegative); //36 blocks
                    }
                    else
                    {
                        useResult |= equalMask & lessThanIndexMask;
                    }
                }

                result = ElementWiseSelect(useResult, result, current); //36 blocks
                resultIndex = ElementWiseSelect(useResult, resultIndex, currentIndex); //36 blocks
            }

The library call System.Numerics.Tensors.TensorPrimitives:IsNegative and System.Numerics.Tensors.TensorPrimitives:IndexLessThan invokes Vector128.LessThan (~137 basic blocks) for 3 and 4 times respectively.

The Vector128.LessThan calls 2 * Vector64.LessThan and then in the end Scalar.LessThan and Scalar.get_AllBitsSetgets called which generates around ~120 basic blocks ((36+24)*2)

https://github.com/dotnet/runtime/blob/e7e3dee0a50e1780c889d63797089acc12e564da/src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs#L1438-#L1449

since all of these are dead code for s390x, Is there a way that this can be eliminated way before the other optimization passes?

Another concern is that
while inlining we can see intermediate basic blocks being generated and then being merged, although I don't see
cfg->num_bblocks being updated on-the-go (looks like this is being used as a counter to assign to cfg->block_num but also used for loop calculations in other optimizations like mono_optimize_branches) . I believe we should keep an extra variable to do the job as the counter.

cc: @uweigand @akoeplinger

akoeplinger · 2024-03-08T14:22:52Z

@vargaz would you mind helping with this issue?

vargaz · 2024-03-09T00:55:07Z

I can't reproduce this. This might have been fixed recently by this and related PRs:
#97096

vargaz · 2024-03-09T00:57:01Z

I can reproduce, it only seems to happen on platforms without SIMD.

* Move the if (cfg->opt & MONO_OPT_SIMD) checks into simd-intrinsics.c so the IsSupported checks can be intrinsified even if SIMD is disabled. * Implement mono_simd_unsupported_aggressive_inline_intrinsic_type () in the non-simd cases as well, add more types. Re: dotnet#97295.

vargaz · 2024-03-09T16:31:14Z

The PR will improve the situation somewhat (2s -> 0.7s).

* Move the if (cfg->opt & MONO_OPT_SIMD) checks into simd-intrinsics.c so the IsSupported checks can be intrinsified even if SIMD is disabled. * Implement mono_simd_unsupported_aggressive_inline_intrinsic_type () in the non-simd cases as well, add more types. Re: dotnet#97295.

saitama951 · 2024-03-11T06:10:26Z

@vargaz Thanks for the fix, this reduced the regression from 2k ms -> 1k ms . however, I see some vector methods still being inlined i.e

INLINE START 0x2aa31a22938 System.Numerics.Tensors.TensorPrimitives:IndexOfMinMaxCore<int, System.Numerics.Tensors.TensorPrimitives/IndexOfMinMagnitudeOperator`1<int>> (System.ReadOnlySpan`1<int>) -> System.Runtime.Intrinsics.Vector512:Create (int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16)
created temp 418 (R983) of type System.Runtime.Intrinsics.Vector512<System.Int16>
method to IR System.Runtime.Intrinsics.Vector512:Create (int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16,int16)

however I was thinking to disable inlining for vector methods completely for platforms which don't support it

@@ -4783,6 +4785,19 @@ inline_method (MonoCompile *cfg, MonoMethod *cmethod, MonoMethodSignature *fsig,
 
        if (!fsig)
                fsig = mono_method_signature_internal (cmethod);
+      if (!(cfg->opt & MONO_OPT_SIMD))
+       {
+               if (!strcmp (m_class_get_name_space (cmethod->klass), "System.Runtime.Intrinsics"))
+               {
+                       if (!strncmp(m_class_get_name (cmethod->klass), "Vector", 6)) {
+                               const char *vector_type = m_class_get_name (cmethod->klass) + 6;
+                       if (!strcmp(vector_type, "256`1") || !strcmp(vector_type, "512`1") || !strcmp(vector_type, "256") || !strcmp(vector_type, "512"))
+                               return 0;
+                       if (!(cfg->opt & MONO_OPT_SIMD) && (!strcmp (vector_type, "128`1") || !strcmp (vector_type, "128") || !strcmp (vector_type, "64`1") || !strcmp (vector_type, "64")))
+                               return 0;
+                       }
+               }
+       }
 
        if (cfg->verbose_level > 2)
                printf ("INLINE START %p %s -> %s\n", cmethod,  mono_method_full_name (cfg->method, TRUE), mono_method_full_name (cmethod, TRUE));

before this patch

dotnet bin/Release/net9.0/tensor.dll
-1
Time Elapsed:1028

after this patch

dotnet bin/Release/net9.0/tensor.dll
-1
Time Elapsed:151

) * Move the if (cfg->opt & MONO_OPT_SIMD) checks into simd-intrinsics.c so the IsSupported checks can be intrinsified even if SIMD is disabled. * Implement mono_simd_unsupported_aggressive_inline_intrinsic_type () in the non-simd cases as well, add more types. Re: #97295.

akoeplinger added the area-Codegen-JIT-mono label Jan 22, 2024

akoeplinger added this to the 9.0.0 milestone Jan 22, 2024

akoeplinger added the disabled-test The test is disabled in source code against the issue label Jan 22, 2024

akoeplinger mentioned this issue Jan 22, 2024

System.Numerics.Tensors.Tests fails with assertion in mini-ppc.c on ppc64le #97296

Closed

akoeplinger mentioned this issue Jan 22, 2024

Disable failing System.Numerics.Tensors tests #97301

Merged

akoeplinger added a commit to akoeplinger/runtime that referenced this issue Jan 22, 2024

Disable failing System.Numerics.Tensors tests

ec22f2d

dotnet#97297 dotnet#97295 dotnet#97296 Caused by dotnet#97192

stephentoub pushed a commit that referenced this issue Jan 22, 2024

Disable failing System.Numerics.Tensors tests (#97301)

ebd5cb6

#97297 #97295 #97296 Caused by #97192

stephentoub mentioned this issue Jan 22, 2024

System.Numerics.Tensor test failures on CoreCLR #97297

Closed

tmds pushed a commit to tmds/runtime that referenced this issue Jan 23, 2024

Disable failing System.Numerics.Tensors tests (dotnet#97301)

e579c91

dotnet#97297 dotnet#97295 dotnet#97296 Caused by dotnet#97192

pavelsavara mentioned this issue Jan 25, 2024

[wasm][mt] Fix System.Threading.Tasks.Parallel build #97505

Merged

lewing added the area-Codegen-Interpreter-mono label Feb 26, 2024

vargaz self-assigned this Mar 8, 2024

vargaz mentioned this issue Mar 9, 2024

[mono][jit] Optimize JIT time on platforms without SIMD support. #99485

Merged

vargaz removed their assignment Mar 22, 2024

steveisok modified the milestones: 9.0.0, Future Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System.Numerics.Tensors.Tests cause long-running/hanging tests on Mono JIT (not interpreter) #97295

System.Numerics.Tensors.Tests cause long-running/hanging tests on Mono JIT (not interpreter) #97295

akoeplinger commented Jan 22, 2024

nealef commented Jan 22, 2024

uweigand commented Jan 22, 2024

giritrivedi commented Jan 22, 2024

pavelsavara commented Jan 25, 2024

ghost commented Feb 26, 2024

saitama951 commented Mar 8, 2024

akoeplinger commented Mar 8, 2024

vargaz commented Mar 9, 2024

vargaz commented Mar 9, 2024

vargaz commented Mar 9, 2024

saitama951 commented Mar 11, 2024

System.Numerics.Tensors.Tests cause long-running/hanging tests on Mono JIT (not interpreter) #97295

System.Numerics.Tensors.Tests cause long-running/hanging tests on Mono JIT (not interpreter) #97295

Comments

akoeplinger commented Jan 22, 2024

nealef commented Jan 22, 2024

uweigand commented Jan 22, 2024

giritrivedi commented Jan 22, 2024

pavelsavara commented Jan 25, 2024

ghost commented Feb 26, 2024

saitama951 commented Mar 8, 2024

akoeplinger commented Mar 8, 2024

vargaz commented Mar 9, 2024

vargaz commented Mar 9, 2024

vargaz commented Mar 9, 2024

saitama951 commented Mar 11, 2024