Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorX<T>.ConditionalSelect doesn’t get optimized for const masks on non-AVX512 platforms #104001

Closed
ezhevita opened this issue Jun 25, 2024 · 4 comments · Fixed by #104092
Closed
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@ezhevita
Copy link
Contributor

ezhevita commented Jun 25, 2024

Description

I was comparing VectorT.ConditionalSelect (128-bit in particular) and Sse41.BlendVariable disassembly output and found out that the first method gets only optimized into the second one only if the mask is the result of Compare* intrinsic. However we can actually optimize it if the mask is constant (e.g. Vector.Create) since we can check its contents in the JIT during compilation. Here is the reproduction repo. I’ve also implemented an optimization in JIT here (worth mentioning that I’ve currently haven’t optimized VectorT.ConditionalSelect(Vector.Create(fieldOrVariable)) since I didn’t find a way to inspect the values of field/variable, GT_LCL_VAR and GT_LCL_FLD in the JIT tree).

Configuration

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X3D, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-preview.5.24307.3
  [Host]     : .NET 9.0.0 (9.0.24.30607), X64 RyuJIT AVX2
  DefaultJob : .NET 9.0.0 (9.0.24.30607), X64 RyuJIT AVX2

Regression?

Not a regression.

Data

Reproduction, benchmarks, and disassembly: https://github.com/ezhevita/ConditionalSelectReproduce

Analysis

The current implementation only checks for Compare* intrinsics however we can actually check and prove that the mask is indeed per-element if the vector is constant.
Also another solution might be a new method in the API for the consumers which restricts the vector to be per-element mask.

@ezhevita ezhevita added the tenet-performance Performance related issue label Jun 25, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 25, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 25, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Jun 26, 2024

Thanks for the idea! I guess we can introduce a general helper to check for SIMD node to be either zeroes or allbits set in its components.

I didn’t find a way to inspect the values of field/variable, GT_LCL_VAR and GT_LCL_FLD in the JIT tree).

Lowering is a bit too conservative place to perform such transformation, it's better to do in e.g. optVNBasedFoldExpr where we can easily look through locals.

PS: similar optimizations can be done around other types of intrinsics, the best example is Vector128.ExtractMostSignificantBits on arm64

Are you interested in contributing it?

@EgorBo EgorBo added this to the Future milestone Jun 26, 2024
@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Jun 26, 2024
@ezhevita
Copy link
Contributor Author

Are you interested in contributing it?

Yes, absolutely! I just would need some more time since it’s the first time I’m working on JIT and I still don’t get a lot of the internals of it

@ezhevita
Copy link
Contributor Author

Update: I’ve realized that we can’t optimize Vector128.ConditionalSelect(Vector128.Create(_field), …) into a const since the array is mutable anyway and can be changed at any point of execution; however we can optimize storing the vector itself, example:

public class TestClass
{
	private static readonly Vector128<byte> _maskVector = Vector128.Create<byte>([255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0, 255, 0]);

	public Vector128<byte> Test()
	{
		return Vector128.ConditionalSelect(_maskVector, Vector128<byte>.One, Vector128<byte>.Zero);
	}
}

@ezhevita ezhevita changed the title VectorT.ConditionalSelect doesn’t get optimized for const masks on non-AVX512 platforms VectorX<T>.ConditionalSelect doesn’t get optimized for const masks on non-AVX512 platforms Jul 1, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants