Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Draft for Vector SIMD Codegen Enhancements #268

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

anthonycanino
Copy link

This PR describes two proposals meant to enhance SIMD codegen in RyuJIT:

  1. Enhance the capability of Vector<T> to serve as either a template to generate multiple SIMD ISA pathways or dynamically recompile its chosen SIMD ISA at runtime via performance guided optimization: https://github.com/anthonycanino/designs/blob/main/accepted/2022/enhance-vector-codegen.md

  2. Introduce additional functionality to Vector<T> through new abstractions (VectorMask) and 512-bit vectors: https://github.com/anthonycanino/designs/blob/main/accepted/2022/enable-512-vectors.md

Looking forward to feedback and discussion on the ideas.

@anthonycanino anthonycanino marked this pull request as draft July 7, 2022 22:59
@tannergooding
Copy link
Member

CC. @JulieLeeMSFT, @jeffhandley

Also CC. @davidwrighton for CG2/R2R considerations
and @AndyAyersMS for PGO considerations

CC. @BruceForstall and @dakersnar who are part of the working group

@JulieLeeMSFT
Copy link
Member

cc @dotnet/jit-contrib.


In this design document, we propose to extend `Vector<T>` to serve as a vessel for frictionless SIMD adoption, both internal to .NET libraries, and to external .NET developers. As a realization of this goal, we propose the following:

1. Upgrading `Vector<T>` to serve as a sufficiently powerful interface for writing both internal hardware accelerated libraries and external developer code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this approach rely on JIT to work with the "template" or Source Generators (sounds much much easier to implement)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have less experience with the source generators as a concept, so when I wrote this, I envisioned the JIT took care of this.

In the spirit of the document, would a source generator approach require the developer to "regenerate" their code as future ISAs become available/implemented, or are the source generators able to be tightly integrated into the pipeline that it's not necessarily a burden for the developer to do so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"source generators" in the Roslyn sense are components that plug directly into the compiler; they're handed all the information the compiler has about the source in the project and are able to add code to the compilation unit. A key limitation is they can't replace code, only add code, at least today (and the expectation is that even if some replacement is enabled in the future, it'll be very constrained). One of the most common forms of this is writing a partial class or method, often with an attribute applied to it, and the generator then fills in the implementation. You can see this, for example, with the JsonSerialization generator that shipped in .NET 6, where a developer just writes a partial class attributed in a certain way, and the generator emits into it all of the logic for serializing the relevant types, or the new LibraryImport and RegexGenerators in .NET 7, where the developer writes a partial method and the generator fills in the implementation of that method.

Copy link
Member

@tannergooding tannergooding Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ComputeSharp by @Sergio0694 does something that is conceptually similar in that it takes C# code and generates code that can run on the GPU.

I had also prototyped a very basic source generator a while back which handles directive driven vectorization.

For vectorized code we need a software fallback for when vectorization isn't supported so it's feasible for a dev to do something like:

public static partial int Sum(ReadOnlySpan<int> values);

[Vectorize(nameof(Sum), ...)]
private static int Sum_SoftwareFallback(ReadOnlySpan<int> values)
{
    // ...
}

The generator can then process Sum_SoftwareFallback and provide an implementation of the public Sum method powered by Vector64/128/256/512<T> or Vector<T>.

It would be quite a bit of work to enable, but would have some benefits over typical "auto-vectorization" approaches. An analyzer that looks for potentially vectorizable code and suggests a refactoring to put it in the above shape would also be feasible to help suggest to users where vectorization is feasible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this sense, we are really talking about source-level auto-vectorization as opposed to JIT-level template generation, correct?

Copy link
Member

@tannergooding tannergooding Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source level directive-driven vectorization (rather than auto-vectorization). It could still be "template driven" (rather than relying on analyzing scalar code patterns) if we felt that was the best approach and there are multiple options/possibilities (particularly when viewed in combination with the IL trimmer/linker).

That is, in all cases (involving a source generator) it requires some level of "user opt-in" (such as via a Vectorize attribute the generator recognizes). Noting that we could do some form assistance in here in recognizing vectorizable patterns and suggesting to the user they add the Vectorize attribute.

Whether we then do the vectorization based on some Vector<T> template (including generating a scalar path) or do recognition of a scalar algorithm and convert it to Vector64/128/256/512<T> or Vector<T> code is something that we could do either of.


#### 2. PGO Codegen from `Vector<T>`

We propose to introduce a `#[Vectorize]` attribute which instructs the JIT to dynamically profile a method to select an optimal length for `Vector<T>`. Returning to the example above, we add `#[Vectorize]` to the `NarrowUtf16ToAscii` method like so:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this attribute? I think it can be just an API e.g.

if (RuntimeHelpers.MostlyTrue(len > 128)) // JIT will insert a probe for the argument condition
{
}
else
{
    // this block will be eliminated by jit
}

and teach the JIT to probe argument of MostlyTrue in tier0 for PGO - we don't currently have infrastructure for that but it shouldn't be hard to implement and it opens opportunities for other optimizations (e.g. recognized never-negative signed types etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have commented below.

In order to perform profile guided optimization for methods annotated with `#[Vectorize]`, the JIT must detect which method parameters to sample and what thresholds should trigger recompilation based on those sample points.
##### Detecting Instrumentation Points and Thresholds

The presence of a `#[Vectorize]` attribute instructs the JIT to perform a dependence analysis upon first encountering the method to determine what method parameters to sample.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A of probing scheme that requires data flow is going to be a challenge.
The current implementation of PGO instruments at Tier0 where there is no ability to do any sort of data flow.

As @EgorBo noted above it might be more feasible to do this sort of analysis either by hand at the source level or within Roslyn and express the results as intrinsic calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have commented below.


##### Transitive Method Codegen

As both proposals allow `Vector<T>` to be specialized per-method, we cannot simply pass `Vector<T>` as an argument to helper methods as before. To address this issue, we propose an additional `[Vectorizeable]` attribute which allows to JIT to specialize the method per selected vector width if used in a `Vectorize.If` or `[Vectorize]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specialization would likely have to be driven by VM -- the JIT has no way of producing multiple method bodies per method.

This begs the question of how we would reconcile this with the current requirement that all Vector within a given runtime instance are the same size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather expect Vector<T> to be lowered to Vector128, Vector256, Vector512 by some IL processing/SG - it's much simpler to do. And yes, good point about variadic Vector<> - if it escapes current method or exposed as an argument then we can't do it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are fair points --- how to reconcile a PGO driven approach for Vector with the current requirement that Vector be a chosen size for the duration of a runtime instance is an open question.

@EgorBo part of the reason I came up with the #[Vectorize] attribute was for this very reason: If a method has #[Vectorize], its use of Vector<T> is of a Vector<T> whose size is selected by PGO. Without #[Vectorize], Vector<T> will codegen to the size determined by the runtime process as it currently does now.

I think @EgorBo suggestion about a runtime probe RuntimeHelpers.MostylTrue as a way to mitigate the need for a dataflow analysis for PGO, and the #[Vectorize] attribute can probably co-exist. The former drives where the probes happen, the latter indicates that this Vector<T> is selected per-method, not per-process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variadic Vector<> - if it escapes current method or exposed as an argument then we can't do it

Noting that VectorMask<T> will have similar considerations. SVE for Arm64 as well, where the technical description disallows its usage as a field and various other scenarios.

Vector<T> is our current "best fit" for SVE on Arm64 and it would be good if we can continue using it there and here to solve the Vector128/256/512 considerations for x64 as well.

If we can make it so that dealing with leading/trailing elements is efficient the exact size of the Vector isn't a huge concern as we no longer lose vectorization on "small data" when the hardware register size increases.

We then just need to consider the scenario where the "largest register" isn't the "best choice" for a given scenario and where using a smaller vector would be better. We could potentially handle this with block/loop cloning, special handling in the JIT/tiered compilation process, or one of several other ways. The biggest "pit of failure" then becomes how Vector<T> behaves when encountered as a field, pointer, or when passed/returned between methods.

In particular, SVE limits it accordingly:

Because of their unknown size at compile time, SVE types must not be used:

  • to declare or define a static or thread-local storage variable
  • as the type of an array element
  • as the operand to a new expression
  • as the type of object deleted by a delete expression
  • as the argument to sizeof and _Alignof
  • with pointer arithmetic on pointers to SVE objects (this affects the +, -, ++, and -- operators)
  • as members of unions, structures and classes
  • in standard library containers like std::vector.

Naturally, due to back-compat Vector<T> violates most/all of these but some of them don't matter as much in the context of a JIT environment, only for an AOT environment. But, if we can resolve how to handle many of these such that the JIT can enable selecting Vector128/256/512<T> as the backing "per method" (with limitations), then we can likely use an analyzer and other functionality to help drive users toward success.

```C#
for (int i = 0; i < s1.Length; i++)
{
if (s1[i] < 0 && s2[i])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this supposed to be if (s1[i] < s2[i])?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was meant to be if (s1[i] < s2[i])


Where `VectorMask<T>` expresses that the number of elements the condition applies to is variable-length, and determined by the JIT at runtime (though it must be compatible with `Vector<T>` selected length).

Lastly, we propose to create a `VectorMask` using builtin C# boolean expressions by passing a lambda to a special `MaskExpr` API:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MaskExpr going to be actually useful? To me, it seems to be:

  1. Fairly hard to use: How do users figure out what is a valid argument for MaskExpr?
  2. Fragile: Is it going to work when some future version of C# changes how lambdas are emitted? Is it going to work for F#, which emits different IL for lambdas than C#?
  3. More limited than the ByVectorMask methods, since it seems it can't express e.g. the v1.LessThanByVectorMask(v2) case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this, MaskExpr represents an idea to move the masking/condition simd processing into an embedded DSL that might be more easy to manipulate, particularly those who are less used to lower-level SIMD processing. To your points:

  1. We can document this or refine it, so I don't see it as a barrier or hard to use.
  2. As it's written it depends on Lambdas but that isn't a strict requirement if there are alternative ideas.
  3. This isn't necessarily true, for example...
v1.MaskExpr(x => v2.MaskExpr(y => x < y))

could encode "a mask where each element of v1 is less than each element of v2"

Now, all that being said, I am not advocating strongly for this idea, but proposing it as a consideration for the developer from a language design standpoint. Ideally, I see this work as making lower-level SIMD optimization more attainable for more developers, and I feel these kind of ideas are at least worth thinking on (hence my response to your post below).

Copy link
Member

@tannergooding tannergooding Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few issues with MaskExpr related to how the JIT operates today and what optimizations it can enable.

I expect we'd have something "simpler" and easier to migrate to if we simply had a .AsMask() API which effectively creates a VectorMask<T> from the most significant bits of each element.

This would then be:

Vector256<int> vmask = Vector256.GreaterThan(v1, Vector256<int>.Zero) & (v1 != Vector256.Create(5));
VectorMask256<int> mask = vmask.AsMask();

This would be fairly "natural" to translate over from:

int mask = vmask.ExtractMostSignificantBits();

but would work for variable length vectors and would provide the mask as an abstraction rather than strictly as an int.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is somewhat unfortunate that ==, !=, <, <=, >, and >= return bool rather than Vector###<T> and so we can't express it as (v1 > Vector128<int>.Zero) && (v1 != Vector256.Create(5)), but that was done to follow existing .NET guidelines/conventions and since overloading by return type isn't feasible.

Copy link
Author

@anthonycanino anthonycanino Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still expect to be able to compare masks directly as well right, or do you see that going away with the AsMask() API, i.e., would we allow something like...

Vector256<int> vmask = Vector256.GreaterThan(v1, Vector256<int>.Zero).AsMask();
Vector256<int> vmask2 = Vector256.NotEquals(v1,  Vector256.Create(5).AsMask();
VectorMask256<int> mask = vmask & vmask2;

Edit: I see your comment below now.

Copy link
Member

@tannergooding tannergooding Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still expect to be able to compare masks directly as well right

Right, I believe everything you proposed being possible with VectorMask will still be possible, its only a difference in how you get the mask.

Doing:

Vector256<int> vmask = Vector256.GreaterThan(v1, Vector256<int>.Zero);
Vector256<int> vmask2 = Vector256.NotEquals(v1,  Vector256.Create(5));
VectorMask256<int> mask = (vmask & vmask2).AsMask();

-or- doing:

VectorMask256<int> mask1 = Vector256.GreaterThan(v1, Vector256<int>.Zero).AsMask();
VectorMask256<int> mask2 = Vector256.NotEquals(v1,  Vector256.Create(5).AsMask();
VectorMask256<int> mask = mask1 & mask2;

should be basically identical, the only difference is when you create the VectorMask256 type. I'd expect them to be "equally performant" on AVX-512. I'd expect the former to be "more performant" on AVX2 and prior (assuming the JIT handled the operations as specified without other optimizations).


Logically, the lambda passed to `MaskExpr` selects for which elements of `v1` to include in the `VectorMask`, and allows developers to program conditional SIMD logic with familiar boolean condition operations.

### Leading/Trailing Element Processing with `VectorMask<T>`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably outside the scope here, but what I would like to see is a simple, uniform way to process a Span<T> of any length using Vector<T>. E.g.:

int SumVector(ReadOnlySpan<int> source)
{
    Vector<int> vresult = Vector<int>.Zero;

    foreach (Vector<int> slice in source.SliceAsVector())
    {
        vresult += slice;
    }

    return vresult.Sum();
}

The JIT would then do whatever it needs to make this code efficient (including e.g. templated codegen).

If this is not feasible, maybe one could get close to that using VectorMask<T>?

int SumVector(ReadOnlySpan<int> source)
{
    Vector<int> vresult = Vector<int>.Zero;

    foreach ((Vector<int> slice, VectorMask<int> sliceMask) in source.SliceAsMaskedVector())
    {
        vresult = Vector<int>.Add(vresult, slice, sliceMask);
    }

    return vresult.Sum();
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree with these ideas, but each represents a slight "layer above" the current way Vector API is used. I like the idea of allowing the JIT to perform more powerful vectorization from more declarative programing construct --- the slice, sliceMask you propose is a nice idea --- but it seems that this could be a next step if others agree.

Copy link
Member

@tannergooding tannergooding Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that APIs providing "micro-kernels" for key functionality like Sum is a different layer and likely "out of scope" of the primitive building block considerations here.

You should feel free to open a proposal for such APIs as they are independent of the work expressed here. LINQ does provide some acceleration today but doesn't support Span<T> or ROSpan<T>.

Comment on lines +305 to +308
| `VectorMask<T> VectorMask<T>.And(VectorMask<T>, VectorMask<T>)` |
| `VectorMask<T> VectorMask<T>.Or(VectorMask<T>, VectorMask<T>)` |
| `VectorMask<T> VectorMask<T>.Not(VectorMask<T>, VectorMask<T>)` |
| `VectorMask<T> VectorMask<T>.Xor(VectorMask<T>, VectorMask<T>)` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should likely be operators where the "friendly names" are provided for parity with Vector<T> and Vector64/128/256<T>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Comment on lines +316 to +317
| `VectorMask<T> VectorMask<T>.FirstIndexOf(VectorMask<T> mask, bool val)` |
| `VectorMask<T> VectorMask<T>.SetElemntCond(VectorMask<T> mask, ulong pos, bool val)` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if rather than FirstIndexOf this could be simply LeadingZeroCount, TrailingZeroCount, and PopCount. While not necessarily as "intuitive" as a named API, it is much more extensible overall and allows the most common operations that are needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envision LeadingZeroCount here returning the actual leading zero count? Because I had in mind that FirstIndexOf would return an index into the VectorMask by type, e.g.,

this

VectorMask<short> mask = ...;
int idx = VectorMask<short>.FirstIndexOf(mask, true);

vs

VectorMask<short> mask = ...;
int idx = VectorMask<short>.LeadingZeroCount(mask) / Unsafe.SizeOf(short);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be the LeadingZeroCount based on type.

That is a VectorMask128<byte> would assume a mask of 16-bits and so the lzcnt would be "0-15". A VectorMask128<int> on the other hand would assume a mask of 4-bits and so the lzcnt would be 0-3.

If we instead always returned a 32-bit based lzcnt for VectorMask128 it would be easier to use with ExtractMostSignificantBits but less usable with the variable sized mask and with the abstract mask more generally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we are in agreement over the behavior roughly, just instead of allowing to check first index of true/false condition, we are working with the leading zero count per type.

I think that's pretty reasonable.


| Method |
| ------ |
| `VectorMask<T> Vector<T>.EqualsByVectorMask(Vector<T> v1, Vector<T> v2)` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we went with the AsMask() proposal from above, then these just become Vector<T>.Equals(x, y).AsMask(), which while slighlty more verbose doesn't require an "explosion" of new APIs and can be more easily integrated into existing code relying on ExtractMostSignificantBits and MoveMask

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an API standpoint I think this is fine and intuitive, so long as the JIT can potentially optimize Vector<T>.Equals(x, y).AsMask() into something that takes advantage of the most performant masking features available, e.g., masking registers for AVX512, e.g., Vector<T>.Equals(x,y).AsMask() can be lowered to vpcmpX if available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, definitely. We also do and rely on this for a couple similar cases like ToScalar today so I don't think it will be particularly problematic.

| `VectorMask<T> Vector<T>.LoadTrailing(Span<T> v1, VectorMask<T> mask)` |
| `VectorMask<T> Vector<T>.StoreTrailing(Span<T> v1, VectorMask<T> mask)` |

### Internals Upgrades for EVEX/AVX512 Enabling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work should be coordinated with @JulieLeeMSFT. I would expect that Microsoft is providing code review and answering JIT design questions here. It would be great to confirm if the actual implementation is expected to be a collaborative effort or primarily driven by Intel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak is the most likely person to loop in for the register support.

There are a few people that semi-regularly work on the emitter including myself, @kunalspathak, and @EgorBo.

Someone needs to be looped in for the debugger work.

The VM work is expected to be small and likely doesn't need anyone dedicated, the exception being extending CG2 to support tracking the additional ISA flags (@davidwrighton).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How AVX512-VL is tracked by the ISA flags might be of particular interest since its a single CPUID flag but is somewhat like the x64 flag in that it impacts multiple ISAs

anthonycanino and others added 2 commits July 19, 2022 11:53
Please take a moment and add a bullet point list of teams and individuals you
think should be involved in the design process and ensure they are involved
(which might mean being tagged on GitHub issues, invited to meetings, or sent
early drafts).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone from Arm should look at this. Support for Arm SVE would require a Vector like interface, so it'd be good to ensure the design is suitable if SVE support gets implemented. I suspect most concerns will be around the mask implementation, and I see that Tanner has some comments below too. I'll see who we've got available to take a look....

@sparker-arm
Copy link

Hi,

Arm person here... and I apologize now for really not knowing anything about .net, so please pardon my ignorance!

I really like the sound of focusing on the generic vector API, it seems most architectures are at least considering old-school vectors...

I've been having a look through the existing Vector APIs and it looks like Vector<T> is only partially implemented for hardware acceleration, or am I misunderstanding? If I've not misunderstood, is the long-term plan to migrate the fixed APIs into the generic one and leave this as the only public API? I'm assuming, if the idea is to program AVX512 using the generic one, it would be advantageous to have fallback paths to more narrow extensions, before scalarizing. And are there plans for implementing 'weird' stuff like shuffle, permute and gather/scatters, etc... for Vector<T>? (Sorry if I've missed them)

I'm not sure I follow why you'd need all off the *Leading and *Trailing operations, would three not suffice? Something like, MaskedLoad, MaskedStore and GenerateRemainderMask? And, if the Vector<T> API is being extended add a mask argument, are there not existing API calls which could be used for masked load/store operations?

Continuing on the theme of predication, what is the defined behavior for 'false' lanes, can it be programmed? I'm thinking in the context of SVE, where predication can enable zeroing or merging lanes. Does AVX512 enable something similar?

@sparker-arm
Copy link

So, what happens if the compiler decides that sometimes we should use AVX-512 for some operations, but AVX2 for others? Is there enough type information to enable this with Vector<T>? Java has a stronger type system for their vectors, would there be any blockers for why it couldn't be done here?

I also have concerns that this approach will not be performant for SVE in an AOT setting, as everything depends on being able eventually have a compile-time fixed size vector. It's possible to set the width SVE at runtime, but I understand this is only possible at the kernel level, and we can't depend on all platforms providing hooks to do this. The other option would be to use predication to fix the width in the generated code, but this is likely to produce slower code.

Flexible (sizeless) vectors could be coming to WebAssembly and we've (hopefully) made the necessary changes in cranelift to enable this for both AVX and SVE. That approach is really based on the idea of using WebAssembly's 128-bit vectors with the addition of a scaling factor to produce a target-defined vector type. This there any way that dotnet could handle a notion of a scaling factor, that could be compile- or runtime-defined?

@tannergooding
Copy link
Member

tannergooding commented Oct 6, 2022

I've been having a look through the existing Vector APIs and it looks like Vector is only partially implemented for hardware acceleration, or am I misunderstanding? If I've not misunderstood, is the long-term plan to migrate the fixed APIs into the generic one and leave this as the only public API? I'm assuming, if the idea is to program AVX512 using the generic one, it would be advantageous to have fallback paths to more narrow extensions, before scalarizing. And are there plans for implementing 'weird' stuff like shuffle, permute and gather/scatters, etc... for Vector? (Sorry if I've missed them)

Vector<T> is an agnostic API that provides a software fallback for when acceleration isn't available. When Vector.IsHardwareAccelerated reports true then the majority of functions are expected to be accelerated and sufficiently (but maybe not "optimally") performant. There may still be cases, such as integer division, where the implementation has to fallback to a purely software-based approach.

I'm not sure I follow why you'd need all off the *Leading and *Trailing operations, would three not suffice? Something like, MaskedLoad, MaskedStore and GenerateRemainderMask? And, if the Vector API is being extended add a mask argument, are there not existing API calls which could be used for masked load/store operations?

I'd typically expect a MaskedLoad/MaskedStore to read a full vector from the given address and then mask off the upper bits. There can be complications and extra expense that arise from this approach including in dealing with whether the masked bytes should cause an AccessViolation if they cross a page boundary and the masked bytes aren't available for reading.

On the other hand, a Leading or Trailing API can take advantage of the data alignment and count to more trivially do something like "backtrack Size - remainder bytes", "load/store", "shift/shuffle data to the correct element position". This ends up making it easier to deal with the explicit concept of leading or trailing element processing in a way that allows the compiler to make it more efficient.

Continuing on the theme of predication, what is the defined behavior for 'false' lanes, can it be programmed? I'm thinking in the context of SVE, where predication can enable zeroing or merging lanes. Does AVX512 enable something similar?

Yes. AVX512 has a concept of both "merge-masking" and "zero-masking". We would likely have functionality exposed in the generic APIs to enable the same.

So, what happens if the compiler decides that sometimes we should use AVX-512 for some operations, but AVX2 for others? Is there enough type information to enable this with Vector?

This would likely end up PGO or data driven in some other fashion. It would default to "Max" (or "ReasonableDefault") otherwise. The types by themselves could never provide enough data.

Java has a stronger type system for their vectors, would there be any blockers for why it couldn't be done here?

While we could enable something like VectorShape, we've found that such functionality overall makes the UX overall more complicated as compared to concrete types.

Users are able to trivially query for type support, hardware acceleration, and have it treated as a JIT time constant so that they can detect and choose the based fit explicitly whether that be a specific sized vector or an agnostic vector that "grows" to the maximum capabilities of the platform.

I also have concerns that this approach will not be performant for SVE in an AOT setting, as everything depends on being able eventually have a compile-time fixed size vector. It's possible to set the width SVE at runtime, but I understand this is only possible at the kernel level, and we can't depend on all platforms providing hooks to do this. The other option would be to use predication to fix the width in the generated code, but this is likely to produce slower code.

Yes. AOT in general is a problematic consideration for such code. In C/C++ SVE puts restrictions on how the underlying "vector" type can be used. For example, it disallows its usage as a field, in sizeof expressions, and a few other scenarios. Vector<T> is a pre-existing type where devs have already used it in such scenarios and that potentially complicates a few things and we will need to consider how to best ensure that can be supported.

Flexible (sizeless) vectors could be coming to WebAssembly and we've (hopefully) made the necessary changes in cranelift to enable this for both AVX and SVE. That approach is really based on the idea of using WebAssembly's 128-bit vectors with the addition of a scaling factor to produce a target-defined vector type. This there any way that dotnet could handle a notion of a scaling factor, that could be compile- or runtime-defined?

Simply scaling up a n * V128<T> ops at a time sounds somewhat similar to how Vector256<T> is now implemented as 2x Vector128<T> ops and how Vector256<T> will be implemented as 2x Vector256<T> ops (and therefore 4x V128<T> ops). However, this comes with many considerations around usability and perf, especially where pipelining comes into play and where issuing too many of a same operation can limit CPU throughput or even hurt perf long term.

I think source generators or an approach similar to generic math with an interface + generic specialization would be better for providing performant and "size agnostic" implementations for a given platform.

@sparker-arm
Copy link

Fair enough wrt to masking vs leading/trailing, thanks for the clarification.

This would likely end up PGO or data driven in some other fashion. It would default to "Max" (or "ReasonableDefault") otherwise. The types by themselves could never provide enough data.

So, IIUC, that means that any path in the API explicitly needs to handle and treat any Vector<T> as a single specific type? For instance, when AVX-512 is supported but an operation isn't supported natively, then the fallback path still needs to treat Vector<T> as a 512-bit vector?

Users are able to trivially query for type support, hardware acceleration, and have it treated as a JIT time constant so that they can detect and choose the based fit explicitly

Related to my query above, it sounds like relying on user-defined logic would be error prone, whereas using stronger types should allow the compiler to detect bugs.

Simply scaling up a n * V128 ops at a time sounds somewhat similar to how Vector256 is now implemented
Just to be clear, the approach wasn't about cracking larger vectors into chunks, as that somewhat defeats the point, but instead it allows a target to communicate it's register width (just with the assumption that it would be a multiple of 128-bits).

@tannergooding
Copy link
Member

tannergooding commented Oct 7, 2022

For instance, when AVX-512 is supported but an operation isn't supported natively, then the fallback path still needs to treat Vector as a 512-bit vector?

Correct. Vector<T> currently has a "fixed-size" for the lifetime of the process. It's possible we could change it to be "fixed-sized" per method instead, but that would likely require significantly more work and would ultimately be dependent on something like PGO to help determine what the "correct" size is.

Related to my query above, it sounds like relying on user-defined logic would be error prone, whereas using stronger types should allow the compiler to detect bugs.

Could you clarify on what you mean by "stronger" types here? There are multiple ways this could be interpreted and I'd like to ensure we're on the same page.

Some examples on what you believe might be error prone would be good as well.

Related to my query above, it sounds like relying on user-defined logic would be error prone, whereas using stronger types should allow the compiler to detect bugs.

How is this different from Vector<T>.Count which communicates how many T is processed by the vector? If you needed to know the exact width of the vector, you can simply Vector<T>.Count * sizeof(T) * 8 to get the bit-width.

@sparker-arm
Copy link

Could you clarify on what you mean by "stronger" types here?
How is this different from Vector.Count which communicates how many T is processed by the vector?

I mean enough information so that a compiler would pick up any type mismatches, and I'm mainly thinking in the context of using different vector sizes.

Is a user able to use Vector<T> along with fixed vectors? If so, what are the mechanisms to help prevent them from doing so incorrectly? Does the compiler use Vector<T>.Count to reason a relationship between it and a fixed type?

Narrowing/widening operations are maybe a good example of where not having a single size for Vector<T> could be useful. From what I've seen, Vector.Narrow takes two operands to combine into a full-width result and it isn't possible to support a single vector input.

To help my understanding, how would this currently get vectorized with Vector<T>? Sorry for the C...

void mul(short *a, short *b, int *c, int N) {
  for (int i = 0; i < N; ++i) {
    c[i] = (int)a[i] * (int)b[i];
  }
} 

@TamarChristinaArm
Copy link

I'd typically expect a MaskedLoad/MaskedStoreto read a full vector from the given address and then mask off the upper bits. There can be complications and extra expense that arise from this approach including in dealing with whether the masked bytes should cause anAccessViolation` if they cross a page boundary and the masked bytes aren't available for reading.

This is not the case for SVE though. In SVE the direction is reversed. You are given a mask to use by the control flow managing instructions such as whilelo. From this predicated loads only load where lanes are active.

The only time you have any cross page file issues are when you have an unknown number of iterations, so e.g. a

while (true)
{
...
}

But for that we have first faulting loads.

On the other hand, a Leading or Trailing API can take advantage of the data alignment and count to more trivially do something like "backtrack Size - remainder bytes", "load/store", "shift/shuffle data to the correct element position". This ends up making it easier to deal with the explicit concept of leading or trailing element processing in a way that allows the compiler to make it more efficient.

The problem is that for SVE none of this is needed at all. At an instruction level something like a whilelo already does the training loop calculations and also produces the mask. https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-

That's why I think the comment made at #268 (comment) is actually quite a salient one.

I agree with that this PR is mostly about building blocks, but I'd hope that no actual code is written directly with these blocks but only a higher level API. In particular manually managing trailing loops will always be suboptimal for SVE.

and in that sense

int SumVector(ReadOnlySpan<int> source)
{
    Vector<int> vresult = Vector<int>.Zero;

    foreach (Vector<int> slice in source.SliceAsVector())
    {
        vresult += slice;
    }

    return vresult.Sum();
}

is a better API for both SVE and AVX I think. The expansion of the foreach can allow for more efficient codegen for both.

That said I personally hope more for something like this

int unsafe SumVector(int* source, int n)
{
    Vector<int> vresult = Vector<int>.Zero;

    foreach (auto iter in new VectorSource (source, n))
    {
        Vector<int> val = Vector<int>.Load (source, iter);
        vresult.Add (val, iter);
    }

    return vresult.Sum();
}

where iter will have methods such as .GetCurrentMask().

@tannergooding
Copy link
Member

We'll ultimately need something that works reasonably well across a range of hardware. That includes x64, Arm64, Wasm, and future platforms. Such support needs to account for both older hardware (pre SVE, pre AVX512) and modern hardware (SVE+, AVX512+). There will likely be a number of gaps, on all platforms, where the JIT needs to be smarter to generate better code and understand where operations can be simplified or alternatively represented.

For the case where developers want the utmost fine-grained control, we'll have the platform specific APIs available so that developers get raw access and can fully take advantage of the functionality available.

is a better API for both SVE and AVX I think.

In the sense of a single "micro-kernel", yes. Once you start getting into more complex algorithms, it ends up worse off.

Consider an algorithm that needs to Lerp and Sum. If you have 1 Lerp and 1 Sum API, then you get code that is generally "more efficient" while still being extremely suboptimal. This is particularly if you need to store intermediate results as you start needing to access and walk n times as much memory.

Effectively any logic that ends up outside the "simple operation" path is in a similar boat, where you ultimately need/want some customized logic to better account for the sequence of operations you need.

That said I personally hope more for something like this

This ends up being "more expensive" for the "core" body of the loop because it will constantly have to check if Load is going to pass the ends. The JIT could be smarter here and rewrite the loop, but that can also get quite expensive/complex.

It is likely simpler and cheaper to just have the main loop where Load reads a "full vector" and then a LoadTrailing which takes in a mask and therefore could take in the "predicate" on the SVE side to indicate how much data is left to be read. This also works well on AVX-512 where a similar mask exists and on older hardware where the mask can be used with 1-2 extra instructions to efficiently read the right amount of data without violating access boundaries.

@TamarChristinaArm
Copy link

We'll ultimately need something that works reasonably well across a range of hardware. That includes x64, Arm64, Wasm, and future platforms. Such support needs to account for both older hardware (pre SVE, pre AVX512) and modern hardware (SVE+, AVX512+).

Agreed, so the expectation is that you expect users to still write ISA specific loops? The current Vector128 generic functions work well enough on all ISAs yes. I'm however not entirely convinced the definition "Well enough" on fully masked ISAs and not/partially masked ones can't mean the same.

This is particularly if you need to store intermediate results as you start needing to access and walk n times as much memory.

I'm not sure I follow you here.. The only thing I was suggesting is that all APIs should have a fully masked version, or a way for the JIT to recognize the loop mask. With an iterator the loop's governing mask is clear, so at least you can mask the operations appropriately. With an explicit counter you have to pass a true predicate to every SVE call. We don't for instance, have unmasked loads.

This ends up being "more expensive" for the "core" body of the loop because it will constantly have to check if Load is going to pass the ends. The JIT could be smarter here and rewrite the loop, but that can also get quite expensive/complex.

I'm guessing you mean here for non-fully masked ISAs. But I don't follow why? I'm guessing this has to do with how IEnumerator is lowered? I would have expected to be able to transform the iterator into a naïve counted loop for ISAs that don't fully support predication, that's also why the example gave the number of elements. I'm guessing you can't do this because the semantics of the iterator have to be preserved in case the iterator escapes the loop? (genuine question).

In which case, you can do the same without the iterator by instead using a custom class with a while loop?

I'd have expected the same number of comparisons as you would normally for the same for loop. I was also expecting that for these ISA the jit could simple peel the loop for the "remainder". But perhaps that not something that can be done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants