Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM] support new udot/sdot patterns #7800

Merged
merged 1 commit into from
Aug 24, 2023
Merged

[ARM] support new udot/sdot patterns #7800

merged 1 commit into from
Aug 24, 2023

Conversation

rootjalex
Copy link
Member

udot and sdot can still be used even if we need to interleave the arguments (and is faster than a string of smull/saddws). This PR adds those patterns + tests (and a few fly-by FindIntrinsics fixes).

@steven-johnson steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Aug 23, 2023
Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending green.

We should figure out how/where to document this; the work on apps/hannk showed that using the dotprod ops on arm could be a huge win, but they were tricky to generate (ie, the IR had to be just right) -- would be good to document what to do to get these.

Additionally: maybe we should add a dot_prod() intrinsic to IROperator.h, along with widening_mul and friends? Recognizing the patterns is highly desirable of course, but having something that means "always use the best ops specifically for this regardless of architecture" seems like it could be useful in pathological cases.

@rootjalex
Copy link
Member Author

@steven-johnson I completely agree - unfortunately, it’s often quite difficult to express exactly what pattern will make it through the simplifier and trigger these patterns (that’s why I’ve included so many pattern variants). @abadams and I have discussed having a strong normalization pass to ease the job of the pattern matcher (and the programmer aiming for the dot product instruction). I think this is something we need to do (but I don’t currently have bandwidth for).

With regards to a new intrinsic - the main difficulty there is that we need a variadic dot product, and there’s not good consistency across backends (i.e. ARM has 4-way matching signed dot product, x86 has 2-way mixed-sign dot product, HVX has many). I worry that an intrinsic like that might be too hard to handle across backends.

@steven-johnson
Copy link
Contributor

Yeah, I hear you, but telling people that you have to look at the generated assembly code to verify you got it right is an unreasonably onerous burden.

@rootjalex
Copy link
Member Author

That’s completely fair! I think the best solution is a powerful normalizer, but could be convinced about the intrinsic.

@abadams
Copy link
Member

abadams commented Aug 23, 2023

Dot product instructions reduce a vector horizontally, and our front-end language doesn't have vectors, so there's no intrinsic we could add that would hit it guaranteed in the way you want. We do have vectors in the scheduling language, so the way to get a dot product instruction for sure is by using atomic().vectorize(some_rvar).

The kind of dot product AJ is targetting here is an opportunistic instruction selection trick where you save a few instructions by interleaving four different widening multiply adds and using udot instead of running them separately. Whether or not it's a win is very very architecture and type dependent.

@rootjalex
Copy link
Member Author

Failures appear unrelated

@rootjalex rootjalex merged commit 678ea32 into main Aug 24, 2023
@rootjalex rootjalex deleted the rootjalex/arm-dot branch August 24, 2023 19:50
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants