Merge to main from newest openai/triton.(2024/4/11) #5

HighCWu · 2024-04-11T05:32:06Z

No description provided.

@binarman

This PR updates SharedToDotOperandMFMA.cpp and MFMA.cpp. - SharedToDotOperandMFMA.cpp is up to date with triton-mlir as of today, which includes changes until ROCm#482 - Fixed issue with opaque pointers - Fixed API for `getMFMAElemsPerInstrForOperands` and `getMFMARepForOperands` - MFMA.cpp is synced with triton-mlir@6bb04d, which includes changes until ROCm#469 Note to @binarman: changes in other files from ROCm#469 are not included in this PR. We can bring up the support for mfma 64x4 and 4x64 later.

) Adds a redis-based one as an initial implementation, but it should be straightforward to extend with more impls.

…ton-lang#3263)

…t tensors (triton-lang#3270)

-Register required dialect for triton-opt CMake target -Add ROCDL dialect as a dependency for ConvertTritonAMDGPUToLLVM pass Signed-off-by: joviliast <iveselov.nn@gmail.com>

…t more than one numScanBlocks along scan axis (triton-lang#3017) (triton-lang#3283) more than one numScanBlocks along scan axis (triton-lang#3017)

Recently we had triton-lang#3264 and triton-lang#3263 which left only a very small difference in the TypeConverter.cpp in the AMD backend compared to the TritonGPUToLLVM lib. This check is now safe to do in the generic location so let's clean this up a bit and remove the TypeConverter.cpp from the AMD backend. `LoadStoreOpToLLVM.cpp` and `ConvertLayoutOpToLLVM.cpp` in the AMD backend need some updates as well but that can come with a follow-on PR.

-Add WMMA conversion logic for dot operation -Fix helper hunctions for WMMA layout -Add lit test for WMMA dot operation conversion Signed-off-by: joviliast <iveselov.nn@gmail.com>

…triton-lang#3284) This reverts commit 9cfad37. There has been an inflight collision with some refactoring causing a build break

…#3261) Existing shared memory representation in TTGIR had some semantic problems as it was mixing value semantic with memory semantic. In order to solve that this PR moves shared memory representation to memory semantic only. This means shared memory is now not represented as tensors but as allocations that may or may not be mutable. tensor cannot have shared encoding now. Convert_layout #distributed -> #shared becomes triton_gpu.local_alloc %init tensor<#distributed> -> !tt.memdesc<#shared> Convert_layout #shared -> #distributed becomes triton_gpu.local_load %mem !tt.memdesc<#shared> -> tensor<#distributed> Insert_slice_async becomes async_copy_global_to_local

…riton-lang#3268) When sizePerThread is larger than tensor's shape, only unique elements will be stored in smemShape when accumulatePartialReductions, so when read out reduced values, we need to modulo srcShape/smemShape to get replicated elements for oversize elements.

…n-lang#3278) Changes in this PR: 1. Deleted `decomposeMmaToDotOperand` and `decomposeInsertSliceAsyncOp ` from AMD backend 2. Moved `decomposeMixedModeDotOp` to `AccelerateAMDMatmul.cpp` 3. Created a new pass in `DecomposeUnsupportedConversions.cpp` to include: `decomposeFp8e4b15Convert`, `decomposeSplatToSharedLayout`, `decomposeMfmaToDotOperand`, `reduceCvtOpLDSUsage`, `decomposeBlockedToDotOperand`, and `promoteReduceOps`. 4. Deleted workaround in `initSharedMemory` to annotate `triton_gpu.shared`.

Resolves the TODO left in `matchReduxKind` from triton-lang#3263, which broke reductions on GPUs without Redux operations.

My previous commit broke lowering on AMD target due to missing pattern and handling of stream pipeliner.

…triton-lang#3288)

…ton-lang#3295) `StandardInstrumentations::registerCallbacks` reads the llvm registered options, so in order for it to enable printing, the `print-after-all` needs to be already set when it is being invoked. This fixes the lack of IR dumps for optimization passes.

- Remove nv-specific code from `ConvertLayoutOpToLLVM.cpp` - Remove `storeDistributedToShared` and related function from `TritonGPUToLLVMBase.h` Later, we can further move `loadSharedToDistributed` from both amd and nv path to the common lib. Note: for now amd still needs `DecomposeConversions` pass as one of the transforms at ttgir level, since some optimization will reorder local_alloc and local_load for both opA and opB in a different way.

-Add WMMA conversion logic for dot operation -Fix helper hunctions for WMMA layout -Add lit test for WMMA dot operation conversion See also: 10e1560 Signed-off-by: joviliast <iveselov.nn@gmail.com>

It's quite easy to unexpectedly use a match from a different CHECK-LABEL in a lit test, which is likely not what the author intended. Prevent this from happening by invoking FileCheck with variable scopes enabled (and fix the failures detected this way). This also makes the file check use in Triton consistent with upstream MLIR.

…n-lang#3290) Add an environment variable that allows us to use ieee f32 by default. This makes it simpler to check for precision problems.

…void GPU OOM (triton-lang#3297)

…riton-lang#3302) triton-lang#2934 added `RedisRemoteCacheBackend` but did not add the symbol to `__init__.py`. Let's also add there so that we can refer to it via full path instead of having to `from ... import RedisRemoteCacheBackend`.

When doing a convert from mma -> blocked with a transpose using a vector size for the load back is more efficient.

…iton-lang#3091) This commit fixes failure in python/tutorials/03-matrix-multiplication.py for FMA cases, also fixes mixed dot for FMA cases. Tested on Navi31 --------- Signed-off-by: joviliast <iveselov.nn@gmail.com>

…iton-lang#3311) This reverts commit 375fee0. Reverting this PR as it causes error in our internal tests. It reaches this error in `python3.11/site-packages/setuptools/_distutils/core.py ` ``` def run_commands(dist): """Given a Distribution object run all the commands, raising ``SystemExit`` errors in the case of failure. This function assumes that either ``sys.argv`` or ``dist.script_args`` is already set accordingly. """ try: dist.run_commands() except KeyboardInterrupt: raise SystemExit("interrupted") except OSError as exc: if DEBUG: sys.stderr.write("error: {}\n".format(exc)) raise else: raise SystemExit("error: {}".format(exc)) except (DistutilsError, CCompilerError) as msg: if DEBUG: raise else: raise SystemExit("error: " + str(msg)) return dist ```

The user_begin() function cannot guarantee a definite order, so when there are multiple users for convert_layout, it might be sunk between two users, resulting in a dominance error.

) Common lib pattern to include: - lowering shared to distributed - lowering shared to dotOp when dotOp parent is a Blocked encoding - lowering distributed to distributed (excluding distributed smem, mmaV1) and calls `TargetInfo` for using stmatrix whenever possible Amd backend pattern to include: - lowering shared to dotOp with dotOp parent being Mfma - lowering Mfma to dotOp Nvidia backend pattern to include: - lowering shared to dotOp with dotOp parent being Mma - lowering mma to mma (this overlaps with lowering distributed to distributed pattern from the common lib; it will have a higher `PatternBenefit` since it's an optimization) - lowering distributed to distributed (only for distributed smem and mmaV1). Note: MmaV1 requires some duplicated code which will get deleted once we stop supporting mmaV1. - lowering mma to dotOp

[Backend] Clean up MMAv3 part of MatmulLoopPipeline. The MMAv3 pipelining code needed some love and comments. There are some minor functional changes in here. For example, the old code relied on a particular order of op.users(), and I'm not convinced it was correct. Once you unravel the code, what it needs to do is a lot less complicated than what it was doing. Indeed, despite the fact that I added a ton of comments, it's roughly the same length as before. The changes to the test are cosmetic. Previously we emitted tt.dot and assumed it would eventually becdome ttng.dot_async in a later pass (or at least during codegen). Now we explicitly convert some tt.dot's to ttng.dot_async so we can have an explicit wait and thread the results through. There's more cleanup that can be done here (see the TODOs added in this PR), but I needed to stop somewhere.

@joker-eph

…ng#3587) The initial motivation for this change was that tt.dot's inputPrecision value was printed in textual IR not as a string (e.g. "tf32") but as an opaque number. The trick to fixing this is to explicitly list the attr in the assemblyFormat, instead of relying on attr-dict. attr-dict prints the attr as an integer, but if we list it explicitly, it will be printed as a string. But when I fixed this, I realized that many other ops, such as load and store, also had enums which were printed as numbers rather than strings. So I had to fix those, too. I also added reasonable defaults for most of the arguments to make the textual IR more concise. I then noticed that load and store had complicated asm parsers to handle the fact usually the pointer type was not specified, but sometimes it was necessary. This is because given e.g. `load : tensor<128xf32>` the pointer may be either `tensor<128x!tt.ptr<f32>>` (the default) or `!tt.ptr<tensor<128xf32>>` (for block pointers). So I fixed this too. Now load and store specify the *pointer* type instead of the value type. This lets us use the built-in asm parser and delete a bunch of code. I also noticed that we had a bunch of dead ttgir tests (actually it seems like these were never enabled); rather than attempt to update them, without any way to test it, I just deleted them. Thanks to @joker-eph for suggesting the solution here -- I never would have figured this out on my own.

…n-lang#3556) While investigating weird autotuned performance of some Triton kernels, I noticed that theoretical estimates returned from `estimate_matmul_time()` were suspiciously high. It turns out that `nvidia-smi` provides the max SM clock rate in MHz (e.g., something like `1410` for A100 GPUs), but `get_max_*_tflops()` functions expect `clock_rate` to be provided in KHz (e.g., `1410000`). I don't see this expectation to be documented anywhere, but: * `clock_rate` is multiplied by `1e-9` [here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/testing.py#L403), which only gives you TFLOPS if `clock_rate` is in KHz * the code in tests already accounts for this by [multiplying](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/test/regression/test_performance.py#L65) the `nvidia-smi` result by `1e3` The end result of this is that `compute_ms` [here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/ops/matmul_perf_model.py#L60), in fact, becomes `compute_us` and therefore completely dominates the final expression [here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/ops/matmul_perf_model.py#L92). In other words, the loading/storing times basically become irrelevant. **Disclaimer** I actually couldn't find find a test case where the fix would make a difference: large compute-bound matmuls are obviously not affected, but at least on A100, there is no meaningful performance difference even for smaller sizes I tested (which should be memory-bound). The set of pruned configs indeed changes after the fix, but it doesn't seem to matter because all configs are essentially the same. It is also possible that fix will actually degrade the performance of the kernels which were written with the old performance model in mind. However, I decided I'd post this PR just in case because I think that this is definitely not the intended behavior. Co-authored-by: Justin Lebar <justin.lebar@gmail.com>

Few small tweaks to get good perf out of fp8 matmul tutorial. Make sure to use 3 source dot and pre-transpose b to get good performance. Also add more configs that work well for fp8. Results on H100: ``` matmul-performance-fp8: M N K Triton 0 256.0 256.0 256.0 0.483438 1 384.0 384.0 384.0 1.768588 2 512.0 512.0 512.0 4.390792 3 640.0 640.0 640.0 9.194164 4 768.0 768.0 768.0 15.353337 5 896.0 896.0 896.0 24.886629 6 1024.0 1024.0 1024.0 37.882509 7 1152.0 1152.0 1152.0 51.427067 8 1280.0 1280.0 1280.0 73.470854 9 1408.0 1408.0 1408.0 96.225503 10 1536.0 1536.0 1536.0 125.411082 11 1664.0 1664.0 1664.0 158.921188 12 1792.0 1792.0 1792.0 193.992213 13 1920.0 1920.0 1920.0 239.117838 14 2048.0 2048.0 2048.0 280.570111 15 2176.0 2176.0 2176.0 284.182140 16 2304.0 2304.0 2304.0 411.416537 17 2432.0 2432.0 2432.0 438.333906 18 2560.0 2560.0 2560.0 527.320119 19 2688.0 2688.0 2688.0 650.862087 20 2816.0 2816.0 2816.0 757.684372 21 2944.0 2944.0 2944.0 859.936907 22 3072.0 3072.0 3072.0 961.241005 23 3200.0 3200.0 3200.0 1040.650362 24 3328.0 3328.0 3328.0 1117.226653 25 3456.0 3456.0 3456.0 1204.430514 26 3584.0 3584.0 3584.0 1279.934470 27 3712.0 3712.0 3712.0 1076.335008 28 3840.0 3840.0 3840.0 1145.660082 29 3968.0 3968.0 3968.0 1226.371204 30 4096.0 4096.0 4096.0 1292.886034 ```

…orization (triton-lang#3609) Current implementation for load vectorization uses segmented short-vectorized loads instead of a full 128-bit load. Using multiple copies of shorter load creates a dependency on the LLVM backend (esp. the load and store vectorizer) for full vectorization. This could be fragile as I saw in some cases the vector combine pass and the jump threading pass screwed it up and resulted in non-ideal vectorization This is a backport of ROCm#445

Fix groupping argument elements to i32 typed vectors Signed-off-by: joviliast <iveselov.nn@gmail.com> Co-authored-by: Zahi Moudallal <zahi@openai.com>

…on-lang#3611) This was not really causing any issues, but double visit was not intended. Rely on the fact that walk also visits the op that it has been called on.

Merging change "[MLIR][SCF] Add checks to verify that the pipeliner schedule is correct. #77083" by ThomasRaoux from the llvm repo co-authored-by: Thomas Raoux <thomasraoux@openai.com>

…ng#3615) (triton-lang#3616)

…g#3604) This PR is to turn on all the unit tests that passed on AMD backend.

- Support 3d tensor when emitting offsets for mfma layouts - Support 3d tensors in Shared to dot operand conversion - Support dot3d in Dialect.cpp - Replace amd::DecomposeConversion with common::ReduceDataDuplication --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>

I'd like to add a favicon too, but my image editing skills are not great.

…from ttgir to llir/ptx (triton-lang#3458) This can help us analyzing the mapping from ttgir to llir/ptx. When used with performance tools, it can provide a breakdown on ttgir instructions. Added one env variable: "USE_TTGIR_LOC": will not emit location information in the dumped ttgir, will re-parse the ttgir file and use ttgir line numbers as debug info When running on vector-add with USE_TTGIR_LOC=1: ttgir: #loc = loc("/cache-path/add_kernel.ttgir":2:1) llir: !3 = !DIFile(filename: "add_kernel.ttgir", directory: "/cache-path") --------- Co-authored-by: Manman Ren <mren@fb.com>

This commit enables test_line_info.py. It's mostly about finding the proper disassembler for hsaco and disassemble it with the necessary command line options.

…lang#3617) This PR - enables test_matmul by skipping fp8 input and tf32x3. - fixes test_subproc on MI300.

…lang#3627) This removes a dedicated step for operators to have better parallelism.

Merge from newest openai/triton.(2024/4/11)

zhanglx13 and others added 30 commits March 4, 2024 16:33

[FRONTEND] Add support for using remote cache managers (triton-lang#2934

04c0daa

) Adds a redis-based one as an initial implementation, but it should be straightforward to extend with more impls.

[BACKEND] Extract reduce/scan conversions to TritonGPUToLLVM lib (tri…

8c46ef7

…ton-lang#3263)

[AMD] Run pytest with 32 workers (triton-lang#3277)

0f74acc

[INTERPRETER] Support unary ops (triton-lang#3279)

6f440ad

[BACKEND] Remove distributed layout conversions between single elemen…

10aff12

…t tensors (triton-lang#3270)

[AMD] Enable ROCDL dialect dependent lit tests (triton-lang#3254)

bc774cb

-Register required dialect for triton-opt CMake target -Add ROCDL dialect as a dependency for ConvertTritonAMDGPUToLLVM pass Signed-off-by: joviliast <iveselov.nn@gmail.com>

[Backend] Fix bug in ScanOp lowerring when one thread and one warp bu…

bd191ee

…t more than one numScanBlocks along scan axis (triton-lang#3017) (triton-lang#3283) more than one numScanBlocks along scan axis (triton-lang#3017)

[AMD][Navi31] Convert WMMA dot op to LLVM (triton-lang#3199)

9cfad37

-Add WMMA conversion logic for dot operation -Fix helper hunctions for WMMA layout -Add lit test for WMMA dot operation conversion Signed-off-by: joviliast <iveselov.nn@gmail.com>

Revert "[AMD][Navi31] Convert WMMA dot op to LLVM (triton-lang#3199)" (…

c1a3fff

…triton-lang#3284) This reverts commit 9cfad37. There has been an inflight collision with some refactoring causing a build break

[BACKEND] Fix reductions on pre-sm80 GPUs (triton-lang#3287)

989adb9

Resolves the TODO left in `matchReduxKind` from triton-lang#3263, which broke reductions on GPUs without Redux operations.

[BACKEND] Fix missing pattern in AMD lowering (triton-lang#3285)

233c88b

My previous commit broke lowering on AMD target due to missing pattern and handling of stream pipeliner.

[INTERPRETER] Fix mask and other fields in load and store ops (…

441d6b1

…triton-lang#3288)

[AMD][Navi31] Convert WMMA dot op to LLVM - rebased (triton-lang#3292)

8cd5052

-Add WMMA conversion logic for dot operation -Fix helper hunctions for WMMA layout -Add lit test for WMMA dot operation conversion See also: 10e1560 Signed-off-by: joviliast <iveselov.nn@gmail.com>

[FRONTEND] Add an environment variable to control TF32 default (trito…

dd8c270

…n-lang#3290) Add an environment variable that allows us to use ieee f32 by default. This makes it simpler to check for precision problems.

[HOPPER] Separate test_flashattention.py from parallel running to a…

784e91e

…void GPU OOM (triton-lang#3297)

[BACKEND] Tweak heuristic for convert layout shape (triton-lang#3304)

f1272a2

When doing a convert from mma -> blocked with a transpose using a vector size for the load back is more efficient.

[BACKEND] do not sink conver_layout for multi users (triton-lang#3269)

30b6fc3

The user_begin() function cannot guarantee a definite order, so when there are multiple users for convert_layout, it might be sunk between two users, resulting in a dominance error.

jlebar and others added 25 commits April 8, 2024 22:35

Add additional tips and links to README. (triton-lang#3608)

ef72603

[AMD][Navi31] Fix WMMA for integer arguments (triton-lang#3481)

8e0c7b4

Fix groupping argument elements to i32 typed vectors Signed-off-by: joviliast <iveselov.nn@gmail.com> Co-authored-by: Zahi Moudallal <zahi@openai.com>

[BACKEND] Fix double visit during walk in the Pipeline Expander (trit…

2a58b2d

…on-lang#3611) This was not really causing any issues, but double visit was not intended. Rely on the fact that walk also visits the op that it has been called on.

Merge Pipeline Expander Verifier from LLVM upstream (triton-lang#3610)

c8cf388

Merging change "[MLIR][SCF] Add checks to verify that the pipeliner schedule is correct. #77083" by ThomasRaoux from the llvm repo co-authored-by: Thomas Raoux <thomasraoux@openai.com>

[BACKEND] Update LLVM version to llvm/llvm-project@ed4e505 (triton-la…

57f41b4

…ng#3615) (triton-lang#3616)

[AMD backend] turn on more unit tests for better coverage (triton-lan…

cf27ce3

…g#3604) This PR is to turn on all the unit tests that passed on AMD backend.

Add Triton logo to docs. (triton-lang#3618)

0fd0be3

I'd like to add a favicon too, but my image editing skills are not great.

[backend][amd] Enable line info test (triton-lang#3622)

6a07b45

This commit enables test_line_info.py. It's mostly about finding the proper disassembler for hsaco and disassemble it with the necessary command line options.

Add more version support in integration-tests.yml

f015c92

Add detail filename for output files in integration-tests.yml

6adc427

[AMD] Enable test_matmul on CI and fix test_subproc on MI300 (triton-…

c8161a9

…lang#3617) This PR - enables test_matmul by skipping fp8 input and tf32x3. - fixes test_subproc on MI300.

[ci] Merge operators test into the main step for parallelism (triton-…

bf5346b

…lang#3627) This removes a dedicated step for operators to have better parallelism.

Merge branch 'windows-build-sync-openai' into openai-main

95ee264

Merge from newest openai/triton.(2024/4/11)

145e45b

Merge from newest openai/triton.(2024/4/11)

Merge branch 'main' into windows-build-sync-openai

0cd35d3

fix for pre-commit.

7cf5707

action add setuptools package install.

7dbd3f2

add integration tests old to compare difference.

59fc91e

patch LLVM header for naming conflict

5efba8e

HighCWu force-pushed the windows-build-sync-openai branch from 9c80aba to 5efba8e Compare April 11, 2024 08:17

HighCWu added 2 commits April 11, 2024 16:27

patch LLVM header for naming conflict on old test file

b763c9b

not to build test on windows.

d44c902

HighCWu force-pushed the windows-build-sync-openai branch from fdb563c to d44c902 Compare April 11, 2024 09:33

fix files occupied problem under Windows.

565b5a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge to main from newest openai/triton.(2024/4/11) #5

Merge to main from newest openai/triton.(2024/4/11) #5

HighCWu commented Apr 11, 2024

Merge to main from newest openai/triton.(2024/4/11) #5

Are you sure you want to change the base?

Merge to main from newest openai/triton.(2024/4/11) #5

Conversation

HighCWu commented Apr 11, 2024