forked from wkpark/triton
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge to main from newest openai/triton.(2024/4/11) #5
Open
HighCWu
wants to merge
377
commits into
main
Choose a base branch
from
windows-build-sync-openai
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR updates SharedToDotOperandMFMA.cpp and MFMA.cpp. - SharedToDotOperandMFMA.cpp is up to date with triton-mlir as of today, which includes changes until ROCm#482 - Fixed issue with opaque pointers - Fixed API for `getMFMAElemsPerInstrForOperands` and `getMFMARepForOperands` - MFMA.cpp is synced with triton-mlir@6bb04d, which includes changes until ROCm#469 Note to @binarman: changes in other files from ROCm#469 are not included in this PR. We can bring up the support for mfma 64x4 and 4x64 later.
-Register required dialect for triton-opt CMake target -Add ROCDL dialect as a dependency for ConvertTritonAMDGPUToLLVM pass Signed-off-by: joviliast <iveselov.nn@gmail.com>
…t more than one numScanBlocks along scan axis (triton-lang#3017) (triton-lang#3283) more than one numScanBlocks along scan axis (triton-lang#3017)
Recently we had triton-lang#3264 and triton-lang#3263 which left only a very small difference in the TypeConverter.cpp in the AMD backend compared to the TritonGPUToLLVM lib. This check is now safe to do in the generic location so let's clean this up a bit and remove the TypeConverter.cpp from the AMD backend. `LoadStoreOpToLLVM.cpp` and `ConvertLayoutOpToLLVM.cpp` in the AMD backend need some updates as well but that can come with a follow-on PR.
-Add WMMA conversion logic for dot operation -Fix helper hunctions for WMMA layout -Add lit test for WMMA dot operation conversion Signed-off-by: joviliast <iveselov.nn@gmail.com>
…triton-lang#3284) This reverts commit 9cfad37. There has been an inflight collision with some refactoring causing a build break
…#3261) Existing shared memory representation in TTGIR had some semantic problems as it was mixing value semantic with memory semantic. In order to solve that this PR moves shared memory representation to memory semantic only. This means shared memory is now not represented as tensors but as allocations that may or may not be mutable. tensor cannot have shared encoding now. Convert_layout #distributed -> #shared becomes triton_gpu.local_alloc %init tensor<#distributed> -> !tt.memdesc<#shared> Convert_layout #shared -> #distributed becomes triton_gpu.local_load %mem !tt.memdesc<#shared> -> tensor<#distributed> Insert_slice_async becomes async_copy_global_to_local
…riton-lang#3268) When sizePerThread is larger than tensor's shape, only unique elements will be stored in smemShape when accumulatePartialReductions, so when read out reduced values, we need to modulo srcShape/smemShape to get replicated elements for oversize elements.
…n-lang#3278) Changes in this PR: 1. Deleted `decomposeMmaToDotOperand` and `decomposeInsertSliceAsyncOp ` from AMD backend 2. Moved `decomposeMixedModeDotOp` to `AccelerateAMDMatmul.cpp` 3. Created a new pass in `DecomposeUnsupportedConversions.cpp` to include: `decomposeFp8e4b15Convert`, `decomposeSplatToSharedLayout`, `decomposeMfmaToDotOperand`, `reduceCvtOpLDSUsage`, `decomposeBlockedToDotOperand`, and `promoteReduceOps`. 4. Deleted workaround in `initSharedMemory` to annotate `triton_gpu.shared`.
Resolves the TODO left in `matchReduxKind` from triton-lang#3263, which broke reductions on GPUs without Redux operations.
My previous commit broke lowering on AMD target due to missing pattern and handling of stream pipeliner.
…ton-lang#3295) `StandardInstrumentations::registerCallbacks` reads the llvm registered options, so in order for it to enable printing, the `print-after-all` needs to be already set when it is being invoked. This fixes the lack of IR dumps for optimization passes.
- Remove nv-specific code from `ConvertLayoutOpToLLVM.cpp` - Remove `storeDistributedToShared` and related function from `TritonGPUToLLVMBase.h` Later, we can further move `loadSharedToDistributed` from both amd and nv path to the common lib. Note: for now amd still needs `DecomposeConversions` pass as one of the transforms at ttgir level, since some optimization will reorder local_alloc and local_load for both opA and opB in a different way.
-Add WMMA conversion logic for dot operation -Fix helper hunctions for WMMA layout -Add lit test for WMMA dot operation conversion See also: 10e1560 Signed-off-by: joviliast <iveselov.nn@gmail.com>
It's quite easy to unexpectedly use a match from a different CHECK-LABEL in a lit test, which is likely not what the author intended. Prevent this from happening by invoking FileCheck with variable scopes enabled (and fix the failures detected this way). This also makes the file check use in Triton consistent with upstream MLIR.
…n-lang#3290) Add an environment variable that allows us to use ieee f32 by default. This makes it simpler to check for precision problems.
…riton-lang#3302) triton-lang#2934 added `RedisRemoteCacheBackend` but did not add the symbol to `__init__.py`. Let's also add there so that we can refer to it via full path instead of having to `from ... import RedisRemoteCacheBackend`.
When doing a convert from mma -> blocked with a transpose using a vector size for the load back is more efficient.
…iton-lang#3091) This commit fixes failure in python/tutorials/03-matrix-multiplication.py for FMA cases, also fixes mixed dot for FMA cases. Tested on Navi31 --------- Signed-off-by: joviliast <iveselov.nn@gmail.com>
…iton-lang#3311) This reverts commit 375fee0. Reverting this PR as it causes error in our internal tests. It reaches this error in `python3.11/site-packages/setuptools/_distutils/core.py ` ``` def run_commands(dist): """Given a Distribution object run all the commands, raising ``SystemExit`` errors in the case of failure. This function assumes that either ``sys.argv`` or ``dist.script_args`` is already set accordingly. """ try: dist.run_commands() except KeyboardInterrupt: raise SystemExit("interrupted") except OSError as exc: if DEBUG: sys.stderr.write("error: {}\n".format(exc)) raise else: raise SystemExit("error: {}".format(exc)) except (DistutilsError, CCompilerError) as msg: if DEBUG: raise else: raise SystemExit("error: " + str(msg)) return dist ```
The user_begin() function cannot guarantee a definite order, so when there are multiple users for convert_layout, it might be sunk between two users, resulting in a dominance error.
) Common lib pattern to include: - lowering shared to distributed - lowering shared to dotOp when dotOp parent is a Blocked encoding - lowering distributed to distributed (excluding distributed smem, mmaV1) and calls `TargetInfo` for using stmatrix whenever possible Amd backend pattern to include: - lowering shared to dotOp with dotOp parent being Mfma - lowering Mfma to dotOp Nvidia backend pattern to include: - lowering shared to dotOp with dotOp parent being Mma - lowering mma to mma (this overlaps with lowering distributed to distributed pattern from the common lib; it will have a higher `PatternBenefit` since it's an optimization) - lowering distributed to distributed (only for distributed smem and mmaV1). Note: MmaV1 requires some duplicated code which will get deleted once we stop supporting mmaV1. - lowering mma to dotOp
[Backend] Clean up MMAv3 part of MatmulLoopPipeline. The MMAv3 pipelining code needed some love and comments. There are some minor functional changes in here. For example, the old code relied on a particular order of op.users(), and I'm not convinced it was correct. Once you unravel the code, what it needs to do is a lot less complicated than what it was doing. Indeed, despite the fact that I added a ton of comments, it's roughly the same length as before. The changes to the test are cosmetic. Previously we emitted tt.dot and assumed it would eventually becdome ttng.dot_async in a later pass (or at least during codegen). Now we explicitly convert some tt.dot's to ttng.dot_async so we can have an explicit wait and thread the results through. There's more cleanup that can be done here (see the TODOs added in this PR), but I needed to stop somewhere.
…ng#3587) The initial motivation for this change was that tt.dot's inputPrecision value was printed in textual IR not as a string (e.g. "tf32") but as an opaque number. The trick to fixing this is to explicitly list the attr in the assemblyFormat, instead of relying on attr-dict. attr-dict prints the attr as an integer, but if we list it explicitly, it will be printed as a string. But when I fixed this, I realized that many other ops, such as load and store, also had enums which were printed as numbers rather than strings. So I had to fix those, too. I also added reasonable defaults for most of the arguments to make the textual IR more concise. I then noticed that load and store had complicated asm parsers to handle the fact usually the pointer type was not specified, but sometimes it was necessary. This is because given e.g. `load : tensor<128xf32>` the pointer may be either `tensor<128x!tt.ptr<f32>>` (the default) or `!tt.ptr<tensor<128xf32>>` (for block pointers). So I fixed this too. Now load and store specify the *pointer* type instead of the value type. This lets us use the built-in asm parser and delete a bunch of code. I also noticed that we had a bunch of dead ttgir tests (actually it seems like these were never enabled); rather than attempt to update them, without any way to test it, I just deleted them. Thanks to @joker-eph for suggesting the solution here -- I never would have figured this out on my own.
…n-lang#3556) While investigating weird autotuned performance of some Triton kernels, I noticed that theoretical estimates returned from `estimate_matmul_time()` were suspiciously high. It turns out that `nvidia-smi` provides the max SM clock rate in MHz (e.g., something like `1410` for A100 GPUs), but `get_max_*_tflops()` functions expect `clock_rate` to be provided in KHz (e.g., `1410000`). I don't see this expectation to be documented anywhere, but: * `clock_rate` is multiplied by `1e-9` [here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/testing.py#L403), which only gives you TFLOPS if `clock_rate` is in KHz * the code in tests already accounts for this by [multiplying](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/test/regression/test_performance.py#L65) the `nvidia-smi` result by `1e3` The end result of this is that `compute_ms` [here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/ops/matmul_perf_model.py#L60), in fact, becomes `compute_us` and therefore completely dominates the final expression [here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/ops/matmul_perf_model.py#L92). In other words, the loading/storing times basically become irrelevant. **Disclaimer** I actually couldn't find find a test case where the fix would make a difference: large compute-bound matmuls are obviously not affected, but at least on A100, there is no meaningful performance difference even for smaller sizes I tested (which should be memory-bound). The set of pruned configs indeed changes after the fix, but it doesn't seem to matter because all configs are essentially the same. It is also possible that fix will actually degrade the performance of the kernels which were written with the old performance model in mind. However, I decided I'd post this PR just in case because I think that this is definitely not the intended behavior. Co-authored-by: Justin Lebar <justin.lebar@gmail.com>
Few small tweaks to get good perf out of fp8 matmul tutorial. Make sure to use 3 source dot and pre-transpose b to get good performance. Also add more configs that work well for fp8. Results on H100: ``` matmul-performance-fp8: M N K Triton 0 256.0 256.0 256.0 0.483438 1 384.0 384.0 384.0 1.768588 2 512.0 512.0 512.0 4.390792 3 640.0 640.0 640.0 9.194164 4 768.0 768.0 768.0 15.353337 5 896.0 896.0 896.0 24.886629 6 1024.0 1024.0 1024.0 37.882509 7 1152.0 1152.0 1152.0 51.427067 8 1280.0 1280.0 1280.0 73.470854 9 1408.0 1408.0 1408.0 96.225503 10 1536.0 1536.0 1536.0 125.411082 11 1664.0 1664.0 1664.0 158.921188 12 1792.0 1792.0 1792.0 193.992213 13 1920.0 1920.0 1920.0 239.117838 14 2048.0 2048.0 2048.0 280.570111 15 2176.0 2176.0 2176.0 284.182140 16 2304.0 2304.0 2304.0 411.416537 17 2432.0 2432.0 2432.0 438.333906 18 2560.0 2560.0 2560.0 527.320119 19 2688.0 2688.0 2688.0 650.862087 20 2816.0 2816.0 2816.0 757.684372 21 2944.0 2944.0 2944.0 859.936907 22 3072.0 3072.0 3072.0 961.241005 23 3200.0 3200.0 3200.0 1040.650362 24 3328.0 3328.0 3328.0 1117.226653 25 3456.0 3456.0 3456.0 1204.430514 26 3584.0 3584.0 3584.0 1279.934470 27 3712.0 3712.0 3712.0 1076.335008 28 3840.0 3840.0 3840.0 1145.660082 29 3968.0 3968.0 3968.0 1226.371204 30 4096.0 4096.0 4096.0 1292.886034 ```
…orization (triton-lang#3609) Current implementation for load vectorization uses segmented short-vectorized loads instead of a full 128-bit load. Using multiple copies of shorter load creates a dependency on the LLVM backend (esp. the load and store vectorizer) for full vectorization. This could be fragile as I saw in some cases the vector combine pass and the jump threading pass screwed it up and resulted in non-ideal vectorization This is a backport of ROCm#445
Fix groupping argument elements to i32 typed vectors Signed-off-by: joviliast <iveselov.nn@gmail.com> Co-authored-by: Zahi Moudallal <zahi@openai.com>
…on-lang#3611) This was not really causing any issues, but double visit was not intended. Rely on the fact that walk also visits the op that it has been called on.
Merging change "[MLIR][SCF] Add checks to verify that the pipeliner schedule is correct. #77083" by ThomasRaoux from the llvm repo co-authored-by: Thomas Raoux <thomasraoux@openai.com>
…g#3604) This PR is to turn on all the unit tests that passed on AMD backend.
- Support 3d tensor when emitting offsets for mfma layouts - Support 3d tensors in Shared to dot operand conversion - Support dot3d in Dialect.cpp - Replace amd::DecomposeConversion with common::ReduceDataDuplication --------- Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
I'd like to add a favicon too, but my image editing skills are not great.
…from ttgir to llir/ptx (triton-lang#3458) This can help us analyzing the mapping from ttgir to llir/ptx. When used with performance tools, it can provide a breakdown on ttgir instructions. Added one env variable: "USE_TTGIR_LOC": will not emit location information in the dumped ttgir, will re-parse the ttgir file and use ttgir line numbers as debug info When running on vector-add with USE_TTGIR_LOC=1: ttgir: #loc = loc("/cache-path/add_kernel.ttgir":2:1) llir: !3 = !DIFile(filename: "add_kernel.ttgir", directory: "/cache-path") --------- Co-authored-by: Manman Ren <mren@fb.com>
This commit enables test_line_info.py. It's mostly about finding the proper disassembler for hsaco and disassemble it with the necessary command line options.
…lang#3617) This PR - enables test_matmul by skipping fp8 input and tf32x3. - fixes test_subproc on MI300.
…lang#3627) This removes a dedicated step for operators to have better parallelism.
Merge from newest openai/triton.(2024/4/11)
HighCWu
force-pushed
the
windows-build-sync-openai
branch
from
April 11, 2024 08:17
9c80aba
to
5efba8e
Compare
HighCWu
force-pushed
the
windows-build-sync-openai
branch
from
April 11, 2024 09:33
fdb563c
to
d44c902
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.