Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge to main from newest openai/triton.(2024/4/11) #5

Open
wants to merge 377 commits into
base: main
Choose a base branch
from

Conversation

HighCWu
Copy link
Owner

@HighCWu HighCWu commented Apr 11, 2024

No description provided.

zhanglx13 and others added 30 commits March 4, 2024 16:33
This PR updates SharedToDotOperandMFMA.cpp and MFMA.cpp.
- SharedToDotOperandMFMA.cpp is up to date with triton-mlir as of today,
which includes changes until ROCm#482
  - Fixed issue with opaque pointers
- Fixed API for `getMFMAElemsPerInstrForOperands` and
`getMFMARepForOperands`
- MFMA.cpp is synced with triton-mlir@6bb04d, which includes changes
until ROCm#469

Note to @binarman: changes in other files from
ROCm#469 are not included in this PR. We
can bring up the support for mfma 64x4 and 4x64 later.
)

Adds a redis-based one as an initial implementation, but it should be
straightforward to extend with more impls.
-Register required dialect for triton-opt CMake target 
-Add ROCDL dialect as a dependency for ConvertTritonAMDGPUToLLVM pass

Signed-off-by: joviliast <iveselov.nn@gmail.com>
…t more than one numScanBlocks along scan axis (triton-lang#3017) (triton-lang#3283)

more than one numScanBlocks along scan axis (triton-lang#3017)
Recently we had triton-lang#3264 and
triton-lang#3263 which left only a very small
difference in the TypeConverter.cpp in the AMD backend compared to the
TritonGPUToLLVM lib.
This check is now safe to do in the generic location so let's clean this
up a bit and remove the TypeConverter.cpp from the AMD backend.

`LoadStoreOpToLLVM.cpp` and `ConvertLayoutOpToLLVM.cpp` in the AMD
backend need some updates as well but that can come with a follow-on PR.
-Add WMMA conversion logic for dot operation
-Fix helper hunctions for WMMA layout
-Add lit test for WMMA dot operation conversion

Signed-off-by: joviliast <iveselov.nn@gmail.com>
…triton-lang#3284)

This reverts commit 9cfad37.


There has been an inflight collision with some refactoring causing a
build break
…#3261)

Existing shared memory representation in TTGIR had some semantic
problems as it was mixing value semantic with memory semantic.
In order to solve that this PR moves shared memory representation to
memory semantic only.
This means shared memory is now not represented as tensors but as
allocations that may or may not be mutable.

tensor cannot have shared encoding now.
Convert_layout #distributed -> #shared becomes triton_gpu.local_alloc %init
tensor<#distributed> -> !tt.memdesc<#shared>
Convert_layout #shared -> #distributed becomes triton_gpu.local_load
%mem !tt.memdesc<#shared> -> tensor<#distributed>

Insert_slice_async becomes async_copy_global_to_local
…riton-lang#3268)

When sizePerThread is larger than tensor's shape, only unique elements
will be stored in smemShape when accumulatePartialReductions, so when
read out reduced values, we need to modulo srcShape/smemShape to get
replicated elements for oversize elements.
…n-lang#3278)

Changes in this PR:

1. Deleted `decomposeMmaToDotOperand` and `decomposeInsertSliceAsyncOp `
from AMD backend
2. Moved `decomposeMixedModeDotOp` to `AccelerateAMDMatmul.cpp`
3. Created a new pass in `DecomposeUnsupportedConversions.cpp` to
include: `decomposeFp8e4b15Convert`, `decomposeSplatToSharedLayout`,
`decomposeMfmaToDotOperand`, `reduceCvtOpLDSUsage`,
`decomposeBlockedToDotOperand`, and `promoteReduceOps`.
4. Deleted workaround in `initSharedMemory` to annotate
`triton_gpu.shared`.
Resolves the TODO left in `matchReduxKind` from triton-lang#3263, which broke
reductions on GPUs without Redux operations.
My previous commit broke lowering on AMD target due to missing pattern
and handling of stream pipeliner.
…ton-lang#3295)

`StandardInstrumentations::registerCallbacks` reads the llvm registered
options, so in order for it to enable printing, the `print-after-all`
needs to be already set when it is being invoked.
This fixes the lack of IR dumps for optimization passes.
- Remove nv-specific code from `ConvertLayoutOpToLLVM.cpp`
- Remove `storeDistributedToShared` and related function from
`TritonGPUToLLVMBase.h`

Later, we can further move `loadSharedToDistributed` from both amd and
nv path to the common lib.
Note: for now amd still needs `DecomposeConversions` pass as one of the
transforms at ttgir level, since some optimization will reorder
local_alloc and local_load for both opA and opB in a different way.
-Add WMMA conversion logic for dot operation
-Fix helper hunctions for WMMA layout
-Add lit test for WMMA dot operation conversion

See also: 10e1560

Signed-off-by: joviliast <iveselov.nn@gmail.com>
It's quite easy to unexpectedly use a match from a different CHECK-LABEL
in a lit test, which is likely not what the author intended.

Prevent this from happening by invoking FileCheck with variable scopes
enabled (and fix the failures detected this way). This also makes the
file check use in Triton consistent with upstream MLIR.
…n-lang#3290)

Add an environment variable that allows us to use ieee f32 by default.
This makes it simpler to check for precision problems.
…riton-lang#3302)

triton-lang#2934 added
`RedisRemoteCacheBackend` but did not add the symbol to `__init__.py`.
Let's also add there so that we can refer to it via full path instead of
having to `from ... import RedisRemoteCacheBackend`.
When doing a convert from mma -> blocked with a transpose using a vector
size for the load back is more efficient.
…iton-lang#3091)

This commit fixes failure in
python/tutorials/03-matrix-multiplication.py for FMA cases,
also fixes mixed dot for FMA cases.
Tested on Navi31

---------

Signed-off-by: joviliast <iveselov.nn@gmail.com>
…iton-lang#3311)

This reverts commit 375fee0.

Reverting this PR as it causes error in our internal tests. It reaches
this error in `python3.11/site-packages/setuptools/_distutils/core.py `
```
def run_commands(dist):
    """Given a Distribution object run all the commands,
    raising ``SystemExit`` errors in the case of failure.

    This function assumes that either ``sys.argv`` or ``dist.script_args``
    is already set accordingly.
    """
    try:
        dist.run_commands()
    except KeyboardInterrupt:
        raise SystemExit("interrupted")
    except OSError as exc:
        if DEBUG: 
            sys.stderr.write("error: {}\n".format(exc))
            raise 
        else:
            raise SystemExit("error: {}".format(exc))
   
    except (DistutilsError, CCompilerError) as msg:
        if DEBUG:
            raise
        else:
            raise SystemExit("error: " + str(msg))

    return dist
```
The user_begin() function cannot guarantee a definite order, so when
there are multiple users for convert_layout, it might be sunk between
two users, resulting in a dominance error.
)

Common lib pattern to include:
- lowering shared to distributed
- lowering shared to dotOp when dotOp parent is a Blocked encoding
- lowering distributed to distributed (excluding distributed smem,
mmaV1) and calls `TargetInfo` for using stmatrix whenever possible

Amd backend pattern to include:
- lowering shared to dotOp with dotOp parent being Mfma
- lowering Mfma to dotOp

Nvidia backend pattern to include:
- lowering shared to dotOp with dotOp parent being Mma
- lowering mma to mma (this overlaps with lowering distributed to
distributed pattern from the common lib; it will have a higher
`PatternBenefit` since it's an optimization)
- lowering distributed to distributed (only for distributed smem and
mmaV1). Note: MmaV1 requires some duplicated code which will get deleted
once we stop supporting mmaV1.
- lowering mma to dotOp
[Backend] Clean up MMAv3 part of MatmulLoopPipeline.

The MMAv3 pipelining code needed some love and comments.

There are some minor functional changes in here.  For example, the old
code relied on a particular order of op.users(), and I'm not convinced
it was correct.

Once you unravel the code, what it needs to do is a lot less complicated
than what it was doing.  Indeed, despite the fact that I added a ton of
comments, it's roughly the same length as before.

The changes to the test are cosmetic.  Previously we emitted tt.dot and
assumed it would eventually becdome ttng.dot_async in a later pass (or
at least during codegen).  Now we explicitly convert some tt.dot's to
ttng.dot_async so we can have an explicit wait and thread the results
through.

There's more cleanup that can be done here (see the TODOs added in this
PR), but I needed to stop somewhere.
jlebar and others added 25 commits April 8, 2024 22:35
…ng#3587)

The initial motivation for this change was that tt.dot's inputPrecision
value was printed in textual IR not as a string (e.g. "tf32") but as an
opaque number. The trick to fixing this is to explicitly list the attr
in the assemblyFormat, instead of relying on attr-dict. attr-dict prints
the attr as an integer, but if we list it explicitly, it will be printed
as a string.

But when I fixed this, I realized that many other ops, such as load and
store, also had enums which were printed as numbers rather than strings.
So I had to fix those, too. I also added reasonable defaults for most of
the arguments to make the textual IR more concise.

I then noticed that load and store had complicated asm parsers to handle
the fact usually the pointer type was not specified, but sometimes it
was necessary. This is because given e.g. `load : tensor<128xf32>` the
pointer may be either `tensor<128x!tt.ptr<f32>>` (the default) or
`!tt.ptr<tensor<128xf32>>` (for block pointers).

So I fixed this too. Now load and store specify the *pointer* type
instead of the value type. This lets us use the built-in asm parser and
delete a bunch of code.

I also noticed that we had a bunch of dead ttgir tests (actually it
seems like these were never enabled); rather than attempt to update
them, without any way to test it, I just deleted them.

Thanks to @joker-eph for suggesting the solution here -- I never would
have figured this out on my own.
…n-lang#3556)

While investigating weird autotuned performance of some Triton kernels,
I noticed that theoretical estimates returned from
`estimate_matmul_time()` were suspiciously high. It turns out that
`nvidia-smi` provides the max SM clock rate in MHz (e.g., something like
`1410` for A100 GPUs), but `get_max_*_tflops()` functions expect
`clock_rate` to be provided in KHz (e.g., `1410000`). I don't see this
expectation to be documented anywhere, but:
* `clock_rate` is multiplied by `1e-9`
[here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/testing.py#L403),
which only gives you TFLOPS if `clock_rate` is in KHz
* the code in tests already accounts for this by
[multiplying](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/test/regression/test_performance.py#L65)
the `nvidia-smi` result by `1e3`

The end result of this is that `compute_ms`
[here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/ops/matmul_perf_model.py#L60),
in fact, becomes `compute_us` and therefore completely dominates the
final expression
[here](https://github.com/openai/triton/blob/d988b759ef1f83632ed2f8aa99e43d6349077acf/python/triton/ops/matmul_perf_model.py#L92).
In other words, the loading/storing times basically become irrelevant.

**Disclaimer** I actually couldn't find find a test case where the fix
would make a difference: large compute-bound matmuls are obviously not
affected, but at least on A100, there is no meaningful performance
difference even for smaller sizes I tested (which should be
memory-bound). The set of pruned configs indeed changes after the fix,
but it doesn't seem to matter because all configs are essentially the
same.

It is also possible that fix will actually degrade the performance of
the kernels which were written with the old performance model in mind.
However, I decided I'd post this PR just in case because I think that
this is definitely not the intended behavior.

Co-authored-by: Justin Lebar <justin.lebar@gmail.com>
Few small tweaks to get good perf out of fp8 matmul tutorial.
Make sure to use 3 source dot and pre-transpose b to get good
performance. Also add more configs that work well for fp8.

Results on H100:
```
matmul-performance-fp8:
         M       N       K       Triton
0    256.0   256.0   256.0     0.483438
1    384.0   384.0   384.0     1.768588
2    512.0   512.0   512.0     4.390792
3    640.0   640.0   640.0     9.194164
4    768.0   768.0   768.0    15.353337
5    896.0   896.0   896.0    24.886629
6   1024.0  1024.0  1024.0    37.882509
7   1152.0  1152.0  1152.0    51.427067
8   1280.0  1280.0  1280.0    73.470854
9   1408.0  1408.0  1408.0    96.225503
10  1536.0  1536.0  1536.0   125.411082
11  1664.0  1664.0  1664.0   158.921188
12  1792.0  1792.0  1792.0   193.992213
13  1920.0  1920.0  1920.0   239.117838
14  2048.0  2048.0  2048.0   280.570111
15  2176.0  2176.0  2176.0   284.182140
16  2304.0  2304.0  2304.0   411.416537
17  2432.0  2432.0  2432.0   438.333906
18  2560.0  2560.0  2560.0   527.320119
19  2688.0  2688.0  2688.0   650.862087
20  2816.0  2816.0  2816.0   757.684372
21  2944.0  2944.0  2944.0   859.936907
22  3072.0  3072.0  3072.0   961.241005
23  3200.0  3200.0  3200.0  1040.650362
24  3328.0  3328.0  3328.0  1117.226653
25  3456.0  3456.0  3456.0  1204.430514
26  3584.0  3584.0  3584.0  1279.934470
27  3712.0  3712.0  3712.0  1076.335008
28  3840.0  3840.0  3840.0  1145.660082
29  3968.0  3968.0  3968.0  1226.371204
30  4096.0  4096.0  4096.0  1292.886034
```
…orization (triton-lang#3609)

Current implementation for load vectorization uses segmented
short-vectorized loads instead of a full 128-bit load. Using multiple
copies of shorter load creates a dependency on the LLVM backend (esp.
the load and store vectorizer) for full vectorization. This could be
fragile as I saw in some cases the vector combine pass and the jump
threading pass screwed it up and resulted in non-ideal vectorization

This is a backport of ROCm#445
Fix groupping argument elements to i32 typed vectors

Signed-off-by: joviliast <iveselov.nn@gmail.com>
Co-authored-by: Zahi Moudallal <zahi@openai.com>
…on-lang#3611)

This was not really causing any issues, but double visit was not
intended.
Rely on the fact that walk also visits the op that it has been called
on.
Merging change "[MLIR][SCF] Add checks to verify that the pipeliner
schedule is correct. #77083" by ThomasRaoux from the llvm repo

co-authored-by: Thomas Raoux <thomasraoux@openai.com>
…g#3604)

This PR is to turn on all the unit tests that passed on AMD backend.
- Support 3d tensor when emitting offsets for mfma layouts
- Support 3d tensors in Shared to dot operand conversion
- Support dot3d in Dialect.cpp
- Replace amd::DecomposeConversion with common::ReduceDataDuplication

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
I'd like to add a favicon too, but my image editing skills are not
great.
…from ttgir to llir/ptx (triton-lang#3458)

This can help us analyzing the mapping from ttgir to llir/ptx. When used
with performance tools, it can provide a breakdown on ttgir
instructions.
Added one env variable:
"USE_TTGIR_LOC": will not emit location information in the dumped ttgir,
will re-parse the ttgir file and use ttgir line numbers as debug info
When running on vector-add with USE_TTGIR_LOC=1:
    ttgir: #loc = loc("/cache-path/add_kernel.ttgir":2:1)
llir: !3 = !DIFile(filename: "add_kernel.ttgir", directory:
"/cache-path")

---------

Co-authored-by: Manman Ren <mren@fb.com>
This commit enables test_line_info.py. It's mostly about finding the
proper disassembler for hsaco and disassemble it with the necessary
command line options.
…lang#3617)

This PR 
- enables test_matmul by skipping fp8 input and tf32x3.
- fixes test_subproc on MI300.
…lang#3627)

This removes a dedicated step for operators to have better parallelism.
Merge from newest openai/triton.(2024/4/11)
@HighCWu HighCWu force-pushed the windows-build-sync-openai branch from 9c80aba to 5efba8e Compare April 11, 2024 08:17
@HighCWu HighCWu force-pushed the windows-build-sync-openai branch from fdb563c to d44c902 Compare April 11, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.