Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a matmul test from int8, bf16 #6

Closed
wants to merge 2 commits into from

Conversation

karupayun
Copy link
Member

In this PR we are adding a matmul test from int8, bf16. I had a few issues in the test so
I refactored the file a bit.

  • First I included two new params:

    • Dot_out_dtype: So users of the test class can specify the type used internally in the dot,
      and not the one set by default given the two types. There are several restrictions for
      these types anyway.
    • C_dtype: The return type of the matmul.
      I included a few tests in the case of making a dot with two float16.
  • I had to modify test_matmul to use small integers when testing with two float16 since torch
    used float32 internally in this case and we were having precision issues when comparing
    the results with triton in the case that dot_out_dtype was float16.

  • I also needed to include torch.int8 in the possible datatypes.

Finally I tried to simplify a bit the logic of the matmul/test_matmul because after adding
these two parameters it was a bit hard to follow why we needed every part of the code, so
I included a type_preference_list for the allowed dot_out_dtype given the types of the operands
a and b.

In this PR we are adding a matmul test from int8, bf16. I had a few issues in the test so
I refactored the file a bit.

- First I included two new params:
  - Dot_out_dtype: So users of the test class can specify the type used internally in the dot,
     and not the one set by default given the two types. There are several restrictions for
     these types anyway.
  - C_dtype: The return type of the matmul.
I included a few tests in the case of making a dot with two float16.

 - I had to modify test_matmul to use small integers when testing with two float16 since torch
    used float32 internally in this case and we were having precision issues when comparing
    the results with triton in the case that dot_out_dtype was float16.
 - I also needed to include torch.int8 in the possible datatypes.

Finally I tried to simplify a bit the logic of the matmul/test_matmul because after adding
these two parameters it was a bit hard to follow why we needed every part of the code, so
I included a type_preference_list for the allowed dot_out_dtype given the types of the operands
a and b.
Copy link

google-cla bot commented Oct 31, 2023

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@karupayun karupayun requested a review from gflegar October 31, 2023 17:17
Copy link
Member

@gflegar gflegar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Definitely much simpler and more versatile than it was before. But I think it can be even simpler and clearer - left some comments to do that.

python/test/unit/operators/test_matmul.py Outdated Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Outdated Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Outdated Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Outdated Show resolved Hide resolved
python/triton/ops/matmul.py Outdated Show resolved Hide resolved
python/triton/ops/matmul.py Outdated Show resolved Hide resolved
python/triton/ops/matmul.py Outdated Show resolved Hide resolved
python/triton/ops/matmul.py Outdated Show resolved Hide resolved
python/triton/ops/matmul.py Outdated Show resolved Hide resolved
@karupayun karupayun force-pushed the llvm-head-staging branch 2 times, most recently from 5f10733 to e81f98b Compare November 6, 2023 12:36
Copy link
Member

@gflegar gflegar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

A few unresolved comments regarding scaled integers, and the weird view thing, we can also discuss them offline if you want.

python/test/unit/operators/test_matmul.py Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Outdated Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Outdated Show resolved Hide resolved
python/test/unit/operators/test_matmul.py Outdated Show resolved Hide resolved
karupayun added a commit to karupayun/triton that referenced this pull request Dec 6, 2023
In this PR we are allowing to manually set acc_dtype and output_dtype
in matmul test.

They are:
  - `acc_dtype`: So users of the test class can specify the type used
internally in the dot, and not the one set by default given the two
types. There are several restrictions for these types anyway.
  - `output_dtype`: The return type of the matmul. I included a few
tests in the case of making a dot with two float16.
- I had to modify test_matmul to use a small range of values to
prevent numerical issues. In the case of testing with two `float16` and
`acc_dtype` `float16`, since I can't force torch to use `float16`
internally (it uses `float32`), I was having precision issues when
comparing the results with triton.

The discussion of why we are doing this for all tests and not only for
that particular ones is simplicity, since we should not be testing
precision here:
The discussion can be seen in
openxla#6 (comment)
and openxla#6 (comment)
but I do not have a strong opinion, so I am ok with just testing with
small integers when the acc_dtype is float16.
karupayun added a commit to karupayun/triton that referenced this pull request Dec 12, 2023
In this PR we are allowing to manually set acc_dtype and output_dtype
in matmul test.

They are:
  - `acc_dtype`: So users of the test class can specify the type used
internally in the dot, and not the one set by default given the two
types. There are several restrictions for these types anyway.
  - `output_dtype`: The return type of the matmul. I included a few
tests in the case of making a dot with two float16.
- I had to modify test_matmul to use a small range of values to
prevent numerical issues. In the case of testing with two `float16` and
`acc_dtype` `float16`, since I can't force torch to use `float16`
internally (it uses `float32`), I was having precision issues when
comparing the results with triton.

The discussion of why we are doing this for all tests and not only for
that particular ones is simplicity, since we should not be testing
precision here:
The discussion can be seen in
openxla#6 (comment)
and openxla#6 (comment)
but I do not have a strong opinion, so I am ok with just testing with
small integers when the acc_dtype is float16.
@gflegar gflegar force-pushed the llvm-head-staging branch 2 times, most recently from 18b8839 to ee2f536 Compare December 13, 2023 14:27
@karupayun
Copy link
Member Author

This PR was divided between triton-lang#2768, triton-lang#2769 and triton-lang#2760. All of them are already merged.

@karupayun karupayun closed this Dec 13, 2023
vwbaker pushed a commit that referenced this pull request Mar 20, 2024
There are two tests that failed under AddressSanitizer:
* test/TritonGPU/loop-pipeline.mlir
* python/test/regression/test_functional_regressions.py

with an error: 

```
==8475==ERROR: AddressSanitizer: heap-use-after-free on address 0x50c000bd0be0 at pc 0x557b03278847 bp 0x7ffd69b2c4a0 sp 0x7ffd69b2c498
READ of size 8 at 0x50c000bd0be0 thread T0
    #0 0x557b03278846 in getNextOperandUsingThisValue [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:43](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=43&ws=aliia/3018&snapshot=215):58
    #1 0x557b03278846 in operator++ [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:322](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=322&ws=aliia/3018&snapshot=215):39
    #2 0x557b03278846 in mlir::ResultRange::UseIterator::operator++() [third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp:614](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp?l=614&ws=aliia/3018&snapshot=215):5
    #3 0x557affde38c4 in operator++ [third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h:281](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h?l=281&ws=aliia/3018&snapshot=215):5
    #4 0x557affde38c4 in createAsyncCopy [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:117](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=117&ws=aliia/3018&snapshot=215):26
    #5 0x557affde38c4 in createAsyncLoad [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:135](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=135&ws=aliia/3018&snapshot=215):3
    #6 0x557affde38c4 in createAsynOps [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:501](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=501&ws=aliia/3018&snapshot=215):5
    #7 0x557affde38c4 in mlir::triton::preProcessLoopAndGetSchedule(mlir::scf::ForOp&, int, mlir::triton::PipeliningOption&) [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:740](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=740&ws=aliia/3018&snapshot=215):7
    #8 0x557affe01c0c in pipelineLoop [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp:76](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp?l=76&ws=aliia/3018&snapshot=215):19
...
```
This is likely happening due to iterator being invalidated after
`alloc.erase()`.
This PR moves erases of allocations outside of a loop and fixes
heap-use-after-free issue.

Do you know if there is an easy way to run the tests under sanitizers
upstream? It would be handy if we can automate it, so we catch this kind
of errors early on.
karupayun pushed a commit that referenced this pull request Apr 3, 2024
There are two tests that failed under AddressSanitizer:
* test/TritonGPU/loop-pipeline.mlir
* python/test/regression/test_functional_regressions.py

with an error: 

```
==8475==ERROR: AddressSanitizer: heap-use-after-free on address 0x50c000bd0be0 at pc 0x557b03278847 bp 0x7ffd69b2c4a0 sp 0x7ffd69b2c498
READ of size 8 at 0x50c000bd0be0 thread T0
    #0 0x557b03278846 in getNextOperandUsingThisValue [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:43](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=43&ws=aliia/3018&snapshot=215):58
    #1 0x557b03278846 in operator++ [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:322](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=322&ws=aliia/3018&snapshot=215):39
    #2 0x557b03278846 in mlir::ResultRange::UseIterator::operator++() [third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp:614](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp?l=614&ws=aliia/3018&snapshot=215):5
    #3 0x557affde38c4 in operator++ [third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h:281](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h?l=281&ws=aliia/3018&snapshot=215):5
    #4 0x557affde38c4 in createAsyncCopy [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:117](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=117&ws=aliia/3018&snapshot=215):26
    #5 0x557affde38c4 in createAsyncLoad [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:135](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=135&ws=aliia/3018&snapshot=215):3
    #6 0x557affde38c4 in createAsynOps [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:501](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=501&ws=aliia/3018&snapshot=215):5
    #7 0x557affde38c4 in mlir::triton::preProcessLoopAndGetSchedule(mlir::scf::ForOp&, int, mlir::triton::PipeliningOption&) [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:740](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=740&ws=aliia/3018&snapshot=215):7
    #8 0x557affe01c0c in pipelineLoop [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp:76](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp?l=76&ws=aliia/3018&snapshot=215):19
...
```
This is likely happening due to iterator being invalidated after
`alloc.erase()`.
This PR moves erases of allocations outside of a loop and fixes
heap-use-after-free issue.

Do you know if there is an easy way to run the tests under sanitizers
upstream? It would be handy if we can automate it, so we catch this kind
of errors early on.
gflegar added a commit that referenced this pull request Jun 24, 2024
When running
[convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924)
Triton ends up computing a rank of a matrix with 0 columns during linear
layout lowering, which trips up f2reduce, and causes undefined behavior,
detectable through
[UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html).

Fix this by returning the rank (0) early in these cases, without calling
f2reduce.

<details><summary>Stack trace</summary>
<p>

```
third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long'
    #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30
    #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9
    #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3
    #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7
    #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41
    #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51
    #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14
    #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8
    #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19
    #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24
    #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7
    #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12
    #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12
    #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5
    #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9
    #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10
    #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12
    #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16
    #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5
    #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5
    #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3
    #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26
    triton-lang#22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7
    triton-lang#23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7
    triton-lang#24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5
    triton-lang#25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22
...
UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 
```
</p>
</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants