Add a matmul test from int8, bf16 #6

karupayun · 2023-10-31T17:17:29Z

In this PR we are adding a matmul test from int8, bf16. I had a few issues in the test so
I refactored the file a bit.

First I included two new params:
- Dot_out_dtype: So users of the test class can specify the type used internally in the dot,
  and not the one set by default given the two types. There are several restrictions for
  these types anyway.
- C_dtype: The return type of the matmul.
  I included a few tests in the case of making a dot with two float16.
I had to modify test_matmul to use small integers when testing with two float16 since torch
used float32 internally in this case and we were having precision issues when comparing
the results with triton in the case that dot_out_dtype was float16.
I also needed to include torch.int8 in the possible datatypes.

Finally I tried to simplify a bit the logic of the matmul/test_matmul because after adding
these two parameters it was a bit hard to follow why we needed every part of the code, so
I included a type_preference_list for the allowed dot_out_dtype given the types of the operands
a and b.

In this PR we are adding a matmul test from int8, bf16. I had a few issues in the test so I refactored the file a bit. - First I included two new params: - Dot_out_dtype: So users of the test class can specify the type used internally in the dot, and not the one set by default given the two types. There are several restrictions for these types anyway. - C_dtype: The return type of the matmul. I included a few tests in the case of making a dot with two float16. - I had to modify test_matmul to use small integers when testing with two float16 since torch used float32 internally in this case and we were having precision issues when comparing the results with triton in the case that dot_out_dtype was float16. - I also needed to include torch.int8 in the possible datatypes. Finally I tried to simplify a bit the logic of the matmul/test_matmul because after adding these two parameters it was a bit hard to follow why we needed every part of the code, so I included a type_preference_list for the allowed dot_out_dtype given the types of the operands a and b.

google-cla · 2023-10-31T17:17:35Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gflegar

Thanks! Definitely much simpler and more versatile than it was before. But I think it can be even simpler and clearer - left some comments to do that.

python/test/unit/operators/test_matmul.py

python/triton/ops/matmul.py

gflegar

Generally LGTM.

A few unresolved comments regarding scaled integers, and the weird view thing, we can also discuss them offline if you want.

python/test/unit/operators/test_matmul.py

In this PR we are allowing to manually set acc_dtype and output_dtype in matmul test. They are: - `acc_dtype`: So users of the test class can specify the type used internally in the dot, and not the one set by default given the two types. There are several restrictions for these types anyway. - `output_dtype`: The return type of the matmul. I included a few tests in the case of making a dot with two float16. - I had to modify test_matmul to use a small range of values to prevent numerical issues. In the case of testing with two `float16` and `acc_dtype` `float16`, since I can't force torch to use `float16` internally (it uses `float32`), I was having precision issues when comparing the results with triton. The discussion of why we are doing this for all tests and not only for that particular ones is simplicity, since we should not be testing precision here: The discussion can be seen in openxla#6 (comment) and openxla#6 (comment) but I do not have a strong opinion, so I am ok with just testing with small integers when the acc_dtype is float16.

karupayun · 2023-12-13T16:17:49Z

This PR was divided between triton-lang#2768, triton-lang#2769 and triton-lang#2760. All of them are already merged.

There are two tests that failed under AddressSanitizer: * test/TritonGPU/loop-pipeline.mlir * python/test/regression/test_functional_regressions.py with an error: ``` ==8475==ERROR: AddressSanitizer: heap-use-after-free on address 0x50c000bd0be0 at pc 0x557b03278847 bp 0x7ffd69b2c4a0 sp 0x7ffd69b2c498 READ of size 8 at 0x50c000bd0be0 thread T0 #0 0x557b03278846 in getNextOperandUsingThisValue [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:43](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=43&ws=aliia/3018&snapshot=215):58 #1 0x557b03278846 in operator++ [third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h:322](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/include/mlir/IR/UseDefLists.h?l=322&ws=aliia/3018&snapshot=215):39 #2 0x557b03278846 in mlir::ResultRange::UseIterator::operator++() [third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp:614](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp?l=614&ws=aliia/3018&snapshot=215):5 #3 0x557affde38c4 in operator++ [third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h:281](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/include/llvm/ADT/iterator.h?l=281&ws=aliia/3018&snapshot=215):5 #4 0x557affde38c4 in createAsyncCopy [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:117](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=117&ws=aliia/3018&snapshot=215):26 #5 0x557affde38c4 in createAsyncLoad [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:135](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=135&ws=aliia/3018&snapshot=215):3 #6 0x557affde38c4 in createAsynOps [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:501](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=501&ws=aliia/3018&snapshot=215):5 #7 0x557affde38c4 in mlir::triton::preProcessLoopAndGetSchedule(mlir::scf::ForOp&, int, mlir::triton::PipeliningOption&) [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp:740](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp?l=740&ws=aliia/3018&snapshot=215):7 #8 0x557affe01c0c in pipelineLoop [third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp:76](https://cs.corp.google.com/piper///depot/google3/third_party/triton/lib/Dialect/TritonGPU/Transforms/Pipeliner/SoftwarePipeliner.cpp?l=76&ws=aliia/3018&snapshot=215):19 ... ``` This is likely happening due to iterator being invalidated after `alloc.erase()`. This PR moves erases of allocations outside of a loop and fixes heap-use-after-free issue. Do you know if there is an easy way to run the tests under sanitizers upstream? It would be handy if we can automate it, so we catch this kind of errors early on.

When running [convert_blocked1d_to_slice0](https://github.com/triton-lang/triton/blob/0ba5f0c3cd029d5c3d1f01b9bf29dac32c27345e/test/Conversion/tritongpu_to_llvm.mlir#L924) Triton ends up computing a rank of a matrix with 0 columns during linear layout lowering, which trips up f2reduce, and causes undefined behavior, detectable through [UBSAN](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). Fix this by returning the rank (0) early in these cases, without calling f2reduce. <details><summary>Stack trace</summary> <p> ``` third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30: runtime error: shift exponent 18446744073709551615 is too large for 64-bit type 'unsigned long long' #0 0x556ee2fea3be in inplace_rref_small third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 #1 0x556ee2fea3be in f2reduce::inplace_rref_strided(unsigned long*, unsigned long, unsigned long, unsigned long) third_party/triton/third_party/f2reduce/f2reduce.cpp:470:9 #2 0x556ee2ea70da in getMatrixRank third_party/triton/lib/Tools/LinearLayout.cpp:125:3 #3 0x556ee2ea70da in mlir::triton::LinearLayout::checkInvariants(bool) third_party/triton/lib/Tools/LinearLayout.cpp:299:7 #4 0x556ee2ea656d in mlir::triton::LinearLayout::tryCreate(llvm::MapVector<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>, llvm::DenseMap<mlir::StringAttr, unsigned int, llvm::DenseMapInfo<mlir::StringAttr, void>, llvm::detail::DenseMapPair<mlir::StringAttr, unsigned int>>, llvm::SmallVector<std::__u::pair<mlir::StringAttr, std::__u::vector<std::__u::vector<int, std::__u::allocator<int>>, std::__u::allocator<std::__u::vector<int, std::__u::allocator<int>>>>>, 0u>>, llvm::ArrayRef<std::__u::pair<mlir::StringAttr, int>>, bool) third_party/triton/lib/Tools/LinearLayout.cpp:190:41 #5 0x556ee2eb2150 in mlir::triton::LinearLayout::divideRight(mlir::triton::LinearLayout const&) third_party/triton/lib/Tools/LinearLayout.cpp:654:51 #6 0x556ee2ee1c39 in mlir::cvtNeedsSharedMemory(mlir::RankedTensorType, mlir::RankedTensorType) third_party/triton/lib/Analysis/Utility.cpp:652:14 #7 0x556ee2cf38fd in mlir::triton::getRepShapeForCvtLayout(mlir::triton::gpu::ConvertLayoutOp) third_party/triton/lib/Analysis/Allocation.cpp:66:8 #8 0x556ee2cf3efa in mlir::triton::getScratchConfigForCvtLayout(mlir::triton::gpu::ConvertLayoutOp, unsigned int&, unsigned int&) third_party/triton/lib/Analysis/Allocation.cpp:95:19 #9 0x556ee2cf6057 in mlir::triton::AllocationAnalysis::getScratchValueSize(mlir::Operation*) third_party/triton/lib/Analysis/Allocation.cpp:272:24 #10 0x556ee2cf5499 in operator() third_party/triton/lib/Analysis/Allocation.cpp:343:7 #11 0x556ee2cf5499 in void llvm::function_ref<void (mlir::Operation*)>::callback_fn<mlir::triton::AllocationAnalysis::getValuesAndSizes()::'lambda'(mlir::Operation*)>(long, mlir::Operation*) third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45:12 #12 0x556edeeee7a9 in operator() third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68:12 #13 0x556edeeee7a9 in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:174:5 #14 0x556edeeee87c in void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:182:9 #15 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), mlir::Operation *, void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Visitors.h:313:10 #16 0x556ee2cf49e7 in walk<(mlir::WalkOrder)0, mlir::ForwardIterator, (lambda at third_party/triton/lib/Analysis/Allocation.cpp:341:42), void> third_party/llvm/llvm-project/mlir/include/mlir/IR/Operation.h:794:12 #17 0x556ee2cf49e7 in mlir::triton::AllocationAnalysis::getValuesAndSizes() third_party/triton/lib/Analysis/Allocation.cpp:341:16 #18 0x556ee2cf4852 in run third_party/triton/lib/Analysis/Allocation.cpp:182:5 #19 0x556ee2cf4852 in AllocationAnalysis third_party/triton/lib/Analysis/Allocation.cpp:169:5 #20 0x556ee2cf4852 in mlir::Allocation::run(llvm::DenseMap<mlir::FunctionOpInterface, mlir::Allocation, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>, llvm::detail::DenseMapPair<mlir::FunctionOpInterface, mlir::Allocation>>&) third_party/triton/lib/Analysis/Allocation.cpp:627:3 #21 0x556ee1677402 in operator() third_party/triton/include/triton/Analysis/Allocation.h:227:26 triton-lang#22 0x556ee1677402 in void mlir::CallGraph<mlir::Allocation>::doWalk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)>(mlir::FunctionOpInterface, llvm::DenseSet<mlir::FunctionOpInterface, llvm::DenseMapInfo<mlir::FunctionOpInterface, void>>&, mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::CallOpInterface, mlir::FunctionOpInterface), mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp)::'lambda'(mlir::FunctionOpInterface)) third_party/triton/include/triton/Analysis/Utility.h:350:7 triton-lang#23 0x556ee16756b3 in walk<(mlir::WalkOrder)0, (mlir::WalkOrder)1, (lambda at third_party/triton/include/triton/Analysis/Allocation.h:222:9), (lambda at third_party/triton/include/triton/Analysis/Allocation.h:224:9)> third_party/triton/include/triton/Analysis/Utility.h:242:7 triton-lang#24 0x556ee16756b3 in mlir::ModuleAllocation::ModuleAllocation(mlir::ModuleOp) third_party/triton/include/triton/Analysis/Allocation.h:220:5 triton-lang#25 0x556ee2c2bf18 in (anonymous namespace)::AllocateSharedMemory::runOnOperation() third_party/triton/lib/Conversion/TritonGPUToLLVM/AllocateSharedMemory.cpp:26:22 ... UndefinedBehaviorSanitizer: invalid-shift-exponent third_party/triton/third_party/f2reduce/f2reduce.cpp:421:30 ``` </p> </details>

karupayun requested a review from gflegar October 31, 2023 17:17

gflegar requested changes Nov 2, 2023

View reviewed changes

Fixup: Addressing comments

8fce141

karupayun force-pushed the llvm-head-staging branch 2 times, most recently from 5f10733 to e81f98b Compare November 6, 2023 12:36

vwbaker force-pushed the llvm-head-staging branch from e81f98b to 3422be9 Compare November 20, 2023 13:50

gflegar approved these changes Nov 27, 2023

View reviewed changes

khasanovaa force-pushed the llvm-head-staging branch from 3cc1a5b to b5fa0a3 Compare November 28, 2023 10:28

Moerafaat force-pushed the llvm-head-staging branch from c00e936 to 2fd0fa3 Compare December 4, 2023 11:36

karupayun mentioned this pull request Dec 5, 2023

Add a matmul test from int8, bf16 triton-lang/triton#2718

Closed

karupayun mentioned this pull request Dec 6, 2023

[TESTS] Allow user defined output_dtype and acc_dtype in matmul tests triton-lang/triton#2769

Merged

gflegar force-pushed the llvm-head-staging branch 2 times, most recently from 18b8839 to ee2f536 Compare December 13, 2023 14:27

karupayun closed this Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a matmul test from int8, bf16 #6

Add a matmul test from int8, bf16 #6

karupayun commented Oct 31, 2023

google-cla bot commented Oct 31, 2023

gflegar left a comment

gflegar left a comment

karupayun commented Dec 13, 2023

Add a matmul test from int8, bf16 #6

Add a matmul test from int8, bf16 #6

Conversation

karupayun commented Oct 31, 2023

google-cla bot commented Oct 31, 2023

gflegar left a comment

Choose a reason for hiding this comment

gflegar left a comment

Choose a reason for hiding this comment

karupayun commented Dec 13, 2023