Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file #8799

Hanyonggong · 2024-07-24T02:05:29Z

PR types

New features

PR changes

Description

Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file

Usage

tune_cublaslt_gemm(M, K, N, dtype, is_test, is_read_form_file, path)
parameter:

M,K,N：数据类型：paddle::Tensor，指定矩阵shape[M, K]*[K,N] = [M,N]
dtype：用于指定矩阵元素数据类型
is_test：是否通过搜索生成matmul配置
is_read_from_file：通过离线文件加载matmul配置
Path：生成文件或者离线文件的路径

#example
M_tensor = paddle.to_tensor([1024])
K_tensor = paddle.to_tensor([2048])
N_tensor = paddle.to_tensor([11008])
Dtype = "int8"
Path = "./search.csv"

tune_cublaslt_gemm(M_tensor, K_tensor, N_tensor, Dtype, True, False, Path)

cache file format

format(each line): M,K,N,algoId,swizzle,customOption,tile,splitK_val,reductionScheme,stages,workspaceSize,time
example: 2048,2048,4096,21,0,0,24,8,3,15,0,0.0701338
parameter：

M,K,N：矩阵shape[M, K]*[K,N] = [M,N]
algoId：指定初始化的算法ID
swizzle：是否开启 CTA swizzling
customOption：自定义选项值
tile：tile_id，用于设置tile size（rows * columns）
splitk_val：K 的分割数
reductionScheme：splitK_val > 1 时使用的缩减方案
stages id：用于配置暂存输入元素的共享内存缓冲区的大小和数量
workspaceSize：所需workspace内存大小
time：运行时间

paddle-bot · 2024-07-24T02:05:34Z

Thanks for your contribution!

CLAassistant · 2024-07-24T02:05:35Z

All committers have signed the CLA.

codecov · 2024-07-24T02:37:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 55.50%. Comparing base (da1eb9c) to head (ca515f2).
Report is 238 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8799      +/-   ##
===========================================
+ Coverage    55.37%   55.50%   +0.13%     
===========================================
  Files          631      631              
  Lines        99707    98544    -1163     
===========================================
- Hits         55211    54701     -510     
+ Misses       44496    43843     -653

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yuanlehome

统一梳理下关键位置的LOG打印，争取算子执行过程中，LOG输出文本可读性高

yuanlehome · 2024-07-24T03:32:50Z

csrc/setup_cuda.py

@@ -49,6 +49,7 @@ def get_gencode_flags():


 gencode_flags = get_gencode_flags()
+library_path = os.environ.get('LD_LIBRARY_PATH')


如果找不到这个环境变量呢？

yuanlehome · 2024-07-24T03:54:33Z

csrc/generation/test_gemm.py

+M.extend(range(1, 4, 1))
+M.extend(range(4, 16, 4))
+M.extend(range(16, 64, 16))
+M.extend(range(64, 256, 32))
+M.extend(range(256, 512, 64))
+M.extend(range(512, 1024, 128))
+M.extend(range(1024, 8193, 1024))


M看起来不用传list进去，传一个整数值就可以

yuanlehome · 2024-07-24T03:54:41Z

csrc/generation/test_gemm.py

+Dtype = "int8"
+Path = "./search.csv"
+
+Tune_gemm(M, K, N, Dtype, Path)


算子名先改成tune_cublaslt_gemm吧，后面再商议个合适的名字，首字母不用大写

yuanlehome · 2024-07-24T03:55:26Z

csrc/generation/tune_gemm.cu

+  cudaMalloc((void**)&A_dev, A.size() * sizeof(int8_t));
+  cudaMalloc((void**)&B_dev, B.size() * sizeof(int8_t));
+  cudaMalloc((void**)&C_dev, m * n * sizeof(int32_t));


内部的cudaMalloc/cudaFree等显存的使用，用paddle::Tensor吧

yuanlehome · 2024-07-24T04:26:10Z

csrc/generation/tune_gemm.cu

+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#pragma once


.cu文件无需这个

yuanlehome · 2024-07-24T04:26:55Z

csrc/generation/tune_gemm.cu

+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 890)
+#include <cuda_fp8.h>
+#endif


fp8不支持先删去

yuanlehome · 2024-07-24T04:30:36Z

csrc/generation/tune_gemm.cu

+void TestBench(const std::vector<int64_t>& M,
+                   const std::vector<int64_t>& K,
+                   const std::vector<int64_t>& N,
+                   const std::string dtype,
+                   const std::string path) {


函数名与算子名保持一致，并且这些函数参数都改为算子的输入，而不是属性

yuanlehome · 2024-07-24T04:31:25Z

csrc/generation/tune_gemm.cu

+    .Attrs({"M :std::vector<int64_t>",
+            "K :std::vector<int64_t>",
+            "N :std::vector<int64_t>",
+            "dtype: std::string",
+            "path: std::string",})


算子输入而非属性，M, N, K均是paddle::Tensor对象

yuanlehome · 2024-07-24T04:32:20Z

csrc/generation/tune_gemm.cu

+    int m_tmp, k_tmp, n_tmp;
+    FILE* fp;
+    fp = fopen(path.c_str(), "r");
+    if (!fp) {
+      using_default_config();
+    } else {
+      bool match = false;
+      int find_cnt = 0;
+      while (1) {
+        fscanf(fp,
+               "%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%f",
+               &m_tmp,
+               &k_tmp,
+               &n_tmp,
+               &algoId,
+               &swizzle,
+               &customOption,
+               &tile,
+               &splitK_val,
+               &reductionScheme,
+               &stages,
+               &work_space_size,
+               &time_ref);
+        if (feof(fp)) break;


改成c++语法，可参考我之前发你的pr里实现

yuanlehome · 2024-07-24T04:33:44Z

csrc/generation/tune_gemm.cu

+    cublasLtMatmulAlgoCapGetAttribute(&algo,
+                                      CUBLASLT_ALGO_CAP_SPLITK_SUPPORT,
+                                      &splitkSupport,
+                                      sizeof(splitkSupport),
+                                      &sizeWritten);
+    std::clog << "splitkSupport " << splitkSupport;
+    cublasLtMatmulAlgoCapGetAttribute(&algo,
+                                      CUBLASLT_ALGO_CAP_REDUCTION_SCHEME_MASK,
+                                      &redMask,
+                                      sizeof(redMask),
+                                      &sizeWritten);
+    cublasLtMatmulAlgoCapGetAttribute(&algo,
+                                      CUBLASLT_ALGO_CAP_CTA_SWIZZLING_SUPPORT,
+                                      &swizzlingMax,
+                                      sizeof(swizzlingMax),
+                                      &sizeWritten);
+    cublasLtMatmulAlgoCapGetAttribute(&algo,


所有CUDA的API调用都用宏检验下返回值，类似于CUDA_CHECK

#define CUDA_CHECK(call) \ do { \ const cudaError_t error_code = call; \ if (error_code != cudaSuccess) { \ std::printf("at %s:%d - %s.\n", \ __FILE__, \ __LINE__, \ cudaGetErrorString(error_code)); \ exit(1); \ } \ } while (0)

DesmonDay · 2024-07-30T07:02:24Z

csrc/generation/test_tune_cublaslt_gemm.py

+N_tensor = paddle.to_tensor([4096, 8192])
+
+Dtype = "int8"
+Path = "./search.csv"


这个文件只是个示例么，实际上不跑？看起来没有 search.csv这个文件

这个是输出path，跑完之后会生成

DesmonDay · 2024-07-30T07:15:00Z

csrc/generation/tune_cublaslt_gemm.cu

+ public:
+  CUBLASLTContext() { CUDA_CHECK(cublasLtCreate(&handle_)); }
+
+  cublasLtHandle_t handle_;


这个是私有变量吗？是不是写到下面，private那里是空的

@Hanyonggong 改一下吧～

DesmonDay

LGTM

…te algo cache file (PaddlePaddle#8799) * Add tune_gemm feature

add tune_gemm feature

3f5d691

paddle-bot bot added the contributor label Jul 24, 2024

paddle-bot bot assigned KB-Ding Jul 24, 2024

yuanlehome reviewed Jul 24, 2024

View reviewed changes

Hanyonggong added 6 commits July 25, 2024 01:04

modify file

6d3b04e

mod M tensor

c6b0d1a

modify files format

15af332

rm old file

0bae485

add judge

089727a

update file

c080092

Hanyonggong changed the title ~~add tune_gemm feature~~ Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file Jul 26, 2024

yuanlehome mentioned this pull request Jul 26, 2024

[Inference] cublaslt gemm algo search code optimization and support load algo caches file generated by offline PaddlePaddle/Paddle#66132

Merged

Hanyonggong mentioned this pull request Jul 28, 2024

[WeeklyReports] 2024.07.15~2024.07.28 周报汇总 PFCCLab/Camp#305

Open

22 tasks

yuanlehome approved these changes Jul 30, 2024

View reviewed changes

DesmonDay reviewed Jul 30, 2024

View reviewed changes

mod file

ca515f2

DesmonDay approved these changes Jul 31, 2024

View reviewed changes

DesmonDay merged commit 99d9bcb into PaddlePaddle:develop Jul 31, 2024
11 of 12 checks passed

yuanlehome mentioned this pull request Jul 31, 2024

fix tune_cublaslt_gemm compile bug #8844

Merged

DrownFish19 pushed a commit to DrownFish19/PaddleNLP that referenced this pull request Aug 2, 2024

Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and genera…

877302b

…te algo cache file (PaddlePaddle#8799) * Add tune_gemm feature

DrownFish19 pushed a commit to DrownFish19/PaddleNLP that referenced this pull request Aug 2, 2024

Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and genera…

6ea4d66

…te algo cache file (PaddlePaddle#8799) * Add tune_gemm feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file #8799

Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file #8799

Hanyonggong commented Jul 24, 2024 •

edited

Loading

paddle-bot bot commented Jul 24, 2024

CLAassistant commented Jul 24, 2024 •

edited

Loading

codecov bot commented Jul 24, 2024 •

edited

Loading

yuanlehome left a comment

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

yuanlehome Jul 24, 2024

DesmonDay Jul 30, 2024

yuanlehome Jul 30, 2024

DesmonDay Jul 30, 2024

yuanlehome Jul 30, 2024

Hanyonggong Jul 30, 2024

DesmonDay left a comment

		@@ -49,6 +49,7 @@ def get_gencode_flags():


		gencode_flags = get_gencode_flags()
		library_path = os.environ.get('LD_LIBRARY_PATH')

Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file #8799

Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file #8799

Conversation

Hanyonggong commented Jul 24, 2024 • edited Loading

PR types

PR changes

Description

Usage

cache file format

paddle-bot bot commented Jul 24, 2024

CLAassistant commented Jul 24, 2024 • edited Loading

codecov bot commented Jul 24, 2024 • edited Loading

Codecov Report

yuanlehome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DesmonDay left a comment

Choose a reason for hiding this comment

Hanyonggong commented Jul 24, 2024 •

edited

Loading

CLAassistant commented Jul 24, 2024 •

edited

Loading

codecov bot commented Jul 24, 2024 •

edited

Loading