-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file #8799
Conversation
Thanks for your contribution! |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #8799 +/- ##
===========================================
+ Coverage 55.37% 55.50% +0.13%
===========================================
Files 631 631
Lines 99707 98544 -1163
===========================================
- Hits 55211 54701 -510
+ Misses 44496 43843 -653 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
统一梳理下关键位置的LOG打印,争取算子执行过程中,LOG输出文本可读性高
csrc/setup_cuda.py
Outdated
@@ -49,6 +49,7 @@ def get_gencode_flags(): | |||
|
|||
|
|||
gencode_flags = get_gencode_flags() | |||
library_path = os.environ.get('LD_LIBRARY_PATH') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果找不到这个环境变量呢?
csrc/generation/test_gemm.py
Outdated
M.extend(range(1, 4, 1)) | ||
M.extend(range(4, 16, 4)) | ||
M.extend(range(16, 64, 16)) | ||
M.extend(range(64, 256, 32)) | ||
M.extend(range(256, 512, 64)) | ||
M.extend(range(512, 1024, 128)) | ||
M.extend(range(1024, 8193, 1024)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M看起来不用传list进去,传一个整数值就可以
csrc/generation/test_gemm.py
Outdated
Dtype = "int8" | ||
Path = "./search.csv" | ||
|
||
Tune_gemm(M, K, N, Dtype, Path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
算子名先改成tune_cublaslt_gemm吧,后面再商议个合适的名字,首字母不用大写
csrc/generation/tune_gemm.cu
Outdated
cudaMalloc((void**)&A_dev, A.size() * sizeof(int8_t)); | ||
cudaMalloc((void**)&B_dev, B.size() * sizeof(int8_t)); | ||
cudaMalloc((void**)&C_dev, m * n * sizeof(int32_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
内部的cudaMalloc/cudaFree等显存的使用,用paddle::Tensor吧
csrc/generation/tune_gemm.cu
Outdated
See the License for the specific language governing permissions and | ||
limitations under the License. */ | ||
|
||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.cu文件无需这个
csrc/generation/tune_gemm.cu
Outdated
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 890) | ||
#include <cuda_fp8.h> | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp8不支持先删去
csrc/generation/tune_gemm.cu
Outdated
void TestBench(const std::vector<int64_t>& M, | ||
const std::vector<int64_t>& K, | ||
const std::vector<int64_t>& N, | ||
const std::string dtype, | ||
const std::string path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数名与算子名保持一致,并且这些函数参数都改为算子的输入,而不是属性
csrc/generation/tune_gemm.cu
Outdated
.Attrs({"M :std::vector<int64_t>", | ||
"K :std::vector<int64_t>", | ||
"N :std::vector<int64_t>", | ||
"dtype: std::string", | ||
"path: std::string",}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
算子输入而非属性,M, N, K均是paddle::Tensor对象
csrc/generation/tune_gemm.cu
Outdated
int m_tmp, k_tmp, n_tmp; | ||
FILE* fp; | ||
fp = fopen(path.c_str(), "r"); | ||
if (!fp) { | ||
using_default_config(); | ||
} else { | ||
bool match = false; | ||
int find_cnt = 0; | ||
while (1) { | ||
fscanf(fp, | ||
"%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%f", | ||
&m_tmp, | ||
&k_tmp, | ||
&n_tmp, | ||
&algoId, | ||
&swizzle, | ||
&customOption, | ||
&tile, | ||
&splitK_val, | ||
&reductionScheme, | ||
&stages, | ||
&work_space_size, | ||
&time_ref); | ||
if (feof(fp)) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成c++语法,可参考我之前发你的pr里实现
csrc/generation/tune_gemm.cu
Outdated
cublasLtMatmulAlgoCapGetAttribute(&algo, | ||
CUBLASLT_ALGO_CAP_SPLITK_SUPPORT, | ||
&splitkSupport, | ||
sizeof(splitkSupport), | ||
&sizeWritten); | ||
std::clog << "splitkSupport " << splitkSupport; | ||
cublasLtMatmulAlgoCapGetAttribute(&algo, | ||
CUBLASLT_ALGO_CAP_REDUCTION_SCHEME_MASK, | ||
&redMask, | ||
sizeof(redMask), | ||
&sizeWritten); | ||
cublasLtMatmulAlgoCapGetAttribute(&algo, | ||
CUBLASLT_ALGO_CAP_CTA_SWIZZLING_SUPPORT, | ||
&swizzlingMax, | ||
sizeof(swizzlingMax), | ||
&sizeWritten); | ||
cublasLtMatmulAlgoCapGetAttribute(&algo, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有CUDA的API调用都用宏检验下返回值,类似于CUDA_CHECK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define CUDA_CHECK(call) \
do { \
const cudaError_t error_code = call; \
if (error_code != cudaSuccess) { \
std::printf("at %s:%d - %s.\n", \
__FILE__, \
__LINE__, \
cudaGetErrorString(error_code)); \
exit(1); \
} \
} while (0)
N_tensor = paddle.to_tensor([4096, 8192]) | ||
|
||
Dtype = "int8" | ||
Path = "./search.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件只是个示例么,实际上不跑?看起来没有 search.csv这个文件
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是输出path,跑完之后会生成
public: | ||
CUBLASLTContext() { CUDA_CHECK(cublasLtCreate(&handle_)); } | ||
|
||
cublasLtHandle_t handle_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是私有变量吗?是不是写到下面,private那里是空的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hanyonggong 改一下吧~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已更改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…te algo cache file (PaddlePaddle#8799) * Add tune_gemm feature
…te algo cache file (PaddlePaddle#8799) * Add tune_gemm feature
PR types
New features
PR changes
Description
Add tune_cublaslt_gemm operator by cublaslt gemm algorithm and generate algo cache file
Usage
tune_cublaslt_gemm(M, K, N, dtype, is_test, is_read_form_file, path)
parameter:
cache file format
format(each line): M,K,N,algoId,swizzle,customOption,tile,splitK_val,reductionScheme,stages,workspaceSize,time
example: 2048,2048,4096,21,0,0,24,8,3,15,0,0.0701338
parameter: