-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Paddle Inference] refactor linear_compress #55490
[Paddle Inference] refactor linear_compress #55490
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
const float* weight_scale_data = weight_scale.data<float>(); | ||
T* out_data = dev_ctx.template Alloc<T>(out); | ||
|
||
int64_t m = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not need?
|
||
int64_t m = 1; | ||
int64_t n = 1; | ||
int64_t k = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是声明一个int_64再强制类型转换?看起来有点奇怪
#endif | ||
} | ||
|
||
template <typename T, bool Enable> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enable -> EnableFastGelu may be better?
paddle/phi/infermeta/unary.cc
Outdated
@@ -3103,6 +3103,48 @@ void QrInferMeta(const MetaTensor& x, | |||
r->set_dtype(x.dtype()); | |||
} | |||
|
|||
void QuantForCompressInferMeta(const MetaTensor& x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意这里命名 Quant已经代表Compress了
from paddle.framework import in_dynamic_mode | ||
|
||
|
||
def quant_for_compress(x, layout="weight_only_int8"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,这个API命名不是很合适
return (out, scale) | ||
|
||
|
||
def quantized_matmul( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还是有Class的API?
weight, | ||
bias=None, | ||
weight_scale=None, | ||
quant_method="None", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里有weight,bias的概念是命名是Linear,而不是matmul
Sorry to inform you that 5c5a1da's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
} | ||
} | ||
#else | ||
LOG(ERROR) << "Please compile with cutlass to EnableUseCutlass()"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- PADDLE_THROW(phi::errors::Unimplemented())
- EnableUseCutlass 请用正确的语法
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Args: | ||
x (Tensor): The input Tensor to be quantized . | ||
layout (str|None): The layout of Tensor is quantized, must be one of 'weight_only_int8', | ||
'weight_only_int4' and 'llm.int8', default: 'weight_only_int8'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- layout含义不准确,weight_only_int8等都是量化类型,不是Layout
- 这个API和下面分开的API的区别是啥?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quant_for_infer这个op是原本用来量化weight到weight_only和llm.int8需要的layout的,只处理weight用
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quant_for_infer -> weight_quantize;
layout -> algo
from paddle import _C_ops | ||
from paddle.fluid.data_feeder import check_variable_and_dtype | ||
from paddle.fluid.layer_helper import LayerHelper | ||
from paddle.framework import in_dynamic_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
非必要import不用fluid下面的内容
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,309 @@ | |||
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测不用放到 legacy_test下面
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
移到了quantization目录
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for API change
看起来paddle.nn.LinearCompress
未曾release过,兼容性是没问题的。
bias = paddle.cast(paddle.randn([32]), dtype='float16') | ||
if paddle.device.cuda.get_device_capability()[0] >= 8: | ||
out = llm_int8_linear(x, weight, bias=bias, weight_scale=scale, threshold=6.0) | ||
print(out.shape) # [1, 2, 32] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这三个
paddle.cast
,让示例代码看着好别扭。 - 这个是必须Ampere才可以用吧?文档里写的是要求cuda version >= 11.2,示例代码里检查的是compute capability,多少有些困惑。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compute capability也是需要的,主要是因为CI环境限制,之后看看能不能优化写法
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for use print
* Modify kernels to support quantized_matmul --------- Co-authored-by: superxf <1208713646@qq.com>
PR types
Others
PR changes
Others
Description
重构linear_compress API,分为weight_only_linear和llm_int8_linear,以及量化weight的quant_for_infer(cpu)
API设计文档
Pcard-74466