Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paddle Inference] refactor linear_compress #55490

Merged
merged 18 commits into from
Aug 22, 2023

Conversation

lizhenyun01
Copy link
Contributor

@lizhenyun01 lizhenyun01 commented Jul 17, 2023

PR types

Others

PR changes

Others

Description

重构linear_compress API,分为weight_only_linear和llm_int8_linear,以及量化weight的quant_for_infer(cpu)

  • weight_only_linear:融合weight_only int8/int4 gemm/gemv计算(gemv_int4待支持),会根据x的shape自动选择gemm/gemv
  • llm_int8_linear:llm.int8 gemm 增加了add bias支持
  • quant_for_infer:将weight量化为weight_only/llm.int8需要的格式
    API设计文档
    Pcard-74466

@paddle-bot
Copy link

paddle-bot bot commented Jul 17, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

const float* weight_scale_data = weight_scale.data<float>();
T* out_data = dev_ctx.template Alloc<T>(out);

int64_t m = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not need?


int64_t m = 1;
int64_t n = 1;
int64_t k = 1;
Copy link
Contributor

@vivienfanghuagood vivienfanghuagood Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是声明一个int_64再强制类型转换?看起来有点奇怪

#endif
}

template <typename T, bool Enable>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable -> EnableFastGelu may be better?

@@ -3103,6 +3103,48 @@ void QrInferMeta(const MetaTensor& x,
r->set_dtype(x.dtype());
}

void QuantForCompressInferMeta(const MetaTensor& x,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意这里命名 Quant已经代表Compress了

from paddle.framework import in_dynamic_mode


def quant_for_compress(x, layout="weight_only_int8"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,这个API命名不是很合适

return (out, scale)


def quantized_matmul(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还是有Class的API?

weight,
bias=None,
weight_scale=None,
quant_method="None",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里有weight,bias的概念是命名是Linear,而不是matmul

@paddle-ci-bot
Copy link

paddle-ci-bot bot commented Jul 27, 2023

Sorry to inform you that 5c5a1da's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@lizhenyun01 lizhenyun01 changed the title refactor linear_compress as quantized_matmul [Paddle Inference] refactor linear_compress Aug 9, 2023
}
}
#else
LOG(ERROR) << "Please compile with cutlass to EnableUseCutlass()";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. PADDLE_THROW(phi::errors::Unimplemented())
  2. EnableUseCutlass 请用正确的语法

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Args:
x (Tensor): The input Tensor to be quantized .
layout (str|None): The layout of Tensor is quantized, must be one of 'weight_only_int8',
'weight_only_int4' and 'llm.int8', default: 'weight_only_int8'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. layout含义不准确,weight_only_int8等都是量化类型,不是Layout
  2. 这个API和下面分开的API的区别是啥?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quant_for_infer这个op是原本用来量化weight到weight_only和llm.int8需要的layout的,只处理weight用

Copy link
Contributor Author

@lizhenyun01 lizhenyun01 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quant_for_infer -> weight_quantize;
layout -> algo

from paddle import _C_ops
from paddle.fluid.data_feeder import check_variable_and_dtype
from paddle.fluid.layer_helper import LayerHelper
from paddle.framework import in_dynamic_mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非必要import不用fluid下面的内容

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,309 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测不用放到 legacy_test下面

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

移到了quantization目录

Copy link
Contributor

@heavyrain-lzy heavyrain-lzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for yaml

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for API change
看起来paddle.nn.LinearCompress未曾release过,兼容性是没问题的。

bias = paddle.cast(paddle.randn([32]), dtype='float16')
if paddle.device.cuda.get_device_capability()[0] >= 8:
out = llm_int8_linear(x, weight, bias=bias, weight_scale=scale, threshold=6.0)
print(out.shape) # [1, 2, 32]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 这三个paddle.cast,让示例代码看着好别扭。
  • 这个是必须Ampere才可以用吧?文档里写的是要求cuda version >= 11.2,示例代码里检查的是compute capability,多少有些困惑。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute capability也是需要的,主要是因为CI环境限制,之后看看能不能优化写法

Copy link
Contributor

@zhangbo9674 zhangbo9674 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for use print

@heavengate heavengate merged commit ffff3da into PaddlePaddle:develop Aug 22, 2023
BeingGod pushed a commit to BeingGod/Paddle that referenced this pull request Sep 9, 2023
* Modify kernels to support quantized_matmul

---------

Co-authored-by: superxf <1208713646@qq.com>
@lizhenyun01 lizhenyun01 deleted the quantized_matmul branch July 18, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants