[INFER] llama&qwen2 A8W8 support skip_scale #8987

ming1753 · 2024-08-22T04:03:01Z

PR types

Bug fixes

PR changes

Models

Description

支持llama和qwen2 W8A8 跳层量化

paddle-bot · 2024-08-22T04:03:05Z

Thanks for your contribution!

codecov · 2024-08-22T04:35:47Z

Codecov Report

Attention: Patch coverage is 0% with 231 lines in your changes missing coverage. Please review.

Project coverage is 53.25%. Comparing base (db270d9) to head (fa56eaf).
Report is 33 commits behind head on develop.

Files with missing lines	Patch %	Lines
...erimental/transformers/fused_transformer_layers.py	0.00%	115 Missing ⚠️
...dlenlp/experimental/transformers/llama/modeling.py	0.00%	55 Missing ⚠️
...dlenlp/experimental/transformers/qwen2/modeling.py	0.00%	54 Missing ⚠️
...dlenlp/experimental/transformers/bloom/modeling.py	0.00%	1 Missing ⚠️
...enlp/experimental/transformers/chatglm/modeling.py	0.00%	1 Missing ⚠️
...p/experimental/transformers/chatglm_v2/modeling.py	0.00%	1 Missing ⚠️
...addlenlp/experimental/transformers/gpt/modeling.py	0.00%	1 Missing ⚠️
...addlenlp/experimental/transformers/opt/modeling.py	0.00%	1 Missing ⚠️
...ddlenlp/experimental/transformers/qwen/modeling.py	0.00%	1 Missing ⚠️
...lp/experimental/transformers/qwen2_moe/modeling.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8987      +/-   ##
===========================================
- Coverage    53.29%   53.25%   -0.05%     
===========================================
  Files          652      652              
  Lines       105483   105576      +93     
===========================================
  Hits         56222    56222              
- Misses       49261    49354      +93

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yuanlehome · 2024-09-10T06:09:00Z

paddlenlp/experimental/transformers/fused_transformer_layers.py

@@ -700,6 +623,97 @@ def __init__(self, config: FusedMultiTransformerConfig):

        self.linear = fused_linear

+    def init_weight(self):


init_weight -> init_weights ?

感觉不是特别有必要，init_weight_shape和get_weight_dtype都是单数

yuanlehome · 2024-09-10T06:10:46Z

paddlenlp/experimental/transformers/fused_transformer_layers.py

@@ -1773,7 +1787,101 @@ def __init__(self, config: FusedMultiTransformerConfig):
            self._add_parameter(ffn2_shift)
            self._add_parameter(ffn2_smooth)

-    def get_weight_create_dype(self):
+    def init_weight(self):


init_weight -> init_weights

yuanlehome · 2024-09-10T06:12:07Z

paddlenlp/experimental/transformers/fused_transformer_layers.py

+
+    def get_weight_create_dype(self, layer_name=None, layer_idx=None):
+        if layer_name is not None and layer_idx is not None:
+            if hasattr(self, "weight_scales") and np.all(self.weight_scales[layer_name][layer_idx] == -1):


看到好多处有类似的判断条件了，我建议把
if hasattr(self, "weight_scales") and np.all(self.weight_scales[layer_name][layer_idx] == -1)
封装成一个函数，并且加上注释

Done，封装成skip_quant

yuanlehome · 2024-09-10T06:14:37Z

paddlenlp/experimental/transformers/generation_utils.py

@@ -326,6 +335,10 @@ def _post_process_(outputs, top_p, temperature, step_idx_ori, model_kwargs):
            )
            logits = logits / temperature

+            # sample
+            if self.config.top_k is not None and self.config.top_k != 0:


top_k的逻辑给删掉，这个是业务特殊要求的，通用的后处理组网没有这么干的

yuanlehome · 2024-09-10T06:16:02Z

paddlenlp/experimental/transformers/llama/modeling.py

@@ -935,10 +985,16 @@ def set_state_dict(self, state_dict):
                self.transformer_block.linear_weights[idx].set_value(linear_quanted_weight_tensor)
                self.transformer_block.linear_weights_scale[idx].set_value(linear_weight_scale_tensor)
            elif "a8w8" in self.quant_type:
+                w_dtype = (
+                    paddle.get_default_dtype()
+                    if np.all(weight_scales_loader.scale["out_linear_weight_scale"][idx] == -1)


if np.all(weight_scales_loader.scale["out_linear_weight_scale"][idx] == -1)类似的，这种判断逻辑封装成一个函数并加上注释

ckl117 · 2024-09-12T12:33:51Z

paddlenlp/experimental/transformers/qwen2/modeling.py

                        num_key_value_heads=self.num_key_value_heads,
                        mp_size=self.config.tensor_parallel_degree,
                    )
-                self.transformer_block.act_scales = act_scale_loader.scale


不能删掉这行，后面推理找不到act_scales了

llama文件里的这一行呢？

yuanlehome · 2024-09-27T03:31:55Z

Merged with #9197

llama A8W8 support skip_scale

ece480c

ming1753 added 2 commits August 23, 2024 19:35

support top_k

b7fe673

qwen2 support skip scales

b4b2500

ming1753 changed the title ~~llama A8W8 support skip_scale~~ llama&qwen2 A8W8 support skip_scale Aug 27, 2024

ming1753 added 6 commits August 27, 2024 13:12

merge develop

595ebfb

merge develop

d5bcc66

fix bug

3f01372

merge develop & modify out_scale

39b6aa7

fix bug

fb9e2e2

fix bug

ed518f5

yuanlehome reviewed Sep 10, 2024

View reviewed changes

ming1753 added 7 commits September 12, 2024 09:06

merge develop

5622226

modify

1974766

fix conflict

a32d316

add comment

56fa253

fix bug

84ba57a

fix bug

b4c29ff

fix confict

8da2246

ckl117 reviewed Sep 12, 2024

View reviewed changes

fix bug

fa56eaf

ZHUI changed the title ~~llama&qwen2 A8W8 support skip_scale~~ [INFER] llama&qwen2 A8W8 support skip_scale Sep 13, 2024

yuanlehome approved these changes Sep 26, 2024

View reviewed changes

yuanlehome closed this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INFER] llama&qwen2 A8W8 support skip_scale #8987

[INFER] llama&qwen2 A8W8 support skip_scale #8987

ming1753 commented Aug 22, 2024 •

edited

Loading

paddle-bot bot commented Aug 22, 2024

codecov bot commented Aug 22, 2024 •

edited

Loading

yuanlehome Sep 10, 2024

ming1753 Sep 12, 2024

yuanlehome Sep 10, 2024

yuanlehome Sep 10, 2024

ming1753 Sep 12, 2024

yuanlehome Sep 10, 2024

ming1753 Sep 12, 2024

yuanlehome Sep 10, 2024

ming1753 Sep 12, 2024

ckl117 Sep 12, 2024 •

edited

Loading

ming1753 Sep 12, 2024

yuanlehome Sep 12, 2024

yuanlehome commented Sep 27, 2024

		@@ -700,6 +623,97 @@ def __init__(self, config: FusedMultiTransformerConfig):

		self.linear = fused_linear

		def init_weight(self):

[INFER] llama&qwen2 A8W8 support skip_scale #8987

[INFER] llama&qwen2 A8W8 support skip_scale #8987

Conversation

ming1753 commented Aug 22, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Aug 22, 2024

codecov bot commented Aug 22, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ckl117 Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuanlehome commented Sep 27, 2024

ming1753 commented Aug 22, 2024 •

edited

Loading

codecov bot commented Aug 22, 2024 •

edited

Loading

ckl117 Sep 12, 2024 •

edited

Loading