[PEFT]Add LoRA-GA #9592

greycooker · 2024-12-09T13:33:52Z

PR types

New features

PR changes

[APIs]Add a new finetuning method LoRA-GA

Description

原PR #9387
参考论文https://arxiv.org/pdf/2407.05000
参考开源代码https://github.com/Outsider565/LoRA-GA
实现LoRA-GA精调方法
支持tp、dp、sharding分布式策略
支持恢复训练（含unified checkpoint与pdparams)

paddle-bot · 2024-12-09T13:33:58Z

Thanks for your contribution!

codecov · 2024-12-09T14:07:24Z

Codecov Report

Attention: Patch coverage is 5.52995% with 205 lines in your changes missing coverage. Please review.

Project coverage is 52.66%. Comparing base (2231feb) to head (275c623).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/peft/lora/loraga_utils.py	0.00%	171 Missing ⚠️
paddlenlp/peft/lora/lora_model.py	25.71%	26 Missing ⚠️
paddlenlp/trainer/trainer.py	25.00%	3 Missing ⚠️
paddlenlp/trainer/unified_checkpoint/utils.py	25.00%	3 Missing ⚠️
...rainer/unified_checkpoint/load_save_single_card.py	0.00%	1 Missing ⚠️
...p/trainer/unified_checkpoint/unified_checkpoint.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9592      +/-   ##
===========================================
- Coverage    52.75%   52.66%   -0.09%     
===========================================
  Files          711      712       +1     
  Lines       111483   111691     +208     
===========================================
+ Hits         58810    58826      +16     
- Misses       52673    52865     +192

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gongel

3.0.0b2后续也提个PR

gongel · 2024-12-10T03:37:51Z

paddlenlp/peft/lora/loraga_utils.py

+    return model
+
+
+def get_loraga_dataloader(train_dataset, data_collator, training_args):


这里一整块只有_get_train_sampler和trainer不一样，看看能否多复用trainer

已经修改了，重写了_wrap_model，别的复用Trainer

DesmonDay · 2024-12-11T11:38:21Z

paddlenlp/peft/lora/lora_model.py

@@ -269,7 +271,7 @@ def from_pretrained(cls, model, lora_path, **kwargs):
                    tp_actions if pre_tensor_parallel_split else None,
                    expected_keys,
                )
-                error_msgs += _load_state_dict_into_model(lora_model.model, state_dict, "")
+                error_msgs += _load_state_dict_into_model(lora_model, state_dict, "")


这块为啥修改了lora_model.model为lora_model？是原来的写法有误吗？

现在的写法没问题吧，我理解按现在的代码，不加LoRA-GA的情况下，这两种写法是等价的，所以改了也不会影响现在的功能。加了LoRA-GA以后我想把它的加载逻辑统一在LoRAModel.set_state_dict()里改掉，如果还是使用现在的写法的话from_pretrained的时候就走不到LoRAModel.set_state_dict()了

DesmonDay · 2024-12-11T11:41:02Z

paddlenlp/peft/lora/lora_model.py

+                    base_name = name.replace("lora_A", "weight")
+
+                    # Reinit base model
+                    offset = init_loraA.cuda() @ init_loraB.cuda()


为啥这里都需要执行 .cuda()?，我看返回的本身就是在设备上了？直接.cuda的话多硬件感觉会有点问题

state_dict就是在cpu上的，split了以后除非使用paddle.to_tensor，不然还是在cpu上。这块可以统一使用to_tensor或者to(target_device)替代掉.cuda()么？

DesmonDay · 2024-12-12T03:44:11Z

paddlenlp/trainer/unified_checkpoint/utils.py

@@ -220,7 +220,9 @@ def get_expected_state_dict(model_to_save):
                if key in state_dict:
                    state_dict.pop(key)
    elif isinstance(model_to_save, LoRAModel):
-        state_dict = model_to_save.get_trainable_state_dict()
+        concat_additional_adapter = kwargs.get("concat_additional_adapter", False)


不太理解这个concat_additional_adapter的意义，看起来uc里面都是默认传了个 concat_additional_adapter = True。

本来是想在get_expected_state_dict里面只通过loraconfig中的loraga来判断是否需要concat，不传这个concat_additional_adapter，但是后来发现需要用到get_expected_state_dict的地方太多了，在save_unified_optimizer和check_unified_checkpoint等等都会用到，这些时候concat就会有问题。所以我这里就是对调用到get_expected_state_dict且需要concat的场景传入了concat_additional_adapter这个开关，别的情况保持现状。

DesmonDay · 2024-12-12T03:44:20Z

paddlenlp/trainer/unified_checkpoint/load_save_single_card.py

@@ -67,7 +67,7 @@ def save_file_sync(state_dict, path):
 def save_single_card_checkpoint(model_to_save, output_dir):
    """Save checkpoint for non-distributed environment."""

-    state_dict = get_expected_state_dict(model_to_save)
+    state_dict = get_expected_state_dict(model_to_save, concat_additional_adapter=True)


DesmonDay · 2024-12-17T08:23:37Z

paddlenlp/peft/lora/loraga_utils.py

+                            gradient_dict[local_grad_name] = grad.clone() / self.loraga_init_iters
+                    else:
+                        if self.gradient_offload:
+                            new_grad = gradient_dict[local_grad_name].cuda() + grad / self.loraga_init_iters


.cuda()之类的操作可以在后续改掉，这个PR先Approve了

DesmonDay

LGTM

wawltor · 2024-12-17T12:15:28Z

paddlenlp/peft/lora/lora_model.py

+
+                    loraB_name = name.replace("lora_A", "lora_B")
+                    concat_lora_B = state_dict[loraB_name]
+                    final_loraB, init_loraB = process_split_and_assign(


这种没有用到的变量可以直接 _ 符号替换

好的，这里我改一下

wawltor · 2024-12-17T12:26:46Z

paddlenlp/peft/lora/lora_model.py

+                    base_name = name.replace("lora_A", "weight")
+                    if not self.reinit_base_model:
+                        # Reinit base model
+                        offset = init_loraA.cuda() @ init_loraB.cuda()


这个看起来也不用主动cuda，使用 paddle.matmul 会根据运行device来切换到cuda显存上

有点奇怪，我试了一下matmul好像是不行的

wawltor · 2024-12-17T12:29:06Z

paddlenlp/peft/lora/loraga_utils.py

+        in_sep_parallel_mode = self.args.sep_parallel_degree > 1
+        in_cp_parallel_mode = self.args.context_parallel_degree > 1
+
+        if in_pipeline_parallel_mode:


不支持 pipeline的原因是什么

pp的梯度需要额外的处理，而且暂时没有这方面的需求

wawltor · 2024-12-17T12:38:16Z

paddlenlp/peft/lora/loraga_utils.py

+    def register_gradient_hook(self):
+        """Register gradient hooks for all model parameters."""
+        for grad_name, param in self.model.named_parameters():
+            param._register_backward_hook(


这个hook是在什么时候执行了，是在每次backward后执行，还是最后一个backward执行

在loraga的estimate_gradient中每次backward都执行，但是在正式训练中不执行

wawltor · 2024-12-17T12:44:30Z

paddlenlp/peft/lora/loraga_utils.py

+                            new_grad = gradient_dict[local_grad_name].cuda() + grad / self.loraga_init_iters
+                            gradient_dict[local_grad_name] = new_grad.cpu()
+                        else:
+                            gradient_dict[local_grad_name] += grad / self.loraga_init_iters


这里的逻辑我也有疑问 grad是多次累积的gradient？

对，grad是多次累积以后求平均，累积次数由loraga_init_iters超参数控制

wawltor

LGTM

greycooker added 7 commits October 20, 2024 16:54

Support LoRA-GA initialization

d8671df

Merge branch 'LoRA-GA' of github.com:greycooker/PaddleNLP into LoRA-GA

b953519

modify loraga_reinit

6c2da73

Merge branch 'LoRA-GA' of github.com:greycooker/PaddleNLP into LoRA-GA

d7a3865

Support multi GPU initialization

f6ee622

Merge branch 'LoRA-GA' of github.com:greycooker/PaddleNLP into LoRA-GA

00c76f5

support resume training and gradient offlaod hook

6a8d175

remove trl/llm_utils.py

0421446

gongel reviewed Dec 10, 2024

View reviewed changes

greycooker and others added 2 commits December 10, 2024 18:14

Merge branch 'PaddlePaddle:develop' into LoRA-GA

7509939

use loraga trainer

468d549

greycooker mentioned this pull request Dec 10, 2024

多卡进行estimate gradient Outsider565/LoRA-GA#14

Open

fix comment

b435ed6

DesmonDay reviewed Dec 11, 2024

View reviewed changes

DesmonDay reviewed Dec 12, 2024

View reviewed changes

greycooker added 2 commits December 16, 2024 06:20

Merge remote-tracking branch 'origin1/develop' into LoRA-GA

4d7603f

fix loraga

86c5b33

gongel previously approved these changes Dec 17, 2024

View reviewed changes

DesmonDay reviewed Dec 17, 2024

View reviewed changes

DesmonDay previously approved these changes Dec 17, 2024

View reviewed changes

wawltor reviewed Dec 17, 2024

View reviewed changes

greycooker and others added 2 commits December 17, 2024 21:55

Merge branch 'PaddlePaddle:develop' into LoRA-GA

44f633e

change variable name

275c623

greycooker dismissed stale reviews from DesmonDay and gongel via 275c623 December 17, 2024 14:00

wawltor approved these changes Dec 18, 2024

View reviewed changes

wawltor merged commit 407b3e6 into PaddlePaddle:develop Dec 18, 2024
9 of 12 checks passed

greycooker mentioned this pull request Dec 18, 2024

[Cherry-Pick] Add LoRA-GA #9654

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PEFT]Add LoRA-GA #9592

[PEFT]Add LoRA-GA #9592

greycooker commented Dec 9, 2024 •

edited

Loading

paddle-bot bot commented Dec 9, 2024

codecov bot commented Dec 9, 2024 •

edited

Loading

gongel left a comment

gongel Dec 10, 2024

greycooker Dec 10, 2024

DesmonDay Dec 11, 2024

greycooker Dec 11, 2024

DesmonDay Dec 11, 2024

greycooker Dec 11, 2024

DesmonDay Dec 12, 2024

greycooker Dec 12, 2024 •

edited

Loading

DesmonDay Dec 12, 2024

DesmonDay Dec 17, 2024

DesmonDay left a comment

wawltor Dec 17, 2024

greycooker Dec 17, 2024

wawltor Dec 17, 2024

greycooker Dec 17, 2024 •

edited

Loading

wawltor Dec 17, 2024

greycooker Dec 17, 2024

wawltor Dec 17, 2024

greycooker Dec 17, 2024

wawltor Dec 17, 2024

greycooker Dec 17, 2024

wawltor left a comment

		return model


		def get_loraga_dataloader(train_dataset, data_collator, training_args):

[PEFT]Add LoRA-GA #9592

[PEFT]Add LoRA-GA #9592

Conversation

greycooker commented Dec 9, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Dec 9, 2024

codecov bot commented Dec 9, 2024 • edited Loading

Codecov Report

gongel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greycooker Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DesmonDay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greycooker Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

greycooker commented Dec 9, 2024 •

edited

Loading

codecov bot commented Dec 9, 2024 •

edited

Loading

greycooker Dec 12, 2024 •

edited

Loading

greycooker Dec 17, 2024 •

edited

Loading