Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PEFT]Add LoRA-GA #9592

Merged
merged 15 commits into from
Dec 18, 2024
Merged

[PEFT]Add LoRA-GA #9592

merged 15 commits into from
Dec 18, 2024

Conversation

greycooker
Copy link
Contributor

@greycooker greycooker commented Dec 9, 2024

PR types

New features

PR changes

[APIs]Add a new finetuning method LoRA-GA

Description

原PR #9387
参考论文https://arxiv.org/pdf/2407.05000
参考开源代码https://github.com/Outsider565/LoRA-GA
实现LoRA-GA精调方法
支持tp、dp、sharding分布式策略
支持恢复训练(含unified checkpoint与pdparams)

Copy link

paddle-bot bot commented Dec 9, 2024

Thanks for your contribution!

Copy link

codecov bot commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 5.52995% with 205 lines in your changes missing coverage. Please review.

Project coverage is 52.66%. Comparing base (2231feb) to head (275c623).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/peft/lora/loraga_utils.py 0.00% 171 Missing ⚠️
paddlenlp/peft/lora/lora_model.py 25.71% 26 Missing ⚠️
paddlenlp/trainer/trainer.py 25.00% 3 Missing ⚠️
paddlenlp/trainer/unified_checkpoint/utils.py 25.00% 3 Missing ⚠️
...rainer/unified_checkpoint/load_save_single_card.py 0.00% 1 Missing ⚠️
...p/trainer/unified_checkpoint/unified_checkpoint.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9592      +/-   ##
===========================================
- Coverage    52.75%   52.66%   -0.09%     
===========================================
  Files          711      712       +1     
  Lines       111483   111691     +208     
===========================================
+ Hits         58810    58826      +16     
- Misses       52673    52865     +192     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@gongel gongel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.0.0b2后续也提个PR

return model


def get_loraga_dataloader(train_dataset, data_collator, training_args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里一整块只有_get_train_sampler和trainer不一样,看看能否多复用trainer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经修改了,重写了_wrap_model,别的复用Trainer

@@ -269,7 +271,7 @@ def from_pretrained(cls, model, lora_path, **kwargs):
tp_actions if pre_tensor_parallel_split else None,
expected_keys,
)
error_msgs += _load_state_dict_into_model(lora_model.model, state_dict, "")
error_msgs += _load_state_dict_into_model(lora_model, state_dict, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块为啥修改了lora_model.model为lora_model?是原来的写法有误吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在的写法没问题吧,我理解按现在的代码,不加LoRA-GA的情况下,这两种写法是等价的,所以改了也不会影响现在的功能。加了LoRA-GA以后我想把它的加载逻辑统一在LoRAModel.set_state_dict()里改掉,如果还是使用现在的写法的话from_pretrained的时候就走不到LoRAModel.set_state_dict()了

base_name = name.replace("lora_A", "weight")

# Reinit base model
offset = init_loraA.cuda() @ init_loraB.cuda()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥这里都需要执行 .cuda()?,我看返回的本身就是在设备上了?直接.cuda的话多硬件感觉会有点问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state_dict就是在cpu上的,split了以后除非使用paddle.to_tensor,不然还是在cpu上。这块可以统一使用to_tensor或者to(target_device)替代掉.cuda()么?

@@ -220,7 +220,9 @@ def get_expected_state_dict(model_to_save):
if key in state_dict:
state_dict.pop(key)
elif isinstance(model_to_save, LoRAModel):
state_dict = model_to_save.get_trainable_state_dict()
concat_additional_adapter = kwargs.get("concat_additional_adapter", False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不太理解这个concat_additional_adapter的意义,看起来uc里面都是默认传了个 concat_additional_adapter = True。

Copy link
Contributor Author

@greycooker greycooker Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

本来是想在get_expected_state_dict里面只通过loraconfig中的loraga来判断是否需要concat,不传这个concat_additional_adapter,但是后来发现需要用到get_expected_state_dict的地方太多了,在save_unified_optimizer和check_unified_checkpoint等等都会用到,这些时候concat就会有问题。所以我这里就是对调用到get_expected_state_dict且需要concat的场景传入了concat_additional_adapter这个开关,别的情况保持现状。

@@ -67,7 +67,7 @@ def save_file_sync(state_dict, path):
def save_single_card_checkpoint(model_to_save, output_dir):
"""Save checkpoint for non-distributed environment."""

state_dict = get_expected_state_dict(model_to_save)
state_dict = get_expected_state_dict(model_to_save, concat_additional_adapter=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

gongel
gongel previously approved these changes Dec 17, 2024
gradient_dict[local_grad_name] = grad.clone() / self.loraga_init_iters
else:
if self.gradient_offload:
new_grad = gradient_dict[local_grad_name].cuda() + grad / self.loraga_init_iters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.cuda()之类的操作可以在后续改掉,这个PR先Approve了

DesmonDay
DesmonDay previously approved these changes Dec 17, 2024
Copy link
Contributor

@DesmonDay DesmonDay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


loraB_name = name.replace("lora_A", "lora_B")
concat_lora_B = state_dict[loraB_name]
final_loraB, init_loraB = process_split_and_assign(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种没有用到的变量 可以直接 _ 符号替换

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,这里我改一下

base_name = name.replace("lora_A", "weight")
if not self.reinit_base_model:
# Reinit base model
offset = init_loraA.cuda() @ init_loraB.cuda()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个看起来也不用主动cuda,使用 paddle.matmul 会根据运行device来切换到cuda显存上

Copy link
Contributor Author

@greycooker greycooker Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有点奇怪,我试了一下matmul好像是不行的

in_sep_parallel_mode = self.args.sep_parallel_degree > 1
in_cp_parallel_mode = self.args.context_parallel_degree > 1

if in_pipeline_parallel_mode:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不支持 pipeline的原因是什么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pp的梯度需要额外的处理,而且暂时没有这方面的需求

def register_gradient_hook(self):
"""Register gradient hooks for all model parameters."""
for grad_name, param in self.model.named_parameters():
param._register_backward_hook(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个hook是在什么时候执行了,是在每次backward后执行,还是最后一个backward执行

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在loraga的estimate_gradient中每次backward都执行,但是在正式训练中不执行

new_grad = gradient_dict[local_grad_name].cuda() + grad / self.loraga_init_iters
gradient_dict[local_grad_name] = new_grad.cpu()
else:
gradient_dict[local_grad_name] += grad / self.loraga_init_iters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的逻辑我也有疑问 grad是多次累积的gradient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,grad是多次累积以后求平均,累积次数由loraga_init_iters超参数控制

@greycooker greycooker dismissed stale reviews from DesmonDay and gongel via 275c623 December 17, 2024 14:00
Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit 407b3e6 into PaddlePaddle:develop Dec 18, 2024
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants