-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PEFT]Add LoRA-GA #9592
[PEFT]Add LoRA-GA #9592
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9592 +/- ##
===========================================
- Coverage 52.75% 52.66% -0.09%
===========================================
Files 711 712 +1
Lines 111483 111691 +208
===========================================
+ Hits 58810 58826 +16
- Misses 52673 52865 +192 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.0.0b2后续也提个PR
paddlenlp/peft/lora/loraga_utils.py
Outdated
return model | ||
|
||
|
||
def get_loraga_dataloader(train_dataset, data_collator, training_args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里一整块只有_get_train_sampler和trainer不一样,看看能否多复用trainer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已经修改了,重写了_wrap_model,别的复用Trainer
@@ -269,7 +271,7 @@ def from_pretrained(cls, model, lora_path, **kwargs): | |||
tp_actions if pre_tensor_parallel_split else None, | |||
expected_keys, | |||
) | |||
error_msgs += _load_state_dict_into_model(lora_model.model, state_dict, "") | |||
error_msgs += _load_state_dict_into_model(lora_model, state_dict, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块为啥修改了lora_model.model为lora_model?是原来的写法有误吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在的写法没问题吧,我理解按现在的代码,不加LoRA-GA的情况下,这两种写法是等价的,所以改了也不会影响现在的功能。加了LoRA-GA以后我想把它的加载逻辑统一在LoRAModel.set_state_dict()里改掉,如果还是使用现在的写法的话from_pretrained的时候就走不到LoRAModel.set_state_dict()了
paddlenlp/peft/lora/lora_model.py
Outdated
base_name = name.replace("lora_A", "weight") | ||
|
||
# Reinit base model | ||
offset = init_loraA.cuda() @ init_loraB.cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为啥这里都需要执行 .cuda()?,我看返回的本身就是在设备上了?直接.cuda的话多硬件感觉会有点问题
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state_dict就是在cpu上的,split了以后除非使用paddle.to_tensor,不然还是在cpu上。这块可以统一使用to_tensor或者to(target_device)替代掉.cuda()么?
@@ -220,7 +220,9 @@ def get_expected_state_dict(model_to_save): | |||
if key in state_dict: | |||
state_dict.pop(key) | |||
elif isinstance(model_to_save, LoRAModel): | |||
state_dict = model_to_save.get_trainable_state_dict() | |||
concat_additional_adapter = kwargs.get("concat_additional_adapter", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不太理解这个concat_additional_adapter的意义,看起来uc里面都是默认传了个 concat_additional_adapter = True。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
本来是想在get_expected_state_dict里面只通过loraconfig中的loraga来判断是否需要concat,不传这个concat_additional_adapter,但是后来发现需要用到get_expected_state_dict的地方太多了,在save_unified_optimizer和check_unified_checkpoint等等都会用到,这些时候concat就会有问题。所以我这里就是对调用到get_expected_state_dict且需要concat的场景传入了concat_additional_adapter这个开关,别的情况保持现状。
@@ -67,7 +67,7 @@ def save_file_sync(state_dict, path): | |||
def save_single_card_checkpoint(model_to_save, output_dir): | |||
"""Save checkpoint for non-distributed environment.""" | |||
|
|||
state_dict = get_expected_state_dict(model_to_save) | |||
state_dict = get_expected_state_dict(model_to_save, concat_additional_adapter=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
gradient_dict[local_grad_name] = grad.clone() / self.loraga_init_iters | ||
else: | ||
if self.gradient_offload: | ||
new_grad = gradient_dict[local_grad_name].cuda() + grad / self.loraga_init_iters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.cuda()之类的操作可以在后续改掉,这个PR先Approve了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
paddlenlp/peft/lora/lora_model.py
Outdated
|
||
loraB_name = name.replace("lora_A", "lora_B") | ||
concat_lora_B = state_dict[loraB_name] | ||
final_loraB, init_loraB = process_split_and_assign( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种没有用到的变量 可以直接 _ 符号替换
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,这里我改一下
base_name = name.replace("lora_A", "weight") | ||
if not self.reinit_base_model: | ||
# Reinit base model | ||
offset = init_loraA.cuda() @ init_loraB.cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个看起来也不用主动cuda,使用 paddle.matmul 会根据运行device来切换到cuda显存上
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有点奇怪,我试了一下matmul好像是不行的
in_sep_parallel_mode = self.args.sep_parallel_degree > 1 | ||
in_cp_parallel_mode = self.args.context_parallel_degree > 1 | ||
|
||
if in_pipeline_parallel_mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不支持 pipeline的原因是什么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pp的梯度需要额外的处理,而且暂时没有这方面的需求
def register_gradient_hook(self): | ||
"""Register gradient hooks for all model parameters.""" | ||
for grad_name, param in self.model.named_parameters(): | ||
param._register_backward_hook( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个hook是在什么时候执行了,是在每次backward后执行,还是最后一个backward执行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在loraga的estimate_gradient中每次backward都执行,但是在正式训练中不执行
new_grad = gradient_dict[local_grad_name].cuda() + grad / self.loraga_init_iters | ||
gradient_dict[local_grad_name] = new_grad.cpu() | ||
else: | ||
gradient_dict[local_grad_name] += grad / self.loraga_init_iters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的逻辑我也有疑问 grad是多次累积的gradient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对,grad是多次累积以后求平均,累积次数由loraga_init_iters超参数控制
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
[APIs]Add a new finetuning method LoRA-GA
Description
原PR #9387
参考论文https://arxiv.org/pdf/2407.05000
参考开源代码https://github.com/Outsider565/LoRA-GA
实现LoRA-GA精调方法
支持tp、dp、sharding分布式策略
支持恢复训练(含unified checkpoint与pdparams)