Fix for "leaf Variable that requires grad" Error in In-Place Operation #1372

DopeorNope-Lee · 2024-01-19T00:41:32Z

Issue:
The current implementation in the file peft/tuners/lora/layer.py encounters a runtime error due to an in-place operation on a leaf variable that requires gradient computation. Specifically, the error is triggered by the following line of code:

result += (after_A @ embedding_B) * scaling

This line uses the += operator, which modifies the result tensor in-place. When result is a leaf variable with requires_grad=True, such in-place operations are incompatible with PyTorch's autograd system, leading to a "RuntimeError: a leaf Variable that requires grad is being used in an in-place operation".

Solution:
To resolve this issue, I propose modifying the in-place operation to a regular operation that creates a new tensor. This change ensures that the original value of result is not altered, thereby maintaining compatibility with the autograd system. The updated line of code is as follows:

result = result + (after_A @ embedding_B) * scaling

This modification allows the computation to proceed without altering the original result tensor, thus avoiding the RuntimeError and ensuring proper gradient calculation during backpropagation.

Therefor, before modification I encountered
"RuntimeError: a leaf Variable that requires grad is being used in an in-place operation."

However, I revised like that it solved!

Best

younesbelkada

Thanks, the fix sounds good !
Can you propagate the fix to other LoRA types (Conv, etc.), also can you share a small reproducible snippet of the bug?

HuggingFaceDocBuilderDev · 2024-01-24T10:09:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

DopeorNope-Lee · 2024-01-24T11:57:50Z

@younesbelka I tried CASUAL_LM and mixtral model..!

Also my target modules includes Embedding layer, so this code revision was implemented in Embedding class in LoRA layer.

So my PR is only related in Embedding class.

DopeorNope-Lee · 2024-01-25T10:06:14Z

@younesbelkada also I changed lora/bnb.py for fix the error which written in below
RuntimeError: Output 0 of MatMul8bitLtBackward is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.

pacman100

Thank you @DopeorNope-Lee for the fixes, makes sense. I agree with Younes that it would be better to make this change across all LoRA layers. However, that could be part of separate PR in case the issue is only noticed for the layers changed in this PR.

DopeorNope-Lee · 2024-01-29T08:22:37Z

@pacman100 I will fix all of the layers and will mention again..!

Thanks for your recommendation..!

younesbelkada · 2024-01-29T22:38:43Z

Thanks @DopeorNope-Lee ! Let us know if you need any help

BenjaminBossan · 2024-02-06T10:29:43Z

Thanks @DopeorNope-Lee for providing this PR. Do you have a minimal code example that demonstrates the error with the in-place operation?

Edit: Is this related to #1425?

DopeorNope-Lee · 2024-02-06T14:25:38Z

Thanks @BenjaminBossan I saw the PR in your mention. But it have some difference.

First, above the codes are related to full-finetuning(someone say it continuous pre-train). However, in place errors occurs in fine-tuning cases.

So I revised all of a+=b operations to a = a + b like this.

@younesbelkada Hi, I revised all of operators in Lora layers. However I saw 'conflicts that must be resolved'.

I think the previous code used copy method, but the recent version uses clone method.

So I also updated it!

If there is more improvement or other issues, let me know I will follow up.

BenjaminBossan · 2024-02-07T10:50:44Z

First, above the codes are related to full-finetuning(someone say it continuous pre-train). However, in place errors occurs in fine-tuning cases.

Thanks for explaining. I wrote a small test to check full fine-tuning and didn't encounter any error when training. Could you please provide a minimal example? This is also important to have as a unit test so that we can prevent regressions in the future.

DopeorNope-Lee · 2024-02-07T14:05:52Z

@BenjaminBossan Did you add embed_tokens in target module?

Detail information is follow:

model: TomGrc/FusionNet_7Bx2_MoE_14B

    -load_in_4bit=True,
    -trust_remote_code=True,
    -torch_dtype=torch.bfloat16
    -attn_implementation='flash_attention_2'


LoRA:
target_module = ['embed_tokens', 'q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'down_proj','up_proj', 'lm_head']
bias='none'
task_type='CAUSAL_LM'


learning_scheduler:  cosine

BenjaminBossan · 2024-02-07T16:58:23Z

Did you add embed_tokens in target module?

I added this test to test_custom_models.py:

    @parameterized.expand(TEST_CASES)
    def test_training_full_finetuning(self, test_name, model_id, config_cls, config_kwargs):
        # check if training the full model works, no error is raised; only check custom models since they are small
        # so full finetuning shouldn't take too long
        model = self.transformers_class.from_pretrained(model_id)
        config = config_cls(
            base_model_name_or_path=model_id,
            **config_kwargs,
        )
        model = get_peft_model(model, config)
        model = model.to(self.torch_device)
        model.train()
        model.requires_grad_(True)  # make all parameters trainable

        optim = torch.optim.SGD(model.parameters(), lr=0.1)
        inputs = self.prepare_inputs_for_testing()

        for _ in range(5):
            optim.zero_grad()
            output = model(**inputs)[0]
            loss = output.sum()
            loss.backward()
            optim.step()

These tests include examples with embedding layers in the target_modules.

DopeorNope-Lee · 2024-02-08T16:10:57Z

@BenjaminBossan How about trying the mixtral model with LoRA layer? I think this error did not occurr from Full-finetuning, fine-tuning. LoRA fine-tuning

BenjaminBossan · 2024-02-09T11:18:09Z

How about trying the mixtral model with LoRA layer?

I see, thanks. I don't have a machine available to test full mixtral, so I couldn't test it.

DopeorNope-Lee · 2024-02-09T12:33:40Z

@BenjaminBossan Then may I help you with testing the LoRA fine-tuning mixtral?

BenjaminBossan · 2024-02-09T13:01:43Z

Then may I help you with testing the LoRA fine-tuning mixtral?

Thanks, that's not necessary. My main concern is to have some kind of test to ensure that we don't have regressions in the future, but maybe that's not easily possible here? I tried a tiny mixtral model I found on HF but that didn't trigger any error. Running full mixtral won't work on our CI.

DopeorNope-Lee · 2024-02-09T13:09:18Z

@BenjaminBossan Have you tried fine-tuning with LoRA?

I attempted fine-tuning using LoRA.

It's an instruct-tune utilizing LoRA, not full fine-tuning.

But your previous code looks full-fine-tuning, I think..

BenjaminBossan · 2024-02-09T14:21:01Z

Well, we have a bunch of tests for fine-tuning with LoRA. I wrote the test for full fine-tuning because you had said earlier:

above the codes are related to full-finetuning

Maybe I misunderstood what you meant.

DopeorNope-Lee · 2024-02-09T15:14:53Z

@BenjaminBossan
It seems we both misunderstood each other thinking the other discussion( #1425 ) was about full-finetuning.

However, the issue I'm referring to here is the instruct tuning (fine-tuning) using LoRA adapters through load_in_4bit.

After hearing your words, I've also started a training test with a new initial environment.

It would be great if we could run this together and share related information.

I'm very grateful for your contribution to this open-source.

BenjaminBossan · 2024-02-09T16:11:29Z

It seems we both misunderstood each other

These things happen, glad we're now on the same boat.

After hearing your words, I've also started a training test with a new initial environment.

It would be great if we could run this together and share related information.

Thanks. If you have something to share, let me know. At the end of the day, if we have confirmation that this PR fixes a real world issue, even if we cannot add it to a unit test, it's fine with me. Maybe we can upload the script somewhere and add a link to it as a comment.

DopeorNope-Lee · 2024-02-09T16:25:16Z

@BenjaminBossan Sure!

Moreover, I'm sharing recent test result using Mixtral LoRA finetuning.

After that, I removed the latest version of peft and implemented the library (PR) I modified.

pip install -q -U git+https://github.com/DopeorNope-Lee/peft-modified.git

Now, It runs really well!

BenjaminBossan · 2024-02-12T14:22:52Z

@DopeorNope-Lee Could you please run make style on your changes? This might require to update ruff to work correctly. I think the issue is the excess empty line 493.

DopeorNope-Lee · 2024-02-12T15:24:57Z

@BenjaminBossan I run make style and push it!

BenjaminBossan

Thanks for the fixes and the fruitful discussion. LGTM.

Moreover, I'm sharing recent test result using Mixtral LoRA finetuning.

Is it possible for you to share the script?

DopeorNope-Lee · 2024-02-13T11:35:04Z

@BenjaminBossan Sure, I usually used Platypus code with bash-file.

# BASE and Data

BASIS=TomGrc/FusionNet_7Bx2_MoE_14B
DATA=DopeorNope/adaption_new_v1.2

# fine_tuning output
output_directory=ko-mixtral_v1.4

# merged output
final_dir=Ko-Mixtral-v1.4-MoE-7Bx2

# repository directory
repo_dir=DopeorNope/Ko-Mixtral-v1.4-MoE-7Bx2


python finetune.py \
    --base_model $BASIS \
    --data-path $DATA \
    --output_dir $output_directory \
    --lora_target_modules '[embed_tokens, q_proj, k_proj, v_proj, o_proj, gate_proj, down_proj,up_proj, lm_head]' \
    --batch_size 256 \
    --micro_batch_size 2 \
    --num_epochs 1 \
    --learning_rate 1e-5 \
    --lr_scheduler 'cosine' \
    --cutoff_len 8192 \
    --lora_r 64 \
    --resume_from_checkpoint False \
    --lora_alpha 16 \
    --train_on_inputs False \
    --add_eos_token False \
    --group_by_length False \
    --prompt_template_name alpaca \
    --warmup_steps 10 \
    --lora_dropout 0.05 \

BenjaminBossan · 2024-02-22T16:28:39Z

I think this PR should be ready to be merged, right? @DopeorNope-Lee could you please fix the small merge conflict?

@younesbelkada @pacman100 do you have further comments?

DopeorNope-Lee · 2024-02-23T14:49:42Z

@BenjaminBossan Sure, I fixed it.!

DopeorNope-Lee · 2024-02-29T16:32:49Z

@pacman100 @younesbelkada @BenjaminBossan Hi?
I think we are ready to merge it.
I tried to merge it, but two pending reviews were left.
Therefore, I failed to merge it.

Could you approve my PR?

pacman100

Thank you @DopeorNope-Lee for the fixes, LGTM!

BenjaminBossan · 2024-03-01T10:22:48Z

@DopeorNope-Lee I think the last merge with main resulted in incorrect code, with weights being merged twice, resulting in the failing CI. Could you please take a look?

DopeorNope-Lee · 2024-03-01T15:40:33Z

@BenjaminBossan
Hmm...
It seems there was an error while resolving a conflict with the existing code.
So, I've brought over the current version of layer.py and made modifications in the same manner as my previous PR changes. Can you check this for me?

BenjaminBossan · 2024-03-04T12:40:41Z

Thanks @DopeorNope-Lee for correcting the resolution to work as intended. This LGTM now.

huggingface#1372) Avoid in-place operations for LoRA forward and merging.

Update layer.py

44764f1

younesbelkada reviewed Jan 24, 2024

View reviewed changes

Update bnb.py

08db6df

pacman100 reviewed Jan 29, 2024

View reviewed changes

DopeorNope-Lee added 2 commits February 5, 2024 15:53

Update layer.py

7304a92

Update layer.py

43f1ba8

DopeorNope-Lee added 2 commits February 6, 2024 23:12

Update layer.py

8fcf54e

Update layer.py

a01ee31

Merge branch 'main' into main

24818c9

Update layer.py

e1d7729

BenjaminBossan approved these changes Feb 13, 2024

View reviewed changes

BenjaminBossan requested review from pacman100 and younesbelkada February 14, 2024 10:29

Merge branch 'main' into main

1afd1fb

Merge branch 'main' into main

067e4b0

pacman100 approved these changes Mar 1, 2024

View reviewed changes

DopeorNope-Lee added 4 commits March 2, 2024 00:14

�go back

f9863ce

Update layer.py

6c36b58

Update layer.py

c7892d4

Update layer.py

50b040f

BenjaminBossan merged commit 34f3fba into huggingface:main Mar 4, 2024
14 checks passed

BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Mar 14, 2024

Fix for "leaf Variable that requires grad" Error in In-Place Operation (

b39527d

huggingface#1372) Avoid in-place operations for LoRA forward and merging.

BenjaminBossan mentioned this pull request Jul 23, 2024

Decrease memory usage of merge_and_unload #1944

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for "leaf Variable that requires grad" Error in In-Place Operation #1372

Fix for "leaf Variable that requires grad" Error in In-Place Operation #1372

DopeorNope-Lee commented Jan 19, 2024

younesbelkada left a comment

HuggingFaceDocBuilderDev commented Jan 24, 2024

DopeorNope-Lee commented Jan 24, 2024 •

edited

Loading

DopeorNope-Lee commented Jan 25, 2024

pacman100 left a comment

DopeorNope-Lee commented Jan 29, 2024

younesbelkada commented Jan 29, 2024

BenjaminBossan commented Feb 6, 2024 •

edited

Loading

DopeorNope-Lee commented Feb 6, 2024 •

edited

Loading

BenjaminBossan commented Feb 7, 2024

DopeorNope-Lee commented Feb 7, 2024 •

edited

Loading

BenjaminBossan commented Feb 7, 2024

DopeorNope-Lee commented Feb 8, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024 •

edited

Loading

BenjaminBossan commented Feb 12, 2024

DopeorNope-Lee commented Feb 12, 2024

BenjaminBossan left a comment

DopeorNope-Lee commented Feb 13, 2024 •

edited

Loading

BenjaminBossan commented Feb 22, 2024

DopeorNope-Lee commented Feb 23, 2024

DopeorNope-Lee commented Feb 29, 2024

pacman100 left a comment

BenjaminBossan commented Mar 1, 2024

DopeorNope-Lee commented Mar 1, 2024

BenjaminBossan commented Mar 4, 2024

Fix for "leaf Variable that requires grad" Error in In-Place Operation #1372

Fix for "leaf Variable that requires grad" Error in In-Place Operation #1372

Conversation

DopeorNope-Lee commented Jan 19, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 24, 2024

DopeorNope-Lee commented Jan 24, 2024 • edited Loading

DopeorNope-Lee commented Jan 25, 2024

pacman100 left a comment

Choose a reason for hiding this comment

DopeorNope-Lee commented Jan 29, 2024

younesbelkada commented Jan 29, 2024

BenjaminBossan commented Feb 6, 2024 • edited Loading

DopeorNope-Lee commented Feb 6, 2024 • edited Loading

BenjaminBossan commented Feb 7, 2024

DopeorNope-Lee commented Feb 7, 2024 • edited Loading

BenjaminBossan commented Feb 7, 2024

DopeorNope-Lee commented Feb 8, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024

BenjaminBossan commented Feb 9, 2024

DopeorNope-Lee commented Feb 9, 2024 • edited Loading

BenjaminBossan commented Feb 12, 2024

DopeorNope-Lee commented Feb 12, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

DopeorNope-Lee commented Feb 13, 2024 • edited Loading

BenjaminBossan commented Feb 22, 2024

DopeorNope-Lee commented Feb 23, 2024

DopeorNope-Lee commented Feb 29, 2024

pacman100 left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Mar 1, 2024

DopeorNope-Lee commented Mar 1, 2024

BenjaminBossan commented Mar 4, 2024

DopeorNope-Lee commented Jan 24, 2024 •

edited

Loading

BenjaminBossan commented Feb 6, 2024 •

edited

Loading

DopeorNope-Lee commented Feb 6, 2024 •

edited

Loading

DopeorNope-Lee commented Feb 7, 2024 •

edited

Loading

DopeorNope-Lee commented Feb 9, 2024 •

edited

Loading

DopeorNope-Lee commented Feb 13, 2024 •

edited

Loading