-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX PiSSA & OLoRA with rank/alpha pattern, rslora #1930
FIX PiSSA & OLoRA with rank/alpha pattern, rslora #1930
Conversation
See huggingface#1929 (comment) At the moment, when using PiSSA or OLoRA with weight conversion to restore the original base weights, there is an error when either of rank_pattern, alpha_pattern, or rslora is being used. This PR fixes this. The issue is that we need to double the rank of the LoRA adapter. Right now, this is done by simply doubling r and alpha. But if rank_pattern and alpha_pattern are being used, those need to be doubled too. Furthermore, when using rslora, the scaling is again different, namely alpha / sqrt(r). This also needs to be adjusted. Unfortunately, when using rslora with rank_pattern and alpha_pattern, this gets way more complicated. Since we don't store the scaling in the state_dict, we would have to resolve all the patterns here to determine the correct scaling, i.e. reimplement the whole matching and init logic. This is a lot of work for a very edgy edge case. Therefore, I opted to raise an error instead. This is not super nice, as the error is only raised when trying to save the model, i.e. a lot of time may already have been spent to train the model. But we cannot know this earlier, so not much can be done. Overall, this fix is ugly because it further couples unrelated code. For instance, if we add new init methods that affect the scaling, we need to remember to change the saving logic accordingly. If anyone has a better idea, LMK.
Tagging @fxmeng and @tokenizer-decode as this affects the code they contributed. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Oh, so we raise the error after all the training is done. Can you tell me why we are checking |
Yes, valid question. The issue is that this is only a problem if users pass |
Hmm. Maybe at the initialization if |
... for saving not to work
@tokenizer-decode Good suggestion about adding a warning. I did that in the last commit, though unfortunately many users don't read warnings. Still, there is not much else that can be done here. |
Yeah you are right not much else to do here. But I agree that this is a very edgy edge case so we should be ok. |
@sayakpaul LMK if you don't want to review this, I'll look for someone else. |
@tokenizer-decode While working on this, I also discovered a small error in a test caused by your PR, which I have fixed (see update section of PR description). |
Oh I totally missed that. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Seems like a very deep edge case. I have left some comments to better understand some stuff. Once replied to, I will review one more time.
output_pissa = peft_model(data)[0] | ||
|
||
# sanity check | ||
tol = 1e-06 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) a stricter check for negations is to relax the tolerance a bit and see if that still holds IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. At the moment, we use the same tolerance for assert not torch.allclose(...)
and assert torch.allclose(...)
throughout the whole PEFT code base. Therefore, I wouldn't change this now only for these specific tests. I think this could be a nice use case for a code-LLM to go through all tests and rewrite them to use higher tolerance for negative checks :)
assert model_loaded.peft_config["default"].r == 8 | ||
assert model_loaded.base_model.model.linear.lora_A["default"].weight.shape[0] == 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between r
being 8 and rank_pattern
being 32 for linear
here?
target_modules=["linear"], r=8, rank_pattern={"linear": 32}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed this comment earlier, so addressing it now.
For this simple test, we only have a small model with a single linear layer. Therefore, if we set rank_pattern={"linear": 32}
, we basically ignore r=8
because rank_pattern
takes precedence. If we had a real model, we could, however, set all ranks to 8 this way and pick individual layers with a different rank by using rank_pattern
.
Does that answer the question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it does. Maybe just add a comment explaining this in the test? Because many developers refer to tests to understand different usage patterns and they may get confused here as well.
tests/test_initialization.py
Outdated
# use rank_pattern here | ||
config = LoraConfig(init_lora_weights="pissa", target_modules=["linear"], r=8, rank_pattern={"linear": 32}) | ||
peft_model = get_peft_model(deepcopy(model), config) | ||
# save the initial model | ||
peft_model.peft_config["default"].init_lora_weights = True | ||
peft_model.save_pretrained(tmp_path / "init-model") | ||
peft_model.peft_config["default"].init_lora_weights = "pissa" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help me unpack this bit of code.
We are using pissa
for init_lora_weights
already in the LoraConfig
. So, I don't understand what would differ between peft_model.peft_config["default"].init_lora_weights = True
and peft_model.peft_config["default"].init_lora_weights = "pissa"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was due to a bug in original PiSSA implementation. I fixed it when I was implementing olora but I didn’t change tests. What would differ is init_lora_weights
should be True
in the config file. See here
model.linear.weight, model_converted.base_model.model.linear.base_layer.weight, atol=tol, rtol=tol | ||
) | ||
|
||
def test_lora_pissa_conversion_same_output_after_loading_with_alpha_pattern(self, data, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments apply for this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor changes requested. I think this is the best we can have right now.
@sayakpaul I added clarifying comments to the tests in my latest commit. |
See #1929 (comment)
At the moment, when using PiSSA or OLoRA with weight conversion to restore the original base weights, there is an error when either of
rank_pattern
,alpha_pattern
, orrslora
is being used. This PR fixes this.The issue is that we need to double the rank of the LoRA adapter. Right now, this is done by simply doubling r and alpha. But if
rank_pattern
andalpha_pattern
are being used, those need to be doubled too.Furthermore, when using
rslora
, the scaling is again different, namelyalpha / sqrt(r)
. This also needs to be adjusted.Unfortunately, when using
rslora
together withrank_pattern
andalpha_pattern
, this gets way more complicated. Since we don't store the scaling in thestate_dict
, we would have to resolve all the patterns here to determine the correct scaling, i.e. reimplement the whole matching and init logic. This is a lot of work for a very edgy edge case.Therefore, I opted to raise an error instead. This is not super nice, as the error is only raised when trying to save the model, i.e. a lot of time may already have been spent to train the model. But we cannot know this earlier, so not much can be done.
Overall, this fix is ugly because it further couples unrelated code. For instance, if we add new init methods that affect the scaling, we need to remember to change the saving logic accordingly. If anyone has a better idea, LMK.
Update
I noticed that the test
test_lora_pissa_conversion_same_output_after_loading_with_quantization
was failing locally. This is because the argumentconvert_mutated_to_lora
was renamed topath_initial_model_for_weight_conversion
in #1828. This oversight has been addressed. However, this also means that the tests did not run in CI (nightly CI with GPU). The reason was that the PiSSA and OLoRA tests were missing the right marker, so I added them as well.