-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for different layer shapes for VeRA #1817
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Thanks for the PR!
I haven't looked into the details yet, but I wondered about the reasoning for this approach. You probably also considered to save one set of weights for each unique shape (perhaps transposition invariant). What would you think is the advantage of the proposed solution, does it save on average on the number of parameters, does it work better? How would you handle parameters with different dimensions, like |
Slicing creates a view of the original tensor so in most cases it should be more memory-efficient than creating separate sets of A&B, and it better fits the current approach which is - one pair of A&B for the whole adapter. For layers like Conv2d, we could slice the large A&B first, and then create a view to give it proper dimensions. Orthogonally to that, we could have an option in the config to have separate A&B for every adapted layer, that would be the equivalent of the ablation from the paper (table 7). I don't see any benefit of having something in between - a separate pair per different shape. |
Yes, my expectation would also be that slicing from a single big matrix should result in fewer parameters overall than having one weight for each unique shape. There could be some exceptions, e.g. say we have a shape 100x10 and a shape 10x100, the total shape would be 100x100, which is much bigger than 100x10 + 10x100. But I'm not sure if there would be any practical examples of that, since Even if this is better for overall parameter count, one potential disadvantage of this approach is, however, that it means we need to use the weights directly: sliced_A = vera_A[:, : self.in_features]
sliced_B = vera_B[: self.out_features, :]
dropout = self.vera_dropout[active_adapter]
x = x.to(lambda_d.dtype)
result = result + lambda_b * F.linear(lambda_d * F.linear(dropout(x), sliced_A), sliced_B) I.e. we cannot go through the forward call of the adapter layers (something like Why would want to go through the
This to me seems to be an inferior approach compared to having one set of weights per shape (though easier to implement), as I think VeRA is all about minimizing the number of trainable parameters as much as possible. Anyway, this is a all bit hypothetical at the moment and I just wanted to put write down my thoughts and hear your opinion. One step forward would be to try out VeRA in different settings that involve these types of wrapping and hooks like offloading with accelerate or using DeepSpeed/FSDP and check if it works. |
Regarding the edge case of two "long and narrow" tensors of different "orientation", e.g. 100x10 and 10x100 - resulting A & B would have shapes (r, 100) & (100, r), not (100x100). So it would be 200*r vs 220*r.
I didn't compare two implementations, but imo both should be equivalent in outcomes, with negligible differences in memory usage.
Actually, my first version of implementation was temporarily swapping .data of tensors, so it could be changed back to that. sliced_A = vera_A.data[:, : self.in_features]
sliced_B = vera_B.data[: self.out_features, :]
init_A = vera_A.data
init_B = vera_B.data
vera_A.data = sliced_A
vera_B.data = sliced_B
dropout = self.vera_dropout[active_adapter]
x = x.to(lambda_d.dtype)
result = result + lambda_b * F.linear(lambda_d * F.linear(dropout(x), vera_A), vera_B)
vera_A.data = init_A
vera_B.data = init_B But done on the weights of adapted layers. Still, I'm not sure if it would work with stuff like FSDP.
True, it would make sense only in combination with the regeneration from a seed. Then we still store only one seed per adapted layer. The most flexible and "future proof" approach would be to give an option to have basis A&B matrices per:
However, it would require the most work and could be confusing for new users of VeRA. On a side note (completely unrelated to this PR), A & B could be potentially quantized as they are not trainable, leading to lower memory usage. I haven't tested it yet tho. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your replies. At the end of the day, I think it's okay to go ahead as is. If we do run into the issue that users want to use VeRA with DS or FSDP and it doesn't work, I'm sure we can figure out a way. This PR shouldn't break anything in that regard which currently works.
The implementation is super sweet, nice that only a few lines needed to be changed. I only have a few small comments, please check.
Actually, my first version of implementation was temporarily swapping .data of tensors, so it could be changed back to that.
That probably would also run into trouble, so let's not do it that way.
the most flexible and "future proof" approach would be to give an option to have basis A&B matrices per: ...
However, it would require the most work and could be confusing for new users of VeRA.
I agree.
On a side note (completely unrelated to this PR), A & B could be potentially quantized as they are not trainable, leading to lower memory usage. I haven't tested it yet tho.
Cool idea, if you have some time to test it and find that it works, that would be great. My intuition is, however, that it won't be quite trivial to implement and of course it will introduce more quantization error.
I've addressed the requested changes and fixed merging of the adapter - adding the new test to |
Thanks @dkopi. Could you please run |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@BenjaminBossan Done. I just skipped two files modified by "make style" - lycoris_utils.py and router.py, to avoid potential merge conflicts. |
@BenjaminBossan One of the tests failed with Is it an issue with HF servers? Could we rerun it? |
Yes, no worries, I'll get a notification and re-run those tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks fantastic, thanks. Updating the notebook was a nice touch.
Some pre-configured models like mistral used not to work with VeRA because the weight shapes were not identical. However, since huggingface#1817, this is no longer a requirement. Therefore, this commented code can now be uncommented. I have tested mistral and gemma and they worked. I haven't tested btlm and mixtral but with the update, I'm pretty sure they will work too.
Some pre-configured models like mistral used not to work with VeRA because the weight shapes were not identical. However, since #1817, this is no longer a requirement. Therefore, this commented code can now be uncommented. I have tested mistral and gemma and they worked. I haven't tested btlm and mixtral but with the update, I'm pretty sure they will work too.
PR for #1816.