-
Notifications
You must be signed in to change notification settings - Fork 772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Add] LayerSkip Blog Post #2459
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ariG23498 for writing this up!
I added some minor comments
layerskip.md
Outdated
|
||
1. [Hugging Face Paper Discussion Forum](https://huggingface.co/papers/2404.16710) | ||
2. [LayerSkip Model Collections](https://huggingface.co/collections/facebook/layerskip-666b25c50c8ae90e1965727a) | ||
3. LayerSkip Space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LayerSkip Space
is the Colab Notebook or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were thinking of hosting a Hugging Face Space (https://huggingface.co/spaces) so that people can play around with the models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a space as soon as the PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's also a Colab Notebook, can we link it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we may want to add a couple of sentences to explain readers what they should expect from the rest of this post (how to use in transformers + how it works in more detail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the suggestion to add the link to the Colab Notebook
3. LayerSkip Space | |
3. LayerSkip Space | |
4. [Colab Notebook](https://colab.research.google.com/drive/1V21LaHaZk_zjhvMLvsWgVSFm6-cn9XAl?usp=sharing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the colab notebook @mostafaelhoushi
I am adding the notebook and removing the space section, as I do not think it will add value to the blog post. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with any option. I am not very familiar with Spaces but my impression that it is easier for users and might get larger traffic than Colab, but since we have a blog and if creating a Space may take a while, then we can just use the Colab.
Something I want to mention about Colab, is that I was only able to get speedups when using A100 GPU and not with the free P100 GPUs. So I had to pay out of pocket to upgrade Colab to A100 to observe decent speeds in the Colab. In Spaces, what GPU type is used in the backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with Spaces but my impression that it is easier for users and might get larger traffic than Colab, but since we have a blog and if creating a Space may take a while, then we can just use the Colab.
Hugging Face spaces hosts demos of models. With Layer Skip, to showcase the power of the algorithm we would need to demo the generation speeds of model with and without self-speculation (as shown in the GIF). Otherwise it does not make sense for us to create the space and let users use it.
what GPU type is used in the backend
We have a lot of options to choose from (attached screenshot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could come up with something visual for the Space, but in the interest of time I'd get this done and published and we can iterate later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ariG23498 ! I made (hopefully) one last round of review and made minor suggestions to the wordings here and there.
layerskip.md
Outdated
* There could be different reasons for the relatively limited speedups of self-speculative decoding | ||
on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually | ||
pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B). | ||
But this is an area of improvement to investigate for future research. Nevertheless, | ||
self-speculative decoding for 70B is significantly faster than autoregressive decoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we indent this bullet point?
* There could be different reasons for the relatively limited speedups of self-speculative decoding | |
on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually | |
pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B). | |
But this is an area of improvement to investigate for future research. Nevertheless, | |
self-speculative decoding for 70B is significantly faster than autoregressive decoding. | |
- There could be different reasons for the relatively limited speedups of self-speculative decoding | |
on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually | |
pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B). | |
But this is an area of improvement to investigate for future research. Nevertheless, | |
self-speculative decoding for 70B is significantly faster than autoregressive decoding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ariG23498 ! What I had in mind was to make the 3rd bullet indent even further, i.e., the 3rd bullet point becomes a sub-bullet of the 2nd bullet point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the change in the latest commit. Thank you for the suggestion.
Co-authored-by: Mostafa Elhoushi <m.elhoushi@ieee.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument early_exit
has changed to assistant_early_exit
in the merged PR. So I have added suggestions to update the code snippets in the blog accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool, very nice and informative! I made a few suggestions with the overall theme to make the post as fluid as possible, but feel free to ignore them!
🔥
layerskip.md
Outdated
By leveraging this technique, we not only speed up text generation but also achieve significant | ||
memory savings and reduce computational latency. In order to obtain an end-to-end speedup, the | ||
output of the earlier layers need to be close enough to the last layer. This is achieved by a | ||
training recipe as described in the paper that could be applied as continual pretraining, | ||
pretraining from scratch, or finetuning on a specific domain. This makes self-speculative decoding | ||
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering | ||
the overall hardware footprint needed for **large-scale inference**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By leveraging this technique, we not only speed up text generation but also achieve significant | |
memory savings and reduce computational latency. In order to obtain an end-to-end speedup, the | |
output of the earlier layers need to be close enough to the last layer. This is achieved by a | |
training recipe as described in the paper that could be applied as continual pretraining, | |
pretraining from scratch, or finetuning on a specific domain. This makes self-speculative decoding | |
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering | |
the overall hardware footprint needed for **large-scale inference**. | |
This technique not only speeds up text generation, but it also achieves significant | |
memory savings and reduces computational latency. In order to obtain an end-to-end speedup, the | |
output of the earlier layers needs to be close enough to the last layer. This is achieved by a | |
training recipe which, as described in the paper, can be applied during pretraining, and also while fine-tuning on a specific domain. Self-speculative decoding is | |
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering | |
the overall hardware footprint needed for **large-scale inference**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not opposed to mentioning shared KV-caching early, like in this summary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pcuenca do you mean I should add a line about shared KV Caching in this paragraph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a mention, if you think it's useful. For example:
This technique not only speeds up text generation, but it also achieves significant
memory savings (because weights and caches can be reused), and reduces computational latency. In order to obtain an end-to-end speedup, the
output of the earlier layers needs to be close enough to the last layer's. This is achieved by a
training recipe which, as described in the paper, can be applied during pretraining, and also while fine-tuning on a specific domain. Self-speculative decoding is
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering
the overall hardware footprint needed for **large-scale inference**
Your call!
layerskip.md
Outdated
|
||
1. [Hugging Face Paper Discussion Forum](https://huggingface.co/papers/2404.16710) | ||
2. [LayerSkip Model Collections](https://huggingface.co/collections/facebook/layerskip-666b25c50c8ae90e1965727a) | ||
3. LayerSkip Space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there's also a Colab Notebook, can we link it here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Left some nits, but good to merge from my side! Great job! 🔥
@pcuenca @Vaibhavs10 I have made the changed. @mostafaelhoushi The colab notebook and the sheet now resides here: https://huggingface.co/datasets/ariG23498/layer-skip-assets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Ready to merge in my opinion!
@ariG23498 Could you please create a README in the dataset explaining what's in there, linking to the notebook, and crediting Mostafa as the main author (I know it's already done in the notebook)? We can also transfer the dataset to your HF namespace @mostafaelhoushi, if you'd like that. |
ToDos:
Note: We will have to wait for huggingface/transformers#34240 to be merged before we upload the blog post.