Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Add] LayerSkip Blog Post #2459

Merged
merged 14 commits into from
Nov 20, 2024
Merged

Conversation

ariG23498
Copy link
Contributor

@ariG23498 ariG23498 commented Nov 5, 2024

ToDos:

  • Adding a thumbnail
  • Adding a space (@Vaibhavs10 you said you were interested, do you want to take a stab at it?)

Note: We will have to wait for huggingface/transformers#34240 to be merged before we upload the blog post.

@ariG23498 ariG23498 self-assigned this Nov 5, 2024
Copy link
Contributor

@mostafaelhoushi mostafaelhoushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ariG23498 for writing this up!
I added some minor comments

layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated

1. [Hugging Face Paper Discussion Forum](https://huggingface.co/papers/2404.16710)
2. [LayerSkip Model Collections](https://huggingface.co/collections/facebook/layerskip-666b25c50c8ae90e1965727a)
3. LayerSkip Space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LayerSkip Space is the Colab Notebook or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were thinking of hosting a Hugging Face Space (https://huggingface.co/spaces) so that people can play around with the models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a space as soon as the PR is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's also a Colab Notebook, can we link it here too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we may want to add a couple of sentences to explain readers what they should expect from the rest of this post (how to use in transformers + how it works in more detail).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the suggestion to add the link to the Colab Notebook

Suggested change
3. LayerSkip Space
3. LayerSkip Space
4. [Colab Notebook](https://colab.research.google.com/drive/1V21LaHaZk_zjhvMLvsWgVSFm6-cn9XAl?usp=sharing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the colab notebook @mostafaelhoushi

I am adding the notebook and removing the space section, as I do not think it will add value to the blog post. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with any option. I am not very familiar with Spaces but my impression that it is easier for users and might get larger traffic than Colab, but since we have a blog and if creating a Space may take a while, then we can just use the Colab.

Something I want to mention about Colab, is that I was only able to get speedups when using A100 GPU and not with the free P100 GPUs. So I had to pay out of pocket to upgrade Colab to A100 to observe decent speeds in the Colab. In Spaces, what GPU type is used in the backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very familiar with Spaces but my impression that it is easier for users and might get larger traffic than Colab, but since we have a blog and if creating a Space may take a while, then we can just use the Colab.

Hugging Face spaces hosts demos of models. With Layer Skip, to showcase the power of the algorithm we would need to demo the generation speeds of model with and without self-speculation (as shown in the GIF). Otherwise it does not make sense for us to create the space and let users use it.

gif

what GPU type is used in the backend

We have a lot of options to choose from (attached screenshot)

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could come up with something visual for the Space, but in the interest of time I'd get this done and published and we can iterate later.

Copy link
Contributor

@mostafaelhoushi mostafaelhoushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ariG23498 ! I made (hopefully) one last round of review and made minor suggestions to the wordings here and there.

layerskip.md Outdated
Comment on lines 173 to 177
* There could be different reasons for the relatively limited speedups of self-speculative decoding
on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually
pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B).
But this is an area of improvement to investigate for future research. Nevertheless,
self-speculative decoding for 70B is significantly faster than autoregressive decoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we indent this bullet point?

Suggested change
* There could be different reasons for the relatively limited speedups of self-speculative decoding
on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually
pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B).
But this is an area of improvement to investigate for future research. Nevertheless,
self-speculative decoding for 70B is significantly faster than autoregressive decoding.
- There could be different reasons for the relatively limited speedups of self-speculative decoding
on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually
pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B).
But this is an area of improvement to investigate for future research. Nevertheless,
self-speculative decoding for 70B is significantly faster than autoregressive decoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section looks like this in the current stage.
image

I also applied your changes and rendered it, which resulted with the same indent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ariG23498 ! What I had in mind was to make the 3rd bullet indent even further, i.e., the 3rd bullet point becomes a sub-bullet of the 2nd bullet point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the change in the latest commit. Thank you for the suggestion.

layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
Copy link
Contributor

@mostafaelhoushi mostafaelhoushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument early_exit has changed to assistant_early_exit in the merged PR. So I have added suggestions to update the code snippets in the blog accordingly.

layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super cool, very nice and informative! I made a few suggestions with the overall theme to make the post as fluid as possible, but feel free to ignore them!

🔥

_blog.yml Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated
Comment on lines 21 to 27
By leveraging this technique, we not only speed up text generation but also achieve significant
memory savings and reduce computational latency. In order to obtain an end-to-end speedup, the
output of the earlier layers need to be close enough to the last layer. This is achieved by a
training recipe as described in the paper that could be applied as continual pretraining,
pretraining from scratch, or finetuning on a specific domain. This makes self-speculative decoding
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering
the overall hardware footprint needed for **large-scale inference**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By leveraging this technique, we not only speed up text generation but also achieve significant
memory savings and reduce computational latency. In order to obtain an end-to-end speedup, the
output of the earlier layers need to be close enough to the last layer. This is achieved by a
training recipe as described in the paper that could be applied as continual pretraining,
pretraining from scratch, or finetuning on a specific domain. This makes self-speculative decoding
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering
the overall hardware footprint needed for **large-scale inference**.
This technique not only speeds up text generation, but it also achieves significant
memory savings and reduces computational latency. In order to obtain an end-to-end speedup, the
output of the earlier layers needs to be close enough to the last layer. This is achieved by a
training recipe which, as described in the paper, can be applied during pretraining, and also while fine-tuning on a specific domain. Self-speculative decoding is
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering
the overall hardware footprint needed for **large-scale inference**.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not opposed to mentioning shared KV-caching early, like in this summary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcuenca do you mean I should add a line about shared KV Caching in this paragraph?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a mention, if you think it's useful. For example:

This technique not only speeds up text generation, but it also achieves significant
memory savings (because weights and caches can be reused), and reduces computational latency. In order to obtain an end-to-end speedup, the
output of the earlier layers needs to be close enough to the last layer's. This is achieved by a
training recipe which, as described in the paper, can be applied during pretraining, and also while fine-tuning on a specific domain. Self-speculative decoding is 
especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering
the overall hardware footprint needed for **large-scale inference**

Your call!

layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated

1. [Hugging Face Paper Discussion Forum](https://huggingface.co/papers/2404.16710)
2. [LayerSkip Model Collections](https://huggingface.co/collections/facebook/layerskip-666b25c50c8ae90e1965727a)
3. LayerSkip Space
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there's also a Colab Notebook, can we link it here too?

layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Left some nits, but good to merge from my side! Great job! 🔥

layerskip.md Outdated Show resolved Hide resolved
layerskip.md Show resolved Hide resolved
layerskip.md Outdated Show resolved Hide resolved
layerskip.md Show resolved Hide resolved
@ariG23498
Copy link
Contributor Author

@pcuenca @Vaibhavs10 I have made the changed.

@mostafaelhoushi The colab notebook and the sheet now resides here: https://huggingface.co/datasets/ariG23498/layer-skip-assets

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Ready to merge in my opinion!

@pcuenca
Copy link
Member

pcuenca commented Nov 20, 2024

@mostafaelhoushi The colab notebook and the sheet now resides here: https://huggingface.co/datasets/ariG23498/layer-skip-assets

@ariG23498 Could you please create a README in the dataset explaining what's in there, linking to the notebook, and crediting Mostafa as the main author (I know it's already done in the notebook)? We can also transfer the dataset to your HF namespace @mostafaelhoushi, if you'd like that.

@ariG23498
Copy link
Contributor Author

@pcuenca thanks for the suggestion.
I have added a README to the HF Dataset.

@ariG23498 ariG23498 merged commit 09fd808 into huggingface:main Nov 20, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants