[Add] LayerSkip Blog Post #2459

ariG23498 · 2024-11-05T13:42:14Z

ToDos:

Adding a thumbnail
Adding a space (@Vaibhavs10 you said you were interested, do you want to take a stab at it?)

Note: We will have to wait for huggingface/transformers#34240 to be merged before we upload the blog post.

mostafaelhoushi

Thanks @ariG23498 for writing this up!
I added some minor comments

layerskip.md

mostafaelhoushi · 2024-11-05T13:50:13Z

layerskip.md

+
+1. [Hugging Face Paper Discussion Forum](https://huggingface.co/papers/2404.16710)
+2. [LayerSkip Model Collections](https://huggingface.co/collections/facebook/layerskip-666b25c50c8ae90e1965727a)
+3. LayerSkip Space


The LayerSkip Space is the Colab Notebook or something else?

We were thinking of hosting a Hugging Face Space (https://huggingface.co/spaces) so that people can play around with the models.

I will create a space as soon as the PR is merged.

But there's also a Colab Notebook, can we link it here too?

In addition, we may want to add a couple of sentences to explain readers what they should expect from the rest of this post (how to use in transformers + how it works in more detail).

Here is the suggestion to add the link to the Colab Notebook

Suggested change

3. LayerSkip Space

3. LayerSkip Space

4. [Colab Notebook](https://colab.research.google.com/drive/1V21LaHaZk_zjhvMLvsWgVSFm6-cn9XAl?usp=sharing)

Thanks for the colab notebook @mostafaelhoushi

I am adding the notebook and removing the space section, as I do not think it will add value to the blog post. WDYT?

I am fine with any option. I am not very familiar with Spaces but my impression that it is easier for users and might get larger traffic than Colab, but since we have a blog and if creating a Space may take a while, then we can just use the Colab.

Something I want to mention about Colab, is that I was only able to get speedups when using A100 GPU and not with the free P100 GPUs. So I had to pay out of pocket to upgrade Colab to A100 to observe decent speeds in the Colab. In Spaces, what GPU type is used in the backend?

I am not very familiar with Spaces but my impression that it is easier for users and might get larger traffic than Colab, but since we have a blog and if creating a Space may take a while, then we can just use the Colab.

Hugging Face spaces hosts demos of models. With Layer Skip, to showcase the power of the algorithm we would need to demo the generation speeds of model with and without self-speculation (as shown in the GIF). Otherwise it does not make sense for us to create the space and let users use it.

what GPU type is used in the backend

We have a lot of options to choose from (attached screenshot)

We could come up with something visual for the Space, but in the interest of time I'd get this done and published and we can iterate later.

mostafaelhoushi

Thanks @ariG23498 ! I made (hopefully) one last round of review and made minor suggestions to the wordings here and there.

mostafaelhoushi · 2024-11-07T18:33:45Z

layerskip.md

+* There could be different reasons for the relatively limited speedups of self-speculative decoding
+  on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually
+  pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B).
+  But this is an area of improvement to investigate for future research. Nevertheless,
+  self-speculative decoding for 70B is significantly faster than autoregressive decoding.


Can we indent this bullet point?

Suggested change

* There could be different reasons for the relatively limited speedups of self-speculative decoding

on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually

pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B).

But this is an area of improvement to investigate for future research. Nevertheless,

self-speculative decoding for 70B is significantly faster than autoregressive decoding.

- There could be different reasons for the relatively limited speedups of self-speculative decoding

on Llama2 70B compared to other models, e.g., the LayerSkip checkpoint of Llama2 70B was continually

pretrained with fewer tokens (328 M tokens for Llama2 70B compared to 52B tokens for Llama2 7B).

But this is an area of improvement to investigate for future research. Nevertheless,

self-speculative decoding for 70B is significantly faster than autoregressive decoding.

This section looks like this in the current stage.

I also applied your changes and rendered it, which resulted with the same indent.

Thanks @ariG23498 ! What I had in mind was to make the 3rd bullet indent even further, i.e., the 3rd bullet point becomes a sub-bullet of the 2nd bullet point.

I have made the change in the latest commit. Thank you for the suggestion.

layerskip.md

Co-authored-by: Mostafa Elhoushi <m.elhoushi@ieee.org>

mostafaelhoushi

The argument early_exit has changed to assistant_early_exit in the merged PR. So I have added suggestions to update the code snippets in the blog accordingly.

layerskip.md

pcuenca

Super cool, very nice and informative! I made a few suggestions with the overall theme to make the post as fluid as possible, but feel free to ignore them!

🔥

_blog.yml

layerskip.md

pcuenca · 2024-11-19T13:11:03Z

layerskip.md

+By leveraging this technique, we not only speed up text generation but also achieve significant
+memory savings and reduce computational latency. In order to obtain an end-to-end speedup, the
+output of the earlier layers need to be close enough to the last layer. This is achieved by a
+training recipe as described in the paper that could be applied as continual pretraining,
+pretraining from scratch, or finetuning on a specific domain. This makes self-speculative decoding
+especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering
+the overall hardware footprint needed for **large-scale inference**.


Suggested change

By leveraging this technique, we not only speed up text generation but also achieve significant

memory savings and reduce computational latency. In order to obtain an end-to-end speedup, the

output of the earlier layers need to be close enough to the last layer. This is achieved by a

training recipe as described in the paper that could be applied as continual pretraining,

pretraining from scratch, or finetuning on a specific domain. This makes self-speculative decoding

especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering

the overall hardware footprint needed for **large-scale inference**.

This technique not only speeds up text generation, but it also achieves significant

memory savings and reduces computational latency. In order to obtain an end-to-end speedup, the

output of the earlier layers needs to be close enough to the last layer. This is achieved by a

training recipe which, as described in the paper, can be applied during pretraining, and also while fine-tuning on a specific domain. Self-speculative decoding is

especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering

the overall hardware footprint needed for **large-scale inference**.

(Not opposed to mentioning shared KV-caching early, like in this summary)

@pcuenca do you mean I should add a line about shared KV Caching in this paragraph?

Just a mention, if you think it's useful. For example:

This technique not only speeds up text generation, but it also achieves significant memory savings (because weights and caches can be reused), and reduces computational latency. In order to obtain an end-to-end speedup, the output of the earlier layers needs to be close enough to the last layer's. This is achieved by a training recipe which, as described in the paper, can be applied during pretraining, and also while fine-tuning on a specific domain. Self-speculative decoding is especially efficient for real-world applications, enabling deployment on smaller GPUs and lowering the overall hardware footprint needed for **large-scale inference**

Your call!

layerskip.md

pcuenca · 2024-11-19T13:14:35Z

layerskip.md

+
+1. [Hugging Face Paper Discussion Forum](https://huggingface.co/papers/2404.16710)
+2. [LayerSkip Model Collections](https://huggingface.co/collections/facebook/layerskip-666b25c50c8ae90e1965727a)
+3. LayerSkip Space


But there's also a Colab Notebook, can we link it here too?

layerskip.md

Vaibhavs10

Very cool! Left some nits, but good to merge from my side! Great job! 🔥

layerskip.md

ariG23498 · 2024-11-20T10:55:47Z

@pcuenca @Vaibhavs10 I have made the changed.

@mostafaelhoushi The colab notebook and the sheet now resides here: https://huggingface.co/datasets/ariG23498/layer-skip-assets

pcuenca

Nice work! Ready to merge in my opinion!

pcuenca · 2024-11-20T11:07:52Z

@mostafaelhoushi The colab notebook and the sheet now resides here: https://huggingface.co/datasets/ariG23498/layer-skip-assets

@ariG23498 Could you please create a README in the dataset explaining what's in there, linking to the notebook, and crediting Mostafa as the main author (I know it's already done in the notebook)? We can also transfer the dataset to your HF namespace @mostafaelhoushi, if you'd like that.

ariG23498 · 2024-11-20T12:14:05Z

@pcuenca thanks for the suggestion.
I have added a README to the HF Dataset.

ariG23498 added 2 commits November 5, 2024 18:59

chore: adding first draft

b63a55b

chore: adding entry and fixing tables

728bf64

ariG23498 requested review from pcuenca and Vaibhavs10 November 5, 2024 13:42

ariG23498 self-assigned this Nov 5, 2024

mostafaelhoushi reviewed Nov 5, 2024

View reviewed changes

add: thumbnail

5667fa4

mostafaelhoushi reviewed Nov 7, 2024

View reviewed changes

ariG23498 and others added 4 commits November 8, 2024 10:17

Apply suggestions from code review

275ffe1

Co-authored-by: Mostafa Elhoushi <m.elhoushi@ieee.org>

indent

299198d

Merge branch 'main' into aritra/layer-skip

4b62af1

resolve conflicts

bfc6bf3

mostafaelhoushi reviewed Nov 19, 2024

View reviewed changes

layerskip.md Outdated Show resolved Hide resolved

layerskip.md Outdated Show resolved Hide resolved

layerskip.md Outdated Show resolved Hide resolved

pcuenca approved these changes Nov 19, 2024

View reviewed changes

ariG23498 and others added 6 commits November 20, 2024 09:34

resolve conflicts

db72dc3

chore: review suggestions

3d92c29

update table according to latest changes

04bd551

chore: re-position the table

ac03857

fix tabs and indent

663ccc9

Merge branch 'main' into aritra/layer-skip

80ee89d

Vaibhavs10 approved these changes Nov 20, 2024

View reviewed changes

layerskip.md Outdated Show resolved Hide resolved

layerskip.md Show resolved Hide resolved

layerskip.md Outdated Show resolved Hide resolved

layerskip.md Show resolved Hide resolved

chore: review suggestions

4f43e66

pcuenca approved these changes Nov 20, 2024

View reviewed changes

ariG23498 merged commit 09fd808 into huggingface:main Nov 20, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Add] LayerSkip Blog Post #2459

[Add] LayerSkip Blog Post #2459

ariG23498 commented Nov 5, 2024 •

edited

Loading

mostafaelhoushi left a comment

mostafaelhoushi Nov 5, 2024

ariG23498 Nov 5, 2024

ariG23498 Nov 8, 2024

pcuenca Nov 19, 2024

pcuenca Nov 19, 2024

mostafaelhoushi Nov 19, 2024

ariG23498 Nov 20, 2024

mostafaelhoushi Nov 20, 2024

ariG23498 Nov 20, 2024

pcuenca Nov 20, 2024

mostafaelhoushi left a comment

mostafaelhoushi Nov 7, 2024

ariG23498 Nov 8, 2024

mostafaelhoushi Nov 8, 2024

ariG23498 Nov 11, 2024

mostafaelhoushi left a comment

pcuenca left a comment

pcuenca Nov 19, 2024

pcuenca Nov 19, 2024

ariG23498 Nov 20, 2024

pcuenca Nov 20, 2024

pcuenca Nov 19, 2024

Vaibhavs10 left a comment

ariG23498 commented Nov 20, 2024

pcuenca left a comment

pcuenca commented Nov 20, 2024 •

edited

Loading

ariG23498 commented Nov 20, 2024

	3. LayerSkip Space
	3. LayerSkip Space
	4. [Colab Notebook](https://colab.research.google.com/drive/1V21LaHaZk_zjhvMLvsWgVSFm6-cn9XAl?usp=sharing)

[Add] LayerSkip Blog Post #2459

[Add] LayerSkip Blog Post #2459

Conversation

ariG23498 commented Nov 5, 2024 • edited Loading

mostafaelhoushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mostafaelhoushi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mostafaelhoushi left a comment

Choose a reason for hiding this comment

pcuenca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vaibhavs10 left a comment

Choose a reason for hiding this comment

ariG23498 commented Nov 20, 2024

pcuenca left a comment

Choose a reason for hiding this comment

pcuenca commented Nov 20, 2024 • edited Loading

ariG23498 commented Nov 20, 2024

ariG23498 commented Nov 5, 2024 •

edited

Loading

pcuenca commented Nov 20, 2024 •

edited

Loading