fix: set cos sin in max_seq_len #203

rheasukthanker · 2024-12-06T22:24:42Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Noves the cos, sin initialized to cpu at init (instead of "meta" device in litgpt , to avoid faliures with FSDP.

Minimal Example / How should this PR be tested?

Any other comments?

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the
terms of your choice.

aaronkl · 2024-12-08T08:12:01Z

Thanks for the PR. Just for my understanding, what happens if we do a forward path with inputs on GPU? Wouldn't that lead to clash since we use different devices?

rheasukthanker · 2024-12-08T15:17:27Z

Thanks for the PR. Just for my understanding, what happens if we do a forward path with inputs on GPU? Wouldn't that lead to clash since we use different devices?

We set cos sin on the device we are training on every time, so cos, sin are always on the device we train on. check

whittle/whittle/models/gpt/model.py

Line 301 in 748e117

cos, sin = self.rope_cache(

gabikadlecova · 2024-12-09T13:14:07Z

While reviewing, I discovered that KV cache does not work at all with subnetworks because of how the rope_cache is constructed. I'll implement a fix along with setting the KV cache outside of forward (while making sure cos and sin are on the right device)

- fix correct rope_n_elem value for subnets in KV cache and forward (when input_pos is not None) - track sub_network_rope_n_elem - in case max_seq_length changes, use sub_network_rope_n_elem

gabikadlecova · 2024-12-09T14:18:47Z

@aaronkl @rheasukthanker what do you think? Except for KV cache calls, the output should be the same as before. Now no rope_cache calls occur in forward.

We should probably add some KV cache test (probably in a separate issue). Although not sure if KV cache is that important for subnets - for inference, you would probably extract the subnet first. We could also disable KV cache for subnets.

- since rope_n_elem is determined in the config along with head size, we need to pass it to the config manually

rheasukthanker · 2024-12-11T13:11:06Z

While reviewing, I discovered that KV cache does not work at all with subnetworks because of how the rope_cache is constructed. I'll implement a fix along with setting the KV cache outside of forward (while making sure cos and sin are on the right device)

While I agree with the fixes. There are a couple things to keep in mind (and this is also the reason I left kv caching untouched earlier). I am noting these down here for completeness and we should look into this in detail further:

The way kv_caching works (https://neptune.ai/blog/transformers-key-value-caching), we need KV sizes (dependent on num_heads, head_size, query groups), we can use a kv_cache only for a fixed dense model (we need the KV sizes for subsequent tokens to match). Since the supernet, in case of sampling subnetworks is adaptive, KV sizes are dynamic depending upon the network sampled at inference, hence a supernet with different networks sampled during inference, renders KV caching inapplicable
KV caching is only used for inference time gains (ie. for deployed models). Since we can basically convert a dense model searched by whittle directly into a litgpt model and use the KV cache there, I don't think there is real utility in adapting KV caching for whittle since we are more training/finetuning focussed.
KV Caching for a scenario wherein num heads/query groups/head sizes are different per layer also fails, again since KV caching needs to be implemented in layer-specific manner in this case https://github.com/Lightning-AI/litgpt/blob/main/litgpt/model.py#L574. Again, this we don't support layer specific dimensions anymore, this is also not an issue currently.

gabikadlecova · 2024-12-11T14:20:01Z

While reviewing, I discovered that KV cache does not work at all with subnetworks because of how the rope_cache is constructed. I'll implement a fix along with setting the KV cache outside of forward (while making sure cos and sin are on the right device)

While I agree with the fixes. There are a couple things to keep in mind (and this is also the reason I left kv caching untouched earlier). I am noting these down here for completeness and we should look into this in detail further:

The way kv_caching works (https://neptune.ai/blog/transformers-key-value-caching), we need KV sizes (dependent on num_heads, head_size, query groups), we can use a kv_cache only for a fixed dense model (we need the KV sizes for subsequent tokens to match). Since the supernet, in case of sampling subnetworks is adaptive, KV sizes are dynamic depending upon the network sampled at inference, hence a supernet with different networks sampled during inference, renders KV caching inapplicable

KV caching is only used for inference time gains (ie. for deployed models). Since we can basically convert a dense model searched by whittle directly into a litgpt model and use the KV cache there, I don't think there is real utility in adapting KV caching for whittle since we are more training/finetuning focussed.

KV Caching for a scenario wherein num heads/query groups/head sizes are different per layer also fails, again since KV caching needs to be implemented in layer-specific manner in this case https://github.com/Lightning-AI/litgpt/blob/main/litgpt/model.py#L574. Again, this we don't support layer specific dimensions anymore, this is also not an issue currently.

Thanks for the feedback. I agree that KV caching is inapplicable if different networks are sampled. However, the speedup could still help here in one particular use-case - subnet evaluation:
a) sample a subnetwork, clean the new KV cache
b) evaluate on a task that requires multiple token generations
c) repeat

It's probably easier to extract the subnet and convert it into a litgpt model. But it could be useful to have it here for flexibility. Also, if we don't want to use it for subnets at all, we could raise a warning/error when calling forward with input_pos not None.

aaronkl · 2024-12-11T14:37:46Z

Thanks for the detailed discussion. I agree it might be actually interesting to see the effect of KV-Cache on sub-network evaluation. @gabikadlecova Could you open another issue for that so we can track it for a future PR?

rheasukthanker · 2024-12-11T14:40:38Z

While reviewing, I discovered that KV cache does not work at all with subnetworks because of how the rope_cache is constructed. I'll implement a fix along with setting the KV cache outside of forward (while making sure cos and sin are on the right device)

While I agree with the fixes. There are a couple things to keep in mind (and this is also the reason I left kv caching untouched earlier). I am noting these down here for completeness and we should look into this in detail further:

The way kv_caching works (https://neptune.ai/blog/transformers-key-value-caching), we need KV sizes (dependent on num_heads, head_size, query groups), we can use a kv_cache only for a fixed dense model (we need the KV sizes for subsequent tokens to match). Since the supernet, in case of sampling subnetworks is adaptive, KV sizes are dynamic depending upon the network sampled at inference, hence a supernet with different networks sampled during inference, renders KV caching inapplicable

KV caching is only used for inference time gains (ie. for deployed models). Since we can basically convert a dense model searched by whittle directly into a litgpt model and use the KV cache there, I don't think there is real utility in adapting KV caching for whittle since we are more training/finetuning focussed.

KV Caching for a scenario wherein num heads/query groups/head sizes are different per layer also fails, again since KV caching needs to be implemented in layer-specific manner in this case https://github.com/Lightning-AI/litgpt/blob/main/litgpt/model.py#L574. Again, this we don't support layer specific dimensions anymore, this is also not an issue currently.

Thanks for the feedback. I agree that KV caching is inapplicable if different networks are sampled. However, the speedup could still help here in one particular use-case - subnet evaluation: a) sample a subnetwork, clean the new KV cache b) evaluate on a task that requires multiple token generations c) repeat

It's probably easier to extract the subnet and convert it into a litgpt model. But it could be useful to have it here for flexibility. Also, if we don't want to use it for subnets at all, we could raise a warning/error when calling forward with input_pos not None.

I agree with the last use case. Let's keep things as they are for now. We need more thorough tests for kv_caching though. @gabikadlecova could you create an issue for that? Also perhaps summarize the points discussed here there?

gabikadlecova · 2024-12-11T14:42:42Z

Sure, I'll create it

set cos sin in max_seq_len

5bc9f0f

rheasukthanker changed the title ~~set cos sin in max_seq_len~~ fix:set cos sin in max_seq_len Dec 6, 2024

rheasukthanker changed the title ~~fix:set cos sin in max_seq_len~~ fix: set cos sin in max_seq_len Dec 6, 2024

fix formatting

748e117

rheasukthanker requested review from aaronkl and gabikadlecova December 6, 2024 22:34

rheasukthanker mentioned this pull request Dec 6, 2024

calling reset_parameters after init leads to crash #194

Closed

zeqri mentioned this pull request Dec 7, 2024

feat: addition of structural pruning methods #192

Open

Remove rope_cache computation from forward, fix kv cache for subnets

7a777c8

- fix correct rope_n_elem value for subnets in KV cache and forward (when input_pos is not None) - track sub_network_rope_n_elem - in case max_seq_length changes, use sub_network_rope_n_elem

gabikadlecova removed their request for review December 9, 2024 14:16

Fix test crashing after moving the rope cache out of forward

95414cf

- since rope_n_elem is determined in the config along with head size, we need to pass it to the config manually

gabikadlecova linked an issue Dec 9, 2024 that may be closed by this pull request

calling reset_parameters after init leads to crash #194

Closed

Merge branch 'main' into cos-sin-fix

b56c96b

aaronkl approved these changes Dec 11, 2024

View reviewed changes

rheasukthanker merged commit 015b7fe into main Dec 11, 2024
9 checks passed

rheasukthanker deleted the cos-sin-fix branch December 11, 2024 14:36

gabikadlecova mentioned this pull request Dec 11, 2024

Analyze KV caching in subnetworks #207

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: set cos sin in max_seq_len #203

fix: set cos sin in max_seq_len #203

rheasukthanker commented Dec 6, 2024

aaronkl commented Dec 8, 2024

rheasukthanker commented Dec 8, 2024 •

edited

Loading

gabikadlecova commented Dec 9, 2024

gabikadlecova commented Dec 9, 2024

rheasukthanker commented Dec 11, 2024 •

edited

Loading

gabikadlecova commented Dec 11, 2024

aaronkl commented Dec 11, 2024

rheasukthanker commented Dec 11, 2024

gabikadlecova commented Dec 11, 2024

fix: set cos sin in max_seq_len #203

fix: set cos sin in max_seq_len #203

Conversation

rheasukthanker commented Dec 6, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Minimal Example / How should this PR be tested?

Any other comments?

aaronkl commented Dec 8, 2024

rheasukthanker commented Dec 8, 2024 • edited Loading

gabikadlecova commented Dec 9, 2024

gabikadlecova commented Dec 9, 2024

rheasukthanker commented Dec 11, 2024 • edited Loading

gabikadlecova commented Dec 11, 2024

aaronkl commented Dec 11, 2024

rheasukthanker commented Dec 11, 2024

gabikadlecova commented Dec 11, 2024

rheasukthanker commented Dec 8, 2024 •

edited

Loading

rheasukthanker commented Dec 11, 2024 •

edited

Loading