Avoid rearranging all caches #1483

wangchou · 2023-06-30T10:35:20Z

Since kv_caches from cross attention block is the same for each beam, we can avoid rearranging or calculating it for multiple times.
I saw about 20% speed up of large model with beam_size = 5 on cpu backend.

* avoid rearranging all kv_caches * avoid calculating the same kv_cache from cross attn * Update decoding.py * linter fix --------- Co-authored-by: Jong Wook Kim <jongwook@openai.com>

yuekaizhang · 2024-01-15T05:58:01Z

whisper/decoding.py

@@ -721,8 +725,7 @@ def run(self, mel: Tensor) -> List[DecodingResult]:
                )
            ]

-        # repeat the audio & text tensors by the group size, for beam search or best-of-n sampling
-        audio_features = audio_features.repeat_interleave(self.n_group, dim=0)


@wangchou I was wondering if you remove the repeat of audio_features, where you repeat the kv_cache for cross attention? Otherwise, during cross_attention, q@k seems with mismatch dims since tokens are repeated according the beam_size.

@jongwook, would you mind checking this please? Thanks.

@yuekaizhang @ operator(matmul) should support broadcasting?

import torch q = torch.ones(70, 4, 16, 4) k = torch.ones(7, 4, 4, 16) k2 = torch.ones(70, 4, 4, 16) context2 = q @ k2 print(context2.shape) context1 = q @ k print(context1.shape)

I ran with torch==2.0.1

RuntimeError: The size of tensor a (70) must match the size of tensor b (7) at non-singleton dimension 0

@yuekaizhang I ran whisper with beam_size=5. It works.

python -m whisper ../samples/thatBand2ch_short.wav --language ja --model small --beam_size=5

After adding print in qkv_attention() like

qk = q @ k print("q.shape=",q.shape,", k.shape=", k.shape)

it outputs

... q.shape= torch.Size([5, 12, 1, 64]) , k.shape= torch.Size([5, 12, 64, 6]) q.shape= torch.Size([5, 12, 1, 64]) , k.shape= torch.Size([1, 12, 64, 1500])

What arguments did you use to get k like (7, 4, 4, 16)? Where is that 7 comes from?

ps: I only test this on mac cpu backend. I guess that 7 is used by GPU related code?

@wangchou Did you try to inference with batch_size > 1? I met this issue when I tried with both batch_size, beam_size > 1. The snippet codes above using batch_size 7, beam_size 10.

@yuekaizhang I don't even know batch_size option. And I cannot find it with whisper --help. Sorry.

* It ensures that audio features are correctly duplicated across beams for each batch item. * Added a test for `decode()` that includes a regression test for this. * This issue was introduced in PR openai#1483.

* It ensures that audio features are correctly duplicated across beams for each batch item. * Added a test for `decode()` that includes a regression test for this. * Update *.github/workflows/test.yml* to run the new test for `decode()` in tiny. * This issue was introduced in PR openai#1483.

wangchou and others added 5 commits June 30, 2023 18:26

avoid rearranging all kv_caches

2551025

Merge branch 'main' into avoid-rearranging-all-caches

9508dc4

avoid calculating the same kv_cache from cross attn

bed9a42

Update decoding.py

641f834

linter fix

7db7b46

jongwook merged commit b91c907 into openai:main Jul 6, 2023

wangchou deleted the avoid-rearranging-all-caches branch July 7, 2023 02:09

yuekaizhang reviewed Jan 15, 2024

View reviewed changes

yuekaizhang mentioned this pull request Jan 15, 2024

add repeat audio back #1959

Open

zuazo mentioned this pull request Jun 1, 2024

Fix beam search with batch processing in Whisper decoding #2197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid rearranging all caches #1483

Avoid rearranging all caches #1483

wangchou commented Jun 30, 2023 •

edited

Loading

yuekaizhang Jan 15, 2024

yuekaizhang Jan 15, 2024

wangchou Jan 15, 2024 •

edited

Loading

yuekaizhang Jan 15, 2024

wangchou Jan 15, 2024 •

edited

Loading

yuekaizhang Jan 15, 2024 •

edited

Loading

wangchou Jan 15, 2024

Avoid rearranging all caches #1483

Avoid rearranging all caches #1483

Conversation

wangchou commented Jun 30, 2023 • edited Loading

yuekaizhang Jan 15, 2024

Choose a reason for hiding this comment

yuekaizhang Jan 15, 2024

Choose a reason for hiding this comment

wangchou Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

yuekaizhang Jan 15, 2024

Choose a reason for hiding this comment

wangchou Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

yuekaizhang Jan 15, 2024 • edited Loading

Choose a reason for hiding this comment

wangchou Jan 15, 2024

Choose a reason for hiding this comment

wangchou commented Jun 30, 2023 •

edited

Loading

wangchou Jan 15, 2024 •

edited

Loading

wangchou Jan 15, 2024 •

edited

Loading

yuekaizhang Jan 15, 2024 •

edited

Loading