Fine-tuning Whisper with timestamp tokens #620

melaniezhang · 2022-12-01T08:50:14Z

melaniezhang
Dec 1, 2022

Hi, I've successfully fine-tuned Whisper without timestamp tokens, but I'm hoping to fine-tune it with timestamp tokens inserted in the decoder inputs.

Here's an example input I'm attempting to use to the decoder:
tokens:

tensor([50259, 50359, 57604,    15,  1042,   286,   220, 21074,   220,  6780,
          321,   603,   536,   220,  6780,   220, 11529,   346,   220,  3322,
          786,   294,   577,   321,   434,   516,   220,  1353,   829,   220,
        11176,   220,    83,  9622,   382,   257,  6370,   220,  1353,   220,
         1353,  2142,   220,  3322,  1674,    13,   492,   603,   362,   220,
         1353,   536,   322,   257,   220, 11529,  5017,   322,   689,   321,
          434,   799,   352,   490,   510,   293,   490,   365,   220, 11176,
           13, 58168, 58175,    16,  8266,    12, 10250,   457,  2232,   360,
          291,  1217,   362,   411,   257, 11745,  1715,   420,   257,   220,
        29113,   804,  1715, 58440, 58426,    15,  4019, 58451, 58440,    16,
          420, 58457, 58451,    15,   337,   220,  3322,  1623,   406,  1939,
           13, 58503, 58526,    16,   876,   337,   220,  3322,  1623,   406,
         1939,    11,  1392,    11,   457,  2232,   437,   311,   437,   311,
          428,  2232,   360,   291,   362,   512,  1716,  1393,    11,   746,
          365,   291,   420, 58875, 50257])

decoded:

<|startoftranscript|><|en|><|transcribe|><|144.80|>0 Well I think that we'll see that throughout the day in how we're going to put this together as a marketing to to market the product. We'll have to see on a through discussion on where we're gonna go from here and from with this.<|156.08|><|156.22|>1 Mm-hmm but uh do you already have like a functional design or a technical design<|161.52|><|161.24|>0 Uh<|161.74|><|161.52|>1 or<|161.86|><|161.74|>0 for the moment not yet.<|162.78|><|163.24|>1 Oh for the moment not yet, okay, but uh what's what's your uh do you have some project plan, something with you or<|170.22|>

When I try to input this into the model decoder (out = wmodel.decoder(dec_input_ids, audio_features)), I get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-27-83b2b55c4f20>](https://localhost:8080/#) in <module>
     18 # The audio_features are the encoded audio features to be
     19 # out = wmodel.decoder(dec_input_ids.cuda(), audio_features)
---> 20 out = wmodel.decoder(dec_input_ids, audio_features)

4 frames
[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/whisper/model.py](https://localhost:8080/#) in forward(self, x, xa, kv_cache)
    183         """
    184         offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0
--> 185         x = self.token_embedding(x) + self.positional_embedding[offset : offset + x.shape[-1]]
    186         x = x.to(xa.dtype)
    187 

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
   1131         # Do not call functions when jit is used
   1132         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.8/dist-packages/torch/nn/modules/sparse.py](https://localhost:8080/#) in forward(self, input)
    156 
    157     def forward(self, input: Tensor) -> Tensor:
--> 158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
    160             self.norm_type, self.scale_grad_by_freq, self.sparse)

[/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2197         # remove once script supports set_grad_enabled
   2198         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2199     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2200 
   2201 

IndexError: index out of range in self

When I remove the timestamp tokens from the decoder input, everything works fine.
It seems to me that maybe the decoder is unable to produce a token embedding for the timestamp tokens; if this is the case, is there an another way I can fine-tune the model using transcriptions that have timestamps?

Can someone point me in the right direction?

Answered by jongwook

Dec 5, 2022

Hi, the tokenizer should have exactly 1501 tokens starting with tokenizer.timestamp_begin which corresponds to <|0.0|> and tokenizer.timestamp_begin + 1500 corresponding to <|30.0|>, with an interval of 0.02 seconds (please note these are just a notation for convenience used by decode_with_timestamps() and not included in the tokenizer as special tokens`). It appears that your timestamp tokens are quite above this range, and fine-tuning should work if you can adjust them to be under the 30.0-second mark!

View full answer

RaulKite · 2022-12-01T15:43:27Z

RaulKite
Dec 1, 2022

Can you share your code?

I'm really interested in fine tunning timestamps but don't know where to start.

0 replies

jongwook · 2022-12-05T00:01:12Z

jongwook
Dec 5, 2022
Maintainer

Hi, the tokenizer should have exactly 1501 tokens starting with tokenizer.timestamp_begin which corresponds to <|0.0|> and tokenizer.timestamp_begin + 1500 corresponding to <|30.0|>, with an interval of 0.02 seconds (please note these are just a notation for convenience used by decode_with_timestamps() and not included in the tokenizer as special tokens`). It appears that your timestamp tokens are quite above this range, and fine-tuning should work if you can adjust them to be under the 30.0-second mark!

2 replies

ehsantaati Dec 19, 2022

Thank you fro your answer. I wondered if it means that to be able to fine-tune the Whisper with timestamp, it is needed to write a new tokeniser as you instructed?

I tried to pad the text tokens by "eot" but the model predicts the pad tokens as words.

dongrixinyu Jul 22, 2024

Hi, the tokenizer should have exactly 1501 tokens starting with tokenizer.timestamp_begin which corresponds to <|0.0|> and tokenizer.timestamp_begin + 1500 corresponding to <|30.0|>, with an interval of 0.02 seconds (please note these are just a notation for convenience used by decode_with_timestamps() and not included in the tokenizer as special tokens`). It appears that your timestamp tokens are quite above this range, and fine-tuning should work if you can adjust them to be under the 30.0-second mark!

Actually, timestamp tokens is absolutely like text tokens. <|15.2|> and <|10.8|> etc. are purely viewed as a word, regardless of the quantitive meaning of 15.2 and 10.8. Am I right?

kouohhashi · 2023-01-23T01:21:54Z

kouohhashi
Jan 23, 2023

Hi, I'm interested in fine-tuning with timestamps too.

when I try to encode "<|0.0|>", tokenizer gave me "[27, 91, 15, 13, 15, 91, 29]" instead of single special token.
Could you show me how to get a special token from "<|0.0|>" ?

Thanks,

2 replies

jongwook Jan 23, 2023
Maintainer

Hi, it's a bit hacky and undocumented, but please use

tokenizer.timestamp_begin + offset

where offset can be 0 to 1500, corresponding to the 1501 possible timestamp tokens in np.linspace(0.0, 30.0, 1501), so a timestamp of t seconds corresponds to offset = round(t / 0.02).

kouohhashi Jan 23, 2023

Thanks! I'll try it!

kouohhashi · 2023-02-06T07:51:15Z

kouohhashi
Feb 6, 2023

Sorry for bothering you but I have one more questions.

Should we insert end time token like the image on https://github.com/openai/whisper?

can i put timestamp tokens like below?
<|0.0|>I have a cute dog.<|4.4|><|4.8|>She is small but runs so fast.<|9.2|>

or should i do like below?
<|0.0|>I have a cute dog.<|4.4|>She is small but runs so fast.<|9.2|>

Thanks in advance.

1 reply

jongwook Feb 6, 2023
Maintainer

The former is the format we used for training, so if a timestamp token appears in the middle, there should be two consecutive ones. During decoding, we mask the tokens to enforce this rule:

whisper/whisper/decoding.py

Lines 413 to 418 in 7858aa9

    
           # timestamps have to appear in pairs, except directly before EOT; mask logits accordingly 
        
           for k in range(tokens.shape[0]): 
        
               sampled_tokens = tokens[k, self.sample_begin :] 
        
               seq = [t for t in sampled_tokens.tolist()] 
        
               last_was_timestamp = len(seq) >= 1 and seq[-1] >= self.tokenizer.timestamp_begin 
        
               penultimate_was_timestamp = len(seq) < 2 or seq[-2] >= self.tokenizer.timestamp_begin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning Whisper with timestamp tokens #620

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fine-tuning Whisper with timestamp tokens #620

melaniezhang Dec 1, 2022

Replies: 4 comments · 5 replies

RaulKite Dec 1, 2022

jongwook Dec 5, 2022 Maintainer

ehsantaati Dec 19, 2022

dongrixinyu Jul 22, 2024

kouohhashi Jan 23, 2023

jongwook Jan 23, 2023 Maintainer

kouohhashi Jan 23, 2023

kouohhashi Feb 6, 2023

jongwook Feb 6, 2023 Maintainer

melaniezhang
Dec 1, 2022

Replies: 4 comments 5 replies

RaulKite
Dec 1, 2022

jongwook
Dec 5, 2022
Maintainer

kouohhashi
Jan 23, 2023

jongwook Jan 23, 2023
Maintainer

kouohhashi
Feb 6, 2023

jongwook Feb 6, 2023
Maintainer