Attention Scale Factor #2304

bertocast · 2024-08-16T13:24:36Z

bertocast
Aug 16, 2024

Hello folks,

I am trying to understand the code of the Whisper model from the paper and the state of the art of the transformers.

In particular, I was reviewing the Multi-Head Attention implementation in this code repository. There is a scale factor in this layer, depicted in line 97.

whisper/whisper/model.py

Lines 96 to 98 in ba3f3cd

    
           n_batch, n_ctx, n_state = q.shape 
        
           scale = (n_state // self.n_head) ** -0.25 
        
           q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3) * scale

According to the original paper in which the transformer architecture is introduced, that should be a squared root, why the 0.25?

Thank you.

bertocast · 2024-08-20T08:23:36Z

bertocast
Aug 20, 2024
Author

I finally understood this. The scale is applied before the matmul between key and query. So the scale factor after that matmul will be the square of the original scale factor and therefore equivalent to the original paper factor.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention Scale Factor #2304

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Attention Scale Factor #2304

bertocast Aug 16, 2024

Replies: 1 comment

bertocast Aug 20, 2024 Author

bertocast
Aug 16, 2024

bertocast
Aug 20, 2024
Author