Replies: 1 comment
-
I finally understood this. The scale is applied before the matmul between key and query. So the scale factor after that matmul will be the square of the original scale factor and therefore equivalent to the original paper factor. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello folks,
I am trying to understand the code of the Whisper model from the paper and the state of the art of the transformers.
In particular, I was reviewing the Multi-Head Attention implementation in this code repository. There is a scale factor in this layer, depicted in line 97.
whisper/whisper/model.py
Lines 96 to 98 in ba3f3cd
According to the original paper in which the transformer architecture is introduced, that should be a squared root, why the
0.25
?Thank you.
Beta Was this translation helpful? Give feedback.
All reactions