The implementation of transformer as presented in the paper "Attention is all you need" from scratch.
Excellent Illustration of Transformers: Illustrated Guide to Transformers Neural Network: A step by step explanation
Keys, Queries and Values in Attention Mechanism: What exactly are keys, queries, and values in attention mechanisms?
Positional Encoding: Transformer Architecture: The Positional Encoding
Data Flow, Parameters, and Dimensions in Transformer: Into The Transformer, Transformers: report on Attention Is All You Need
The Transformer architecture is a popular type of neural network used in natural language processing (NLP) tasks, such as machine translation and text classification. It was first introduced in a paper by Vaswani et al. in 2017.
At a high level, the Transformer model consists of an encoder and a decoder, both of which contain a series of identical layers. Each layer has two sub-layers: a self-attention layer and a feedforward layer. The self-attention layer allows the model to attend to different parts of the input sequence, while the feedforward layer applies a non-linear transformation to the output of the self-attention layer.
Now, let's break down the math behind the self-attention layer. Suppose we have an input sequence of length
However, we want to compute the weights dynamically, based on the similarity between each pair of input vectors. This is where self-attention comes in. We first compute a "query" vector
where
where
Finally, we compute the output of the self-attention layer as a weighted sum of the value vectors:
where
Overall, the Transformer architecture is a powerful tool for NLP tasks, and its self-attention mechanism allows it to model long-range dependencies in the input sequence.