Transformer architecture ditched the recurrence mechanism in favor of multi-head self-attention mechanism. Avoiding the RNNs’ method of recurrence will result in massive speed-up in the training time. And theoretically, it can capture longer dependencies in a sentence.
As each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack, The model itself doesn’t have any sense of position/order for each word. Consequently, there’s still the need for a way to incorporate the order of the words into our model.
From top of the mind one can think of a few brute force idea
Ideally the solution how be
Inject a d-dimension vector in the model input that contains information about a specific position in a sentence. Hence mathematically this is how it looks
$$ % <![CDATA[ \begin{align} \vec{p_t}^{(i)} = f(t)^{(i)} & := \begin{cases} \sin({\omega_k} . t), & \text{if}\ i = 2k \\ \cos({\omega_k} . t), & \text{if}\ i = 2k + 1 \end{cases} \end{align} %]]> $$
where
sin
cos
Relation between post of token , dimension of encoding and length of input sequence
t
(position of token) corresponds to the position of a specific token in the input sequence of length n
d
dimensionality of the positional encoding vector. It determines the number of sine and cosine values used to encode each position 𝑡. Typically, d is a hyperparameter chosen based on the model's architecture. For example: transformer architectures like BERT or GPT, d
is equal to the embedding size of the tokens (512 or 768)<aside> 💡
When we convert decimal to binary we can spot the rate of change between different bits; it is constant with LSB alternating on every number and 2nd lower bit is rotating on every 2 numbers. Hence, equivalent of alternating bits in FP world are Sinusoidal functions.
The 128-dimensional positional encoding for a sentence with the maximum length of 50. Each row represents the embedding vector p_t
</aside>
PS: ChatGPT :)