Transformer architecture ditched the recurrence mechanism in favor of multi-head self-attention mechanism. Avoiding the RNNs’ method of recurrence will result in massive speed-up in the training time. And theoretically, it can capture longer dependencies in a sentence.
As each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack, The model itself doesn’t have any sense of position/order for each word. Consequently, there’s still the need for a way to incorporate the order of the words into our model.
From top of the mind one can think of a few brute force idea
Ideally the solution how be
Inject a d-dimension vector in the model input that contains information about a specific position in a sentence. Hence mathematically this is how it looks
$$ % <![CDATA[ \begin{align} \vec{p_t}^{(i)} = f(t)^{(i)} & := \begin{cases} \sin({\omega_k} . t), & \text{if}\ i = 2k \\ \cos({\omega_k} . t), & \text{if}\ i = 2k + 1 \end{cases} \end{align} %]]> $$
where
sincosRelation between post of token , dimension of encoding and length of input sequence
t (position of token) corresponds to the position of a specific token in the input sequence of length nd dimensionality of the positional encoding vector. It determines the number of sine and cosine values used to encode each position 𝑡. Typically, d is a hyperparameter chosen based on the model's architecture. For example: transformer architectures like BERT or GPT, d is equal to the embedding size of the tokens (512 or 768)<aside> 💡
When we convert decimal to binary we can spot the rate of change between different bits; it is constant with LSB alternating on every number and 2nd lower bit is rotating on every 2 numbers. Hence, equivalent of alternating bits in FP world are Sinusoidal functions.

The 128-dimensional positional encoding for a sentence with the maximum length of 50. Each row represents the embedding vector p_t
</aside>
PS: ChatGPT :)