Positional Encoding through Qs

Why we need PE?

Transformer architecture ditched the recurrence mechanism in favor of multi-head self-attention mechanism. Avoiding the RNNs’ method of recurrence will result in massive speed-up in the training time. And theoretically, it can capture longer dependencies in a sentence.
As each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack, The model itself doesn’t have any sense of position/order for each word. Consequently, there’s still the need for a way to incorporate the order of the words into our model.
From top of the mind one can think of a few brute force idea
- Assigning a number to each time-step within the [0, 1] range
- Assigning a number to each time-step linearly
Ideally the solution how be
1. It should output a unique encoding for each time-step (word’s position in a sentence)
2. Distance between any two time-steps should be consistent across sentences with different lengths.
3. Our model should generalize to longer sentences without any efforts. Its values should be bounded.
4. It must be deterministic.
Inject a d-dimension vector in the model input that contains information about a specific position in a sentence. Hence mathematically this is how it looks

$$ % <![CDATA[ \begin{align} \vec{p_t}^{(i)} = f(t)^{(i)} & := \begin{cases} \sin({\omega_k} . t), & \text{if}\ i = 2k \\ \cos({\omega_k} . t), & \text{if}\ i = 2k + 1 \end{cases} \end{align} %]]> $$

where
- t: pos of input sequence | d: dimension of encoding vector
- For odd index: i = 2k → encoding uses sin
- For even index: i = 2k + 1 → encoding uses cos
- Frequency Term: $\omega_k = \frac{1}{10000^{2k / d}} \quad \forall \quad$2k/d scales the frequency logarithmically across dimensions. Lower dimensions correspond to slower frequencies, and higher dimensions correspond to faster frequencies.
Relation between post of token , dimension of encoding and length of input sequence
- t (position of token) corresponds to the position of a specific token in the input sequence of length n
- d dimensionality of the positional encoding vector. It determines the number of sine and cosine values used to encode each position 𝑡. Typically, d is a hyperparameter chosen based on the model's architecture. For example: transformer architectures like BERT or GPT, d is equal to the embedding size of the tokens (512 or 768)

<aside> 💡

When we convert decimal to binary we can spot the rate of change between different bits; it is constant with LSB alternating on every number and 2nd lower bit is rotating on every 2 numbers. Hence, equivalent of alternating bits in FP world are Sinusoidal functions.

The 128-dimensional positional encoding for a sentence with the maximum length of 50. Each row represents the embedding vector p_t

The 128-dimensional positional encoding for a sentence with the maximum length of 50. Each row represents the embedding vector p_t

</aside>

Interview Questions

PS: ChatGPT :)

Easy
Medium
Hard

Questions

Doesn't the position information get vanished once it reaches the upper layers?
Why are both sine and cosine used?
Why positional embeddings are summed with word embeddings instead of concatenation?

Shortcoming of Positional Encoding (and their resolution)

Traditional encoding primarily provide absolute positional information. They don't inherently capture relative distances or relationships between tokens, which can be more informative in understanding context