Why we need PE?

<aside> 💡

When we convert decimal to binary we can spot the rate of change between different bits; it is constant with LSB alternating on every number and 2nd lower bit is rotating on every 2 numbers. Hence, equivalent of alternating bits in FP world are Sinusoidal functions.

The 128-dimensional positional encoding for a sentence with the maximum length of 50. Each row represents the embedding vector p_t

The 128-dimensional positional encoding for a sentence with the maximum length of 50. Each row represents the embedding vector p_t

</aside>

Interview Questions

PS: ChatGPT :)

Questions

Shortcoming of Positional Encoding (and their resolution)

  1. Traditional encoding primarily provide absolute positional information. They don't inherently capture relative distances or relationships between tokens, which can be more informative in understanding context