Introduction — Why Positional Information Is Needed
Self-attention looks at a set of input tokens and produces a weighted average. But the attention operation itself has no sense of order. "I like you" and "You like me" are the same set of words but mean the opposite, yet without positional information the model lacks the clue to tell them apart. Mathematically, self-attention is permutation equivariant: permuting the input only permutes the output in the same way, and it cannot perceive order itself.
So a Transformer must inject token positional information explicitly. This article follows the evolution of how. Starting from the simplest sinusoidal encoding, we look at learned and relative positions, then RoPE and ALiBi, which have become effectively standard today, and finish with length-extrapolation techniques for handling inputs longer than training.
1. Sinusoidal — Fixed Sine-Wave Encoding
This is the original paper's approach. For each position, it builds sine/cosine values at different frequencies and adds them to the token embedding.
PE(pos, 2i) = sin( pos / 10000^(2i / D) )
PE(pos, 2i+1) = cos( pos / 10000^(2i / D) )
pos = position index
i = dimension index (0 .. D/2-1)
D = model dimension
Low dimensions carry fast frequencies and high dimensions slow frequencies, creating a unique pattern per position. The advantage is that there are no learned parameters and, in theory, positions longer than training can have values computed. In practice, however, performance often degrades well beyond the training length.
def sinusoidal_encoding(n_positions, d_model):
pe = torch.zeros(n_positions, d_model)
pos = torch.arange(n_positions).unsqueeze(1).float()
i = torch.arange(0, d_model, 2).float()
denom = torch.pow(10000, i / d_model)
pe[:, 0::2] = torch.sin(pos / denom)
pe[:, 1::2] = torch.cos(pos / denom)
return pe # (n_positions, d_model)
2. Learned — Learning Positions
Instead of sine waves, this approach uses a learnable embedding vector per position. Early BERT, GPT, and others used it. The advantage is that the positional representation is optimized to the data, but there is a decisive drawback: positions never seen during training (inputs exceeding the maximum length) have no embedding at all. That is, extrapolation is fundamentally impossible.
learned PE:
learns a (D,) vector for each of positions 0..max_len-1
positions beyond max_len -> no corresponding vector -> cannot process
Because of this limitation, learned absolute positional encoding has fallen out of favor recently, now that long context matters.
3. Relative Positions — Relative Feels More Natural Than Absolute
The meaning of a sentence often depends more on "how far this token is from that token" than on "which position in the document this token occupies." Relative positional encoding reflects the distance (relative position) between two tokens in the attention score.
Instead of adding absolute positions to the embedding, it adds a term that depends on the distance between query and key at the attention-score stage. This way the same distance relationship is treated identically no matter where it appears in the document, which helps generalization and extrapolation. RoPE and ALiBi both follow this relative-position philosophy.
4. RoPE — Encoding Relative Position via Rotation
The Core Idea
RoPE (Rotary Position Embedding) is the approach used today by most open LLMs such as Llama and Qwen. The idea is elegant: rather than adding positional information, it rotates the Query and Key vectors by an angle proportional to position.
Rotating the vector at position m in a 2D subspace by an angle of m times theta makes the dot product of Query and Key depend naturally only on the difference of the two positions (m - n). That is, you apply rotation using absolute positions, yet the result encodes relative position.
key property of RoPE:
rotate the query at position m by R(m), the key at position n by R(n), then
dot( R(m) q, R(n) k ) = function( q, k, m - n )
-> the dot product depends not on absolute positions m, n but only on relative distance (m - n)
The Rotation Matrix
It groups the D dimensions into D/2 2D pairs and rotates each pair at a different frequency.
rotate each 2D pair (x_a, x_b) by angle theta:
[ x_a' ] [ cos(theta) -sin(theta) ] [ x_a ]
[ x_b' ] = [ sin(theta) cos(theta) ] [ x_b ]
angle of the j-th pair at position m:
theta_j = m / base^(2j / D) (base is usually 10000)
low-dimension pairs: fast rotation (high frequency)
high-dimension pairs: slow rotation (low frequency)
def apply_rope(x, positions, base=10000):
x: (..., N, D) D is even
*_, N, D = x.shape
half = D // 2
freq = 1.0 / (base ** (torch.arange(0, half).float() / half))
angles = positions.unsqueeze(-1).float() * freq # (N, half)
cos = torch.cos(angles)
sin = torch.sin(angles)
x1 = x[..., :half]
x2 = x[..., half:]
rotate 2D pairs
rotated = torch.cat([x1 * cos - x2 * sin,
x1 * sin + x2 * cos], dim=-1)
return rotated
RoPE is popular because it (1) encodes relative position naturally, (2) adds no parameters, (3) applies only to Q and K, fitting well with the KV cache, and (4) leaves room to handle extrapolation by adjusting the base.
5. ALiBi — A Distance-Proportional Attention Penalty
ALiBi (Attention with Linear Biases) is simpler. Rather than putting positional encoding into the embedding, it adds a negative bias proportional to the query-key distance directly to the attention score. The farther a token is, the more its score is reduced.
ALiBi attention score:
score(i, j) = q_i · k_j - m * |i - j|
m = a per-head slope constant
|i - j| = distance between the two tokens
-> small penalty for near tokens, large penalty for far tokens
Giving each head a different slope m differentiates them so that some heads focus on near tokens and others on far ones. ALiBi's strength is that it extrapolates relatively well to inputs much longer than training. It is not adopted as universally as RoPE, however, and the choice varies by model design.
6. Comparison Summary
| Method | Injection point | Relative/absolute | Extra parameters | Extrapolation | Representative use |
| --- | --- | --- | --- | --- | --- |
| Sinusoidal | added to embedding | absolute | none | limited | original paper |
| Learned | added to embedding | absolute | yes | impossible | early BERT/GPT |
| Relative position | attention score | relative | depends on method | good | Transformer-XL, etc. |
| RoPE | rotate Q, K | relative | none | combined with extrapolation | Llama, Qwen, etc. |
| ALiBi | attention score bias | relative | none | good | some models |
7. Length Extrapolation and Context Extension
The Problem
What happens if you train a model with a 4K context but feed it 32K input? For RoPE, large rotation angles never seen during training appear, the attention pattern collapses, and performance drops sharply. This is called extrapolation failure. Extending context without retraining is a major practical concern.
NTK Scaling
This raises RoPE's base (the frequency reference value) so that rotation angles increase slowly. Intuitively, it has the effect of "stretching the position axis to squeeze it within the training range." It strikes a balance, extending context while reducing the loss of high-frequency information.
intuition of NTK-aware scaling:
adjust RoPE base from 10000 -> a larger value
-> rotation angle per dimension becomes gentler
-> positions longer than training map closer to inside the training distribution
YaRN
YaRN advances the NTK family by adjusting the interpolation/extrapolation strength differently per frequency band and also correcting the attention temperature. It preserves high frequencies (near-distance information) while stretching low frequencies (far distance), designed to maintain quality even at longer contexts. Combined with a small amount of additional fine-tuning, it works well.
summary of context-extension strategies:
position interpolation (PI): compress position indices by a ratio -> simple but damages high frequencies
NTK-aware: adjust base -> improves high-frequency preservation
YaRN: per-band correction + temperature adjustment -> favorable for longer extension
| Technique | Core | Retraining | Strength | Caution |
| --- | --- | --- | --- | --- |
| Position interpolation (PI) | compress position indices | mildly recommended | simple | damages high-frequency info |
| NTK-aware | adjust base | unnecessary ~ mild | preserves high frequencies | limited for big extension |
| YaRN | per-band correction + temperature | small amount recommended | strong for long extension | complex to implement |
8. Multimodal M-RoPE
In vision-language models (VLMs) that handle images or video, position is not one-dimensional. An image has rows and columns; a video adds a time axis. M-RoPE (Multimodal RoPE) decomposes RoPE across multiple axes, distributing rotation to dimensions like time/height/width. Models such as Qwen2-VL use this multi-axis positional encoding to handle arbitrary-resolution inputs.
intuition of M-RoPE:
text: 1D position (order)
image: 2D position (row, column)
video: 3D position (time, row, column)
split RoPE dimensions by axis and rotate by each axis's position
-> handle multiple modalities with one unified positional encoding
This lets text and image tokens attend together while preserving their relative relationships within the same positional system.
8.5. Why Absolute Positional Encoding Is Weak at Extrapolation
Here we summarize why the relative-position family (RoPE, ALiBi) is strong at extrapolation, contrasted with the limits of absolute positional encoding.
extrapolation weaknesses of absolute positional encoding:
sinusoidal -> position patterns beyond the training length were never seen in training
values are computable in theory, but the model was not trained to
interpret those patterns, so real performance drops
learned -> there is no embedding at all for positions beyond training length -> extrapolation impossible
why the relative-position family is strong:
the same distance relationship is represented identically regardless of position
-> the meaning of "10 apart" is the same whether at position 5 or 5000
-> distance relationships seen in training can be reused on long inputs
That said, relative position is not a panacea either. Even RoPE drops in performance on inputs much longer than training, as large rotation angles fall outside the training distribution. That is exactly why corrections like NTK/YaRN are needed. The key lesson is that "relative position gives a good starting point for extrapolation, but large extensions almost always need additional correction."
8.7. Arbitrary Resolution and Positional Encoding
In multimodality, positional encoding is directly tied to handling arbitrary resolution. When image sizes vary, a fixed positional grid struggles to handle diverse resolutions consistently.
limits of a fixed positional grid:
if you learn positions for the resolution seen in training (e.g., 224x224)
-> the meaning of position is off on other resolutions (e.g., 1024x768) inputs
advantage of multi-axis relative position (M-RoPE):
encode position as the relative relationship of coordinates like (row, column)
-> even when resolution changes, the meaning of "one cell to the right" is preserved
-> handles arbitrary-resolution inputs more naturally
Approaches like Qwen2-VL's naive dynamic resolution do not force-resize images to a fixed size; they process them while preserving the original aspect ratio and resolution, maintaining spatial relationships with multi-axis positional encoding. This is especially advantageous for tasks where fine spatial information matters, such as OCR and document understanding.
9. RoPE Rotation Through a Small Example
If RoPE feels abstract, rotating a single 2D pair yourself builds intuition. Say the dimension is D=4 (that is, two 2D pairs).
vector x = [x0, x1, x2, x3]
group into 2D pairs: (x0, x1), (x2, x3)
at position m=2, with the first pair's angle theta_0 = 0.5:
after rotation:
x0' = x0 * cos(2*0.5) - x1 * sin(2*0.5)
x1' = x0 * sin(2*0.5) + x1 * cos(2*0.5)
the second pair rotates at a slower frequency (smaller theta_1)
-> even at the same position, each pair turns by a different angle
The key is that the same vector rotates by a different angle depending on position m, and computing the Query-Key dot product of two tokens leaves only the difference of rotation angles (i.e., the position difference m-n). This is why applying rotation by absolute position shows up as relative position in the result.
def rope_dot_demo(q, k, pos_q, pos_k, base=10000):
q, k: (D,) even for the same token content, different positions change the dot product
qr = apply_rope(q.unsqueeze(0), torch.tensor([pos_q]), base)
kr = apply_rope(k.unsqueeze(0), torch.tensor([pos_k]), base)
return (qr * kr).sum() # a value depending on the position difference
10. Getting a Feel for Context Extension with Numbers
Say you want to extend a model trained at 4K to 16K. The extension factor is 4x. Let us see intuitively how each technique handles this factor.
extension factor s = target length / training length = 16K / 4K = 4
position interpolation (PI):
compress all position indices by 1/s (pos -> pos / 4)
-> falls within the training range, but the angular spacing between adjacent
tokens also shrinks by 1/4, blurring near-distance resolution (high frequency)
NTK-aware:
raise base (e.g., near 10000 -> 10000 * s^(D/(D-2))) to preserve high frequencies
mostly, stretching only low frequencies to extrapolate
-> retains short-context accuracy better than PI
YaRN:
split frequency bands: leave high frequencies as is, interpolate low frequencies,
smoothly transition the middle band + correct attention temperature
-> stable even at large factors, large effect when combined with a little fine-tuning
Seeing it numerically makes clear why simple position interpolation loses out on short context: compressing all frequencies equally damages most the high-frequency information that distinguishes nearby tokens. NTK and YaRN reduce this loss by differentiating "which frequency to touch and how much."
11. The Procedure for Extending Context in Practice
Here is the general order when actually extending a model's context.
1) decide the target length and extension factor (e.g., 4K -> 32K, 8x)
2) check the model's positional encoding (is it RoPE, what is the base value)
3) choose an extension technique:
- small extension + retraining impossible -> NTK-aware (adjust base)
- large extension + a little fine-tuning possible -> YaRN
4) apply the chosen scaling and do inference/fine-tuning
5) evaluate long context:
- check per-position recall with needle-in-a-haystack
- regression-check that short-context performance has not degraded
6) at serving time, manage the increased KV memory with a paged KV cache
The regression check in step 5 is especially important. Context-extension techniques almost always trade off against short-context performance, so do not only look at whether long context recalls well — also verify that the short tasks the model was originally good at have not broken.
12. Connecting Positional Encoding with the KV Cache and Attention
The choice of positional encoding is intertwined with attention and KV cache optimization.
RoPE applies only to Q, K
-> the rotation is already reflected in the K stored in the KV cache
-> sharing KV heads as in GQA/MQA does not conflict with RoPE
ALiBi adds a bias to the attention score
-> no need to carry a separate positional embedding in the KV cache
-> simpler on the memory side
length extrapolation (NTK/YaRN) only changes the RoPE angle computation
-> the KV cache structure stays the same -> easy to be compatible with the serving stack
There are strong practical reasons — not just quality — that the RoPE family is widely used: it fits cleanly with serving optimizations like GQA/MQA and a paged KV cache. When choosing positional encoding, it is wise to weigh not only model quality but also compatibility with the entire serving pipeline.
Pitfalls and Troubleshooting
- **Not adjusting base when extending context**: Using a RoPE model on longer input as-is causes a sharp quality drop from extrapolation failure. Always apply an extension technique like NTK/YaRN.
- **High-frequency damage from position interpolation**: Simple position interpolation (PI) damages the high frequencies responsible for near-distance information, which can reduce short-context accuracy.
- **Attempting extrapolation with learned PE**: Learned absolute positional encoding has no embedding beyond the maximum length, so extrapolation is fundamentally impossible. RoPE/ALiBi families suit long context.
- **Applying RoPE to V**: RoPE must be applied only to Q and K for the relative-position property to hold. Rotating Value as well breaks the meaning.
- **Odd-dimension problem**: RoPE groups dimensions into 2D pairs, so the head dimension must be even. An odd dimension breaks the pairing.
- **Axis-allocation mistakes in multimodal**: Misallocating dimensions to the time/row/column axes in M-RoPE distorts the spatial relationships of an image. Follow the model implementation's axis-splitting rules.
12.5. RoPE Application Point and Caching Details
Let us note a detail that often confuses people when putting RoPE into an inference pipeline. RoPE applies to Q and K, but where you place its application changes what the KV cache stores.
what to store in the KV cache:
typically store the K after applying RoPE in the cache
-> the cached K already has the rotation reflected
-> rotate the new token's Q to its position as well, then take the dot product
-> only the position difference remains, so relative position works naturally
caution:
storing pre-rotation K in the cache and rotating every time wastes compute + risks inconsistency
-> the standard implementation usually stores the post-rotation K
This flow is preserved when applying context-extension techniques. NTK and YaRN only change the rotation-angle computation at each position, so you can keep the KV cache storage structure and the attention kernel itself the same and just swap the position scaling. This is yet another reason the RoPE family fits well with the serving stack.
13. Practical FAQ
Here we collect questions you frequently run into when applying positional encoding.
Q. If I will only use it up to the training length, does the choice of positional encoding matter?
A. Mostly no, but relative position (RoPE/ALiBi) tends to be slightly better for generalization.
If there is any chance of extending context later, the RoPE family is the safe bet.
Q. If I change the RoPE base value, do I have to retrain the existing weights?
A. A small adjustment (NTK-aware) works to some degree without retraining.
Large extensions are stable when accompanied by a little fine-tuning (e.g., YaRN).
Q. Should I choose ALiBi or RoPE?
A. Most of the open-LLM ecosystem centers on RoPE, so RoPE is the safe default for tooling/compatibility.
If you need a specific extrapolation property, ALiBi is also a candidate.
Q. For multimodal, is it always M-RoPE?
A. It is advantageous for images/video that need multi-axis position. Follow the axis split the model implementation defines.
14. A Core Summary Checklist
Compressing this long article into one page:
[ ] self-attention has no sense of order -> injecting positional information is essential
[ ] sinusoidal: no parameters, limited extrapolation
[ ] learned: fits the data, no extrapolation
[ ] relative position: distance-based -> good for generalization/extrapolation
[ ] RoPE: encodes relative position by rotating Q,K, no extra parameters, current standard
[ ] ALiBi: distance-proportional penalty, good extrapolation
[ ] length extrapolation: PI < NTK-aware < YaRN (in order of extension range/quality)
[ ] M-RoPE: RoPE across multiple axes -> multimodal/arbitrary resolution
[ ] after extension, always run a short-context regression check
Closing
Positional encoding is the device that fills the fundamental limitation that "self-attention has no sense of order." Starting from sine waves, through learned and relative positions, today RoPE — which elegantly encodes relative position via rotation — and ALiBi — which applies a distance penalty — have become the mainstream. And extrapolation techniques like NTK and YaRN have opened the way to extend context without retraining, while M-RoPE extends this idea to multimodality.
Understanding positional encoding lets you answer practical questions such as "why does this model collapse on long context" and "how do I extend context safely." Read together with the previous two articles (Transformer structure, the evolution of attention), the core skeleton of the modern LLM connects into one.
References
- Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (arxiv 2104.09864): https://arxiv.org/abs/2104.09864
- Press et al., "Train Short, Test Long: Attention with Linear Biases (ALiBi)" (arxiv 2108.12409): https://arxiv.org/abs/2108.12409
- Peng et al., "YaRN: Efficient Context Window Extension of Large Language Models" (arxiv 2309.00071): https://arxiv.org/abs/2309.00071
- Vaswani et al., "Attention Is All You Need" (arxiv 1706.03762): https://arxiv.org/abs/1706.03762
- Wang et al., "Qwen2-VL" (arxiv 2409.12191): https://arxiv.org/abs/2409.12191
- Hugging Face Transformers docs: https://huggingface.co/docs/transformers/index
- PyTorch official docs: https://pytorch.org/docs/stable/index.html
- Qwen model repository: https://github.com/QwenLM
현재 단락 (1/208)
Self-attention looks at a set of input tokens and produces a weighted average. But the attention ope...