- Authors
- Name
- 1. Introduction
- 2. The Memory Problem of Standard Attention
- 3. GPU Memory Hierarchy
- 4. IO Complexity Analysis
- 5. Tiling: Block-wise Computation That Fits in SRAM
- 6. Online Softmax (Safe Softmax) Algorithm
- 7. Backward Pass Recomputation Strategy
- 8. FlashAttention-2 Improvements
- 9. FlashAttention-3: Latest Advances
- 10. Benchmarks: Speed and Memory Comparison
- 11. Integration with PyTorch torch.nn.functional.scaled_dot_product_attention
- 12. Summary and Key Takeaways
- References
1. Introduction
Self-Attention, the core component of the Transformer architecture, computes relationships between all token pairs in a sequence. While this operation provides powerful representational capacity, it suffers from a fundamental limitation: both time and memory complexity grow as with respect to the sequence length . For state-of-the-art LLMs such as GPT-4, LLaMA, and Gemini to handle long contexts exceeding 128K tokens, this bottleneck must be effectively addressed.
FlashAttention (Dao et al., 2022) solves this problem without any approximation. The core idea is simple yet profound: rather than reducing the computational cost of the attention operation itself, it minimizes data movement (IO) between GPU memory hierarchy levels. In this article, we systematically analyze the principles of FlashAttention from a GPU hardware perspective and trace its evolution through FlashAttention-2 and FlashAttention-3.
2. The Memory Problem of Standard Attention
2.1 Standard Attention Computation Flow
Standard Self-Attention is computed as follows. Given inputs :
Here, is the sequence length and is the head dimension.
2.2 Memory Complexity Analysis
The crux of the problem lies in the intermediate matrices and . These matrices are of size , requiring quadratic memory with respect to sequence length. To put this in concrete numbers:
| Sequence Length () | Attention Matrix Size | FP16 Memory |
|---|---|---|
| 1,024 | 1M elements | 2 MB |
| 4,096 | 16.7M elements | 33 MB |
| 16,384 | 268M elements | 536 MB |
| 65,536 | 4.3B elements | 8.6 GB |
| 131,072 | 17.2B elements | 34.4 GB |
These figures are for a single head and single batch. In multi-head attention, multiplying by the number of heads and the batch size results in significantly larger actual memory consumption. At sequence length 65,536, even a single head consumes a substantial portion of the HBM on an A100 80GB GPU.
2.3 HBM Bottleneck
In the standard attention implementation, these matrices are materialized in GPU HBM (High Bandwidth Memory). That is, is computed and written to HBM, then read back for softmax, the result is written to HBM, and then read back for . The total number of HBM reads and writes in this process is .
The real reason this operation is slow on actual GPUs is that memory access, not compute, is the bottleneck. The A100 GPU delivers 312 TFLOPS (FP16) of compute throughput, while its HBM bandwidth is only about 2 TB/s. Attention is a classic memory-bound operation due to its low arithmetic intensity (ratio of compute to memory access).
3. GPU Memory Hierarchy
Understanding FlashAttention requires precise knowledge of the GPU memory hierarchy.
3.1 HBM (High Bandwidth Memory)
- Capacity: 40GB or 80GB on the A100
- Bandwidth: Approximately 1.5-2.0 TB/s (A100 80GB SXM: 2,039 GB/s)
- Access latency: Approximately 200-600 cycles
- Role: The GPU's main memory. All data including model parameters, input tensors, and output tensors are stored here
3.2 SRAM (On-chip Shared Memory)
- Capacity: Approximately 192KB per SM on the A100, approximately 20MB total (108 SMs)
- Bandwidth: Approximately 19 TB/s
- Access latency: Approximately 20-30 cycles
- Role: High-speed on-chip memory within each Streaming Multiprocessor (SM)
3.3 The Critical Asymmetry
A dramatic asymmetry exists between SRAM and HBM:
| Property | SRAM | HBM |
|---|---|---|
| Bandwidth | ~19 TB/s | ~2 TB/s |
| Capacity | ~20 MB | 40-80 GB |
| Access Latency | 20-30 cycles | 200-600 cycles |
SRAM is approximately 10x faster than HBM, but approximately 4,000x smaller in capacity. FlashAttention's key insight is to actively exploit this asymmetry: instead of materializing the entire matrix in HBM, performing computations in small blocks that fit in SRAM can dramatically reduce HBM accesses.
4. IO Complexity Analysis
4.1 IO Complexity of Standard Attention
Standard attention exhibits the following HBM access pattern:
- Read from HBM, compute , write to HBM: IO
- Read from HBM, compute , write to HBM: IO
- Read from HBM, compute , write to HBM: IO
Total HBM access:
Since the sequence length is typically much larger than the head dimension (usually 64 or 128), the term dominates.
4.2 FlashAttention's IO Complexity
FlashAttention reduces HBM access through tiling to:
where is the SRAM size. Intuitively, larger SRAM allows processing larger blocks at once, reducing HBM accesses.
4.3 Optimality Proof (Lower Bound)
The paper goes further to prove the following lower bound:
Theorem: For all SRAM sizes where , any algorithm computing exact attention requires HBM accesses.
This means FlashAttention is optimal in terms of IO complexity. Excluding constant and polylogarithmic factors, it is impossible to compute exact attention with fewer HBM accesses.
4.4 Numerical Example
For the A100 with SRAM size KB, head dimension , and sequence length :
- Standard attention IO: elements
- FlashAttention IO: elements (varies with block size)
In practice, since the -sized intermediate matrices are never written to HBM at all, the savings are even greater. The benefits become particularly pronounced as sequence length increases.
5. Tiling: Block-wise Computation That Fits in SRAM
5.1 Algorithm Overview
The core FlashAttention algorithm works as follows:
- Partition into blocks: , each of size
- Partition into blocks: and , each of size
- Block sizes are set to fit in SRAM of size : ,
5.2 Forward Pass Pseudocode
Algorithm: FlashAttention Forward Pass
---------------------------------------
Input: Q, K, V in HBM, SRAM size M
Output: O in HBM
1. Set block sizes: B_c = ceil(M / 4d), B_r = min(ceil(M / 4d), d)
2. Initialize O = zeros(N, d), l = zeros(N), m = -inf * ones(N) in HBM
3. for j = 1 to T_c: # Outer loop: K, V blocks
Load K_j, V_j from HBM to SRAM
for i = 1 to T_r: # Inner loop: Q blocks
Load Q_i, O_i, l_i, m_i from HBM to SRAM
# Perform block-wise computation in SRAM
S_ij = Q_i @ K_j^T # (B_r x B_c)
m_ij = rowmax(S_ij)
P_ij = exp(S_ij - m_ij)
l_ij = rowsum(P_ij)
# Combine with statistics from previous blocks (Online Softmax)
m_new = max(m_i, m_ij)
l_new = exp(m_i - m_new) * l_i + exp(m_ij - m_new) * l_ij
# Update output (with rescaling)
O_i = diag(exp(m_i - m_new))^(-1) * (diag(l_i) * O_i)
+ diag(exp(m_ij - m_new))^(-1) * P_ij @ V_j
O_i = diag(l_new)^(-1) * O_i
# Update statistics
m_i = m_new, l_i = l_new
Write O_i, l_i, m_i back to HBM
end for
end for
4. return O
5.3 Why This Works
The key point is that the attention matrices and are never materialized in HBM. Each block is computed within SRAM, immediately used for softmax statistics updates and output accumulation, and then discarded.
The mathematical technique that makes this possible is Online Softmax.
6. Online Softmax (Safe Softmax) Algorithm
6.1 The Problem with Standard Softmax
Softmax is a global operation. For a row vector :
Computing this requires seeing the entire row at once to calculate the denominator sum. This is the fundamental barrier that makes tiling difficult -- looking only at block is insufficient to complete the softmax, because the denominator changes depending on the values in the remaining blocks .
Additionally, for numerical stability, "safe softmax" is used:
This also requires the global maximum , necessitating a scan of the entire row first.
6.2 The Online Softmax Trick
The key idea of Online Softmax (Milakov & Gimelshein, 2018) is to compute softmax incrementally, block by block, while maintaining running statistics.
Two scalars are maintained per row:
- : the maximum of all elements seen so far (running max)
- : the normalization constant so far (running sum of exponentials)
When a new block arrives:
- Compute the row-wise maximum of the new block:
- Update the global maximum:
- Rescale the previous normalization constant:
- Rescale the previous output:
This process is mathematically exact. It is not an approximation. Regardless of the order in which blocks are processed, the final result is identical to standard attention bit-for-bit (except for minor numerical differences due to floating-point operation ordering).
6.3 Mathematical Justification
The core of the proof is the rescaling property of softmax:
Even when the maximum is updated from to , the same factor multiplies both numerator and denominator, so the ratio remains unchanged. This property allows the results from previous blocks to be safely rescaled to the new maximum.
7. Backward Pass Recomputation Strategy
7.1 The Problem with Standard Backward Pass
In the standard attention backward pass, the intermediate matrices and saved during the forward pass are needed for gradient computation. Since their size is , storing them in the forward pass and reading them back in the backward pass requires memory.
7.2 FlashAttention's Recomputation
FlashAttention uses a variant of gradient checkpointing. During the forward pass, and are not saved. Instead, only the following are stored:
- The final output
- The softmax normalization statistics (per-row maximum and sum)
During the backward pass, these statistics and the original are used to recompute the needed blocks of and in SRAM. This recomputation requires additional FLOPs but significantly reduces HBM access.
7.3 The Paradoxical Effect of Recomputation
Typically, gradient checkpointing saves memory at the cost of speed. However, FlashAttention's recomputation actually improves speed as well. The reason is:
- FLOPs increase: Since what was computed once in the forward pass is recomputed in the backward pass, total FLOPs slightly increase.
- HBM IO decreases: The cost of writing and reading the -sized and to and from HBM is eliminated.
On modern GPUs, HBM access is much slower than computation, so the benefit from IO reduction outweighs the FLOP increase. Experimental results show that the additional runtime overhead from recomputation is less than 5%, while memory usage decreases from to .
7.4 Memory Savings
| Sequence Length | Standard Attention Memory | FlashAttention Memory | Savings Ratio |
|---|---|---|---|
| 1K | ~2 MB | ~0.13 MB | ~15x |
| 2K | ~8 MB | ~0.26 MB | ~30x |
| 4K | ~33 MB | ~0.52 MB | ~63x |
| 8K | ~131 MB | ~1.04 MB | ~126x |
These savings enable processing longer sequences or using larger batch sizes with the same GPU memory.
8. FlashAttention-2 Improvements
Dao (2023) introduced three key improvements in FlashAttention-2.
8.1 Minimizing Non-matmul FLOPs
The A100 GPU's Tensor Cores deliver 312 TFLOPS (FP16) for matrix multiplication (matmul), but non-matmul operations (softmax's exp, max, sum, etc.) run at 19.5 TFLOPS (FP32) -- approximately 16x slower. In FlashAttention-1, the proportion of non-matmul operations was significant.
FlashAttention-2 restructures the algorithm to minimize these non-matmul FLOPs. Specifically, it reduces the number of rescaling operations and performs softmax statistics updates more efficiently. The key change is performing the final rescaling only once at the end of the loop.
8.2 Improved Parallelism: Sequence Length Dimension Parallelization
FlashAttention-1 only parallelized across the batch and head dimensions. When the batch size was small or the number of heads was low, the GPU's SMs (Streaming Multiprocessors) were underutilized.
FlashAttention-2 also parallelizes across the sequence length dimension. By changing the outer loop to iterate over Q blocks (rather than blocks), each Q block can be processed by an independent thread block. This change significantly improves occupancy during the forward pass.
8.3 Work Partitioning Optimization
Work distribution among warps within a thread block was also improved:
- FlashAttention-1: K, V are split across 4 warps, each warp independently computes and then synchronizes results. This approach incurs communication and synchronization overhead through shared memory.
- FlashAttention-2: Q is split across 4 warps, while K and V are shared by all warps. Since each warp computes outputs for different parts of Q independently, no inter-warp communication is needed.
8.4 Performance Results
Combining these three improvements:
- Approximately 2x speedup over FlashAttention-1
- Achieves 230 TFLOPS in FP16/BF16 on the A100 (approximately 73% of the theoretical maximum)
- Up to 9x speedup over standard PyTorch attention
- Approaches the efficiency of GEMM (matrix multiplication) operations
9. FlashAttention-3: Latest Advances
FlashAttention-3 (Shah et al., 2024) took a further step forward by leveraging the new hardware capabilities of the NVIDIA Hopper architecture (H100).
9.1 New Capabilities of the Hopper GPU
The H100 GPU provides the following key capabilities over the A100:
- WGMMA (Warpgroup Matrix Multiply-Accumulate): A new Tensor Core instruction with much higher throughput than the A100's
mma.sync - TMA (Tensor Memory Accelerator): A dedicated hardware unit for data transfers between global memory and shared memory, handling index computation and bounds checking in hardware
9.2 Three Key Techniques
1. Asynchronous Execution via Warp Specialization
Computation (WGMMA) and data movement (TMA) are assigned to different warp groups for pipelined, overlapping execution. While one warp group computes the current block, another prefetches data for the next block.
2. Interleaving of Matmul and Softmax
Previously, matmul was followed by softmax, then another matmul, in a sequential manner. FlashAttention-3 interleaves these so that matmul and softmax execute simultaneously on different hardware units. While Tensor Cores compute for the next block, CUDA Cores process the softmax of the current block.
3. FP8 Low-precision Support
Leveraging the H100's FP8 Tensor Cores doubles throughput. Naive FP8 quantization degrades accuracy, but FlashAttention-3 addresses this with two techniques:
- Block quantization: Maintaining separate scale factors per block to preserve dynamic range
- Incoherent processing: Multiplying by a random orthogonal matrix to distribute outliers before quantization, achieving 2.6x lower numerical error compared to the FP8 baseline
9.3 Performance Results
FlashAttention-3 performance on the H100:
| Configuration | TFLOPS | GPU Utilization |
|---|---|---|
| FP16 FlashAttention-2 | ~400 | ~50% |
| FP16 FlashAttention-3 | ~740 | ~75% |
| FP8 FlashAttention-3 | ~1,200 | ~75% |
In FP16, it achieves a 1.5-2.0x speedup over FlashAttention-2, and in FP8, it approaches 1.2 PFLOPS.
10. Benchmarks: Speed and Memory Comparison
10.1 Attention Forward Pass Speed (A100 80GB, FP16)
Key figures reported in the FlashAttention paper and subsequent benchmarks:
| Sequence Length | Standard Attention | FlashAttention | FlashAttention-2 | Speedup (FA2 vs Std) |
|---|---|---|---|---|
| 512 | 12.2 ms | 3.5 ms | 1.9 ms | 6.4x |
| 1K | 45.8 ms | 7.8 ms | 4.1 ms | 11.2x |
| 2K | 178 ms | 18.9 ms | 9.8 ms | 18.2x |
| 4K | 710 ms | 52.3 ms | 27.1 ms | 26.2x |
| 8K | OOM | 145 ms | 75 ms | - |
| 16K | OOM | 520 ms | 270 ms | - |
The speedup becomes increasingly dramatic as sequence length grows. At 8K and above, standard attention fails with OOM (Out of Memory), while FlashAttention handles them without issue.
10.2 End-to-End Training Performance
| Model | Standard | FlashAttention | Speedup |
|---|---|---|---|
| BERT-large (seq 512) | 100% (MLPerf ref.) | 115% | 1.15x |
| GPT-2 (seq 1K) | 100% | 300% | 3.0x |
| Long-range Arena (seq 1K-4K) | 100% | 240% | 2.4x |
10.3 Memory Usage Comparison
FlashAttention's attention operation memory scales linearly with sequence length, a dramatic improvement over standard attention's quadratic scaling:
- Sequence length 2K: approximately 10x memory savings
- Sequence length 4K: approximately 20x memory savings
- Sequence length 64K: standard attention causes OOM even on an A100 80GB, while FlashAttention runs normally
11. Integration with PyTorch torch.nn.functional.scaled_dot_product_attention
11.1 Native Integration
Starting with PyTorch 2.0, FlashAttention is natively integrated into torch.nn.functional.scaled_dot_product_attention (SDPA). Since PyTorch 2.2, FlashAttention-2 is used as the default backend.
import torch
import torch.nn.functional as F
# Basic usage - automatically selects FlashAttention backend
query = torch.randn(batch_size, num_heads, seq_len, head_dim,
device='cuda', dtype=torch.float16)
key = torch.randn(batch_size, num_heads, seq_len, head_dim,
device='cuda', dtype=torch.float16)
value = torch.randn(batch_size, num_heads, seq_len, head_dim,
device='cuda', dtype=torch.float16)
# PyTorch automatically selects the optimal backend
output = F.scaled_dot_product_attention(query, key, value)
11.2 Explicit Backend Selection
You can force or exclude specific backends:
from torch.nn.attention import sdpa_kernel, SDPBackend
# Use FlashAttention backend only
with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
output = F.scaled_dot_product_attention(query, key, value)
# Use Memory-efficient attention backend only
with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
output = F.scaled_dot_product_attention(query, key, value)
# Use Math (naive) backend - for debugging
with sdpa_kernel(SDPBackend.MATH):
output = F.scaled_dot_product_attention(query, key, value)
# Use CuDNN backend (PyTorch 2.2+)
with sdpa_kernel(SDPBackend.CUDNN_ATTENTION):
output = F.scaled_dot_product_attention(query, key, value)
11.3 Using with Causal Mask
Causal masking, essential for autoregressive generation in LLMs, is also supported:
# Apply causal mask with is_causal=True
# FlashAttention handles the causal mask within the fused kernel, requiring no extra memory
output = F.scaled_dot_product_attention(
query, key, value,
is_causal=True
)
# Using a custom attention mask
attn_mask = torch.tril(torch.ones(seq_len, seq_len, device='cuda', dtype=torch.bool))
output = F.scaled_dot_product_attention(
query, key, value,
attn_mask=attn_mask
)
11.4 Backend Selection Criteria
Conditions for PyTorch SDPA to select the FlashAttention backend:
- dtype:
float16orbfloat16(float32 is not supported) - device: CUDA GPU (CPU is not supported)
- head dimension: Maximum 256 (for FlashAttention-2)
- attention mask: Boolean mask or
is_causal=Trueare supported; arbitrary float masks are not supported
If these conditions are not met, PyTorch automatically falls back to the memory-efficient attention or math backend.
11.5 Practical Tips
# Check which backend is being used
import torch.backends.cuda
# Check the enabled state of each backend
print(f"Flash SDP enabled: {torch.backends.cuda.flash_sdp_enabled()}")
print(f"Mem efficient SDP enabled: {torch.backends.cuda.mem_efficient_sdp_enabled()}")
print(f"Math SDP enabled: {torch.backends.cuda.math_sdp_enabled()}")
# Globally disable a specific backend
torch.backends.cuda.enable_flash_sdp(False) # Disable FlashAttention
torch.backends.cuda.enable_mem_efficient_sdp(True)
11.6 Using the flash-attn Library Directly
In addition to PyTorch's native SDPA, you can use Tri Dao's flash-attn package directly. This package provides more features than PyTorch SDPA (e.g., sliding window attention, ALiBi, cross-attention optimization):
# pip install flash-attn
from flash_attn import flash_attn_func
# Shape: (batch, seqlen, nheads, headdim)
output = flash_attn_func(q, k, v, causal=True)
12. Summary and Key Takeaways
The key lesson of FlashAttention is that FLOP complexity alone does not determine performance. On modern GPUs, memory access patterns dominate actual execution time, and IO-aware algorithm design is decisive for practical performance.
The main contributions can be summarized as follows:
- IO-Aware Design Principle: Algorithm design that exploits the asymmetry of the GPU memory hierarchy (HBM vs SRAM)
- Tiling + Online Softmax: Block-wise computation that fits in SRAM, eliminating HBM materialization of the matrix
- Recomputation Strategy: Recomputing intermediate values in the backward pass to reduce memory from to , while simultaneously improving speed
- Optimality Proof: Proving the lower bound from an IO complexity perspective to establish the algorithm's optimality
- Exact Computation: Maintaining exact attention without approximation despite all optimizations
FlashAttention is a rare piece of research that combines theoretical elegance with practical effectiveness, and it has become core infrastructure for modern LLM training and inference. Thanks to its native integration in PyTorch, its benefits can be enjoyed simply by calling F.scaled_dot_product_attention without any additional implementation.
References
- Dao, T., Fu, D.Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022. https://arxiv.org/abs/2205.14135
- Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ICLR 2024. https://arxiv.org/abs/2307.08691
- Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. NeurIPS 2024 Spotlight. https://arxiv.org/abs/2407.08608
- Dao-AILab. flash-attention GitHub Repository. https://github.com/Dao-AILab/flash-attention
- PyTorch Documentation.
torch.nn.functional.scaled_dot_product_attention. https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html - PyTorch Documentation.
torch.nn.attention.sdpa_kernel. https://docs.pytorch.org/docs/stable/generated/torch.nn.attention.sdpa_kernel.html - PyTorch Blog. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. https://pytorch.org/blog/flashattention-3/
- Milakov, M. & Gimelshein, N. (2018). Online Normalizer Calculation for Softmax. arXiv:1805.02867. https://arxiv.org/abs/1805.02867
- NVIDIA. A100 Tensor Core GPU Architecture Whitepaper. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
- NVIDIA. Hopper Architecture In-Depth. https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/