Chaos and Order

Chaos and Order https://www.youngju.dev/blog 천천히 올바르게. AI Researcher & DevOps Engineer Youngju's tech blog. GPU/CUDA, LLM, MLOps, Kubernetes AI workloads, distributed training, and data engineering. ko fjvbn2003@gmail.com (Youngju Kim) fjvbn2003@gmail.com (Youngju Kim) Fri, 26 Jun 2026 00:00:00 GMT https://www.youngju.dev/blog/llm/2026-06-26-attention-variants-gqa-mqa-flashattention.en The Evolution of Attention — MQA, GQA, FlashAttention, and Long Context https://www.youngju.dev/blog/llm/2026-06-26-attention-variants-gqa-mqa-flashattention.en We analyze the memory and compute cost of standard attention, then explain how MQA and GQA shrink the KV cache and how FlashAttention optimizes IO. We compare sliding-window and long-context techniques and trace how all of these choices affect serving memory. Fri, 26 Jun 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) llmattentionflashattentiongqamqakv-cachelong-context https://www.youngju.dev/blog/llm/2026-06-26-attention-variants-gqa-mqa-flashattention.ja アテンションの進化 — MQA、GQA、FlashAttention、そして長いコンテキスト https://www.youngju.dev/blog/llm/2026-06-26-attention-variants-gqa-mqa-flashattention.ja 標準アテンションのメモリと演算コストを分析し、MQAとGQAがKV cacheをどう削減するか、FlashAttentionがIOをどう最適化するかを説明します。スライディングウィンドウや長コンテキスト手法を比較し、これらの選択がサービングメモリに与える影響までたどります。 Fri, 26 Jun 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) llmattentionflashattentiongqamqakv-cachelong-context https://www.youngju.dev/blog/llm/2026-06-26-attention-variants-gqa-mqa-flashattention 어텐션의 진화 — MQA, GQA, FlashAttention, 그리고 긴 컨텍스트 https://www.youngju.dev/blog/llm/2026-06-26-attention-variants-gqa-mqa-flashattention 표준 어텐션의 메모리와 연산 비용을 분석하고, MQA와 GQA가 KV cache를 어떻게 줄이는지, FlashAttention이 IO를 어떻게 최적화하는지 설명합니다. 슬라이딩 윈도우와 롱컨텍스트 기법, 그리고 이 모든 선택이 서빙 메모리에 미치는 영향까지 비교합니다. Fri, 26 Jun 2026 00:00:00 GMT fjvbn2003@gmail.com (Youngju Kim) llmattentionflashattentiongqamqakv-cachelong-context