Mixture-of-experts

All Posts

Published on
2026년 5월 16일
파운데이션 모델 아키텍처 2026 — Transformer 이후 / Mamba 2 / Hyena / RWKV / RetNet / Griffin / Jamba / xLSTM / TTT / DiT / MoE / Flash Attention 3 심층 가이드
foundation-models transformer attention-is-all-you-need vaswani mamba state-space-model ssm albert-gu tri-dao mamba-2 hyena stanford-h2o linear-attention schmidhuber rwkv bo-peng retnet microsoft-retentive griffin deepmind-griffin s5 jamba ai21 falcon-mamba xlstm sepp-hochreiter test-time-training ttt sun-et-al dit diffusion-transformer sora-dit mixture-of-experts moe mixtral deepseek-v3-moe million-experts google-mome flash-attention-3 ring-attention gemini-2m magic-ltm-2-mini sakana-ai-evolutionary 2026 deep-dive
2026년 파운데이션 모델 세계는 더 이상 Transformer 일변도가 아니다. Vaswani의 2017년 "Attention is All You Need"는 여전히 표준이지만, 그 옆에 Mamba/Mamba 2 같은 상태공간 모델(SSM), RWKV/RetNet/Griffin 같은 선형 RNN 재발견 진영, AI21 Jamba와 Falcon Mamba 같은 하이브리드, Sepp Hochreiter의 xLSTM, Test-Time Training, Sora의 DiT, Mixtral/DeepSeek-V3 671B/Google Million Experts 같은 MoE, Flash Attention 3와 Ring Attention, 그리고 Gemini 2M/Magic LTM-2-mini 100M의 초장문 컨텍스트까지 — 어떤 아키텍처가 어떤 문제에 강한지, 한국과 일본 진영은 무엇을 만들고 있는지 한 번에 정리.
Published on
2026년 5월 16일
LLM 논문 큐레이션 2024-2026 - Llama · DeepSeek · Qwen · Mistral · Phi · RLHF · DPO · CoT · RAG · FlashAttention · vLLM 심층 가이드
llm papers llama deepseek qwen mistral phi rlhf dpo chain-of-thought rag flashattention vllm foundation-models mixture-of-experts
LLM을 만들고 운영하는 엔지니어를 위한 2024-2026 필독 논문 30+편 큐레이션. 파운데이션 모델(Llama 3/4, DeepSeek-V3/R1, Qwen3, Mistral, Phi-4, Gemma 3), 학습 혁신(MoE, MLA, GQA), 포스트-트레이닝(RLHF, DPO, ORPO, KTO), 추론(CoT, ToT, GRPO), 에이전트(ReAct, SWE-Agent), 검색(RAG, GraphRAG, ColBERT), 효율(FlashAttention 1/2/3, vLLM PagedAttention, SGLang), 평가(MMLU, GSM8K, SWE-Bench, OSWorld), 안전성, 한국·일본 모델까지 — 각 논문의 arXiv ID와 "왜 중요한지"를 한 단락으로 정리.
Published on
2026년 3월 14일
Mixture of Experts(MoE) 아키텍처 논문 심층 분석: GShard에서 DeepSeek-MoE까지
ai-papers mixture-of-experts moe transformer deepseek
Mixture of Experts 아키텍처의 핵심 논문을 분석하고, GShard, Switch Transformer, Mixtral, DeepSeek-MoE의 라우팅 전략과 학습 안정성 기법을 비교합니다.
Published on
2026년 3월 11일
Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer에서 Mixtral까지의 발전과 효율적 스케일링 전략
ai-papers mixture-of-experts switch-transformer mixtral model-architecture 2026-03 2026-03-11
Mixture of Experts(MoE) 아키텍처의 핵심 원리부터 Switch Transformer의 단일 전문가 라우팅, Mixtral 8x7B의 Sparse MoE 구현, DeepSeek-MoE의 세분화 전략까지 심층 분석. 라우팅 메커니즘, 로드 밸런싱 손실, 학습 안정화 기법, 추론 최적화, 장애 사례와 체크리스트를 다룹니다.
Published on
2026년 3월 10일
Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer부터 Mixtral·DeepSeek까지
ai-papers mixture-of-experts moe transformer mixtral deepseek 2026-03 2026-03-10
Mixture of Experts(MoE) 아키텍처를 심층 분석합니다. Sparse MoE의 수학적 기초부터 Switch Transformer, Mixtral 8x7B, DeepSeek-V3의 라우팅 전략, 학습 안정성 기법, 추론 최적화까지 논문 기반으로 상세히 다룹니다.
Published on
2026년 3월 6일
Sparse Mixture of Experts(MoE) 아키텍처 심층 분석: 설계 원리부터 DeepSeek-V3·Qwen3까지
ai-papers moe mixture-of-experts sparse-model deepseek 2026-03 2026-03-06
Sparse MoE 아키텍처의 수학적 원리, 라우팅 전략, 로드 밸런싱 기법을 분석하고, Switch Transformer에서 DeepSeek-V3·Qwen3-235B까지 최신 MoE 모델의 설계 선택과 실전 학습·추론 최적화를 다룬다.

Mixture-of-experts

mixture-of-experts (6)

파운데이션 모델 아키텍처 2026 — Transformer 이후 / Mamba 2 / Hyena / RWKV / RetNet / Griffin / Jamba / xLSTM / TTT / DiT / MoE / Flash Attention 3 심층 가이드

LLM 논문 큐레이션 2024-2026 - Llama · DeepSeek · Qwen · Mistral · Phi · RLHF · DPO · CoT · RAG · FlashAttention · vLLM 심층 가이드

Mixture of Experts(MoE) 아키텍처 논문 심층 분석: GShard에서 DeepSeek-MoE까지

Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer에서 Mixtral까지의 발전과 효율적 스케일링 전략

Mixture of Experts(MoE) 아키텍처 심층 분석: Switch Transformer부터 Mixtral·DeepSeek까지

Sparse Mixture of Experts(MoE) 아키텍처 심층 분석: 설계 원리부터 DeepSeek-V3·Qwen3까지