Split View: [가상화] 05. AWS EC2와 Nitro 시스템: 클라우드 가상화의 진화

[가상화] 05. AWS EC2와 Nitro 시스템: 클라우드 가상화의 진화

들어가며
AWS Nitro 시스템 아키텍처
GPU 인스턴스 타입
EC2 네트워킹
- EFA (Elastic Fabric Adapter)
- Placement Groups
Elastic Graphics (Deprecated)
온프레미스 GPU 가상화와의 비교
EC2 인스턴스 선택 가이드
실습: EC2 GPU 인스턴스 설정
Nitro 시스템의 혁신 포인트

들어가며

AWS EC2(Elastic Compute Cloud)는 세계 최대의 클라우드 가상화 플랫폼입니다. 그 핵심에는 AWS가 자체 개발한 Nitro 시스템이 있습니다. Nitro는 전통적인 하이퍼바이저의 한계를 극복하고, I/O 처리를 전용 하드웨어로 오프로드하여 거의 모든 호스트 리소스를 인스턴스에 제공합니다.

AWS Nitro 시스템 아키텍처

전통적 가상화 vs Nitro

[전통적 가상화]                      [AWS Nitro]

+------------------+                 +------------------+
|    VM 1 | VM 2   |                 |    VM 1 | VM 2   |
+------------------+                 +------------------+
| Hypervisor       |                 | Nitro Hypervisor |
| - CPU/Mem 관리   |                 | (경량, CPU/Mem만) |
| - 네트워크 처리  |                 +------------------+
| - 스토리지 처리  |                 | Nitro Cards      |
| - 보안/관리      |                 | (전용 HW가 처리)  |
+------------------+                 +--+--+--+---------+
|   Hardware       |                 |NIC|EBS|Mgmt|Security
+------------------+                 +--+--+--+---------+
                                     |   Hardware       |
호스트 CPU의 30%를                   +------------------+
하이퍼바이저가 소비                   호스트 CPU 거의 100%를
                                     인스턴스에 제공

Nitro 시스템 구성요소

+----------------------------------------------------------+
|                    AWS Nitro System                        |
+----------------------------------------------------------+
|                                                            |
|  +------------------+  +------------------------------+   |
|  | Nitro Hypervisor |  |       Nitro Cards            |   |
|  | - 경량 KVM 기반   |  | +--------+ +--------+       |   |
|  | - CPU/메모리      |  | | VPC    | | EBS    |       |   |
|  |   격리만 담당     |  | | Card   | | Card   |       |   |
|  +------------------+  | +--------+ +--------+       |   |
|                         | +--------+ +--------+       |   |
|  +------------------+  | | NVMe   | | Mgmt   |       |   |
|  | Nitro Security   |  | | Card   | | Card   |       |   |
|  | Chip             |  | +--------+ +--------+       |   |
|  | - HW Root of Trust|  +------------------------------+   |
|  | - 펌웨어 보호     |                                     |
|  +------------------+  +------------------------------+   |
|                         | Nitro Enclaves               |   |
|                         | - 격리된 컴퓨팅 환경          |   |
|                         | - 민감 데이터 처리             |   |
|                         +------------------------------+   |
+----------------------------------------------------------+

1. Nitro Hypervisor

경량 KVM 기반: CPU와 메모리 격리만 담당
네트워크, 스토리지, 관리 기능은 모두 Nitro Cards로 오프로드
호스트 CPU/메모리의 거의 100%를 인스턴스에 제공
소프트웨어 공격면(attack surface) 최소화

2. Nitro Cards

전용 ASIC으로 제작된 하드웨어 카드입니다.

Nitro Card	역할
VPC Card	가상 네트워크 처리 (VPC, SG, NACL, EFA)
EBS Card	EBS 볼륨 I/O 처리, 암호화, NVMe 프로토콜
Local NVMe Card	인스턴스 스토어 NVMe SSD 관리
Management Card	인스턴스 모니터링, 부팅, 보안 관리

3. Nitro Security Chip

[Nitro Security Chain]

서버 부팅
    |
    v
Nitro Security Chip (HW Root of Trust)
    |
    v  펌웨어 무결성 검증
    |
Nitro Hypervisor 로드
    |
    v  하이퍼바이저 무결성 검증
    |
EC2 인스턴스 시작
    |
    v  런타임 모니터링 (지속)

하드웨어 기반 신뢰 루트(Root of Trust)
서버 펌웨어의 무결성을 부팅 시마다 검증
AWS 직원조차 인스턴스 메모리에 접근 불가
NitroTPM으로 인스턴스 수준 TPM 2.0 제공

4. Nitro Enclaves

격리된 컴퓨팅 환경으로, 민감한 데이터를 처리합니다.

+-------------------------------------+
|           EC2 Instance              |
|  +-------------+  +-------------+  |
|  | Application |  | Nitro       |  |
|  | (일반 처리) |  | Enclave     |  |
|  |             |  | (격리 환경)  |  |
|  |             |  | - 자체 커널  |  |
|  |             |  | - 네트워크X  |  |
|  |             |  | - 스토리지X  |  |
|  |             |  | - vsock 통신 |  |
|  +-------------+  +-------------+  |
+-------------------------------------+

부모 인스턴스에서도 Enclave 메모리 접근 불가
네트워크, 스토리지 접근 없음 (vsock으로만 통신)
암호화 증명(Attestation)으로 무결성 검증
용도: 암호화 키 관리, 금융 데이터, 의료 정보 처리

GPU 인스턴스 타입

P-시리즈 (Training/HPC)

인스턴스	GPU	GPU 수	GPU 메모리	vCPU	메모리	네트워크
p5.48xlarge	H100	8	640GB HBM3	192	2TB	3,200 Gbps EFA
p5e.48xlarge	H200	8	1,128GB HBM3e	192	2TB	3,200 Gbps EFA
p5en.48xlarge	H200	8	1,128GB HBM3e	192	2TB	3,200 Gbps EFAv2
p4d.24xlarge	A100	8	320GB HBM2e	96	1.1TB	400 Gbps EFA
p4de.24xlarge	A100 80G	8	640GB HBM2e	96	1.1TB	400 Gbps EFA

G-시리즈 (Inference/Graphics)

인스턴스	GPU	GPU 수	GPU 메모리	vCPU	메모리	네트워크
g6.xlarge~48xl	L4	1~8	24~192GB	4~192	16~768GB	최대 100 Gbps
g6e.xlarge~48xl	L40S	1~8	48~384GB	4~192	16~768GB	최대 100 Gbps
g5.xlarge~48xl	A10G	1~8	24~192GB	4~192	16~768GB	최대 100 Gbps

GPU 프로비저닝 방식

AWS는 Nitro 시스템을 통해 GPU를 패스스루 모드로 제공합니다.

[AWS GPU Passthrough via Nitro]

+------------------+
|   EC2 Instance   |
|  (GPU Driver)    |
+------------------+
|  Nitro Hypervisor|
|  (CPU/Mem 격리)  |
+------------------+
|  Nitro VPC Card  |  Nitro EBS Card  |  Nitro Mgmt Card
+------------------+------------------+-----------------+
|                Physical Server                         |
|  CPU | RAM | GPU (Direct Passthrough) | NVMe           |
+-------------------------------------------------------+

GPU는 인스턴스에 직접 할당 (vGPU 아님)
베어메탈과 동등한 GPU 성능
CUDA, cuDNN, NCCL 등 네이티브 GPU 스택 전체 사용 가능
MIG는 사용자가 인스턴스 내에서 직접 설정 가능 (A100/H100)

EC2 네트워킹

EFA (Elastic Fabric Adapter)

[EFA 아키텍처]

+----------+  +----------+  +----------+  +----------+
| Instance |  | Instance |  | Instance |  | Instance |
| GPU x8   |  | GPU x8   |  | GPU x8   |  | GPU x8   |
+----+-----+  +----+-----+  +----+-----+  +----+-----+
     |              |              |              |
+----+--------------+--------------+--------------+----+
|              EFA Network (RDMA-like)                  |
|         (OS bypass, low-latency, high-bandwidth)      |
+------------------------------------------------------+

기능	설명
OS Bypass	커널을 우회한 직접 NIC 접근
대역폭	P5: 3,200 Gbps, P4d: 400 Gbps
NCCL 지원	GPU 간 직접 통신 (All-reduce 등)
GDR (GPUDirect RDMA)	GPU 메모리에서 직접 네트워크 전송
SRD 프로토콜	Scalable Reliable Datagram

[GPUDirect RDMA 경로]

일반 경로:           GPU -> CPU Memory -> NIC -> Network
GPUDirect RDMA:     GPU -> NIC -> Network  (CPU 바이패스)

Placement Groups

GPU 클러스터의 네트워크 성능을 최적화합니다.

[Cluster Placement Group]

+---------------------------------------------------+
|  Same AZ, Same Rack (or Adjacent Racks)           |
|                                                    |
|  +--------+  +--------+  +--------+  +--------+  |
|  | p5.48xl|  | p5.48xl|  | p5.48xl|  | p5.48xl|  |
|  | 8xH100 |  | 8xH100 |  | 8xH100 |  | 8xH100 |  |
|  +--------+  +--------+  +--------+  +--------+  |
|                                                    |
|  --> 최소 네트워크 지연, 최대 대역폭               |
|  --> 대규모 분산 학습에 필수                       |
+---------------------------------------------------+

Placement Group 유형	설명	용도
Cluster	같은 AZ에 밀집 배치	분산 GPU 학습, HPC
Spread	서로 다른 랙에 분산	고가용성
Partition	파티션별 별도 랙	대규모 분산 시스템

Elastic Graphics (Deprecated)

Elastic Graphics는 EC2 인스턴스에 원격 GPU를 네트워크로 연결하던 서비스였습니다.

2024년에 공식 사용 중단(deprecated)
제한된 OpenGL 지원만 가능했음
대안: G-시리즈 인스턴스 또는 NICE DCV 프로토콜 사용

온프레미스 GPU 가상화와의 비교

항목	AWS EC2 (Nitro)	온프레미스 (ESXi/KVM)
GPU 할당 방식	패스스루 (Nitro)	패스스루, vGPU, MIG 선택
GPU 공유	인스턴스 단위 독점	vGPU로 다중 VM 공유 가능
네트워크	EFA (최대 3,200 Gbps)	InfiniBand (최대 400 Gbps/포트)
GPU 종류 변경	인스턴스 타입 변경으로 즉시	물리적 GPU 교체 필요
확장성	수백 GPU를 분 단위로 프로비저닝	주~월 단위 조달
비용 모델	사용한 만큼 (초 단위)	CAPEX + 유지보수
MIG 지원	사용자가 인스턴스 내 직접 설정	하이퍼바이저 레벨에서 관리
멀티 테넌시	인스턴스 간 Nitro HW 격리	vGPU/MIG로 논리적 격리

EC2 인스턴스 선택 가이드

[GPU 인스턴스 선택 플로우차트]

용도가 무엇인가요?
  |
  +-- AI/ML 학습 --> 모델 크기는?
  |                    |
  |                    +-- 대규모 LLM --> P5 (H100/H200)
  |                    |                 8 GPU, EFA 3200Gbps
  |                    |
  |                    +-- 중간 규모 --> P4d (A100)
  |                                      8 GPU, EFA 400Gbps
  |
  +-- 추론(Inference) --> 처리량은?
  |                        |
  |                        +-- 높은 처리량 --> G6e (L40S)
  |                        |                  최대 8 GPU
  |                        |
  |                        +-- 비용 효율 --> G6 (L4)
  |                                          최대 8 GPU
  |
  +-- 그래픽/렌더링 --> G5 (A10G)
  |                     3D 렌더링, 비디오 처리
  |
  +-- 개발/프로토타입 --> G6.xlarge (L4 1개)
                          가장 저렴한 GPU 옵션

실습: EC2 GPU 인스턴스 설정

# AWS CLI로 GPU 인스턴스 시작
aws ec2 run-instances \
  --instance-type p4d.24xlarge \
  --image-id ami-0abcdef1234567890 \
  --key-name my-key \
  --security-group-ids sg-12345678 \
  --subnet-id subnet-12345678 \
  --placement "GroupName=my-gpu-cluster,Tenancy=default" \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=gpu-training}]'

# EFA 네트워크 인터페이스 추가
aws ec2 create-network-interface \
  --subnet-id subnet-12345678 \
  --interface-type efa \
  --groups sg-12345678

# GPU 상태 확인 (인스턴스 내)
nvidia-smi

# MIG 활성화 (A100/H100 인스턴스에서)
sudo nvidia-smi -i 0 --mig 1
# 재부팅 후
sudo nvidia-smi mig -cgi 9,9 -C

# NCCL 테스트 (멀티 노드)
# all_reduce_perf를 사용한 GPU 간 통신 벤치마크
mpirun -np 16 --hostfile hosts \
  -x NCCL_DEBUG=INFO \
  -x FI_PROVIDER=efa \
  -x FI_EFA_USE_DEVICE_RDMA=1 \
  all_reduce_perf -b 8 -e 1G -f 2 -g 8

Nitro 시스템의 혁신 포인트

[혁신의 핵심: I/O 오프로딩]

Before Nitro:
  Host CPU:  [==VM==][==VM==][===Hypervisor I/O===]
                                    ~30% 소비

After Nitro:
  Host CPU:  [========VM========][========VM========]
  Nitro HW:  [Net][EBS][NVMe][Mgmt][Security]
                  ~100% VM에 제공

I/O 하드웨어 오프로드: 네트워크, 스토리지, 관리를 전용 칩으로 처리
보안 격리: 하드웨어 수준의 신뢰 루트와 메모리 격리
베어메탈 성능: 가상화 오버헤드 거의 제로
일관된 성능: I/O 처리가 CPU와 독립적이므로 "노이지 네이버" 문제 최소화
빠른 혁신: 하드웨어 구성요소를 독립적으로 업데이트 가능

퀴즈: AWS EC2/Nitro 이해도 점검

Q1. Nitro 시스템이 전통적인 하이퍼바이저 대비 갖는 핵심 장점은?

네트워크, 스토리지, 관리 기능을 전용 Nitro Cards(ASIC)로 오프로드하여 호스트 CPU/메모리의 거의 100%를 인스턴스에 제공합니다. 전통적 하이퍼바이저는 이런 기능을 소프트웨어로 처리하여 호스트 리소스의 약 30%를 소비합니다.

Q2. AWS EC2에서 GPU를 할당하는 방식은?

패스스루 방식입니다. Nitro 시스템을 통해 물리 GPU를 인스턴스에 직접 할당하므로 베어메탈과 동등한 성능을 제공합니다. vGPU 방식의 공유는 사용하지 않습니다.

Q3. EFA(Elastic Fabric Adapter)가 일반 네트워크와 다른 점은?

EFA는 OS 바이패스를 통해 커널을 우회하고 NIC에 직접 접근합니다. RDMA와 유사한 저지연, 고대역폭 통신을 제공하며, GPUDirect RDMA로 GPU 메모리에서 직접 네트워크 전송이 가능합니다.

Q4. Nitro Enclaves의 보안 모델은?

부모 인스턴스에서도 Enclave의 메모리에 접근할 수 없습니다. 네트워크와 스토리지 접근이 없으며 오직 vsock을 통해서만 통신합니다. 암호화 증명(Attestation)으로 Enclave의 무결성을 검증할 수 있습니다.

Q5. Cluster Placement Group이 분산 GPU 학습에 중요한 이유는?

같은 AZ의 인접한 랙에 인스턴스를 밀집 배치하여 네트워크 지연을 최소화하고 대역폭을 최대화합니다. 분산 학습에서 GPU 간 통신(All-reduce 등)이 빈번하므로 네트워크 성능이 학습 속도에 직접적인 영향을 줍니다.

[Virtualization] 05. AWS EC2 and Nitro System: The Evolution of Cloud Virtualization

Introduction
AWS Nitro System Architecture
GPU Instance Types
EC2 Networking
- EFA (Elastic Fabric Adapter)
- Placement Groups
Elastic Graphics (Deprecated)
Comparison with On-Premises GPU Virtualization
EC2 Instance Selection Guide
Hands-On: EC2 GPU Instance Setup
Key Innovations of the Nitro System

Introduction

AWS EC2 (Elastic Compute Cloud) is the world's largest cloud virtualization platform. At its core is the Nitro System, developed in-house by AWS. Nitro overcomes the limitations of traditional hypervisors by offloading I/O processing to dedicated hardware, delivering nearly all host resources to instances.

AWS Nitro System Architecture

Traditional Virtualization vs Nitro

[Traditional Virtualization]          [AWS Nitro]

+------------------+                 +------------------+
|    VM 1 | VM 2   |                 |    VM 1 | VM 2   |
+------------------+                 +------------------+
| Hypervisor       |                 | Nitro Hypervisor |
| - CPU/Mem mgmt   |                 | (lightweight,    |
| - Network proc   |                 |  CPU/Mem only)   |
| - Storage proc   |                 +------------------+
| - Security/Mgmt  |                 | Nitro Cards      |
+------------------+                 | (dedicated HW)   |
|   Hardware       |                 +--+--+--+---------+
+------------------+                 |NIC|EBS|Mgmt|Security
                                     +--+--+--+---------+
Host CPU: ~30%                       |   Hardware       |
consumed by hypervisor               +------------------+
                                     Host CPU: ~100%
                                     available to instances

Nitro System Components

+----------------------------------------------------------+
|                    AWS Nitro System                        |
+----------------------------------------------------------+
|                                                            |
|  +------------------+  +------------------------------+   |
|  | Nitro Hypervisor |  |       Nitro Cards            |   |
|  | - Lightweight    |  | +--------+ +--------+       |   |
|  |   KVM-based      |  | | VPC    | | EBS    |       |   |
|  | - CPU/Memory     |  | | Card   | | Card   |       |   |
|  |   isolation only |  | +--------+ +--------+       |   |
|  +------------------+  | +--------+ +--------+       |   |
|                         | | NVMe   | | Mgmt   |       |   |
|  +------------------+  | | Card   | | Card   |       |   |
|  | Nitro Security   |  | +--------+ +--------+       |   |
|  | Chip             |  +------------------------------+   |
|  | - HW Root of Trust|                                     |
|  | - Firmware protect|  +------------------------------+   |
|  +------------------+  | Nitro Enclaves               |   |
|                         | - Isolated compute           |   |
|                         | - Sensitive data processing  |   |
|                         +------------------------------+   |
+----------------------------------------------------------+

1. Nitro Hypervisor

Lightweight KVM-based: Handles only CPU and memory isolation
Network, storage, and management functions all offloaded to Nitro Cards
Delivers nearly 100% of host CPU/memory to instances
Minimizes software attack surface

2. Nitro Cards

Hardware cards built with dedicated ASICs.

Nitro Card	Role
VPC Card	Virtual network processing (VPC, SG, NACL, EFA)
EBS Card	EBS volume I/O, encryption, NVMe protocol
Local NVMe Card	Instance store NVMe SSD management
Management Card	Instance monitoring, boot, security management

3. Nitro Security Chip

[Nitro Security Chain]

Server Boot
    |
    v
Nitro Security Chip (HW Root of Trust)
    |
    v  Firmware integrity verification
    |
Nitro Hypervisor loads
    |
    v  Hypervisor integrity verification
    |
EC2 Instance starts
    |
    v  Runtime monitoring (continuous)

Hardware-based Root of Trust
Verifies server firmware integrity at every boot
Even AWS employees cannot access instance memory
NitroTPM provides instance-level TPM 2.0

4. Nitro Enclaves

Isolated compute environments for processing sensitive data.

+-------------------------------------+
|           EC2 Instance              |
|  +-------------+  +-------------+  |
|  | Application |  | Nitro       |  |
|  | (general)   |  | Enclave     |  |
|  |             |  | (isolated)  |  |
|  |             |  | - own kernel|  |
|  |             |  | - no network|  |
|  |             |  | - no storage|  |
|  |             |  | - vsock only|  |
|  +-------------+  +-------------+  |
+-------------------------------------+

Parent instance cannot access Enclave memory
No network or storage access (vsock communication only)
Cryptographic attestation for integrity verification
Use cases: encryption key management, financial data, medical information

GPU Instance Types

P-Series (Training/HPC)

Instance	GPU	Count	GPU Memory	vCPU	Memory	Network
p5.48xlarge	H100	8	640GB HBM3	192	2TB	3,200 Gbps EFA
p5e.48xlarge	H200	8	1,128GB HBM3e	192	2TB	3,200 Gbps EFA
p5en.48xlarge	H200	8	1,128GB HBM3e	192	2TB	3,200 Gbps EFAv2
p4d.24xlarge	A100	8	320GB HBM2e	96	1.1TB	400 Gbps EFA
p4de.24xlarge	A100 80G	8	640GB HBM2e	96	1.1TB	400 Gbps EFA

G-Series (Inference/Graphics)

Instance	GPU	Count	GPU Memory	vCPU	Memory	Network
g6.xlarge-48xl	L4	1-8	24-192GB	4-192	16-768GB	Up to 100 Gbps
g6e.xlarge-48xl	L40S	1-8	48-384GB	4-192	16-768GB	Up to 100 Gbps
g5.xlarge-48xl	A10G	1-8	24-192GB	4-192	16-768GB	Up to 100 Gbps

GPU Provisioning Method

AWS provides GPUs in passthrough mode via the Nitro System.

[AWS GPU Passthrough via Nitro]

+------------------+
|   EC2 Instance   |
|  (GPU Driver)    |
+------------------+
|  Nitro Hypervisor|
|  (CPU/Mem only)  |
+------------------+
|  Nitro VPC Card  |  Nitro EBS Card  |  Nitro Mgmt Card
+------------------+------------------+-----------------+
|                Physical Server                         |
|  CPU | RAM | GPU (Direct Passthrough) | NVMe           |
+-------------------------------------------------------+

GPUs are directly assigned to instances (not vGPU)
Bare-metal equivalent GPU performance
Full native GPU stack available (CUDA, cuDNN, NCCL, etc.)
Users can configure MIG directly within instances (A100/H100)

EC2 Networking

EFA (Elastic Fabric Adapter)

[EFA Architecture]

+----------+  +----------+  +----------+  +----------+
| Instance |  | Instance |  | Instance |  | Instance |
| GPU x8   |  | GPU x8   |  | GPU x8   |  | GPU x8   |
+----+-----+  +----+-----+  +----+-----+  +----+-----+
     |              |              |              |
+----+--------------+--------------+--------------+----+
|              EFA Network (RDMA-like)                  |
|         (OS bypass, low-latency, high-bandwidth)      |
+------------------------------------------------------+

Feature	Description
OS Bypass	Direct NIC access bypassing the kernel
Bandwidth	P5: 3,200 Gbps, P4d: 400 Gbps
NCCL Support	Direct GPU-to-GPU communication (All-reduce, etc.)
GDR (GPUDirect RDMA)	Direct network transfer from GPU memory
SRD Protocol	Scalable Reliable Datagram

[GPUDirect RDMA Path]

Standard path:     GPU -> CPU Memory -> NIC -> Network
GPUDirect RDMA:    GPU -> NIC -> Network  (CPU bypass)

Placement Groups

Optimize network performance for GPU clusters.

[Cluster Placement Group]

+---------------------------------------------------+
|  Same AZ, Same Rack (or Adjacent Racks)           |
|                                                    |
|  +--------+  +--------+  +--------+  +--------+  |
|  | p5.48xl|  | p5.48xl|  | p5.48xl|  | p5.48xl|  |
|  | 8xH100 |  | 8xH100 |  | 8xH100 |  | 8xH100 |  |
|  +--------+  +--------+  +--------+  +--------+  |
|                                                    |
|  --> Minimal network latency, maximum bandwidth    |
|  --> Essential for large-scale distributed training|
+---------------------------------------------------+

Placement Group Type	Description	Use Case
Cluster	Dense placement in same AZ	Distributed GPU training, HPC
Spread	Distributed across different racks	High availability
Partition	Separate racks per partition	Large distributed systems

Elastic Graphics (Deprecated)

Elastic Graphics was a service that attached remote GPUs to EC2 instances over the network.

Officially deprecated in 2024
Only offered limited OpenGL support
Alternatives: G-series instances or NICE DCV protocol

Comparison with On-Premises GPU Virtualization

Aspect	AWS EC2 (Nitro)	On-Premises (ESXi/KVM)
GPU Assignment	Passthrough (Nitro)	Passthrough, vGPU, MIG selectable
GPU Sharing	Exclusive per instance	Multi-VM sharing via vGPU
Network	EFA (up to 3,200 Gbps)	InfiniBand (up to 400 Gbps/port)
GPU Type Change	Instant via instance type change	Physical GPU replacement needed
Scalability	Hundreds of GPUs in minutes	Weeks-months procurement
Cost Model	Pay-per-use (per second)	CAPEX + maintenance
MIG Support	User configures within instance	Managed at hypervisor level
Multi-tenancy	Nitro HW isolation between instances	Logical isolation via vGPU/MIG

EC2 Instance Selection Guide

[GPU Instance Selection Flowchart]

What is your use case?
  |
  +-- AI/ML Training --> Model size?
  |                       |
  |                       +-- Large LLM --> P5 (H100/H200)
  |                       |                 8 GPUs, EFA 3200Gbps
  |                       |
  |                       +-- Medium --> P4d (A100)
  |                                      8 GPUs, EFA 400Gbps
  |
  +-- Inference --> Throughput needs?
  |                  |
  |                  +-- High throughput --> G6e (L40S)
  |                  |                      Up to 8 GPUs
  |                  |
  |                  +-- Cost efficient --> G6 (L4)
  |                                         Up to 8 GPUs
  |
  +-- Graphics/Rendering --> G5 (A10G)
  |                          3D rendering, video processing
  |
  +-- Dev/Prototype --> G6.xlarge (1x L4)
                        Most affordable GPU option

Hands-On: EC2 GPU Instance Setup

# Launch GPU instance with AWS CLI
aws ec2 run-instances \
  --instance-type p4d.24xlarge \
  --image-id ami-0abcdef1234567890 \
  --key-name my-key \
  --security-group-ids sg-12345678 \
  --subnet-id subnet-12345678 \
  --placement "GroupName=my-gpu-cluster,Tenancy=default" \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=gpu-training}]'

# Add EFA network interface
aws ec2 create-network-interface \
  --subnet-id subnet-12345678 \
  --interface-type efa \
  --groups sg-12345678

# Check GPU status (inside instance)
nvidia-smi

# Enable MIG (on A100/H100 instances)
sudo nvidia-smi -i 0 --mig 1
# After reboot
sudo nvidia-smi mig -cgi 9,9 -C

# NCCL test (multi-node)
# All-reduce benchmark for GPU-to-GPU communication
mpirun -np 16 --hostfile hosts \
  -x NCCL_DEBUG=INFO \
  -x FI_PROVIDER=efa \
  -x FI_EFA_USE_DEVICE_RDMA=1 \
  all_reduce_perf -b 8 -e 1G -f 2 -g 8

Key Innovations of the Nitro System

[Core Innovation: I/O Offloading]

Before Nitro:
  Host CPU:  [==VM==][==VM==][===Hypervisor I/O===]
                                    ~30% consumed

After Nitro:
  Host CPU:  [========VM========][========VM========]
  Nitro HW:  [Net][EBS][NVMe][Mgmt][Security]
                  ~100% available to VMs

I/O Hardware Offload: Network, storage, management handled by dedicated chips
Security Isolation: Hardware-level root of trust and memory isolation
Bare-metal Performance: Near-zero virtualization overhead
Consistent Performance: I/O processing independent of CPU eliminates "noisy neighbor" issues
Rapid Innovation: Hardware components can be updated independently

Quiz: AWS EC2/Nitro Knowledge Check

Q1. What is the key advantage of the Nitro System over traditional hypervisors?

It offloads network, storage, and management functions to dedicated Nitro Cards (ASICs), providing nearly 100% of host CPU/memory to instances. Traditional hypervisors process these in software, consuming about 30% of host resources.

Q2. How does AWS EC2 assign GPUs?

Via passthrough. The Nitro System directly assigns physical GPUs to instances, delivering bare-metal equivalent performance. It does not use vGPU-style sharing.

Q3. How does EFA differ from standard networking?

EFA provides OS bypass for direct NIC access without going through the kernel. It offers RDMA-like low-latency, high-bandwidth communication, and GPUDirect RDMA enables direct network transfer from GPU memory.

Q4. What is the security model of Nitro Enclaves?

Even the parent instance cannot access Enclave memory. There is no network or storage access; communication occurs only through vsock. Cryptographic attestation verifies Enclave integrity.

Q5. Why are Cluster Placement Groups important for distributed GPU training?

They place instances densely in adjacent racks within the same AZ, minimizing network latency and maximizing bandwidth. Distributed training involves frequent GPU-to-GPU communication (All-reduce, etc.), making network performance directly impact training speed.