Split View: Kubernetes RuntimeClass와 GPU 스케줄링 운영 가이드

Kubernetes RuntimeClass와 GPU 스케줄링 운영 가이드

RuntimeClass와 GPU 스케줄링이 얽히는 지점
RuntimeClass 설정
NVIDIA Device Plugin 설정과 운영
- Device Plugin DaemonSet 배포
- GPU Operator vs Device Plugin 직접 배포 비교
GPU 노드 격리와 Taint/Toleration 전략
- Node Label과 Taint 설정
- GPU 워크로드 Deployment 예시
MIG (Multi-Instance GPU) 운영
- MIG 프로파일 설정
- GPU Operator에서 MIG 선언적 관리
GPU 모니터링과 알림
트러블슈팅: 실제 에러 메시지와 해결
Kyverno를 활용한 GPU 정책 자동화
참고 자료

RuntimeClass와 GPU 스케줄링이 얽히는 지점

Kubernetes에서 GPU 워크로드를 운영하면, 단순히 nvidia.com/gpu: 1을 requests에 넣는 것으로 끝나지 않는다. 컨테이너 런타임이 GPU를 인식하려면 NVIDIA Container Toolkit이 설치된 런타임 핸들러가 필요하고, 이 핸들러를 Pod에 연결하는 메커니즘이 RuntimeClass다.

즉, GPU 스케줄링의 전체 체인은 다음과 같다:

Pod spec (runtimeClassName + resources)
    → RuntimeClass (handler 매핑)
        → 컨테이너 런타임 (containerd + nvidia handler)
            → NVIDIA Device Plugin (GPU 리소스 등록)
                → GPU 드라이버 (호스트 커널 모듈)

이 체인의 어느 한 곳이라도 버전 불일치나 설정 오류가 있으면, Pod는 Pending 상태에 빠지거나 GPU를 인식하지 못한다. 이 글에서는 이 체인 전체를 설정하고 운영하는 방법을 다룬다.

RuntimeClass 설정

RuntimeClass 리소스 생성

RuntimeClass는 클러스터 스코프 리소스다. 노드에 설치된 컨테이너 런타임 핸들러와 1:1로 매핑된다.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
# handler는 containerd 설정의 runtime handler 이름과 일치해야 한다
handler: nvidia
scheduling:
  # 이 RuntimeClass를 사용하는 Pod를 특정 노드에만 스케줄링
  nodeSelector:
    accelerator: nvidia-gpu
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
overhead:
  # 런타임 오버헤드 (gVisor 등 샌드박스 런타임 사용 시 설정)
  podFixed:
    cpu: '100m'
    memory: '128Mi'

containerd 런타임 핸들러 설정

GPU 노드의 /etc/containerd/config.toml에 nvidia 핸들러를 등록한다. NVIDIA GPU Operator를 사용하면 이 설정이 자동으로 적용된다.

# /etc/containerd/config.toml (GPU 노드)
version = 2

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

설정 후 containerd를 재시작한다:

sudo systemctl restart containerd
# 핸들러 등록 확인
sudo crictl info | jq '.config.containerd.runtimes | keys'
# 출력: ["nvidia", "runc"]

Pod에서 RuntimeClass 사용

apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference
spec:
  runtimeClassName: nvidia # RuntimeClass 지정
  containers:
    - name: model-server
      image: nvcr.io/nvidia/tritonserver:24.08-py3
      resources:
        limits:
          nvidia.com/gpu: 1
      ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: grpc

NVIDIA Device Plugin 설정과 운영

NVIDIA Device Plugin은 노드의 GPU를 Kubernetes 리소스로 등록하는 DaemonSet이다. GPU Operator를 쓰지 않는 경우 직접 배포해야 한다.

Device Plugin DaemonSet 배포

# NVIDIA Device Plugin 배포 (v0.17.0 기준)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

배포 후 GPU가 정상 등록되었는지 확인:

# 노드별 GPU 리소스 확인
kubectl get nodes -o json | jq '.items[] |
  select(.status.capacity["nvidia.com/gpu"] != null) |
  {name: .metadata.name,
   gpu_capacity: .status.capacity["nvidia.com/gpu"],
   gpu_allocatable: .status.allocatable["nvidia.com/gpu"]}'

출력 예시:

{
  "name": "gpu-node-01",
  "gpu_capacity": "4",
  "gpu_allocatable": "4"
}
{
  "name": "gpu-node-02",
  "gpu_capacity": "8",
  "gpu_allocatable": "7"
}

GPU Operator vs Device Plugin 직접 배포 비교

항목	GPU Operator	Device Plugin 직접 배포
드라이버 관리	자동 설치/업데이트	수동 설치 필요
containerd 설정	자동 구성	수동 config.toml 수정
DCGM 모니터링	자동 배포	별도 설치 필요
MIG 관리	CRD로 선언적 관리	nvidia-smi로 수동 설정
유연성	Operator 업데이트 주기에 종속	각 컴포넌트 독립 관리
권장 환경	클라우드 매니지드 + 대규모	온프레미스 + 세밀한 제어 필요 시

대부분의 프로덕션 환경에서는 GPU Operator를 권장한다:

# Helm으로 GPU Operator 설치
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false  # MIG 미사용 시

GPU 노드 격리와 Taint/Toleration 전략

GPU 노드에 일반 워크로드가 스케줄링되면 GPU 메모리가 낭비되고, GPU 워크로드가 일반 노드에 가면 Pending이 된다. 이를 방지하는 격리 전략이 필수다.

Node Label과 Taint 설정

# GPU 노드에 라벨 추가
kubectl label node gpu-node-01 \
  accelerator=nvidia-gpu \
  gpu.nvidia.com/model=A100 \
  gpu.nvidia.com/memory=80Gi

# GPU 노드에 Taint 추가 (GPU 워크로드만 스케줄링)
kubectl taint node gpu-node-01 \
  nvidia.com/gpu=present:NoSchedule

GPU 워크로드 Deployment 예시

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      runtimeClassName: nvidia
      nodeSelector:
        accelerator: nvidia-gpu
        gpu.nvidia.com/model: A100
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: present
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.4
          args:
            - '--model'
            - 'meta-llama/Llama-3.1-8B-Instruct'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          resources:
            requests:
              cpu: '4'
              memory: '32Gi'
              nvidia.com/gpu: 1
            limits:
              cpu: '8'
              memory: '64Gi'
              nvidia.com/gpu: 1
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10

MIG (Multi-Instance GPU) 운영

A100, H100 같은 GPU에서 MIG를 활성화하면 하나의 물리 GPU를 여러 독립 인스턴스로 분할할 수 있다. 추론 워크로드가 GPU 전체를 필요로 하지 않을 때 비용 효율성이 크게 높아진다.

MIG 프로파일 설정

# GPU에서 MIG 활성화 (노드에 SSH 접속 후)
sudo nvidia-smi -i 0 -mig 1

# MIG 프로파일 생성 (A100 80GB 기준)
# 3g.40gb 인스턴스 2개 생성
sudo nvidia-smi mig -i 0 -cgi 9,9 -C

# 생성된 MIG 인스턴스 확인
nvidia-smi -L

출력 예시:

GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-xxxxx)
  MIG 3g.40gb Device 0: (UUID: MIG-xxxxx)
  MIG 3g.40gb Device 1: (UUID: MIG-xxxxx)

GPU Operator에서 MIG 선언적 관리

GPU Operator를 사용하면 ConfigMap으로 MIG 구성을 선언적으로 관리할 수 있다:

apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      # 프로파일 A: 추론용 (작은 인스턴스 여러 개)
      inference-optimized:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "1g.10gb": 7
      # 프로파일 B: 학습용 (큰 인스턴스 소수)
      training-optimized:
        - devices: [0]
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2
      # 프로파일 C: MIG 비활성화
      full-gpu:
        - devices: [0]
          mig-enabled: false

노드에 라벨로 프로파일을 적용한다:

# 추론 서버용 MIG 프로파일 적용
kubectl label node gpu-node-01 nvidia.com/mig.config=inference-optimized --overwrite

GPU 모니터링과 알림

DCGM Exporter + Prometheus 연동

DCGM Exporter는 GPU 메트릭을 Prometheus 형식으로 노출한다. GPU Operator를 쓰면 자동 배포되지만, 수동 설치도 가능하다.

# ServiceMonitor (Prometheus Operator 사용 시)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: monitoring
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

핵심 GPU 메트릭과 알림 규칙

# PrometheusRule - GPU 알림
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: gpu.rules
      rules:
        # GPU 사용률 10% 미만이 30분 지속 → 리소스 낭비
        - alert: GPUUnderutilized
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
            and on(pod) kube_pod_status_phase{phase="Running"} == 1
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: 'GPU 사용률 10% 미만 지속 ({{ $labels.gpu }})'
            description: '노드 {{ $labels.node }}의 GPU {{ $labels.gpu }}가 30분간 사용률 10% 미만. MIG 분할 또는 워크로드 통합 검토.'

        # GPU 메모리 95% 이상 → OOM 위험
        - alert: GPUMemoryPressure
          expr: |
            DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE > 19
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: 'GPU 메모리 95% 이상 ({{ $labels.gpu }})'
            description: '노드 {{ $labels.node }}의 GPU {{ $labels.gpu }} 메모리 사용률 95% 초과. 배치 크기 축소 또는 모델 양자화 필요.'

        # GPU 온도 85도 이상 → 서멀 스로틀링 위험
        - alert: GPUTemperatureHigh
          expr: DCGM_FI_DEV_GPU_TEMP > 85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: 'GPU 온도 85도 초과 ({{ $labels.gpu }})'

Grafana 대시보드 핵심 패널

GPU 운영에 필요한 Grafana 대시보드 핵심 쿼리:

# GPU 사용률 (전체 클러스터)
avg(DCGM_FI_DEV_GPU_UTIL) by (node, gpu)

# GPU 메모리 사용률
DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100

# GPU당 할당된 Pod 수
count(kube_pod_container_resource_limits{resource="nvidia_com_gpu"}) by (node)

# Pending GPU Pod 수 (스케줄링 대기)
count(kube_pod_status_phase{phase="Pending"})
  * on(pod) group_left()
  kube_pod_container_resource_requests{resource="nvidia_com_gpu"}

트러블슈팅: 실제 에러 메시지와 해결

Case 1: Pod가 Pending 상태에서 벗어나지 않음

$ kubectl describe pod gpu-job-xyz
Events:
  Warning  FailedScheduling  0/10 nodes are available:
    3 node(s) had untolerated taint {nvidia.com/gpu: present},
    7 node(s) didn't match Pod's node affinity/selector.

원인 분석:

GPU 노드에 Taint가 있는데 Pod에 Toleration이 없거나
nodeSelector가 GPU 노드 라벨과 불일치

해결:

# 1. 노드 라벨 확인
kubectl get nodes --show-labels | grep accelerator

# 2. 노드 Taint 확인
kubectl describe node gpu-node-01 | grep -A5 Taints

# 3. Pod spec에 toleration과 nodeSelector 추가 확인

Case 2: RuntimeClass를 찾을 수 없음

$ kubectl describe pod gpu-job-xyz
Events:
  Warning  FailedValidation  RuntimeClass "nvidia" not found

해결:

# RuntimeClass 존재 여부 확인
kubectl get runtimeclass

# 없으면 생성
kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

Case 3: GPU가 노드에 등록되지 않음

$ kubectl describe node gpu-node-01 | grep nvidia
# 아무것도 출력되지 않음

해결 순서:

# 1. Device Plugin Pod 상태 확인
kubectl -n kube-system get pods | grep nvidia-device-plugin
# STATUS가 Running이 아니면 로그 확인

# 2. Device Plugin 로그 확인
kubectl -n kube-system logs nvidia-device-plugin-xxxxx
# 예상 에러: "Failed to initialize NVML: Driver Not Loaded"
# → 호스트에 NVIDIA 드라이버가 설치되지 않음

# 3. 노드에서 직접 확인 (SSH)
nvidia-smi
# Command not found → 드라이버 미설치
# NVIDIA-SMI has failed → 드라이버 버전과 커널 불일치

# 4. 드라이버 설치 확인
dpkg -l | grep nvidia-driver
# 또는
rpm -qa | grep nvidia-driver

Case 4: GPU 메모리 부족 (CUDA OOM)

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB
(GPU 0; 79.35 GiB total capacity; 77.31 GiB already allocated)

해결 옵션:

# 옵션 1: 더 큰 GPU 또는 더 많은 GPU 사용
resources:
  limits:
    nvidia.com/gpu: 2  # 1 → 2

# 옵션 2: 모델 최적화
# vLLM 사용 시 --gpu-memory-utilization 조절
args: ["--gpu-memory-utilization", "0.85"]  # 0.9 → 0.85

# 옵션 3: 배치 크기 축소
# 추론 서버의 max-batch-size 설정 낮추기

# 옵션 4: 모델 양자화 (INT8/INT4)
# 모델 파일 자체를 양자화된 버전으로 교체

Case 5: 업그레이드 후 GPU 워크로드 장애

Kubernetes 버전 업그레이드 후 GPU 워크로드가 깨지는 가장 흔한 원인은 Device Plugin과 Container Toolkit 호환성이다.

# 업그레이드 전 호환성 매트릭스 확인
# 1. Kubernetes 버전
kubectl version --short

# 2. NVIDIA Driver 버전
nvidia-smi --query-gpu=driver_version --format=csv,noheader

# 3. Device Plugin 버전
kubectl -n kube-system get ds nvidia-device-plugin-daemonset \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# 4. Container Toolkit 버전
nvidia-container-cli --version

호환성 매트릭스 (2026년 3월 기준):

Kubernetes	Device Plugin	NVIDIA Driver	Container Toolkit
v1.33	v0.17.x	550+	1.17+
v1.32	v0.16.x ~ v0.17.x	535+	1.16+
v1.31	v0.15.x ~ v0.17.x	535+	1.15+

Kyverno를 활용한 GPU 정책 자동화

GPU 리소스는 비싸므로, 무분별한 요청을 정책으로 제어해야 한다.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: gpu-resource-governance
spec:
  validationFailureAction: Enforce
  rules:
    # 규칙 1: GPU 요청 시 반드시 limits도 설정
    - name: require-gpu-limits
      match:
        any:
          - resources:
              kinds: ['Pod']
      preconditions:
        all:
          - key: '{{ request.object.spec.containers[].resources.requests."nvidia.com/gpu" || '''' }}'
            operator: NotEquals
            value: ''
      validate:
        message: 'GPU를 requests에 넣었으면 limits에도 동일하게 설정하세요.'
        pattern:
          spec:
            containers:
              - resources:
                  limits:
                    nvidia.com/gpu: '?*'

    # 규칙 2: GPU Pod에 privileged 금지
    - name: deny-privileged-gpu-pods
      match:
        any:
          - resources:
              kinds: ['Pod']
      preconditions:
        all:
          - key: '{{ request.object.spec.containers[].resources.limits."nvidia.com/gpu" || '''' }}'
            operator: NotEquals
            value: ''
      validate:
        message: 'GPU Pod에서 privileged 모드는 보안상 금지됩니다.'
        pattern:
          spec:
            containers:
              - =(securityContext):
                  =(privileged): false

    # 규칙 3: GPU Pod에 runtimeClassName 필수
    - name: require-nvidia-runtimeclass
      match:
        any:
          - resources:
              kinds: ['Pod']
      preconditions:
        all:
          - key: '{{ request.object.spec.containers[].resources.limits."nvidia.com/gpu" || '''' }}'
            operator: NotEquals
            value: ''
      validate:
        message: 'GPU Pod는 반드시 runtimeClassName: nvidia를 지정해야 합니다.'
        pattern:
          spec:
            runtimeClassName: nvidia

퀴즈

Q1. RuntimeClass의 handler 필드와 containerd config.toml의 runtime 이름이 불일치하면 어떤 일이 발생하는가?

정답: ||Pod 생성 시 "RuntimeClass handler not found" 에러가 발생하며, 컨테이너가 시작되지 않는다. handler 값은 containerd에 등록된 runtime 이름과 정확히 일치해야 한다.||

Q2. GPU 노드에 Taint를 걸지 않으면 어떤 운영 문제가 생기는가?

정답: ||일반 워크로드(CPU-only)가 GPU 노드에 스케줄링되어 노드 리소스(메모리, CPU)를 점유한다. GPU 워크로드가 실제로 배치될 때 리소스 부족으로 Pending이 발생할 수 있다.||

Q3. NVIDIA Device Plugin이 GPU를 노드에 등록하지 못하는 가장 흔한 원인은?

정답: ||호스트에 NVIDIA 드라이버가 설치되지 않았거나, 드라이버 버전과 커널 버전이 불일치하여 NVML 초기화에 실패하는 것이다.||

Q4. MIG를 사용할 때 GPU Operator의 선언적 관리가 nvidia-smi 수동 설정보다 나은 이유는?

정답: ||노드 재부팅이나 드라이버 업데이트 시 MIG 설정이 초기화되는데, GPU Operator는 ConfigMap 기반으로 자동 복원한다. 수동 설정은 매번 다시 적용해야 한다.||

Q5. GPU 사용률이 10% 미만으로 30분 이상 지속되는 상황의 대응 방안은?

정답: ||MIG로 GPU를 분할하여 여러 작은 워크로드에 배분하거나, 여러 추론 모델을 같은 GPU에 배치하여 활용률을 높인다. Time-slicing도 옵션이다.||

Q6. Kubernetes 버전 업그레이드 전에 GPU 워크로드 관점에서 반드시 확인해야 할 것은?

정답: ||Kubernetes 버전, NVIDIA Driver 버전, Device Plugin 버전, Container Toolkit 버전 간의 호환성 매트릭스를 확인해야 한다. 하나라도 맞지 않으면 GPU 등록 실패나 런타임 에러가 발생한다.||

참고 자료

Kubernetes RuntimeClass and GPU Scheduling Operations Guide

1. Why / How / When Perspective Overview

This article was written from an operational perspective so that teams can apply it immediately. The key focus is not the technology itself but consistency in operational decision-making and recovery speed. By separating why it is needed (Why), how to apply it (How), and when to choose it (When), both team onboarding and incident response quality improve simultaneously. In particular, this reflects changes in default values and recommended patterns based on official documentation updated in 2025-2026. This article was written from an operational perspective so that teams can apply it immediately. The key focus is not the technology itself but consistency in operational decision-making and recovery speed. By separating why it is needed (Why), how to apply it (How), and when to choose it (When), both team onboarding and incident response quality improve simultaneously. In particular, this reflects changes in default values and recommended patterns based on official documentation updated in 2025-2026. This article was written from an operational perspective so that teams can apply it immediately. The key focus is not the technology itself but consistency in operational decision-making and recovery speed. By separating why it is needed (Why), how to apply it (How), and when to choose it (When), both team onboarding and incident response quality improve simultaneously. In particular, this reflects changes in default values and recommended patterns based on official documentation updated in 2025-2026. This article was written from an operational perspective so that teams can apply it immediately. The key focus is not the technology itself but consistency in operational decision-making and recovery speed. By separating why it is needed (Why), how to apply it (How), and when to choose it (When), both team onboarding and incident response quality improve simultaneously. In particular, this reflects changes in default values and recommended patterns based on official documentation updated in 2025-2026.

2. Why / How / When Perspective Overview

3. Why / How / When Perspective Overview

4. Why / How / When Perspective Overview

5. Why / How / When Perspective Overview

6. Why / How / When Perspective Overview

7. Why / How / When Perspective Overview

8. Why / How / When Perspective Overview

9. Why / How / When Perspective Overview

10. Why / How / When Perspective Overview

11. Why / How / When Perspective Overview

12. Why / How / When Perspective Overview

13. Why / How / When Perspective Overview

14. Why / How / When Perspective Overview

Practical Code Example 1: Environment Check

set -euo pipefail
kubectl version --short || true
python3 --version

Practical Code Example 2: Automation Script

#!/usr/bin/env bash
for env in dev staging prod; do
  echo "apply to $env"
done

Practical Code Example 3: Python Validation

from datetime import datetime
print("validated", datetime.utcnow().isoformat())

Practical Code Example 4: YAML Template

apiVersion: v1
kind: ConfigMap
metadata:
  name: sample
data:
  mode: production

Practical Code Example 5: SQL/Query Example

select now() as checked_at, current_database();

Comparison Table

Item	Option A	Option B	When A	When B
Operational Difficulty	Low	Medium to High	When the team is small	When a platform team exists
Scalability	Medium	High	Single service	Multi-service / Multi-team
Cost	Low	High	Early stage	Traffic / Organization growth

Troubleshooting

Symptom: Increased latency after deployment
- Cause: Missing cache warming, excessive HPA thresholds
- Resolution: Recalibrate thresholds based on load testing
Symptom: Sudden spike in error rate
- Cause: Timeout mismatch with dependent services
- Resolution: Unify timeout/retry/circuit breaker policies
Symptom: Increased rollback time
- Cause: Irreversible DB migration
- Resolution: Use expand/contract pattern + pre-validate rollback scripts

References

/blog/2026-03-04-kubernetes-v133-production-playbook
/blog/2026-03-04-devops-golden-path-2026
/blog/2026-03-04-opentelemetry-observability-blueprint

Quiz

View Answers

What are the 3 key decision-making axes of this article? Answer: Why, How, When
What is the criterion that separates Option A and B? Answer: Team maturity and system complexity
Which 2 metrics should be checked first during incident response? Answer: Error rate, latency
Why is expand/contract needed in rollback strategy? Answer: To avoid irreversible changes
Scenario: Error rate tripled within 5 minutes after deployment. What is the first action? Answer: Reduce traffic or immediately roll back
Scenario: No performance degradation but costs increased by 40%. What should you look at? Answer: Autoscale thresholds and resource requests/limits
Comparison: If simple operations are the priority, which one -- A or B? Answer: A
Comparison: If multi-team independent deployment is the priority, which one -- A or B? Answer: B
Short answer: What document must be produced during monthly reviews? Answer: ADR or operational retrospective

Kubernetes RuntimeClass와 GPU 스케줄링 운영 가이드

RuntimeClass와 GPU 스케줄링이 얽히는 지점

RuntimeClass 설정

RuntimeClass 리소스 생성

containerd 런타임 핸들러 설정

Pod에서 RuntimeClass 사용

NVIDIA Device Plugin 설정과 운영

Device Plugin DaemonSet 배포

GPU Operator vs Device Plugin 직접 배포 비교

GPU 노드 격리와 Taint/Toleration 전략

Node Label과 Taint 설정

GPU 워크로드 Deployment 예시

MIG (Multi-Instance GPU) 운영

MIG 프로파일 설정

GPU Operator에서 MIG 선언적 관리

GPU 모니터링과 알림

DCGM Exporter + Prometheus 연동

핵심 GPU 메트릭과 알림 규칙

Grafana 대시보드 핵심 패널

트러블슈팅: 실제 에러 메시지와 해결

Case 1: Pod가 Pending 상태에서 벗어나지 않음

Case 2: RuntimeClass를 찾을 수 없음

Case 3: GPU가 노드에 등록되지 않음

Case 4: GPU 메모리 부족 (CUDA OOM)

Case 5: 업그레이드 후 GPU 워크로드 장애

Kyverno를 활용한 GPU 정책 자동화

참고 자료

Kubernetes RuntimeClass and GPU Scheduling Operations Guide

Kubernetes RuntimeClass and GPU Scheduling Operations Guide

1. Why / How / When Perspective Overview

2. Why / How / When Perspective Overview

3. Why / How / When Perspective Overview

4. Why / How / When Perspective Overview

5. Why / How / When Perspective Overview

6. Why / How / When Perspective Overview

7. Why / How / When Perspective Overview

8. Why / How / When Perspective Overview

9. Why / How / When Perspective Overview

10. Why / How / When Perspective Overview

11. Why / How / When Perspective Overview

12. Why / How / When Perspective Overview

13. Why / How / When Perspective Overview

14. Why / How / When Perspective Overview

Practical Code Example 1: Environment Check

Practical Code Example 2: Automation Script

Practical Code Example 3: Python Validation

Practical Code Example 4: YAML Template

Practical Code Example 5: SQL/Query Example

Comparison Table

Troubleshooting

References

Related Series

Quiz