Split View: GitHub Actions Self-Hosted Runner 대규모 운영과 보안 하드닝 가이드

GitHub Actions Self-Hosted Runner 대규모 운영과 보안 하드닝 가이드

왜 Self-Hosted Runner인가
GitHub-Hosted vs Self-Hosted vs ARC 비교
ARC(Actions Runner Controller) 아키텍처
- ARC 동작 원리
ARC 설치 및 구성
커스텀 Runner 이미지 빌드
- Dockerfile 작성 원칙
보안 하드닝
캐시 전략
모니터링과 관찰 가능성
- Prometheus + Grafana 메트릭
- 핵심 알림 규칙
장애 케이스와 복구 절차
대규모 운영 최적화
운영 체크리스트
마무리
References

왜 Self-Hosted Runner인가

GitHub-hosted runner는 빠르게 시작할 수 있지만, 조직 규모가 커지면 한계에 부딪힌다. 빌드 시간이 30분을 넘기고, GPU가 필요하거나, 내부망 리소스에 접근해야 하거나, 비용이 월 수백만 원을 넘기기 시작하면 self-hosted runner 도입을 검토해야 한다.

2026년 3월부터 GitHub은 self-hosted runner에 대해서도 분당 $0.002의 컨트롤 플레인 비용을 부과하기 시작했다(퍼블릭 리포지토리와 GitHub Enterprise Server 고객은 제외). 하지만 대규모 조직에서는 여전히 GitHub-hosted runner 대비 60~80% 비용 절감이 가능하며, 무엇보다 인프라 커스터마이징의 자유도가 압도적이다.

Self-hosted runner를 도입하면 다음이 가능해진다:

내부 네트워크의 프라이빗 레지스트리, 데이터베이스, 시크릿 매니저에 직접 접근
GPU, ARM, Apple Silicon 등 특수 하드웨어에서의 빌드 및 테스트
빌드 캐시를 로컬 스토리지에 유지하여 의존성 설치 시간 90% 이상 단축
조직 보안 정책에 맞는 네트워크 격리와 감사 로그 구현

GitHub-Hosted vs Self-Hosted vs ARC 비교

러너 선택은 팀 규모, 보안 요건, 운영 역량에 따라 달라진다. 아래 비교표를 기준으로 판단하라.

항목	GitHub-Hosted	Self-Hosted (VM)	ARC (Kubernetes)
초기 설정 난이도	없음	중간	높음
오토스케일링	자동	직접 구현 필요	네이티브 지원
비용 (월 1000시간 기준)	~$480 (Linux 2-core)	EC2 비용 + 운영 인건비	K8s 클러스터 비용 + 운영
빌드 캐시	10GB 제한, Azure Blob	로컬 디스크 무제한	PVC 또는 S3
내부망 접근	불가	가능	가능
보안 격리	GitHub 관리	직접 하드닝	Pod 수준 격리
GPU 지원	제한적 (대형 러너)	완전 지원	NVIDIA Device Plugin
최대 동시 러너	플랜에 따라 제한	인프라 한도	클러스터 노드 한도
유지보수 부담	없음	높음 (OS 패치, 버전 관리)	중간 (Helm 업그레이드)
Ephemeral 지원	기본	--ephemeral 플래그	기본

판단 기준: 월 CI/CD 시간이 500시간 이하이고 내부망 접근이 불필요하면 GitHub-hosted가 합리적이다. 500~2000시간이고 K8s 운영 역량이 없으면 VM 기반 self-hosted를 추천한다. 2000시간 이상이거나 K8s 클러스터가 이미 있다면 ARC가 최선이다.

ARC(Actions Runner Controller) 아키텍처

Actions Runner Controller는 GitHub이 공식으로 관리하는 Kubernetes 오퍼레이터다. 커뮤니티 프로젝트로 시작했으나 2023년부터 GitHub이 직접 개발하며 Runner Scale Sets라는 새로운 아키텍처로 진화했다. 레거시 모드의 webhook 기반 오토스케일링과 달리, Runner Scale Sets는 GitHub API와 직접 통신하며 잡 큐를 실시간으로 감지한다.

ARC 동작 원리

┌─────────────────┐     ┌──────────────────────┐
│   GitHub.com    │     │   Kubernetes Cluster  │
│                 │     │                       │
│  Job Queue      │◄───►│  ARC Controller       │
│  (workflow_job)  │     │    │                  │
│                 │     │    ▼                  │
│  Scale Set API  │◄───►│  ScaleSet Listener    │
│                 │     │    │                  │
└─────────────────┘     │    ▼                  │
                        │  EphemeralRunnerSet   │
                        │    │                  │
                        │    ├─► Runner Pod 1   │
                        │    ├─► Runner Pod 2   │
                        │    └─► Runner Pod N   │
                        └──────────────────────┘

ScaleSet Listener가 GitHub의 잡 큐를 Long Polling 방식으로 감시한다
Job Available 메시지를 수신하면 현재 러너 수와 maxRunners 설정을 비교한다
스케일업이 가능하면 메시지를 ACK하고 Kubernetes API를 통해 EphemeralRunnerSet의 replica 수를 패치한다
새 Runner Pod가 생성되고, JIT(Just-In-Time) 토큰으로 GitHub에 등록된다
잡 실행이 완료되면 Pod는 즉시 삭제된다 (ephemeral 기본 동작)

ARC 설치 및 구성

사전 요건

Kubernetes 1.27 이상
Helm 3.x
GitHub App 또는 Personal Access Token (org 수준 admin:org, repo 수준 repo 스코프)
cert-manager (TLS 인증서 자동 관리용, 선택)

Step 1: Controller 설치

# ARC 컨트롤러용 네임스페이스 생성
kubectl create namespace arc-systems

# Helm 리포지토리 추가
helm install arc \
  --namespace arc-systems \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  --version 0.10.1

Step 2: GitHub App 인증 설정

Personal Access Token보다 GitHub App 인증을 강력하게 추천한다. PAT는 사용자에 귀속되어 퇴사 시 문제가 발생하며, 권한 범위가 넓다. GitHub App은 조직 수준에서 관리되고 필요한 최소 권한만 부여할 수 있다.

# GitHub App 시크릿 생성
kubectl create secret generic github-app-secret \
  --namespace arc-runners \
  --from-literal=github_app_id=12345 \
  --from-literal=github_app_installation_id=67890 \
  --from-file=github_app_private_key=./private-key.pem

Step 3: Runner Scale Set 배포

# values.yaml - Runner Scale Set 설정
githubConfigUrl: 'https://github.com/my-org'
githubConfigSecret: github-app-secret

# 오토스케일링 설정
minRunners: 2 # 최소 대기 러너 (콜드 스타트 방지)
maxRunners: 30 # 최대 러너 수 (클러스터 리소스 고려)

# 러너 그룹 지정 (Enterprise/Org 수준)
runnerGroup: 'production-runners'

# 컨테이너 모드 설정
containerMode:
  type: 'kubernetes'
  kubernetesModeWorkVolumeClaim:
    accessModes: ['ReadWriteOnce']
    storageClassName: 'gp3'
    resources:
      requests:
        storage: 50Gi

# Pod 템플릿 커스터마이징
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        resources:
          requests:
            cpu: '2'
            memory: '4Gi'
          limits:
            cpu: '4'
            memory: '8Gi'
        env:
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: '60'
    # 노드 선택 (빌드 전용 노드풀)
    nodeSelector:
      workload-type: ci-runner
    tolerations:
      - key: 'ci-runner'
        operator: 'Equal'
        value: 'true'
        effect: 'NoSchedule'

# Runner Scale Set 배포
helm install arc-runner-set \
  --namespace arc-runners \
  --create-namespace \
  -f values.yaml \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --version 0.10.1

커스텀 Runner 이미지 빌드

기본 Runner 이미지에는 빌드 도구가 포함되어 있지 않다. 조직에서 사용하는 도구를 미리 포함한 커스텀 이미지를 빌드하면 워크플로우 실행 시간을 크게 줄일 수 있다.

Dockerfile 작성 원칙

베이스 이미지는 가능한 slim 변형을 사용한다
Runner 바이너리 버전은 v2.329.0 이상을 사용한다 (2026년 3월 16일부터 이전 버전은 등록이 차단된다)
불필요한 패키지 설치를 피하고, 멀티스테이지 빌드로 이미지 크기를 최소화한다
root가 아닌 전용 사용자로 Runner 프로세스를 실행한다

# Dockerfile.runner - 커스텀 GitHub Actions Runner 이미지
FROM ubuntu:22.04 AS base

# 시스템 패키지 설치 (최소한으로 유지)
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    ca-certificates \
    git \
    jq \
    unzip \
    zip \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Runner 전용 사용자 생성
RUN useradd -m -d /home/runner -s /bin/bash runner

# Runner 바이너리 설치
ARG RUNNER_VERSION=2.321.0
RUN curl -fsSL -o runner.tar.gz \
    "https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \
    && mkdir -p /home/runner/actions-runner \
    && tar xzf runner.tar.gz -C /home/runner/actions-runner \
    && rm runner.tar.gz \
    && /home/runner/actions-runner/bin/installdependencies.sh

# Node.js 22 LTS 설치
RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
    && apt-get install -y nodejs \
    && rm -rf /var/lib/apt/lists/*

# Docker CLI 설치 (DinD가 아닌 CLI만)
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker.gpg \
    && echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" \
    > /etc/apt/sources.list.d/docker.list \
    && apt-get update && apt-get install -y docker-ce-cli \
    && rm -rf /var/lib/apt/lists/*

# 자동 업데이트 비활성화 (이미지 빌드로 버전 관리)
ENV RUNNER_MANUALLY_TRAP_SIG=1
ENV ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1

# 권한 설정 및 사용자 전환
RUN chown -R runner:runner /home/runner
USER runner
WORKDIR /home/runner/actions-runner

ENTRYPOINT ["./run.sh"]

# 이미지 빌드 및 푸시
docker build -t ghcr.io/my-org/actions-runner:v2.321.0-custom -f Dockerfile.runner .
docker push ghcr.io/my-org/actions-runner:v2.321.0-custom

주의: Runner의 자동 업데이트를 비활성화했으므로, 새로운 Runner 버전이 릴리스되면 이미지를 다시 빌드하고 ARC의 Runner Scale Set 이미지 태그를 업데이트해야 한다. GitHub은 최소 버전 요건을 주기적으로 상향하므로 릴리스 노트를 모니터링하라.

보안 하드닝

Self-hosted runner는 조직의 인프라에서 외부 코드를 실행한다. 하드닝 없이 운영하면 공급망 공격, 시크릿 유출, 네트워크 침투의 경로가 된다.

Ephemeral Runner 필수화

Persistent runner는 이전 잡의 파일, 환경 변수, 프로세스가 다음 잡에 영향을 미칠 수 있다. 공격자가 악성 워크플로우를 통해 runner에 백도어를 설치하면, 이후 모든 잡이 오염된다. Ephemeral runner는 잡 완료 후 즉시 파기되므로 이 위험을 원천 차단한다.

# 워크플로우에서 ephemeral runner 사용 확인
runs-on: arc-runner-set # ARC는 기본적으로 ephemeral

# VM 기반 self-hosted runner의 경우
# ./config.sh --ephemeral 플래그로 등록

네트워크 격리

Runner Pod는 인터넷과 내부 네트워크 모두에 접근할 수 있으므로, NetworkPolicy로 필요한 트래픽만 허용해야 한다.

# network-policy.yaml - Runner Pod 네트워크 제한
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-network-policy
  namespace: arc-runners
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: runner
  policyTypes:
    - Egress
    - Ingress
  ingress: [] # 외부에서 Runner로의 인바운드 차단
  egress:
    # GitHub API 및 Actions 서비스
    - to:
        - ipBlock:
            cidr: 140.82.112.0/20 # github.com
        - ipBlock:
            cidr: 185.199.108.0/22 # GitHub Pages/CDN
      ports:
        - protocol: TCP
          port: 443
    # 내부 컨테이너 레지스트리
    - to:
        - namespaceSelector:
            matchLabels:
              name: registry
      ports:
        - protocol: TCP
          port: 5000
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

RBAC 최소 권한 원칙

Runner Pod의 ServiceAccount에는 최소한의 권한만 부여한다. 특히 Kubernetes API에 대한 접근을 제한해야 한다.

# rbac.yaml - Runner ServiceAccount 최소 권한
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner-sa
  namespace: arc-runners
automountServiceAccountToken: false # K8s API 토큰 자동 마운트 비활성화
---
# 필요한 경우에만 최소 권한의 Role 바인딩
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: runner-minimal
  namespace: arc-runners
rules: [] # 기본적으로 권한 없음

워크플로우 수준 보안

# 안전한 워크플로우 작성 패턴
name: Secure CI Pipeline
on:
  pull_request:
    branches: [main]

# 최소 권한 토큰
permissions:
  contents: read
  packages: read

jobs:
  build:
    runs-on: arc-runner-set
    steps:
      # SHA 고정으로 Action 사용 (태그가 아님)
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

      # 환경 변수에 시크릿 직접 노출 금지
      - name: Build
        run: |
          echo "Building..."
        env:
          # 시크릿은 필요한 스텝에서만 주입
          REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}

Harden-Runner를 활용한 런타임 보안

StepSecurity의 Harden-Runner는 GitHub Actions 전용 EDR(Endpoint Detection and Response)로, 네트워크 이그레스 모니터링, 파일 무결성 검사, 프로세스 활동 추적을 제공한다.

jobs:
  build:
    runs-on: arc-runner-set
    steps:
      - uses: step-security/harden-runner@0634a2670c59f64b4a01f0f96f84700a4088b9f0 # v2.12.0
        with:
          egress-policy: audit # 먼저 audit으로 트래픽 패턴 파악
          # 패턴 파악 후 block으로 전환
          # egress-policy: block
          # allowed-endpoints: >
          #   github.com:443
          #   registry.npmjs.org:443
          #   ghcr.io:443

퍼블릭 리포지토리 주의사항

퍼블릭 리포지토리에서는 절대로 self-hosted runner를 사용하지 마라. 외부 공격자가 Fork PR을 통해 임의의 코드를 runner에서 실행할 수 있다. pull_request_target 이벤트와 self-hosted runner의 조합은 특히 위험하다. 반드시 프라이빗 리포지토리 또는 조직 내부 리포지토리에서만 사용하라.

캐시 전략

Ephemeral runner의 가장 큰 단점은 잡마다 캐시가 사라진다는 것이다. 효율적인 캐시 전략 없이는 매 빌드마다 의존성 다운로드부터 시작해야 한다.

전략 1: PersistentVolumeClaim (PVC) 기반 캐시

# ARC Runner Scale Set values.yaml에 PVC 추가
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/my-org/actions-runner:latest
        volumeMounts:
          - name: cache-volume
            mountPath: /opt/cache
        env:
          - name: RUNNER_TOOL_CACHE
            value: /opt/cache/tool-cache
          - name: npm_config_cache
            value: /opt/cache/npm
          - name: GOPATH
            value: /opt/cache/go
    volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: runner-cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: runner-cache-pvc
  namespace: arc-runners
spec:
  accessModes:
    - ReadWriteMany # 여러 Runner Pod가 동시 접근
  storageClassName: efs # AWS EFS 또는 NFS
  resources:
    requests:
      storage: 100Gi

주의: ReadWriteMany 모드는 EFS, NFS, GlusterFS 등 네트워크 파일 시스템이 필요하다. EBS 같은 블록 스토리지는 ReadWriteOnce만 지원하므로 한 번에 하나의 Pod만 접근 가능하다.

전략 2: S3 호환 캐시 서버

GitHub의 기본 캐시(actions/cache)는 Azure Blob Storage를 사용한다. AWS 환경이라면 리전 간 지연이 발생한다. S3 호환 캐시 서버를 self-hosted로 운영하면 네트워크 지연을 최소화할 수 있다.

# MinIO 기반 캐시 서버 배포 (같은 VPC/리전 내)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: actions-cache-server
  namespace: arc-systems
spec:
  replicas: 1
  selector:
    matchLabels:
      app: actions-cache
  template:
    spec:
      containers:
        - name: minio
          image: minio/minio:latest
          args: ['server', '/data', '--console-address', ':9001']
          env:
            - name: MINIO_ROOT_USER
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: user
            - name: MINIO_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: password
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: minio-data-pvc

전략 3: Docker Layer 캐시 (BuildKit)

컨테이너 이미지 빌드가 CI의 주요 워크로드라면, BuildKit의 캐시 백엔드를 레지스트리로 설정해 레이어 캐시를 공유하라.

# 워크플로우에서 BuildKit 레지스트리 캐시 활용
- name: Build and Push
  uses: docker/build-push-action@48aba3b46d1b1fec4febb7c5d0c644b249a11355 # v6
  with:
    push: true
    tags: ghcr.io/my-org/my-app:${{ github.sha }}
    cache-from: type=registry,ref=ghcr.io/my-org/my-app:buildcache
    cache-to: type=registry,ref=ghcr.io/my-org/my-app:buildcache,mode=max

모니터링과 관찰 가능성

Self-hosted runner를 운영하면서 가장 자주 듣는 질문은 "러너가 왜 안 뜨나요?"이다. 모니터링 없이는 답을 줄 수 없다.

Prometheus + Grafana 메트릭

ARC는 Prometheus 메트릭을 기본 노출한다. 핵심 메트릭은 다음과 같다:

gha_runner_scale_set_desired_replicas: 현재 요청된 러너 수
gha_runner_scale_set_running_replicas: 실행 중인 러너 수
gha_runner_scale_set_registered_replicas: GitHub에 등록 완료된 러너 수
gha_runner_scale_set_idle_replicas: 유휴 러너 수

# Prometheus ServiceMonitor 설정
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: arc-controller-monitor
  namespace: arc-systems
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: gha-runner-scale-set-controller
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

핵심 알림 규칙

# Alertmanager 규칙
groups:
  - name: arc-runner-alerts
    rules:
      # 러너 풀 고갈 경고
      - alert: RunnerPoolExhausted
        expr: |
          gha_runner_scale_set_desired_replicas
          >= gha_runner_scale_set_max_replicas * 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Runner pool이 90% 이상 사용 중'
          description: 'maxRunners 증설 또는 워크플로우 최적화 필요'

      # 러너 등록 실패 감지
      - alert: RunnerRegistrationFailed
        expr: |
          rate(gha_runner_scale_set_registration_failures_total[5m]) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'Runner 등록 실패 발생'
          description: 'GitHub App 인증 또는 네트워크 확인 필요'

      # Pending Pod 장시간 대기
      - alert: RunnerPodPending
        expr: |
          kube_pod_status_phase{namespace="arc-runners", phase="Pending"} > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'Runner Pod가 10분 이상 Pending 상태'
          description: '노드 리소스 부족 또는 PVC 바인딩 실패 가능성'

장애 케이스와 복구 절차

장애 1: ScaleSet Listener CrashLoopBackOff

증상: Listener Pod가 반복 재시작되며 러너가 전혀 스케일업되지 않는다.

원인 분석 순서:

# 1. Listener Pod 로그 확인
kubectl logs -n arc-systems -l app.kubernetes.io/component=runner-scale-set-listener --tail=100

# 2. 흔한 원인: GitHub App 인증 만료
# - Private key 파일 확인
# - App installation 상태 확인 (org settings > GitHub Apps)

# 3. 네트워크 문제: GitHub API 접근 불가
kubectl exec -n arc-systems deploy/arc-gha-runner-scale-set-controller -- \
  curl -s https://api.github.com/meta | jq '.actions[]'

복구: GitHub App의 private key를 갱신하고 시크릿을 업데이트한다.

kubectl create secret generic github-app-secret \
  --namespace arc-runners \
  --from-literal=github_app_id=12345 \
  --from-literal=github_app_installation_id=67890 \
  --from-file=github_app_private_key=./new-private-key.pem \
  --dry-run=client -o yaml | kubectl apply -f -

# Controller 재시작
kubectl rollout restart deployment -n arc-systems arc-gha-runner-scale-set-controller

장애 2: Runner Pod가 Pending 상태에서 멈춤

증상: 잡은 큐에 쌓이지만 Runner Pod가 생성되지 않거나 Pending 상태로 남는다.

# Pod 이벤트 확인
kubectl describe pod -n arc-runners -l actions.github.com/scale-set-name=arc-runner-set

# 흔한 원인별 대응
# 1. 노드 리소스 부족
kubectl top nodes
# -> Cluster Autoscaler가 동작하는지 확인, 또는 maxRunners 하향

# 2. PVC 바인딩 대기
kubectl get pvc -n arc-runners
# -> StorageClass 설정, 가용영역 불일치 확인

# 3. 이미지 Pull 실패
kubectl get events -n arc-runners --sort-by='.lastTimestamp' | grep -i pull
# -> 이미지 태그, 레지스트리 인증 확인

장애 3: 잡이 Runner에 할당되지 않음

증상: GitHub UI에서 잡이 "Queued" 상태로 무기한 대기한다.

# Runner 등록 상태 확인
kubectl get ephemeralrunner -n arc-runners

# Runner labels 확인 (runs-on과 일치해야 함)
kubectl get autoscalingrunnersets -n arc-runners -o yaml | grep -A5 labels

# GitHub에서 Runner 그룹 설정 확인
# Settings > Actions > Runner groups > 해당 그룹에 리포지토리가 포함되어 있는지 확인

복구: 워크플로우의 runs-on 라벨이 ARC Runner Scale Set의 이름과 정확히 일치하는지 확인하라. Runner 그룹이 설정된 경우, 해당 리포지토리가 그룹에 포함되어 있는지도 검증하라.

장애 4: Runner 버전 호환성 문제

2026년 3월 16일부터 v2.329.0 미만 Runner의 등록이 차단된다. 커스텀 이미지를 사용 중이라면 반드시 Runner 버전을 확인하라.

# 현재 Runner 버전 확인
kubectl exec -n arc-runners -it <runner-pod> -- ./config.sh --version

# 이미지 업데이트 (values.yaml 수정 후)
helm upgrade arc-runner-set \
  --namespace arc-runners \
  -f values.yaml \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set

대규모 운영 최적화

러너 그룹 분리 전략

워크로드 특성에 따라 Runner Scale Set을 분리 운영하라. 단일 Scale Set으로 모든 워크로드를 처리하면 리소스 경합과 노이지 네이버 문제가 발생한다.

# 용도별 Runner Scale Set 분리
# 1. 일반 CI (가벼운 테스트, 린트)
# values-ci-light.yaml
minRunners: 2
maxRunners: 20
template:
  spec:
    containers:
      - name: runner
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"

# 2. 빌드 전용 (컴파일, Docker 빌드)
# values-ci-build.yaml
minRunners: 1
maxRunners: 10
template:
  spec:
    containers:
      - name: runner
        resources:
          requests:
            cpu: "4"
            memory: "8Gi"

# 3. GPU 워크로드 (ML 모델 테스트)
# values-gpu.yaml
minRunners: 0
maxRunners: 4
template:
  spec:
    containers:
      - name: runner
        resources:
          limits:
            nvidia.com/gpu: 1
    nodeSelector:
      accelerator: nvidia-a10g

Graceful Shutdown 처리

러너가 잡을 실행 중일 때 노드 드레인이나 스케일다운이 발생하면 잡이 실패한다. RUNNER_GRACEFUL_STOP_TIMEOUT을 설정하여 진행 중인 잡이 완료될 때까지 기다리도록 하라.

template:
  spec:
    terminationGracePeriodSeconds: 3600 # 최대 1시간 대기
    containers:
      - name: runner
        env:
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: '3500' # terminationGracePeriodSeconds보다 약간 짧게

노드 오토스케일러와의 연동

ARC가 러너 Pod를 생성해도 노드가 부족하면 Pod는 Pending 상태에 머문다. Cluster Autoscaler 또는 Karpenter를 함께 구성하라.

# Karpenter NodePool 예시 (CI Runner 전용)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ci-runners
spec:
  template:
    metadata:
      labels:
        workload-type: ci-runner
    spec:
      taints:
        - key: ci-runner
          value: 'true'
          effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['m7i.xlarge', 'm7i.2xlarge', 'm6i.xlarge', 'm6i.2xlarge']
  limits:
    cpu: 200
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s

Spot 인스턴스를 활용하면 비용을 추가로 50~70% 절감할 수 있다. 단, Spot 인터럽션 시 잡이 실패할 수 있으므로, 중요도가 낮은 CI 워크로드에만 적용하라. 프로덕션 배포 파이프라인에는 On-Demand 인스턴스를 사용하라.

운영 체크리스트

Self-hosted runner 도입 전후로 아래 체크리스트를 점검하라.

초기 구축 체크리스트

GitHub App 인증 구성 완료 (PAT 대신 GitHub App 사용)
Runner 이미지에 필요한 도구 사전 설치 완료
Runner 버전 v2.329.0 이상 확인
Ephemeral 모드 활성화 확인
NetworkPolicy 적용 (최소한의 이그레스만 허용)
ServiceAccount에 automountServiceAccountToken: false 설정
Runner Pod에 리소스 requests/limits 설정
노드 선택기(nodeSelector) 또는 Taint/Toleration으로 빌드 노드 분리
캐시 전략 결정 및 구현 (PVC, S3, 레지스트리 캐시)

보안 하드닝 체크리스트

퍼블릭 리포지토리에서 self-hosted runner 사용 차단
Runner 그룹으로 리포지토리 접근 범위 제한
워크플로우에서 permissions 최소 권한 선언
Action 참조 시 commit SHA 고정 (태그 대신)
Docker 소켓 마운트 금지 (컨테이너 모드 사용)
시크릿 스캐닝 및 누출 방지 도구 적용
Runner 호스트 OS 하드닝 (불필요한 서비스 제거, 방화벽 설정)
OIDC를 활용한 단기 토큰 기반 클라우드 인증

운영 모니터링 체크리스트

Prometheus 메트릭 수집 및 Grafana 대시보드 구성
Runner Pool 고갈 알림 설정 (maxRunners의 90% 임계)
Runner 등록 실패 알림 설정
Pod Pending 장시간 대기 알림 설정
Runner 버전 업데이트 알림 (GitHub Changelog 구독)
월간 보안 감사 일정 수립 (네트워크 정책 리뷰, 시크릿 로테이션)

마무리

Self-hosted runner 운영은 단순히 VM이나 Pod를 띄우는 것이 아니다. 보안, 스케일링, 캐시, 모니터링, 장애 대응까지 포괄하는 플랫폼 엔지니어링 영역이다. ARC와 Runner Scale Sets의 등장으로 Kubernetes 위에서의 운영이 크게 안정화되었지만, 결국 조직의 워크로드에 맞는 튜닝과 지속적인 모니터링이 필요하다.

핵심을 다시 정리하면:

Ephemeral은 선택이 아닌 필수다. 보안과 재현성을 동시에 보장한다.
ARC Runner Scale Sets가 현재 최선의 오토스케일링 방식이다. 레거시 webhook 기반 모드는 사용하지 마라.
보안 하드닝을 빌드가 아닌 Day 0에 적용하라. NetworkPolicy, RBAC, SHA 고정, 퍼블릭 리포 차단은 기본이다.
캐시 전략 없는 ephemeral runner는 느린 runner일 뿐이다. PVC, S3, 레지스트리 캐시를 반드시 구성하라.
모니터링과 알림은 운영의 생명줄이다. Runner Pool 고갈과 등록 실패를 즉시 감지할 수 있어야 한다.

References

GitHub Docs - Actions Runner Controller - ARC 공식 문서 및 아키텍처 설명
GitHub Docs - Deploying Runner Scale Sets with ARC - Runner Scale Set 배포 튜토리얼
GitHub Docs - Self-Hosted Runners - Self-hosted runner 공식 가이드
GitHub Docs - Secure Use Reference - GitHub Actions 보안 참조 문서
GitHub Actions Runner Controller Repository - ARC 소스 코드 및 Helm chart values 참조
AWS Blog - Best Practices for Self-Hosted Runners at Scale - AWS 환경에서의 대규모 runner 운영
StepSecurity Harden-Runner - GitHub Actions 런타임 보안 모니터링 도구
GitHub Blog - Self-Hosted Runner Minimum Version Enforcement - 2026년 runner 최소 버전 요건 변경

GitHub Actions Self-Hosted Runner: Large-Scale Operations and Security Hardening Guide

Why Self-Hosted Runners
GitHub-Hosted vs Self-Hosted vs ARC Comparison
ARC (Actions Runner Controller) Architecture
- How ARC Works
ARC Installation and Configuration
Custom Runner Image Build
- Dockerfile Writing Principles
Security Hardening
Cache Strategies
Monitoring and Observability
- Prometheus + Grafana Metrics
- Key Alert Rules
Failure Cases and Recovery Procedures
Large-Scale Operations Optimization
Operations Checklist
Conclusion
References
Quiz

Why Self-Hosted Runners

GitHub-hosted runners are quick to get started with, but they hit limitations as organizations scale. When build times exceed 30 minutes, when GPU access is needed, when you need to reach internal network resources, or when costs start exceeding thousands of dollars per month, it is time to consider self-hosted runners.

Starting March 2026, GitHub began charging a control plane fee of $0.002 per minute for self-hosted runners (public repositories and GitHub Enterprise Server customers are excluded). However, for large organizations, 60-80% cost savings compared to GitHub-hosted runners are still achievable, and above all, the freedom of infrastructure customization is overwhelming.

Adopting self-hosted runners enables the following:

Direct access to private registries, databases, and secret managers on your internal network
Builds and tests on specialized hardware such as GPU, ARM, and Apple Silicon
Maintaining build caches on local storage, reducing dependency installation time by over 90%
Network isolation and audit logging that conforms to organizational security policies

GitHub-Hosted vs Self-Hosted vs ARC Comparison

Runner selection depends on team size, security requirements, and operational capabilities. Use the comparison table below as your decision criteria.

Item	GitHub-Hosted	Self-Hosted (VM)	ARC (Kubernetes)
Initial setup difficulty	None	Medium	High
Autoscaling	Automatic	Must implement yourself	Native support
Cost (per 1000 hours/month)	~$480 (Linux 2-core)	EC2 cost + ops personnel	K8s cluster cost + ops
Build cache	10GB limit, Azure Blob	Local disk unlimited	PVC or S3
Internal network access	Not possible	Possible	Possible
Security isolation	Managed by GitHub	Manual hardening	Pod-level isolation
GPU support	Limited (larger runners)	Full support	NVIDIA Device Plugin
Max concurrent runners	Plan-dependent limits	Infrastructure limits	Cluster node limits
Maintenance burden	None	High (OS patches, version mgmt)	Medium (Helm upgrades)
Ephemeral support	Default	--ephemeral flag	Default

Decision criteria: If your monthly CI/CD time is under 500 hours and internal network access is not needed, GitHub-hosted is practical. If 500-2000 hours and you lack K8s operational capability, VM-based self-hosted is recommended. If over 2000 hours or you already have a K8s cluster, ARC is the best option.

ARC (Actions Runner Controller) Architecture

Actions Runner Controller is an officially GitHub-maintained Kubernetes operator. It started as a community project but has been developed directly by GitHub since 2023, evolving into the new Runner Scale Sets architecture. Unlike the legacy mode's webhook-based autoscaling, Runner Scale Sets communicates directly with the GitHub API and detects job queues in real time.

How ARC Works

┌─────────────────┐     ┌──────────────────────┐
│   GitHub.com    │     │   Kubernetes Cluster  │
│                 │     │                       │
│  Job Queue      │◄───►│  ARC Controller       │
│  (workflow_job)  │     │    │                  │
│                 │     │    ▼                  │
│  Scale Set API  │◄───►│  ScaleSet Listener    │
│                 │     │    │                  │
└─────────────────┘     │    ▼                  │
                        │  EphemeralRunnerSet   │
                        │    │                  │
                        │    ├─► Runner Pod 1   │
                        │    ├─► Runner Pod 2   │
                        │    └─► Runner Pod N   │
                        └──────────────────────┘

The ScaleSet Listener monitors GitHub's job queue via Long Polling
Upon receiving a Job Available message, it compares the current runner count against the maxRunners setting
If scale-up is possible, it ACKs the message and patches the EphemeralRunnerSet replica count via the Kubernetes API
A new Runner Pod is created and registered with GitHub using a JIT (Just-In-Time) token
Once job execution completes, the Pod is immediately deleted (default ephemeral behavior)

ARC Installation and Configuration

Prerequisites

Kubernetes 1.27 or higher
Helm 3.x
GitHub App or Personal Access Token (org-level admin:org, repo-level repo scope)
cert-manager (optional, for automated TLS certificate management)

Step 1: Controller Installation

# Create namespace for ARC controller
kubectl create namespace arc-systems

# Install via Helm
helm install arc \
  --namespace arc-systems \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  --version 0.10.1

Step 2: GitHub App Authentication Setup

GitHub App authentication is strongly recommended over Personal Access Tokens. PATs are tied to individual users, causing issues when employees leave, and their permission scope is broad. GitHub Apps are managed at the organization level and can be granted minimum necessary permissions.

# Create GitHub App secret
kubectl create secret generic github-app-secret \
  --namespace arc-runners \
  --from-literal=github_app_id=12345 \
  --from-literal=github_app_installation_id=67890 \
  --from-file=github_app_private_key=./private-key.pem

Step 3: Runner Scale Set Deployment

# values.yaml - Runner Scale Set configuration
githubConfigUrl: 'https://github.com/my-org'
githubConfigSecret: github-app-secret

# Autoscaling settings
minRunners: 2 # Minimum standby runners (prevents cold starts)
maxRunners: 30 # Maximum runners (consider cluster resources)

# Runner group assignment (Enterprise/Org level)
runnerGroup: 'production-runners'

# Container mode settings
containerMode:
  type: 'kubernetes'
  kubernetesModeWorkVolumeClaim:
    accessModes: ['ReadWriteOnce']
    storageClassName: 'gp3'
    resources:
      requests:
        storage: 50Gi

# Pod template customization
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        resources:
          requests:
            cpu: '2'
            memory: '4Gi'
          limits:
            cpu: '4'
            memory: '8Gi'
        env:
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: '60'
    # Node selection (dedicated build node pool)
    nodeSelector:
      workload-type: ci-runner
    tolerations:
      - key: 'ci-runner'
        operator: 'Equal'
        value: 'true'
        effect: 'NoSchedule'

# Deploy Runner Scale Set
helm install arc-runner-set \
  --namespace arc-runners \
  --create-namespace \
  -f values.yaml \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --version 0.10.1

Custom Runner Image Build

The default Runner image does not include build tools. Building a custom image with your organization's pre-installed tools can significantly reduce workflow execution time.

Dockerfile Writing Principles

Use slim variants for base images when possible
Use Runner binary version v2.329.0 or higher (versions below this will be blocked from registration starting March 16, 2026)
Avoid installing unnecessary packages and minimize image size with multi-stage builds
Run the Runner process as a dedicated user, not root

# Dockerfile.runner - Custom GitHub Actions Runner image
FROM ubuntu:22.04 AS base

# Install system packages (keep minimal)
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    ca-certificates \
    git \
    jq \
    unzip \
    zip \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Create dedicated runner user
RUN useradd -m -d /home/runner -s /bin/bash runner

# Install Runner binary
ARG RUNNER_VERSION=2.321.0
RUN curl -fsSL -o runner.tar.gz \
    "https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \
    && mkdir -p /home/runner/actions-runner \
    && tar xzf runner.tar.gz -C /home/runner/actions-runner \
    && rm runner.tar.gz \
    && /home/runner/actions-runner/bin/installdependencies.sh

# Install Node.js 22 LTS
RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
    && apt-get install -y nodejs \
    && rm -rf /var/lib/apt/lists/*

# Install Docker CLI (CLI only, not DinD)
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker.gpg \
    && echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" \
    > /etc/apt/sources.list.d/docker.list \
    && apt-get update && apt-get install -y docker-ce-cli \
    && rm -rf /var/lib/apt/lists/*

# Disable auto-updates (manage versions through image builds)
ENV RUNNER_MANUALLY_TRAP_SIG=1
ENV ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1

# Set permissions and switch user
RUN chown -R runner:runner /home/runner
USER runner
WORKDIR /home/runner/actions-runner

ENTRYPOINT ["./run.sh"]

# Build and push image
docker build -t ghcr.io/my-org/actions-runner:v2.321.0-custom -f Dockerfile.runner .
docker push ghcr.io/my-org/actions-runner:v2.321.0-custom

Note: Since auto-updates have been disabled for the Runner, you must rebuild the image and update the ARC Runner Scale Set image tag when new Runner versions are released. GitHub periodically raises minimum version requirements, so monitor release notes.

Security Hardening

Self-hosted runners execute external code within your organization's infrastructure. Operating without hardening creates pathways for supply chain attacks, secret leaks, and network infiltration.

Mandate Ephemeral Runners

Persistent runners allow files, environment variables, and processes from previous jobs to affect subsequent jobs. If an attacker installs a backdoor on the runner through a malicious workflow, all subsequent jobs become compromised. Ephemeral runners are destroyed immediately after job completion, eliminating this risk at its source.

# Verify ephemeral runner usage in workflows
runs-on: arc-runner-set # ARC is ephemeral by default

# For VM-based self-hosted runners
# Register with the --ephemeral flag via ./config.sh

Network Isolation

Runner Pods can access both the internet and internal networks, so NetworkPolicy must be used to allow only necessary traffic.

# network-policy.yaml - Runner Pod network restrictions
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-network-policy
  namespace: arc-runners
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: runner
  policyTypes:
    - Egress
    - Ingress
  ingress: [] # Block inbound traffic from outside to Runner
  egress:
    # GitHub API and Actions services
    - to:
        - ipBlock:
            cidr: 140.82.112.0/20 # github.com
        - ipBlock:
            cidr: 185.199.108.0/22 # GitHub Pages/CDN
      ports:
        - protocol: TCP
          port: 443
    # Internal container registry
    - to:
        - namespaceSelector:
            matchLabels:
              name: registry
      ports:
        - protocol: TCP
          port: 5000
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

RBAC Least Privilege Principle

Grant only minimum permissions to the Runner Pod's ServiceAccount. Access to the Kubernetes API must be restricted in particular.

# rbac.yaml - Runner ServiceAccount minimum permissions
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner-sa
  namespace: arc-runners
automountServiceAccountToken: false # Disable auto-mounting K8s API token
---
# Only bind minimal Role when necessary
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: runner-minimal
  namespace: arc-runners
rules: [] # No permissions by default

Workflow-Level Security

# Secure workflow writing patterns
name: Secure CI Pipeline
on:
  pull_request:
    branches: [main]

# Minimum privilege tokens
permissions:
  contents: read
  packages: read

jobs:
  build:
    runs-on: arc-runner-set
    steps:
      # Use Actions with SHA pinning (not tags)
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

      # Avoid directly exposing secrets in environment variables
      - name: Build
        run: |
          echo "Building..."
        env:
          # Inject secrets only in the steps that need them
          REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}

Runtime Security with Harden-Runner

StepSecurity's Harden-Runner is a GitHub Actions-specific EDR (Endpoint Detection and Response) that provides network egress monitoring, file integrity checking, and process activity tracking.

jobs:
  build:
    runs-on: arc-runner-set
    steps:
      - uses: step-security/harden-runner@0634a2670c59f64b4a01f0f96f84700a4088b9f0 # v2.12.0
        with:
          egress-policy: audit # First use audit to understand traffic patterns
          # Switch to block after understanding patterns
          # egress-policy: block
          # allowed-endpoints: >
          #   github.com:443
          #   registry.npmjs.org:443
          #   ghcr.io:443

Public Repository Considerations

Never use self-hosted runners with public repositories. External attackers can execute arbitrary code on the runner through Fork PRs. The combination of pull_request_target events and self-hosted runners is especially dangerous. Only use them with private repositories or internal organization repositories.

Cache Strategies

The biggest drawback of ephemeral runners is that the cache is lost with every job. Without an efficient cache strategy, every build must start from dependency downloads.

Strategy 1: PersistentVolumeClaim (PVC) Based Cache

# Add PVC to ARC Runner Scale Set values.yaml
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/my-org/actions-runner:latest
        volumeMounts:
          - name: cache-volume
            mountPath: /opt/cache
        env:
          - name: RUNNER_TOOL_CACHE
            value: /opt/cache/tool-cache
          - name: npm_config_cache
            value: /opt/cache/npm
          - name: GOPATH
            value: /opt/cache/go
    volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: runner-cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: runner-cache-pvc
  namespace: arc-runners
spec:
  accessModes:
    - ReadWriteMany # Multiple Runner Pods access simultaneously
  storageClassName: efs # AWS EFS or NFS
  resources:
    requests:
      storage: 100Gi

Note: ReadWriteMany mode requires a network file system such as EFS, NFS, or GlusterFS. Block storage like EBS only supports ReadWriteOnce, allowing access by only one Pod at a time.

Strategy 2: S3-Compatible Cache Server

GitHub's default cache (actions/cache) uses Azure Blob Storage. In AWS environments, cross-region latency occurs. Running a self-hosted S3-compatible cache server minimizes network latency.

# MinIO-based cache server deployment (within the same VPC/region)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: actions-cache-server
  namespace: arc-systems
spec:
  replicas: 1
  selector:
    matchLabels:
      app: actions-cache
  template:
    spec:
      containers:
        - name: minio
          image: minio/minio:latest
          args: ['server', '/data', '--console-address', ':9001']
          env:
            - name: MINIO_ROOT_USER
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: user
            - name: MINIO_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: password
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: minio-data-pvc

Strategy 3: Docker Layer Cache (BuildKit)

If container image builds are the primary CI workload, set BuildKit's cache backend to a registry to share layer caches.

# Use BuildKit registry cache in workflows
- name: Build and Push
  uses: docker/build-push-action@48aba3b46d1b1fec4febb7c5d0c644b249a11355 # v6
  with:
    push: true
    tags: ghcr.io/my-org/my-app:${{ github.sha }}
    cache-from: type=registry,ref=ghcr.io/my-org/my-app:buildcache
    cache-to: type=registry,ref=ghcr.io/my-org/my-app:buildcache,mode=max

Monitoring and Observability

The most common question when operating self-hosted runners is "Why isn't the runner starting?" Without monitoring, you cannot provide an answer.

Prometheus + Grafana Metrics

ARC exposes Prometheus metrics by default. The key metrics are:

gha_runner_scale_set_desired_replicas: Currently requested runner count
gha_runner_scale_set_running_replicas: Currently running runner count
gha_runner_scale_set_registered_replicas: Runners successfully registered with GitHub
gha_runner_scale_set_idle_replicas: Idle runner count

# Prometheus ServiceMonitor configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: arc-controller-monitor
  namespace: arc-systems
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: gha-runner-scale-set-controller
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Key Alert Rules

# Alertmanager rules
groups:
  - name: arc-runner-alerts
    rules:
      # Runner pool exhaustion warning
      - alert: RunnerPoolExhausted
        expr: |
          gha_runner_scale_set_desired_replicas
          >= gha_runner_scale_set_max_replicas * 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Runner pool is over 90% utilized'
          description: 'Increase maxRunners or optimize workflows'

      # Runner registration failure detection
      - alert: RunnerRegistrationFailed
        expr: |
          rate(gha_runner_scale_set_registration_failures_total[5m]) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'Runner registration failure detected'
          description: 'Check GitHub App authentication or network'

      # Prolonged Pod Pending state
      - alert: RunnerPodPending
        expr: |
          kube_pod_status_phase{namespace="arc-runners", phase="Pending"} > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'Runner Pod has been Pending for over 10 minutes'
          description: 'Possible node resource shortage or PVC binding failure'

Failure Cases and Recovery Procedures

Failure 1: ScaleSet Listener CrashLoopBackOff

Symptoms: The Listener Pod repeatedly restarts and runners do not scale up at all.

Root cause analysis order:

# 1. Check Listener Pod logs
kubectl logs -n arc-systems -l app.kubernetes.io/component=runner-scale-set-listener --tail=100

# 2. Common cause: GitHub App authentication expiry
# - Check private key file
# - Check App installation status (org settings > GitHub Apps)

# 3. Network issue: Cannot reach GitHub API
kubectl exec -n arc-systems deploy/arc-gha-runner-scale-set-controller -- \
  curl -s https://api.github.com/meta | jq '.actions[]'

Recovery: Renew the GitHub App's private key and update the secret.

kubectl create secret generic github-app-secret \
  --namespace arc-runners \
  --from-literal=github_app_id=12345 \
  --from-literal=github_app_installation_id=67890 \
  --from-file=github_app_private_key=./new-private-key.pem \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart Controller
kubectl rollout restart deployment -n arc-systems arc-gha-runner-scale-set-controller

Failure 2: Runner Pod Stuck in Pending State

Symptoms: Jobs queue up but Runner Pods are not created or remain in Pending state.

# Check Pod events
kubectl describe pod -n arc-runners -l actions.github.com/scale-set-name=arc-runner-set

# Response per common cause
# 1. Node resource shortage
kubectl top nodes
# -> Verify Cluster Autoscaler is working, or lower maxRunners

# 2. PVC binding waiting
kubectl get pvc -n arc-runners
# -> Check StorageClass settings, availability zone mismatch

# 3. Image pull failure
kubectl get events -n arc-runners --sort-by='.lastTimestamp' | grep -i pull
# -> Check image tag, registry authentication

Failure 3: Jobs Not Assigned to Runners

Symptoms: Jobs remain in "Queued" state indefinitely in the GitHub UI.

# Check Runner registration status
kubectl get ephemeralrunner -n arc-runners

# Check Runner labels (must match runs-on)
kubectl get autoscalingrunnersets -n arc-runners -o yaml | grep -A5 labels

# Check Runner group settings on GitHub
# Settings > Actions > Runner groups > Verify the repository is included in the group

Recovery: Verify that the workflow's runs-on label exactly matches the ARC Runner Scale Set name. If a Runner group is configured, also verify that the repository is included in the group.

Failure 4: Runner Version Compatibility Issues

Starting March 16, 2026, registration of Runners below v2.329.0 will be blocked. If you are using custom images, you must verify the Runner version.

# Check current Runner version
kubectl exec -n arc-runners -it <runner-pod> -- ./config.sh --version

# Update image (after modifying values.yaml)
helm upgrade arc-runner-set \
  --namespace arc-runners \
  -f values.yaml \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set

Large-Scale Operations Optimization

Runner Group Separation Strategy

Separate Runner Scale Sets by workload characteristics. Handling all workloads with a single Scale Set causes resource contention and noisy neighbor problems.

# Separate Runner Scale Sets by purpose
# 1. General CI (lightweight tests, lint)
# values-ci-light.yaml
minRunners: 2
maxRunners: 20
template:
  spec:
    containers:
      - name: runner
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"

# 2. Build-dedicated (compilation, Docker builds)
# values-ci-build.yaml
minRunners: 1
maxRunners: 10
template:
  spec:
    containers:
      - name: runner
        resources:
          requests:
            cpu: "4"
            memory: "8Gi"

# 3. GPU workloads (ML model testing)
# values-gpu.yaml
minRunners: 0
maxRunners: 4
template:
  spec:
    containers:
      - name: runner
        resources:
          limits:
            nvidia.com/gpu: 1
    nodeSelector:
      accelerator: nvidia-a10g

Graceful Shutdown Handling

If a node drain or scale-down occurs while a runner is executing a job, the job fails. Set RUNNER_GRACEFUL_STOP_TIMEOUT to wait until in-progress jobs complete.

template:
  spec:
    terminationGracePeriodSeconds: 3600 # Wait up to 1 hour
    containers:
      - name: runner
        env:
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: '3500' # Slightly shorter than terminationGracePeriodSeconds

Integration with Node Autoscalers

Even when ARC creates Runner Pods, if nodes are insufficient, Pods remain in Pending state. Configure Cluster Autoscaler or Karpenter alongside ARC.

# Karpenter NodePool example (dedicated to CI Runners)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ci-runners
spec:
  template:
    metadata:
      labels:
        workload-type: ci-runner
    spec:
      taints:
        - key: ci-runner
          value: 'true'
          effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['m7i.xlarge', 'm7i.2xlarge', 'm6i.xlarge', 'm6i.2xlarge']
  limits:
    cpu: 200
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s

Utilizing Spot instances can reduce costs by an additional 50-70%. However, since jobs may fail upon Spot interruption, apply this only to low-priority CI workloads. Use On-Demand instances for production deployment pipelines.

Operations Checklist

Review the following checklists before and after adopting self-hosted runners.

Initial Setup Checklist

GitHub App authentication configuration complete (use GitHub App instead of PAT)
Required tools pre-installed in Runner image
Runner version v2.329.0 or higher confirmed
Ephemeral mode activation confirmed
NetworkPolicy applied (allow only minimum egress)
automountServiceAccountToken: false set on ServiceAccount
Resource requests/limits set on Runner Pods
Build nodes isolated via nodeSelector or Taint/Toleration
Cache strategy decided and implemented (PVC, S3, registry cache)

Security Hardening Checklist

Self-hosted runner usage blocked for public repositories
Repository access scope limited via Runner groups
Minimum privilege permissions declared in workflows
Actions referenced by commit SHA (not tags)
Docker socket mounting prohibited (use container mode)
Secret scanning and leak prevention tools applied
Runner host OS hardened (unnecessary services removed, firewall configured)
Short-lived token-based cloud authentication via OIDC

Operations Monitoring Checklist

Prometheus metrics collection and Grafana dashboard configured
Runner Pool exhaustion alert set (90% of maxRunners threshold)
Runner registration failure alert set
Prolonged Pod Pending alert set
Runner version update alerts (subscribe to GitHub Changelog)
Monthly security audit schedule established (network policy review, secret rotation)

Conclusion

Operating self-hosted runners is not just about spinning up VMs or Pods. It is a platform engineering domain that encompasses security, scaling, caching, monitoring, and incident response. While ARC and Runner Scale Sets have significantly stabilized Kubernetes-based operations, ultimately you need tuning tailored to your organization's workloads and continuous monitoring.

To recap the key points:

Ephemeral is mandatory, not optional. It ensures both security and reproducibility.
ARC Runner Scale Sets is currently the best autoscaling approach. Do not use the legacy webhook-based mode.
Apply security hardening on Day 0, not as an afterthought. NetworkPolicy, RBAC, SHA pinning, and public repo blocking are baseline requirements.
An ephemeral runner without a cache strategy is just a slow runner. Always configure PVC, S3, or registry cache.
Monitoring and alerts are the lifeline of operations. You must be able to immediately detect Runner Pool exhaustion and registration failures.

References

GitHub Docs - Actions Runner Controller - ARC official documentation and architecture explanation
GitHub Docs - Deploying Runner Scale Sets with ARC - Runner Scale Set deployment tutorial
GitHub Docs - Self-Hosted Runners - Self-hosted runner official guide
GitHub Docs - Secure Use Reference - GitHub Actions security reference document
GitHub Actions Runner Controller Repository - ARC source code and Helm chart values reference
AWS Blog - Best Practices for Self-Hosted Runners at Scale - Large-scale runner operations in AWS environments
StepSecurity Harden-Runner - GitHub Actions runtime security monitoring tool
GitHub Blog - Self-Hosted Runner Minimum Version Enforcement - 2026 runner minimum version requirement changes

Quiz

Q1: What is the main topic covered in "GitHub Actions Self-Hosted Runner: Large-Scale Operations and Security Hardening Guide"?

A practical operations guide covering GitHub Actions Self-Hosted Runner Kubernetes-based large-scale operations, ARC autoscaling, security hardening, cache strategies, and disaster recovery.

Q2: Why Self-Hosted Runners?

Q3: What are the key differences in GitHub-Hosted vs Self-Hosted vs ARC Comparison?

Runner selection depends on team size, security requirements, and operational capabilities. Use the comparison table below as your decision criteria.

Q4: Describe the ARC (Actions Runner Controller) Architecture.

Q5: What are the key steps for ARC Installation and Configuration?

Prerequisites Kubernetes 1.27 or higher Helm 3.x GitHub App or Personal Access Token (org-level admin:org, repo-level repo scope) cert-manager (optional, for automated TLS certificate management) Step 1: Controller Installation Step 2: GitHub App Authentication Setup GitHub App a...