Split View: GitHub Actions Self-Hosted Runner 대규모 운영과 보안 하드닝 가이드
GitHub Actions Self-Hosted Runner 대규모 운영과 보안 하드닝 가이드
- 왜 Self-Hosted Runner인가
- GitHub-Hosted vs Self-Hosted vs ARC 비교
- ARC(Actions Runner Controller) 아키텍처
- ARC 설치 및 구성
- 커스텀 Runner 이미지 빌드
- 보안 하드닝
- 캐시 전략
- 모니터링과 관찰 가능성
- 장애 케이스와 복구 절차
- 대규모 운영 최적화
- 운영 체크리스트
- 마무리
- References

왜 Self-Hosted Runner인가
GitHub-hosted runner는 빠르게 시작할 수 있지만, 조직 규모가 커지면 한계에 부딪힌다. 빌드 시간이 30분을 넘기고, GPU가 필요하거나, 내부망 리소스에 접근해야 하거나, 비용이 월 수백만 원을 넘기기 시작하면 self-hosted runner 도입을 검토해야 한다.
2026년 3월부터 GitHub은 self-hosted runner에 대해서도 분당 $0.002의 컨트롤 플레인 비용을 부과하기 시작했다(퍼블릭 리포지토리와 GitHub Enterprise Server 고객은 제외). 하지만 대규모 조직에서는 여전히 GitHub-hosted runner 대비 60~80% 비용 절감이 가능하며, 무엇보다 인프라 커스터마이징의 자유도가 압도적이다.
Self-hosted runner를 도입하면 다음이 가능해진다:
- 내부 네트워크의 프라이빗 레지스트리, 데이터베이스, 시크릿 매니저에 직접 접근
- GPU, ARM, Apple Silicon 등 특수 하드웨어에서의 빌드 및 테스트
- 빌드 캐시를 로컬 스토리지에 유지하여 의존성 설치 시간 90% 이상 단축
- 조직 보안 정책에 맞는 네트워크 격리와 감사 로그 구현
GitHub-Hosted vs Self-Hosted vs ARC 비교
러너 선택은 팀 규모, 보안 요건, 운영 역량에 따라 달라진다. 아래 비교표를 기준으로 판단하라.
| 항목 | GitHub-Hosted | Self-Hosted (VM) | ARC (Kubernetes) |
|---|---|---|---|
| 초기 설정 난이도 | 없음 | 중간 | 높음 |
| 오토스케일링 | 자동 | 직접 구현 필요 | 네이티브 지원 |
| 비용 (월 1000시간 기준) | ~$480 (Linux 2-core) | EC2 비용 + 운영 인건비 | K8s 클러스터 비용 + 운영 |
| 빌드 캐시 | 10GB 제한, Azure Blob | 로컬 디스크 무제한 | PVC 또는 S3 |
| 내부망 접근 | 불가 | 가능 | 가능 |
| 보안 격리 | GitHub 관리 | 직접 하드닝 | Pod 수준 격리 |
| GPU 지원 | 제한적 (대형 러너) | 완전 지원 | NVIDIA Device Plugin |
| 최대 동시 러너 | 플랜에 따라 제한 | 인프라 한도 | 클러스터 노드 한도 |
| 유지보수 부담 | 없음 | 높음 (OS 패치, 버전 관리) | 중간 (Helm 업그레이드) |
| Ephemeral 지원 | 기본 | --ephemeral 플래그 | 기본 |
판단 기준: 월 CI/CD 시간이 500시간 이하이고 내부망 접근이 불필요하면 GitHub-hosted가 합리적이다. 500~2000시간이고 K8s 운영 역량이 없으면 VM 기반 self-hosted를 추천한다. 2000시간 이상이거나 K8s 클러스터가 이미 있다면 ARC가 최선이다.
ARC(Actions Runner Controller) 아키텍처
Actions Runner Controller는 GitHub이 공식으로 관리하는 Kubernetes 오퍼레이터다. 커뮤니티 프로젝트로 시작했으나 2023년부터 GitHub이 직접 개발하며 Runner Scale Sets라는 새로운 아키텍처로 진화했다. 레거시 모드의 webhook 기반 오토스케일링과 달리, Runner Scale Sets는 GitHub API와 직접 통신하며 잡 큐를 실시간으로 감지한다.
ARC 동작 원리
┌─────────────────┐ ┌──────────────────────┐
│ GitHub.com │ │ Kubernetes Cluster │
│ │ │ │
│ Job Queue │◄───►│ ARC Controller │
│ (workflow_job) │ │ │ │
│ │ │ ▼ │
│ Scale Set API │◄───►│ ScaleSet Listener │
│ │ │ │ │
└─────────────────┘ │ ▼ │
│ EphemeralRunnerSet │
│ │ │
│ ├─► Runner Pod 1 │
│ ├─► Runner Pod 2 │
│ └─► Runner Pod N │
└──────────────────────┘
- ScaleSet Listener가 GitHub의 잡 큐를 Long Polling 방식으로 감시한다
Job Available메시지를 수신하면 현재 러너 수와maxRunners설정을 비교한다- 스케일업이 가능하면 메시지를 ACK하고 Kubernetes API를 통해 EphemeralRunnerSet의 replica 수를 패치한다
- 새 Runner Pod가 생성되고, JIT(Just-In-Time) 토큰으로 GitHub에 등록된다
- 잡 실행이 완료되면 Pod는 즉시 삭제된다 (ephemeral 기본 동작)
ARC 설치 및 구성
사전 요건
- Kubernetes 1.27 이상
- Helm 3.x
- GitHub App 또는 Personal Access Token (org 수준
admin:org, repo 수준repo스코프) - cert-manager (TLS 인증서 자동 관리용, 선택)
Step 1: Controller 설치
# ARC 컨트롤러용 네임스페이스 생성
kubectl create namespace arc-systems
# Helm 리포지토리 추가
helm install arc \
--namespace arc-systems \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
--version 0.10.1
Step 2: GitHub App 인증 설정
Personal Access Token보다 GitHub App 인증을 강력하게 추천한다. PAT는 사용자에 귀속되어 퇴사 시 문제가 발생하며, 권한 범위가 넓다. GitHub App은 조직 수준에서 관리되고 필요한 최소 권한만 부여할 수 있다.
# GitHub App 시크릿 생성
kubectl create secret generic github-app-secret \
--namespace arc-runners \
--from-literal=github_app_id=12345 \
--from-literal=github_app_installation_id=67890 \
--from-file=github_app_private_key=./private-key.pem
Step 3: Runner Scale Set 배포
# values.yaml - Runner Scale Set 설정
githubConfigUrl: 'https://github.com/my-org'
githubConfigSecret: github-app-secret
# 오토스케일링 설정
minRunners: 2 # 최소 대기 러너 (콜드 스타트 방지)
maxRunners: 30 # 최대 러너 수 (클러스터 리소스 고려)
# 러너 그룹 지정 (Enterprise/Org 수준)
runnerGroup: 'production-runners'
# 컨테이너 모드 설정
containerMode:
type: 'kubernetes'
kubernetesModeWorkVolumeClaim:
accessModes: ['ReadWriteOnce']
storageClassName: 'gp3'
resources:
requests:
storage: 50Gi
# Pod 템플릿 커스터마이징
template:
spec:
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
resources:
requests:
cpu: '2'
memory: '4Gi'
limits:
cpu: '4'
memory: '8Gi'
env:
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
value: '60'
# 노드 선택 (빌드 전용 노드풀)
nodeSelector:
workload-type: ci-runner
tolerations:
- key: 'ci-runner'
operator: 'Equal'
value: 'true'
effect: 'NoSchedule'
# Runner Scale Set 배포
helm install arc-runner-set \
--namespace arc-runners \
--create-namespace \
-f values.yaml \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
--version 0.10.1
커스텀 Runner 이미지 빌드
기본 Runner 이미지에는 빌드 도구가 포함되어 있지 않다. 조직에서 사용하는 도구를 미리 포함한 커스텀 이미지를 빌드하면 워크플로우 실행 시간을 크게 줄일 수 있다.
Dockerfile 작성 원칙
- 베이스 이미지는 가능한 slim 변형을 사용한다
- Runner 바이너리 버전은 v2.329.0 이상을 사용한다 (2026년 3월 16일부터 이전 버전은 등록이 차단된다)
- 불필요한 패키지 설치를 피하고, 멀티스테이지 빌드로 이미지 크기를 최소화한다
- root가 아닌 전용 사용자로 Runner 프로세스를 실행한다
# Dockerfile.runner - 커스텀 GitHub Actions Runner 이미지
FROM ubuntu:22.04 AS base
# 시스템 패키지 설치 (최소한으로 유지)
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
git \
jq \
unzip \
zip \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Runner 전용 사용자 생성
RUN useradd -m -d /home/runner -s /bin/bash runner
# Runner 바이너리 설치
ARG RUNNER_VERSION=2.321.0
RUN curl -fsSL -o runner.tar.gz \
"https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \
&& mkdir -p /home/runner/actions-runner \
&& tar xzf runner.tar.gz -C /home/runner/actions-runner \
&& rm runner.tar.gz \
&& /home/runner/actions-runner/bin/installdependencies.sh
# Node.js 22 LTS 설치
RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
&& apt-get install -y nodejs \
&& rm -rf /var/lib/apt/lists/*
# Docker CLI 설치 (DinD가 아닌 CLI만)
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker.gpg \
&& echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" \
> /etc/apt/sources.list.d/docker.list \
&& apt-get update && apt-get install -y docker-ce-cli \
&& rm -rf /var/lib/apt/lists/*
# 자동 업데이트 비활성화 (이미지 빌드로 버전 관리)
ENV RUNNER_MANUALLY_TRAP_SIG=1
ENV ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1
# 권한 설정 및 사용자 전환
RUN chown -R runner:runner /home/runner
USER runner
WORKDIR /home/runner/actions-runner
ENTRYPOINT ["./run.sh"]
# 이미지 빌드 및 푸시
docker build -t ghcr.io/my-org/actions-runner:v2.321.0-custom -f Dockerfile.runner .
docker push ghcr.io/my-org/actions-runner:v2.321.0-custom
주의: Runner의 자동 업데이트를 비활성화했으므로, 새로운 Runner 버전이 릴리스되면 이미지를 다시 빌드하고 ARC의 Runner Scale Set 이미지 태그를 업데이트해야 한다. GitHub은 최소 버전 요건을 주기적으로 상향하므로 릴리스 노트를 모니터링하라.
보안 하드닝
Self-hosted runner는 조직의 인프라에서 외부 코드를 실행한다. 하드닝 없이 운영하면 공급망 공격, 시크릿 유출, 네트워크 침투의 경로가 된다.
Ephemeral Runner 필수화
Persistent runner는 이전 잡의 파일, 환경 변수, 프로세스가 다음 잡에 영향을 미칠 수 있다. 공격자가 악성 워크플로우를 통해 runner에 백도어를 설치하면, 이후 모든 잡이 오염된다. Ephemeral runner는 잡 완료 후 즉시 파기되므로 이 위험을 원천 차단한다.
# 워크플로우에서 ephemeral runner 사용 확인
runs-on: arc-runner-set # ARC는 기본적으로 ephemeral
# VM 기반 self-hosted runner의 경우
# ./config.sh --ephemeral 플래그로 등록
네트워크 격리
Runner Pod는 인터넷과 내부 네트워크 모두에 접근할 수 있으므로, NetworkPolicy로 필요한 트래픽만 허용해야 한다.
# network-policy.yaml - Runner Pod 네트워크 제한
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: runner-network-policy
namespace: arc-runners
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: runner
policyTypes:
- Egress
- Ingress
ingress: [] # 외부에서 Runner로의 인바운드 차단
egress:
# GitHub API 및 Actions 서비스
- to:
- ipBlock:
cidr: 140.82.112.0/20 # github.com
- ipBlock:
cidr: 185.199.108.0/22 # GitHub Pages/CDN
ports:
- protocol: TCP
port: 443
# 내부 컨테이너 레지스트리
- to:
- namespaceSelector:
matchLabels:
name: registry
ports:
- protocol: TCP
port: 5000
# DNS
- to: []
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
RBAC 최소 권한 원칙
Runner Pod의 ServiceAccount에는 최소한의 권한만 부여한다. 특히 Kubernetes API에 대한 접근을 제한해야 한다.
# rbac.yaml - Runner ServiceAccount 최소 권한
apiVersion: v1
kind: ServiceAccount
metadata:
name: runner-sa
namespace: arc-runners
automountServiceAccountToken: false # K8s API 토큰 자동 마운트 비활성화
---
# 필요한 경우에만 최소 권한의 Role 바인딩
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: runner-minimal
namespace: arc-runners
rules: [] # 기본적으로 권한 없음
워크플로우 수준 보안
# 안전한 워크플로우 작성 패턴
name: Secure CI Pipeline
on:
pull_request:
branches: [main]
# 최소 권한 토큰
permissions:
contents: read
packages: read
jobs:
build:
runs-on: arc-runner-set
steps:
# SHA 고정으로 Action 사용 (태그가 아님)
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
# 환경 변수에 시크릿 직접 노출 금지
- name: Build
run: |
echo "Building..."
env:
# 시크릿은 필요한 스텝에서만 주입
REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
Harden-Runner를 활용한 런타임 보안
StepSecurity의 Harden-Runner는 GitHub Actions 전용 EDR(Endpoint Detection and Response)로, 네트워크 이그레스 모니터링, 파일 무결성 검사, 프로세스 활동 추적을 제공한다.
jobs:
build:
runs-on: arc-runner-set
steps:
- uses: step-security/harden-runner@0634a2670c59f64b4a01f0f96f84700a4088b9f0 # v2.12.0
with:
egress-policy: audit # 먼저 audit으로 트래픽 패턴 파악
# 패턴 파악 후 block으로 전환
# egress-policy: block
# allowed-endpoints: >
# github.com:443
# registry.npmjs.org:443
# ghcr.io:443
퍼블릭 리포지토리 주의사항
퍼블릭 리포지토리에서는 절대로 self-hosted runner를 사용하지 마라. 외부 공격자가 Fork PR을 통해 임의의 코드를 runner에서 실행할 수 있다. pull_request_target 이벤트와 self-hosted runner의 조합은 특히 위험하다. 반드시 프라이빗 리포지토리 또는 조직 내부 리포지토리에서만 사용하라.
캐시 전략
Ephemeral runner의 가장 큰 단점은 잡마다 캐시가 사라진다는 것이다. 효율적인 캐시 전략 없이는 매 빌드마다 의존성 다운로드부터 시작해야 한다.
전략 1: PersistentVolumeClaim (PVC) 기반 캐시
# ARC Runner Scale Set values.yaml에 PVC 추가
template:
spec:
containers:
- name: runner
image: ghcr.io/my-org/actions-runner:latest
volumeMounts:
- name: cache-volume
mountPath: /opt/cache
env:
- name: RUNNER_TOOL_CACHE
value: /opt/cache/tool-cache
- name: npm_config_cache
value: /opt/cache/npm
- name: GOPATH
value: /opt/cache/go
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: runner-cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: runner-cache-pvc
namespace: arc-runners
spec:
accessModes:
- ReadWriteMany # 여러 Runner Pod가 동시 접근
storageClassName: efs # AWS EFS 또는 NFS
resources:
requests:
storage: 100Gi
주의: ReadWriteMany 모드는 EFS, NFS, GlusterFS 등 네트워크 파일 시스템이 필요하다. EBS 같은 블록 스토리지는 ReadWriteOnce만 지원하므로 한 번에 하나의 Pod만 접근 가능하다.
전략 2: S3 호환 캐시 서버
GitHub의 기본 캐시(actions/cache)는 Azure Blob Storage를 사용한다. AWS 환경이라면 리전 간 지연이 발생한다. S3 호환 캐시 서버를 self-hosted로 운영하면 네트워크 지연을 최소화할 수 있다.
# MinIO 기반 캐시 서버 배포 (같은 VPC/리전 내)
apiVersion: apps/v1
kind: Deployment
metadata:
name: actions-cache-server
namespace: arc-systems
spec:
replicas: 1
selector:
matchLabels:
app: actions-cache
template:
spec:
containers:
- name: minio
image: minio/minio:latest
args: ['server', '/data', '--console-address', ':9001']
env:
- name: MINIO_ROOT_USER
valueFrom:
secretKeyRef:
name: minio-credentials
key: user
- name: MINIO_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: minio-credentials
key: password
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: minio-data-pvc
전략 3: Docker Layer 캐시 (BuildKit)
컨테이너 이미지 빌드가 CI의 주요 워크로드라면, BuildKit의 캐시 백엔드를 레지스트리로 설정해 레이어 캐시를 공유하라.
# 워크플로우에서 BuildKit 레지스트리 캐시 활용
- name: Build and Push
uses: docker/build-push-action@48aba3b46d1b1fec4febb7c5d0c644b249a11355 # v6
with:
push: true
tags: ghcr.io/my-org/my-app:${{ github.sha }}
cache-from: type=registry,ref=ghcr.io/my-org/my-app:buildcache
cache-to: type=registry,ref=ghcr.io/my-org/my-app:buildcache,mode=max
모니터링과 관찰 가능성
Self-hosted runner를 운영하면서 가장 자주 듣는 질문은 "러너가 왜 안 뜨나요?"이다. 모니터링 없이는 답을 줄 수 없다.
Prometheus + Grafana 메트릭
ARC는 Prometheus 메트릭을 기본 노출한다. 핵심 메트릭은 다음과 같다:
gha_runner_scale_set_desired_replicas: 현재 요청된 러너 수gha_runner_scale_set_running_replicas: 실행 중인 러너 수gha_runner_scale_set_registered_replicas: GitHub에 등록 완료된 러너 수gha_runner_scale_set_idle_replicas: 유휴 러너 수
# Prometheus ServiceMonitor 설정
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: arc-controller-monitor
namespace: arc-systems
spec:
selector:
matchLabels:
app.kubernetes.io/name: gha-runner-scale-set-controller
endpoints:
- port: metrics
interval: 30s
path: /metrics
핵심 알림 규칙
# Alertmanager 규칙
groups:
- name: arc-runner-alerts
rules:
# 러너 풀 고갈 경고
- alert: RunnerPoolExhausted
expr: |
gha_runner_scale_set_desired_replicas
>= gha_runner_scale_set_max_replicas * 0.9
for: 5m
labels:
severity: warning
annotations:
summary: 'Runner pool이 90% 이상 사용 중'
description: 'maxRunners 증설 또는 워크플로우 최적화 필요'
# 러너 등록 실패 감지
- alert: RunnerRegistrationFailed
expr: |
rate(gha_runner_scale_set_registration_failures_total[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: 'Runner 등록 실패 발생'
description: 'GitHub App 인증 또는 네트워크 확인 필요'
# Pending Pod 장시간 대기
- alert: RunnerPodPending
expr: |
kube_pod_status_phase{namespace="arc-runners", phase="Pending"} > 0
for: 10m
labels:
severity: warning
annotations:
summary: 'Runner Pod가 10분 이상 Pending 상태'
description: '노드 리소스 부족 또는 PVC 바인딩 실패 가능성'
장애 케이스와 복구 절차
장애 1: ScaleSet Listener CrashLoopBackOff
증상: Listener Pod가 반복 재시작되며 러너가 전혀 스케일업되지 않는다.
원인 분석 순서:
# 1. Listener Pod 로그 확인
kubectl logs -n arc-systems -l app.kubernetes.io/component=runner-scale-set-listener --tail=100
# 2. 흔한 원인: GitHub App 인증 만료
# - Private key 파일 확인
# - App installation 상태 확인 (org settings > GitHub Apps)
# 3. 네트워크 문제: GitHub API 접근 불가
kubectl exec -n arc-systems deploy/arc-gha-runner-scale-set-controller -- \
curl -s https://api.github.com/meta | jq '.actions[]'
복구: GitHub App의 private key를 갱신하고 시크릿을 업데이트한다.
kubectl create secret generic github-app-secret \
--namespace arc-runners \
--from-literal=github_app_id=12345 \
--from-literal=github_app_installation_id=67890 \
--from-file=github_app_private_key=./new-private-key.pem \
--dry-run=client -o yaml | kubectl apply -f -
# Controller 재시작
kubectl rollout restart deployment -n arc-systems arc-gha-runner-scale-set-controller
장애 2: Runner Pod가 Pending 상태에서 멈춤
증상: 잡은 큐에 쌓이지만 Runner Pod가 생성되지 않거나 Pending 상태로 남는다.
# Pod 이벤트 확인
kubectl describe pod -n arc-runners -l actions.github.com/scale-set-name=arc-runner-set
# 흔한 원인별 대응
# 1. 노드 리소스 부족
kubectl top nodes
# -> Cluster Autoscaler가 동작하는지 확인, 또는 maxRunners 하향
# 2. PVC 바인딩 대기
kubectl get pvc -n arc-runners
# -> StorageClass 설정, 가용영역 불일치 확인
# 3. 이미지 Pull 실패
kubectl get events -n arc-runners --sort-by='.lastTimestamp' | grep -i pull
# -> 이미지 태그, 레지스트리 인증 확인
장애 3: 잡이 Runner에 할당되지 않음
증상: GitHub UI에서 잡이 "Queued" 상태로 무기한 대기한다.
# Runner 등록 상태 확인
kubectl get ephemeralrunner -n arc-runners
# Runner labels 확인 (runs-on과 일치해야 함)
kubectl get autoscalingrunnersets -n arc-runners -o yaml | grep -A5 labels
# GitHub에서 Runner 그룹 설정 확인
# Settings > Actions > Runner groups > 해당 그룹에 리포지토리가 포함되어 있는지 확인
복구: 워크플로우의 runs-on 라벨이 ARC Runner Scale Set의 이름과 정확히 일치하는지 확인하라. Runner 그룹이 설정된 경우, 해당 리포지토리가 그룹에 포함되어 있는지도 검증하라.
장애 4: Runner 버전 호환성 문제
2026년 3월 16일부터 v2.329.0 미만 Runner의 등록이 차단된다. 커스텀 이미지를 사용 중이라면 반드시 Runner 버전을 확인하라.
# 현재 Runner 버전 확인
kubectl exec -n arc-runners -it <runner-pod> -- ./config.sh --version
# 이미지 업데이트 (values.yaml 수정 후)
helm upgrade arc-runner-set \
--namespace arc-runners \
-f values.yaml \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set
대규모 운영 최적화
러너 그룹 분리 전략
워크로드 특성에 따라 Runner Scale Set을 분리 운영하라. 단일 Scale Set으로 모든 워크로드를 처리하면 리소스 경합과 노이지 네이버 문제가 발생한다.
# 용도별 Runner Scale Set 분리
# 1. 일반 CI (가벼운 테스트, 린트)
# values-ci-light.yaml
minRunners: 2
maxRunners: 20
template:
spec:
containers:
- name: runner
resources:
requests:
cpu: "1"
memory: "2Gi"
# 2. 빌드 전용 (컴파일, Docker 빌드)
# values-ci-build.yaml
minRunners: 1
maxRunners: 10
template:
spec:
containers:
- name: runner
resources:
requests:
cpu: "4"
memory: "8Gi"
# 3. GPU 워크로드 (ML 모델 테스트)
# values-gpu.yaml
minRunners: 0
maxRunners: 4
template:
spec:
containers:
- name: runner
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-a10g
Graceful Shutdown 처리
러너가 잡을 실행 중일 때 노드 드레인이나 스케일다운이 발생하면 잡이 실패한다. RUNNER_GRACEFUL_STOP_TIMEOUT을 설정하여 진행 중인 잡이 완료될 때까지 기다리도록 하라.
template:
spec:
terminationGracePeriodSeconds: 3600 # 최대 1시간 대기
containers:
- name: runner
env:
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
value: '3500' # terminationGracePeriodSeconds보다 약간 짧게
노드 오토스케일러와의 연동
ARC가 러너 Pod를 생성해도 노드가 부족하면 Pod는 Pending 상태에 머문다. Cluster Autoscaler 또는 Karpenter를 함께 구성하라.
# Karpenter NodePool 예시 (CI Runner 전용)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: ci-runners
spec:
template:
metadata:
labels:
workload-type: ci-runner
spec:
taints:
- key: ci-runner
value: 'true'
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ['amd64']
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand', 'spot']
- key: node.kubernetes.io/instance-type
operator: In
values: ['m7i.xlarge', 'm7i.2xlarge', 'm6i.xlarge', 'm6i.2xlarge']
limits:
cpu: 200
memory: 400Gi
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s
Spot 인스턴스를 활용하면 비용을 추가로 50~70% 절감할 수 있다. 단, Spot 인터럽션 시 잡이 실패할 수 있으므로, 중요도가 낮은 CI 워크로드에만 적용하라. 프로덕션 배포 파이프라인에는 On-Demand 인스턴스를 사용하라.
운영 체크리스트
Self-hosted runner 도입 전후로 아래 체크리스트를 점검하라.
초기 구축 체크리스트
- GitHub App 인증 구성 완료 (PAT 대신 GitHub App 사용)
- Runner 이미지에 필요한 도구 사전 설치 완료
- Runner 버전 v2.329.0 이상 확인
- Ephemeral 모드 활성화 확인
- NetworkPolicy 적용 (최소한의 이그레스만 허용)
- ServiceAccount에
automountServiceAccountToken: false설정 - Runner Pod에 리소스 requests/limits 설정
- 노드 선택기(nodeSelector) 또는 Taint/Toleration으로 빌드 노드 분리
- 캐시 전략 결정 및 구현 (PVC, S3, 레지스트리 캐시)
보안 하드닝 체크리스트
- 퍼블릭 리포지토리에서 self-hosted runner 사용 차단
- Runner 그룹으로 리포지토리 접근 범위 제한
- 워크플로우에서
permissions최소 권한 선언 - Action 참조 시 commit SHA 고정 (태그 대신)
- Docker 소켓 마운트 금지 (컨테이너 모드 사용)
- 시크릿 스캐닝 및 누출 방지 도구 적용
- Runner 호스트 OS 하드닝 (불필요한 서비스 제거, 방화벽 설정)
- OIDC를 활용한 단기 토큰 기반 클라우드 인증
운영 모니터링 체크리스트
- Prometheus 메트릭 수집 및 Grafana 대시보드 구성
- Runner Pool 고갈 알림 설정 (maxRunners의 90% 임계)
- Runner 등록 실패 알림 설정
- Pod Pending 장시간 대기 알림 설정
- Runner 버전 업데이트 알림 (GitHub Changelog 구독)
- 월간 보안 감사 일정 수립 (네트워크 정책 리뷰, 시크릿 로테이션)
마무리
Self-hosted runner 운영은 단순히 VM이나 Pod를 띄우는 것이 아니다. 보안, 스케일링, 캐시, 모니터링, 장애 대응까지 포괄하는 플랫폼 엔지니어링 영역이다. ARC와 Runner Scale Sets의 등장으로 Kubernetes 위에서의 운영이 크게 안정화되었지만, 결국 조직의 워크로드에 맞는 튜닝과 지속적인 모니터링이 필요하다.
핵심을 다시 정리하면:
- Ephemeral은 선택이 아닌 필수다. 보안과 재현성을 동시에 보장한다.
- ARC Runner Scale Sets가 현재 최선의 오토스케일링 방식이다. 레거시 webhook 기반 모드는 사용하지 마라.
- 보안 하드닝을 빌드가 아닌 Day 0에 적용하라. NetworkPolicy, RBAC, SHA 고정, 퍼블릭 리포 차단은 기본이다.
- 캐시 전략 없는 ephemeral runner는 느린 runner일 뿐이다. PVC, S3, 레지스트리 캐시를 반드시 구성하라.
- 모니터링과 알림은 운영의 생명줄이다. Runner Pool 고갈과 등록 실패를 즉시 감지할 수 있어야 한다.
References
- GitHub Docs - Actions Runner Controller - ARC 공식 문서 및 아키텍처 설명
- GitHub Docs - Deploying Runner Scale Sets with ARC - Runner Scale Set 배포 튜토리얼
- GitHub Docs - Self-Hosted Runners - Self-hosted runner 공식 가이드
- GitHub Docs - Secure Use Reference - GitHub Actions 보안 참조 문서
- GitHub Actions Runner Controller Repository - ARC 소스 코드 및 Helm chart values 참조
- AWS Blog - Best Practices for Self-Hosted Runners at Scale - AWS 환경에서의 대규모 runner 운영
- StepSecurity Harden-Runner - GitHub Actions 런타임 보안 모니터링 도구
- GitHub Blog - Self-Hosted Runner Minimum Version Enforcement - 2026년 runner 최소 버전 요건 변경
GitHub Actions Self-Hosted Runner: Large-Scale Operations and Security Hardening Guide
- Why Self-Hosted Runners
- GitHub-Hosted vs Self-Hosted vs ARC Comparison
- ARC (Actions Runner Controller) Architecture
- ARC Installation and Configuration
- Custom Runner Image Build
- Security Hardening
- Cache Strategies
- Monitoring and Observability
- Failure Cases and Recovery Procedures
- Large-Scale Operations Optimization
- Operations Checklist
- Conclusion
- References
- Quiz

Why Self-Hosted Runners
GitHub-hosted runners are quick to get started with, but they hit limitations as organizations scale. When build times exceed 30 minutes, when GPU access is needed, when you need to reach internal network resources, or when costs start exceeding thousands of dollars per month, it is time to consider self-hosted runners.
Starting March 2026, GitHub began charging a control plane fee of $0.002 per minute for self-hosted runners (public repositories and GitHub Enterprise Server customers are excluded). However, for large organizations, 60-80% cost savings compared to GitHub-hosted runners are still achievable, and above all, the freedom of infrastructure customization is overwhelming.
Adopting self-hosted runners enables the following:
- Direct access to private registries, databases, and secret managers on your internal network
- Builds and tests on specialized hardware such as GPU, ARM, and Apple Silicon
- Maintaining build caches on local storage, reducing dependency installation time by over 90%
- Network isolation and audit logging that conforms to organizational security policies
GitHub-Hosted vs Self-Hosted vs ARC Comparison
Runner selection depends on team size, security requirements, and operational capabilities. Use the comparison table below as your decision criteria.
| Item | GitHub-Hosted | Self-Hosted (VM) | ARC (Kubernetes) |
|---|---|---|---|
| Initial setup difficulty | None | Medium | High |
| Autoscaling | Automatic | Must implement yourself | Native support |
| Cost (per 1000 hours/month) | ~$480 (Linux 2-core) | EC2 cost + ops personnel | K8s cluster cost + ops |
| Build cache | 10GB limit, Azure Blob | Local disk unlimited | PVC or S3 |
| Internal network access | Not possible | Possible | Possible |
| Security isolation | Managed by GitHub | Manual hardening | Pod-level isolation |
| GPU support | Limited (larger runners) | Full support | NVIDIA Device Plugin |
| Max concurrent runners | Plan-dependent limits | Infrastructure limits | Cluster node limits |
| Maintenance burden | None | High (OS patches, version mgmt) | Medium (Helm upgrades) |
| Ephemeral support | Default | --ephemeral flag | Default |
Decision criteria: If your monthly CI/CD time is under 500 hours and internal network access is not needed, GitHub-hosted is practical. If 500-2000 hours and you lack K8s operational capability, VM-based self-hosted is recommended. If over 2000 hours or you already have a K8s cluster, ARC is the best option.
ARC (Actions Runner Controller) Architecture
Actions Runner Controller is an officially GitHub-maintained Kubernetes operator. It started as a community project but has been developed directly by GitHub since 2023, evolving into the new Runner Scale Sets architecture. Unlike the legacy mode's webhook-based autoscaling, Runner Scale Sets communicates directly with the GitHub API and detects job queues in real time.
How ARC Works
┌─────────────────┐ ┌──────────────────────┐
│ GitHub.com │ │ Kubernetes Cluster │
│ │ │ │
│ Job Queue │◄───►│ ARC Controller │
│ (workflow_job) │ │ │ │
│ │ │ ▼ │
│ Scale Set API │◄───►│ ScaleSet Listener │
│ │ │ │ │
└─────────────────┘ │ ▼ │
│ EphemeralRunnerSet │
│ │ │
│ ├─► Runner Pod 1 │
│ ├─► Runner Pod 2 │
│ └─► Runner Pod N │
└──────────────────────┘
- The ScaleSet Listener monitors GitHub's job queue via Long Polling
- Upon receiving a
Job Availablemessage, it compares the current runner count against themaxRunnerssetting - If scale-up is possible, it ACKs the message and patches the EphemeralRunnerSet replica count via the Kubernetes API
- A new Runner Pod is created and registered with GitHub using a JIT (Just-In-Time) token
- Once job execution completes, the Pod is immediately deleted (default ephemeral behavior)
ARC Installation and Configuration
Prerequisites
- Kubernetes 1.27 or higher
- Helm 3.x
- GitHub App or Personal Access Token (org-level
admin:org, repo-levelreposcope) - cert-manager (optional, for automated TLS certificate management)
Step 1: Controller Installation
# Create namespace for ARC controller
kubectl create namespace arc-systems
# Install via Helm
helm install arc \
--namespace arc-systems \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
--version 0.10.1
Step 2: GitHub App Authentication Setup
GitHub App authentication is strongly recommended over Personal Access Tokens. PATs are tied to individual users, causing issues when employees leave, and their permission scope is broad. GitHub Apps are managed at the organization level and can be granted minimum necessary permissions.
# Create GitHub App secret
kubectl create secret generic github-app-secret \
--namespace arc-runners \
--from-literal=github_app_id=12345 \
--from-literal=github_app_installation_id=67890 \
--from-file=github_app_private_key=./private-key.pem
Step 3: Runner Scale Set Deployment
# values.yaml - Runner Scale Set configuration
githubConfigUrl: 'https://github.com/my-org'
githubConfigSecret: github-app-secret
# Autoscaling settings
minRunners: 2 # Minimum standby runners (prevents cold starts)
maxRunners: 30 # Maximum runners (consider cluster resources)
# Runner group assignment (Enterprise/Org level)
runnerGroup: 'production-runners'
# Container mode settings
containerMode:
type: 'kubernetes'
kubernetesModeWorkVolumeClaim:
accessModes: ['ReadWriteOnce']
storageClassName: 'gp3'
resources:
requests:
storage: 50Gi
# Pod template customization
template:
spec:
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
resources:
requests:
cpu: '2'
memory: '4Gi'
limits:
cpu: '4'
memory: '8Gi'
env:
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
value: '60'
# Node selection (dedicated build node pool)
nodeSelector:
workload-type: ci-runner
tolerations:
- key: 'ci-runner'
operator: 'Equal'
value: 'true'
effect: 'NoSchedule'
# Deploy Runner Scale Set
helm install arc-runner-set \
--namespace arc-runners \
--create-namespace \
-f values.yaml \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
--version 0.10.1
Custom Runner Image Build
The default Runner image does not include build tools. Building a custom image with your organization's pre-installed tools can significantly reduce workflow execution time.
Dockerfile Writing Principles
- Use slim variants for base images when possible
- Use Runner binary version v2.329.0 or higher (versions below this will be blocked from registration starting March 16, 2026)
- Avoid installing unnecessary packages and minimize image size with multi-stage builds
- Run the Runner process as a dedicated user, not root
# Dockerfile.runner - Custom GitHub Actions Runner image
FROM ubuntu:22.04 AS base
# Install system packages (keep minimal)
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
ca-certificates \
git \
jq \
unzip \
zip \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Create dedicated runner user
RUN useradd -m -d /home/runner -s /bin/bash runner
# Install Runner binary
ARG RUNNER_VERSION=2.321.0
RUN curl -fsSL -o runner.tar.gz \
"https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \
&& mkdir -p /home/runner/actions-runner \
&& tar xzf runner.tar.gz -C /home/runner/actions-runner \
&& rm runner.tar.gz \
&& /home/runner/actions-runner/bin/installdependencies.sh
# Install Node.js 22 LTS
RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
&& apt-get install -y nodejs \
&& rm -rf /var/lib/apt/lists/*
# Install Docker CLI (CLI only, not DinD)
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker.gpg \
&& echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" \
> /etc/apt/sources.list.d/docker.list \
&& apt-get update && apt-get install -y docker-ce-cli \
&& rm -rf /var/lib/apt/lists/*
# Disable auto-updates (manage versions through image builds)
ENV RUNNER_MANUALLY_TRAP_SIG=1
ENV ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1
# Set permissions and switch user
RUN chown -R runner:runner /home/runner
USER runner
WORKDIR /home/runner/actions-runner
ENTRYPOINT ["./run.sh"]
# Build and push image
docker build -t ghcr.io/my-org/actions-runner:v2.321.0-custom -f Dockerfile.runner .
docker push ghcr.io/my-org/actions-runner:v2.321.0-custom
Note: Since auto-updates have been disabled for the Runner, you must rebuild the image and update the ARC Runner Scale Set image tag when new Runner versions are released. GitHub periodically raises minimum version requirements, so monitor release notes.
Security Hardening
Self-hosted runners execute external code within your organization's infrastructure. Operating without hardening creates pathways for supply chain attacks, secret leaks, and network infiltration.
Mandate Ephemeral Runners
Persistent runners allow files, environment variables, and processes from previous jobs to affect subsequent jobs. If an attacker installs a backdoor on the runner through a malicious workflow, all subsequent jobs become compromised. Ephemeral runners are destroyed immediately after job completion, eliminating this risk at its source.
# Verify ephemeral runner usage in workflows
runs-on: arc-runner-set # ARC is ephemeral by default
# For VM-based self-hosted runners
# Register with the --ephemeral flag via ./config.sh
Network Isolation
Runner Pods can access both the internet and internal networks, so NetworkPolicy must be used to allow only necessary traffic.
# network-policy.yaml - Runner Pod network restrictions
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: runner-network-policy
namespace: arc-runners
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: runner
policyTypes:
- Egress
- Ingress
ingress: [] # Block inbound traffic from outside to Runner
egress:
# GitHub API and Actions services
- to:
- ipBlock:
cidr: 140.82.112.0/20 # github.com
- ipBlock:
cidr: 185.199.108.0/22 # GitHub Pages/CDN
ports:
- protocol: TCP
port: 443
# Internal container registry
- to:
- namespaceSelector:
matchLabels:
name: registry
ports:
- protocol: TCP
port: 5000
# DNS
- to: []
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
RBAC Least Privilege Principle
Grant only minimum permissions to the Runner Pod's ServiceAccount. Access to the Kubernetes API must be restricted in particular.
# rbac.yaml - Runner ServiceAccount minimum permissions
apiVersion: v1
kind: ServiceAccount
metadata:
name: runner-sa
namespace: arc-runners
automountServiceAccountToken: false # Disable auto-mounting K8s API token
---
# Only bind minimal Role when necessary
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: runner-minimal
namespace: arc-runners
rules: [] # No permissions by default
Workflow-Level Security
# Secure workflow writing patterns
name: Secure CI Pipeline
on:
pull_request:
branches: [main]
# Minimum privilege tokens
permissions:
contents: read
packages: read
jobs:
build:
runs-on: arc-runner-set
steps:
# Use Actions with SHA pinning (not tags)
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
# Avoid directly exposing secrets in environment variables
- name: Build
run: |
echo "Building..."
env:
# Inject secrets only in the steps that need them
REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}
Runtime Security with Harden-Runner
StepSecurity's Harden-Runner is a GitHub Actions-specific EDR (Endpoint Detection and Response) that provides network egress monitoring, file integrity checking, and process activity tracking.
jobs:
build:
runs-on: arc-runner-set
steps:
- uses: step-security/harden-runner@0634a2670c59f64b4a01f0f96f84700a4088b9f0 # v2.12.0
with:
egress-policy: audit # First use audit to understand traffic patterns
# Switch to block after understanding patterns
# egress-policy: block
# allowed-endpoints: >
# github.com:443
# registry.npmjs.org:443
# ghcr.io:443
Public Repository Considerations
Never use self-hosted runners with public repositories. External attackers can execute arbitrary code on the runner through Fork PRs. The combination of pull_request_target events and self-hosted runners is especially dangerous. Only use them with private repositories or internal organization repositories.
Cache Strategies
The biggest drawback of ephemeral runners is that the cache is lost with every job. Without an efficient cache strategy, every build must start from dependency downloads.
Strategy 1: PersistentVolumeClaim (PVC) Based Cache
# Add PVC to ARC Runner Scale Set values.yaml
template:
spec:
containers:
- name: runner
image: ghcr.io/my-org/actions-runner:latest
volumeMounts:
- name: cache-volume
mountPath: /opt/cache
env:
- name: RUNNER_TOOL_CACHE
value: /opt/cache/tool-cache
- name: npm_config_cache
value: /opt/cache/npm
- name: GOPATH
value: /opt/cache/go
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: runner-cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: runner-cache-pvc
namespace: arc-runners
spec:
accessModes:
- ReadWriteMany # Multiple Runner Pods access simultaneously
storageClassName: efs # AWS EFS or NFS
resources:
requests:
storage: 100Gi
Note: ReadWriteMany mode requires a network file system such as EFS, NFS, or GlusterFS. Block storage like EBS only supports ReadWriteOnce, allowing access by only one Pod at a time.
Strategy 2: S3-Compatible Cache Server
GitHub's default cache (actions/cache) uses Azure Blob Storage. In AWS environments, cross-region latency occurs. Running a self-hosted S3-compatible cache server minimizes network latency.
# MinIO-based cache server deployment (within the same VPC/region)
apiVersion: apps/v1
kind: Deployment
metadata:
name: actions-cache-server
namespace: arc-systems
spec:
replicas: 1
selector:
matchLabels:
app: actions-cache
template:
spec:
containers:
- name: minio
image: minio/minio:latest
args: ['server', '/data', '--console-address', ':9001']
env:
- name: MINIO_ROOT_USER
valueFrom:
secretKeyRef:
name: minio-credentials
key: user
- name: MINIO_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: minio-credentials
key: password
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: minio-data-pvc
Strategy 3: Docker Layer Cache (BuildKit)
If container image builds are the primary CI workload, set BuildKit's cache backend to a registry to share layer caches.
# Use BuildKit registry cache in workflows
- name: Build and Push
uses: docker/build-push-action@48aba3b46d1b1fec4febb7c5d0c644b249a11355 # v6
with:
push: true
tags: ghcr.io/my-org/my-app:${{ github.sha }}
cache-from: type=registry,ref=ghcr.io/my-org/my-app:buildcache
cache-to: type=registry,ref=ghcr.io/my-org/my-app:buildcache,mode=max
Monitoring and Observability
The most common question when operating self-hosted runners is "Why isn't the runner starting?" Without monitoring, you cannot provide an answer.
Prometheus + Grafana Metrics
ARC exposes Prometheus metrics by default. The key metrics are:
gha_runner_scale_set_desired_replicas: Currently requested runner countgha_runner_scale_set_running_replicas: Currently running runner countgha_runner_scale_set_registered_replicas: Runners successfully registered with GitHubgha_runner_scale_set_idle_replicas: Idle runner count
# Prometheus ServiceMonitor configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: arc-controller-monitor
namespace: arc-systems
spec:
selector:
matchLabels:
app.kubernetes.io/name: gha-runner-scale-set-controller
endpoints:
- port: metrics
interval: 30s
path: /metrics
Key Alert Rules
# Alertmanager rules
groups:
- name: arc-runner-alerts
rules:
# Runner pool exhaustion warning
- alert: RunnerPoolExhausted
expr: |
gha_runner_scale_set_desired_replicas
>= gha_runner_scale_set_max_replicas * 0.9
for: 5m
labels:
severity: warning
annotations:
summary: 'Runner pool is over 90% utilized'
description: 'Increase maxRunners or optimize workflows'
# Runner registration failure detection
- alert: RunnerRegistrationFailed
expr: |
rate(gha_runner_scale_set_registration_failures_total[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: 'Runner registration failure detected'
description: 'Check GitHub App authentication or network'
# Prolonged Pod Pending state
- alert: RunnerPodPending
expr: |
kube_pod_status_phase{namespace="arc-runners", phase="Pending"} > 0
for: 10m
labels:
severity: warning
annotations:
summary: 'Runner Pod has been Pending for over 10 minutes'
description: 'Possible node resource shortage or PVC binding failure'
Failure Cases and Recovery Procedures
Failure 1: ScaleSet Listener CrashLoopBackOff
Symptoms: The Listener Pod repeatedly restarts and runners do not scale up at all.
Root cause analysis order:
# 1. Check Listener Pod logs
kubectl logs -n arc-systems -l app.kubernetes.io/component=runner-scale-set-listener --tail=100
# 2. Common cause: GitHub App authentication expiry
# - Check private key file
# - Check App installation status (org settings > GitHub Apps)
# 3. Network issue: Cannot reach GitHub API
kubectl exec -n arc-systems deploy/arc-gha-runner-scale-set-controller -- \
curl -s https://api.github.com/meta | jq '.actions[]'
Recovery: Renew the GitHub App's private key and update the secret.
kubectl create secret generic github-app-secret \
--namespace arc-runners \
--from-literal=github_app_id=12345 \
--from-literal=github_app_installation_id=67890 \
--from-file=github_app_private_key=./new-private-key.pem \
--dry-run=client -o yaml | kubectl apply -f -
# Restart Controller
kubectl rollout restart deployment -n arc-systems arc-gha-runner-scale-set-controller
Failure 2: Runner Pod Stuck in Pending State
Symptoms: Jobs queue up but Runner Pods are not created or remain in Pending state.
# Check Pod events
kubectl describe pod -n arc-runners -l actions.github.com/scale-set-name=arc-runner-set
# Response per common cause
# 1. Node resource shortage
kubectl top nodes
# -> Verify Cluster Autoscaler is working, or lower maxRunners
# 2. PVC binding waiting
kubectl get pvc -n arc-runners
# -> Check StorageClass settings, availability zone mismatch
# 3. Image pull failure
kubectl get events -n arc-runners --sort-by='.lastTimestamp' | grep -i pull
# -> Check image tag, registry authentication
Failure 3: Jobs Not Assigned to Runners
Symptoms: Jobs remain in "Queued" state indefinitely in the GitHub UI.
# Check Runner registration status
kubectl get ephemeralrunner -n arc-runners
# Check Runner labels (must match runs-on)
kubectl get autoscalingrunnersets -n arc-runners -o yaml | grep -A5 labels
# Check Runner group settings on GitHub
# Settings > Actions > Runner groups > Verify the repository is included in the group
Recovery: Verify that the workflow's runs-on label exactly matches the ARC Runner Scale Set name. If a Runner group is configured, also verify that the repository is included in the group.
Failure 4: Runner Version Compatibility Issues
Starting March 16, 2026, registration of Runners below v2.329.0 will be blocked. If you are using custom images, you must verify the Runner version.
# Check current Runner version
kubectl exec -n arc-runners -it <runner-pod> -- ./config.sh --version
# Update image (after modifying values.yaml)
helm upgrade arc-runner-set \
--namespace arc-runners \
-f values.yaml \
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set
Large-Scale Operations Optimization
Runner Group Separation Strategy
Separate Runner Scale Sets by workload characteristics. Handling all workloads with a single Scale Set causes resource contention and noisy neighbor problems.
# Separate Runner Scale Sets by purpose
# 1. General CI (lightweight tests, lint)
# values-ci-light.yaml
minRunners: 2
maxRunners: 20
template:
spec:
containers:
- name: runner
resources:
requests:
cpu: "1"
memory: "2Gi"
# 2. Build-dedicated (compilation, Docker builds)
# values-ci-build.yaml
minRunners: 1
maxRunners: 10
template:
spec:
containers:
- name: runner
resources:
requests:
cpu: "4"
memory: "8Gi"
# 3. GPU workloads (ML model testing)
# values-gpu.yaml
minRunners: 0
maxRunners: 4
template:
spec:
containers:
- name: runner
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-a10g
Graceful Shutdown Handling
If a node drain or scale-down occurs while a runner is executing a job, the job fails. Set RUNNER_GRACEFUL_STOP_TIMEOUT to wait until in-progress jobs complete.
template:
spec:
terminationGracePeriodSeconds: 3600 # Wait up to 1 hour
containers:
- name: runner
env:
- name: RUNNER_GRACEFUL_STOP_TIMEOUT
value: '3500' # Slightly shorter than terminationGracePeriodSeconds
Integration with Node Autoscalers
Even when ARC creates Runner Pods, if nodes are insufficient, Pods remain in Pending state. Configure Cluster Autoscaler or Karpenter alongside ARC.
# Karpenter NodePool example (dedicated to CI Runners)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: ci-runners
spec:
template:
metadata:
labels:
workload-type: ci-runner
spec:
taints:
- key: ci-runner
value: 'true'
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ['amd64']
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand', 'spot']
- key: node.kubernetes.io/instance-type
operator: In
values: ['m7i.xlarge', 'm7i.2xlarge', 'm6i.xlarge', 'm6i.2xlarge']
limits:
cpu: 200
memory: 400Gi
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 60s
Utilizing Spot instances can reduce costs by an additional 50-70%. However, since jobs may fail upon Spot interruption, apply this only to low-priority CI workloads. Use On-Demand instances for production deployment pipelines.
Operations Checklist
Review the following checklists before and after adopting self-hosted runners.
Initial Setup Checklist
- GitHub App authentication configuration complete (use GitHub App instead of PAT)
- Required tools pre-installed in Runner image
- Runner version v2.329.0 or higher confirmed
- Ephemeral mode activation confirmed
- NetworkPolicy applied (allow only minimum egress)
-
automountServiceAccountToken: falseset on ServiceAccount - Resource requests/limits set on Runner Pods
- Build nodes isolated via nodeSelector or Taint/Toleration
- Cache strategy decided and implemented (PVC, S3, registry cache)
Security Hardening Checklist
- Self-hosted runner usage blocked for public repositories
- Repository access scope limited via Runner groups
- Minimum privilege
permissionsdeclared in workflows - Actions referenced by commit SHA (not tags)
- Docker socket mounting prohibited (use container mode)
- Secret scanning and leak prevention tools applied
- Runner host OS hardened (unnecessary services removed, firewall configured)
- Short-lived token-based cloud authentication via OIDC
Operations Monitoring Checklist
- Prometheus metrics collection and Grafana dashboard configured
- Runner Pool exhaustion alert set (90% of maxRunners threshold)
- Runner registration failure alert set
- Prolonged Pod Pending alert set
- Runner version update alerts (subscribe to GitHub Changelog)
- Monthly security audit schedule established (network policy review, secret rotation)
Conclusion
Operating self-hosted runners is not just about spinning up VMs or Pods. It is a platform engineering domain that encompasses security, scaling, caching, monitoring, and incident response. While ARC and Runner Scale Sets have significantly stabilized Kubernetes-based operations, ultimately you need tuning tailored to your organization's workloads and continuous monitoring.
To recap the key points:
- Ephemeral is mandatory, not optional. It ensures both security and reproducibility.
- ARC Runner Scale Sets is currently the best autoscaling approach. Do not use the legacy webhook-based mode.
- Apply security hardening on Day 0, not as an afterthought. NetworkPolicy, RBAC, SHA pinning, and public repo blocking are baseline requirements.
- An ephemeral runner without a cache strategy is just a slow runner. Always configure PVC, S3, or registry cache.
- Monitoring and alerts are the lifeline of operations. You must be able to immediately detect Runner Pool exhaustion and registration failures.
References
- GitHub Docs - Actions Runner Controller - ARC official documentation and architecture explanation
- GitHub Docs - Deploying Runner Scale Sets with ARC - Runner Scale Set deployment tutorial
- GitHub Docs - Self-Hosted Runners - Self-hosted runner official guide
- GitHub Docs - Secure Use Reference - GitHub Actions security reference document
- GitHub Actions Runner Controller Repository - ARC source code and Helm chart values reference
- AWS Blog - Best Practices for Self-Hosted Runners at Scale - Large-scale runner operations in AWS environments
- StepSecurity Harden-Runner - GitHub Actions runtime security monitoring tool
- GitHub Blog - Self-Hosted Runner Minimum Version Enforcement - 2026 runner minimum version requirement changes
Quiz
Q1: What is the main topic covered in "GitHub Actions Self-Hosted Runner: Large-Scale Operations
and Security Hardening Guide"?
A practical operations guide covering GitHub Actions Self-Hosted Runner Kubernetes-based large-scale operations, ARC autoscaling, security hardening, cache strategies, and disaster recovery.
Q2: Why Self-Hosted Runners?
GitHub-hosted runners are quick to get started with, but they hit limitations as organizations
scale. When build times exceed 30 minutes, when GPU access is needed, when you need to reach
internal network resources, or when costs start exceeding thousands of dollars per month, it...
Q3: What are the key differences in GitHub-Hosted vs Self-Hosted vs ARC Comparison?
Runner selection depends on team size, security requirements, and operational capabilities. Use the comparison table below as your decision criteria.
Q4: Describe the ARC (Actions Runner Controller) Architecture.
Actions Runner Controller is an officially GitHub-maintained Kubernetes operator. It started as a
community project but has been developed directly by GitHub since 2023, evolving into the new
Runner Scale Sets architecture.
Q5: What are the key steps for ARC Installation and Configuration?
Prerequisites Kubernetes 1.27 or higher Helm 3.x GitHub App or Personal Access Token (org-level
admin:org, repo-level repo scope) cert-manager (optional, for automated TLS certificate
management) Step 1: Controller Installation Step 2: GitHub App Authentication Setup GitHub App
a...