GitHub Actions Self-Hosted Runner: Large-Scale Operations and Security Hardening Guide

Why Self-Hosted Runners
GitHub-Hosted vs Self-Hosted vs ARC Comparison
ARC (Actions Runner Controller) Architecture
- How ARC Works
ARC Installation and Configuration
Custom Runner Image Build
- Dockerfile Writing Principles
Security Hardening
Cache Strategies
Monitoring and Observability
- Prometheus + Grafana Metrics
- Key Alert Rules
Failure Cases and Recovery Procedures
Large-Scale Operations Optimization
Operations Checklist
Conclusion
References

Why Self-Hosted Runners

GitHub-hosted runners are quick to get started with, but they hit limitations as organizations scale. When build times exceed 30 minutes, when GPU access is needed, when you need to reach internal network resources, or when costs start exceeding thousands of dollars per month, it is time to consider self-hosted runners.

Starting March 2026, GitHub began charging a control plane fee of $0.002 per minute for self-hosted runners (public repositories and GitHub Enterprise Server customers are excluded). However, for large organizations, 60-80% cost savings compared to GitHub-hosted runners are still achievable, and above all, the freedom of infrastructure customization is overwhelming.

Adopting self-hosted runners enables the following:

Direct access to private registries, databases, and secret managers on your internal network
Builds and tests on specialized hardware such as GPU, ARM, and Apple Silicon
Maintaining build caches on local storage, reducing dependency installation time by over 90%
Network isolation and audit logging that conforms to organizational security policies

GitHub-Hosted vs Self-Hosted vs ARC Comparison

Runner selection depends on team size, security requirements, and operational capabilities. Use the comparison table below as your decision criteria.

Item	GitHub-Hosted	Self-Hosted (VM)	ARC (Kubernetes)
Initial setup difficulty	None	Medium	High
Autoscaling	Automatic	Must implement yourself	Native support
Cost (per 1000 hours/month)	~$480 (Linux 2-core)	EC2 cost + ops personnel	K8s cluster cost + ops
Build cache	10GB limit, Azure Blob	Local disk unlimited	PVC or S3
Internal network access	Not possible	Possible	Possible
Security isolation	Managed by GitHub	Manual hardening	Pod-level isolation
GPU support	Limited (larger runners)	Full support	NVIDIA Device Plugin
Max concurrent runners	Plan-dependent limits	Infrastructure limits	Cluster node limits
Maintenance burden	None	High (OS patches, version mgmt)	Medium (Helm upgrades)
Ephemeral support	Default	--ephemeral flag	Default

Decision criteria: If your monthly CI/CD time is under 500 hours and internal network access is not needed, GitHub-hosted is practical. If 500-2000 hours and you lack K8s operational capability, VM-based self-hosted is recommended. If over 2000 hours or you already have a K8s cluster, ARC is the best option.

ARC (Actions Runner Controller) Architecture

Actions Runner Controller is an officially GitHub-maintained Kubernetes operator. It started as a community project but has been developed directly by GitHub since 2023, evolving into the new Runner Scale Sets architecture. Unlike the legacy mode's webhook-based autoscaling, Runner Scale Sets communicates directly with the GitHub API and detects job queues in real time.

How ARC Works

┌─────────────────┐     ┌──────────────────────┐
│   GitHub.com    │     │   Kubernetes Cluster  │
│                 │     │                       │
│  Job Queue      │◄───►│  ARC Controller       │
│  (workflow_job)  │     │    │                  │
│                 │     │    ▼                  │
│  Scale Set API  │◄───►│  ScaleSet Listener    │
│                 │     │    │                  │
└─────────────────┘     │    ▼                  │
                        │  EphemeralRunnerSet   │
                        │    │                  │
                        │    ├─► Runner Pod 1   │
                        │    ├─► Runner Pod 2   │
                        │    └─► Runner Pod N   │
                        └──────────────────────┘

The ScaleSet Listener monitors GitHub's job queue via Long Polling
Upon receiving a Job Available message, it compares the current runner count against the maxRunners setting
If scale-up is possible, it ACKs the message and patches the EphemeralRunnerSet replica count via the Kubernetes API
A new Runner Pod is created and registered with GitHub using a JIT (Just-In-Time) token
Once job execution completes, the Pod is immediately deleted (default ephemeral behavior)

ARC Installation and Configuration

Prerequisites

Kubernetes 1.27 or higher
Helm 3.x
GitHub App or Personal Access Token (org-level admin:org, repo-level repo scope)
cert-manager (optional, for automated TLS certificate management)

Step 1: Controller Installation

# Create namespace for ARC controller
kubectl create namespace arc-systems

# Install via Helm
helm install arc \
  --namespace arc-systems \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller \
  --version 0.10.1

Step 2: GitHub App Authentication Setup

GitHub App authentication is strongly recommended over Personal Access Tokens. PATs are tied to individual users, causing issues when employees leave, and their permission scope is broad. GitHub Apps are managed at the organization level and can be granted minimum necessary permissions.

# Create GitHub App secret
kubectl create secret generic github-app-secret \
  --namespace arc-runners \
  --from-literal=github_app_id=12345 \
  --from-literal=github_app_installation_id=67890 \
  --from-file=github_app_private_key=./private-key.pem

Step 3: Runner Scale Set Deployment

# values.yaml - Runner Scale Set configuration
githubConfigUrl: 'https://github.com/my-org'
githubConfigSecret: github-app-secret

# Autoscaling settings
minRunners: 2 # Minimum standby runners (prevents cold starts)
maxRunners: 30 # Maximum runners (consider cluster resources)

# Runner group assignment (Enterprise/Org level)
runnerGroup: 'production-runners'

# Container mode settings
containerMode:
  type: 'kubernetes'
  kubernetesModeWorkVolumeClaim:
    accessModes: ['ReadWriteOnce']
    storageClassName: 'gp3'
    resources:
      requests:
        storage: 50Gi

# Pod template customization
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        resources:
          requests:
            cpu: '2'
            memory: '4Gi'
          limits:
            cpu: '4'
            memory: '8Gi'
        env:
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: '60'
    # Node selection (dedicated build node pool)
    nodeSelector:
      workload-type: ci-runner
    tolerations:
      - key: 'ci-runner'
        operator: 'Equal'
        value: 'true'
        effect: 'NoSchedule'

# Deploy Runner Scale Set
helm install arc-runner-set \
  --namespace arc-runners \
  --create-namespace \
  -f values.yaml \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set \
  --version 0.10.1

Custom Runner Image Build

The default Runner image does not include build tools. Building a custom image with your organization's pre-installed tools can significantly reduce workflow execution time.

Dockerfile Writing Principles

Use slim variants for base images when possible
Use Runner binary version v2.329.0 or higher (versions below this will be blocked from registration starting March 16, 2026)
Avoid installing unnecessary packages and minimize image size with multi-stage builds
Run the Runner process as a dedicated user, not root

# Dockerfile.runner - Custom GitHub Actions Runner image
FROM ubuntu:22.04 AS base

# Install system packages (keep minimal)
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    ca-certificates \
    git \
    jq \
    unzip \
    zip \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Create dedicated runner user
RUN useradd -m -d /home/runner -s /bin/bash runner

# Install Runner binary
ARG RUNNER_VERSION=2.321.0
RUN curl -fsSL -o runner.tar.gz \
    "https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-x64-${RUNNER_VERSION}.tar.gz" \
    && mkdir -p /home/runner/actions-runner \
    && tar xzf runner.tar.gz -C /home/runner/actions-runner \
    && rm runner.tar.gz \
    && /home/runner/actions-runner/bin/installdependencies.sh

# Install Node.js 22 LTS
RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
    && apt-get install -y nodejs \
    && rm -rf /var/lib/apt/lists/*

# Install Docker CLI (CLI only, not DinD)
RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker.gpg \
    && echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" \
    > /etc/apt/sources.list.d/docker.list \
    && apt-get update && apt-get install -y docker-ce-cli \
    && rm -rf /var/lib/apt/lists/*

# Disable auto-updates (manage versions through image builds)
ENV RUNNER_MANUALLY_TRAP_SIG=1
ENV ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1

# Set permissions and switch user
RUN chown -R runner:runner /home/runner
USER runner
WORKDIR /home/runner/actions-runner

ENTRYPOINT ["./run.sh"]

# Build and push image
docker build -t ghcr.io/my-org/actions-runner:v2.321.0-custom -f Dockerfile.runner .
docker push ghcr.io/my-org/actions-runner:v2.321.0-custom

Note: Since auto-updates have been disabled for the Runner, you must rebuild the image and update the ARC Runner Scale Set image tag when new Runner versions are released. GitHub periodically raises minimum version requirements, so monitor release notes.

Security Hardening

Self-hosted runners execute external code within your organization's infrastructure. Operating without hardening creates pathways for supply chain attacks, secret leaks, and network infiltration.

Mandate Ephemeral Runners

Persistent runners allow files, environment variables, and processes from previous jobs to affect subsequent jobs. If an attacker installs a backdoor on the runner through a malicious workflow, all subsequent jobs become compromised. Ephemeral runners are destroyed immediately after job completion, eliminating this risk at its source.

# Verify ephemeral runner usage in workflows
runs-on: arc-runner-set # ARC is ephemeral by default

# For VM-based self-hosted runners
# Register with the --ephemeral flag via ./config.sh

Network Isolation

Runner Pods can access both the internet and internal networks, so NetworkPolicy must be used to allow only necessary traffic.

# network-policy.yaml - Runner Pod network restrictions
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: runner-network-policy
  namespace: arc-runners
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/component: runner
  policyTypes:
    - Egress
    - Ingress
  ingress: [] # Block inbound traffic from outside to Runner
  egress:
    # GitHub API and Actions services
    - to:
        - ipBlock:
            cidr: 140.82.112.0/20 # github.com
        - ipBlock:
            cidr: 185.199.108.0/22 # GitHub Pages/CDN
      ports:
        - protocol: TCP
          port: 443
    # Internal container registry
    - to:
        - namespaceSelector:
            matchLabels:
              name: registry
      ports:
        - protocol: TCP
          port: 5000
    # DNS
    - to: []
      ports:
        - protocol: UDP
          port: 53
        - protocol: TCP
          port: 53

RBAC Least Privilege Principle

Grant only minimum permissions to the Runner Pod's ServiceAccount. Access to the Kubernetes API must be restricted in particular.

# rbac.yaml - Runner ServiceAccount minimum permissions
apiVersion: v1
kind: ServiceAccount
metadata:
  name: runner-sa
  namespace: arc-runners
automountServiceAccountToken: false # Disable auto-mounting K8s API token
---
# Only bind minimal Role when necessary
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: runner-minimal
  namespace: arc-runners
rules: [] # No permissions by default

Workflow-Level Security

# Secure workflow writing patterns
name: Secure CI Pipeline
on:
  pull_request:
    branches: [main]

# Minimum privilege tokens
permissions:
  contents: read
  packages: read

jobs:
  build:
    runs-on: arc-runner-set
    steps:
      # Use Actions with SHA pinning (not tags)
      - uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

      # Avoid directly exposing secrets in environment variables
      - name: Build
        run: |
          echo "Building..."
        env:
          # Inject secrets only in the steps that need them
          REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}

Runtime Security with Harden-Runner

StepSecurity's Harden-Runner is a GitHub Actions-specific EDR (Endpoint Detection and Response) that provides network egress monitoring, file integrity checking, and process activity tracking.

jobs:
  build:
    runs-on: arc-runner-set
    steps:
      - uses: step-security/harden-runner@0634a2670c59f64b4a01f0f96f84700a4088b9f0 # v2.12.0
        with:
          egress-policy: audit # First use audit to understand traffic patterns
          # Switch to block after understanding patterns
          # egress-policy: block
          # allowed-endpoints: >
          #   github.com:443
          #   registry.npmjs.org:443
          #   ghcr.io:443

Public Repository Considerations

Never use self-hosted runners with public repositories. External attackers can execute arbitrary code on the runner through Fork PRs. The combination of pull_request_target events and self-hosted runners is especially dangerous. Only use them with private repositories or internal organization repositories.

Cache Strategies

The biggest drawback of ephemeral runners is that the cache is lost with every job. Without an efficient cache strategy, every build must start from dependency downloads.

Strategy 1: PersistentVolumeClaim (PVC) Based Cache

# Add PVC to ARC Runner Scale Set values.yaml
template:
  spec:
    containers:
      - name: runner
        image: ghcr.io/my-org/actions-runner:latest
        volumeMounts:
          - name: cache-volume
            mountPath: /opt/cache
        env:
          - name: RUNNER_TOOL_CACHE
            value: /opt/cache/tool-cache
          - name: npm_config_cache
            value: /opt/cache/npm
          - name: GOPATH
            value: /opt/cache/go
    volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: runner-cache-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: runner-cache-pvc
  namespace: arc-runners
spec:
  accessModes:
    - ReadWriteMany # Multiple Runner Pods access simultaneously
  storageClassName: efs # AWS EFS or NFS
  resources:
    requests:
      storage: 100Gi

Note: ReadWriteMany mode requires a network file system such as EFS, NFS, or GlusterFS. Block storage like EBS only supports ReadWriteOnce, allowing access by only one Pod at a time.

Strategy 2: S3-Compatible Cache Server

GitHub's default cache (actions/cache) uses Azure Blob Storage. In AWS environments, cross-region latency occurs. Running a self-hosted S3-compatible cache server minimizes network latency.

# MinIO-based cache server deployment (within the same VPC/region)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: actions-cache-server
  namespace: arc-systems
spec:
  replicas: 1
  selector:
    matchLabels:
      app: actions-cache
  template:
    spec:
      containers:
        - name: minio
          image: minio/minio:latest
          args: ['server', '/data', '--console-address', ':9001']
          env:
            - name: MINIO_ROOT_USER
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: user
            - name: MINIO_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: minio-credentials
                  key: password
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: minio-data-pvc

Strategy 3: Docker Layer Cache (BuildKit)

If container image builds are the primary CI workload, set BuildKit's cache backend to a registry to share layer caches.

# Use BuildKit registry cache in workflows
- name: Build and Push
  uses: docker/build-push-action@48aba3b46d1b1fec4febb7c5d0c644b249a11355 # v6
  with:
    push: true
    tags: ghcr.io/my-org/my-app:${{ github.sha }}
    cache-from: type=registry,ref=ghcr.io/my-org/my-app:buildcache
    cache-to: type=registry,ref=ghcr.io/my-org/my-app:buildcache,mode=max

Monitoring and Observability

The most common question when operating self-hosted runners is "Why isn't the runner starting?" Without monitoring, you cannot provide an answer.

Prometheus + Grafana Metrics

ARC exposes Prometheus metrics by default. The key metrics are:

gha_runner_scale_set_desired_replicas: Currently requested runner count
gha_runner_scale_set_running_replicas: Currently running runner count
gha_runner_scale_set_registered_replicas: Runners successfully registered with GitHub
gha_runner_scale_set_idle_replicas: Idle runner count

# Prometheus ServiceMonitor configuration
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: arc-controller-monitor
  namespace: arc-systems
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: gha-runner-scale-set-controller
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Key Alert Rules

# Alertmanager rules
groups:
  - name: arc-runner-alerts
    rules:
      # Runner pool exhaustion warning
      - alert: RunnerPoolExhausted
        expr: |
          gha_runner_scale_set_desired_replicas
          >= gha_runner_scale_set_max_replicas * 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'Runner pool is over 90% utilized'
          description: 'Increase maxRunners or optimize workflows'

      # Runner registration failure detection
      - alert: RunnerRegistrationFailed
        expr: |
          rate(gha_runner_scale_set_registration_failures_total[5m]) > 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: 'Runner registration failure detected'
          description: 'Check GitHub App authentication or network'

      # Prolonged Pod Pending state
      - alert: RunnerPodPending
        expr: |
          kube_pod_status_phase{namespace="arc-runners", phase="Pending"} > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: 'Runner Pod has been Pending for over 10 minutes'
          description: 'Possible node resource shortage or PVC binding failure'

Failure Cases and Recovery Procedures

Failure 1: ScaleSet Listener CrashLoopBackOff

Symptoms: The Listener Pod repeatedly restarts and runners do not scale up at all.

Root cause analysis order:

# 1. Check Listener Pod logs
kubectl logs -n arc-systems -l app.kubernetes.io/component=runner-scale-set-listener --tail=100

# 2. Common cause: GitHub App authentication expiry
# - Check private key file
# - Check App installation status (org settings > GitHub Apps)

# 3. Network issue: Cannot reach GitHub API
kubectl exec -n arc-systems deploy/arc-gha-runner-scale-set-controller -- \
  curl -s https://api.github.com/meta | jq '.actions[]'

Recovery: Renew the GitHub App's private key and update the secret.

kubectl create secret generic github-app-secret \
  --namespace arc-runners \
  --from-literal=github_app_id=12345 \
  --from-literal=github_app_installation_id=67890 \
  --from-file=github_app_private_key=./new-private-key.pem \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart Controller
kubectl rollout restart deployment -n arc-systems arc-gha-runner-scale-set-controller

Failure 2: Runner Pod Stuck in Pending State

Symptoms: Jobs queue up but Runner Pods are not created or remain in Pending state.

# Check Pod events
kubectl describe pod -n arc-runners -l actions.github.com/scale-set-name=arc-runner-set

# Response per common cause
# 1. Node resource shortage
kubectl top nodes
# -> Verify Cluster Autoscaler is working, or lower maxRunners

# 2. PVC binding waiting
kubectl get pvc -n arc-runners
# -> Check StorageClass settings, availability zone mismatch

# 3. Image pull failure
kubectl get events -n arc-runners --sort-by='.lastTimestamp' | grep -i pull
# -> Check image tag, registry authentication

Failure 3: Jobs Not Assigned to Runners

Symptoms: Jobs remain in "Queued" state indefinitely in the GitHub UI.

# Check Runner registration status
kubectl get ephemeralrunner -n arc-runners

# Check Runner labels (must match runs-on)
kubectl get autoscalingrunnersets -n arc-runners -o yaml | grep -A5 labels

# Check Runner group settings on GitHub
# Settings > Actions > Runner groups > Verify the repository is included in the group

Recovery: Verify that the workflow's runs-on label exactly matches the ARC Runner Scale Set name. If a Runner group is configured, also verify that the repository is included in the group.

Failure 4: Runner Version Compatibility Issues

Starting March 16, 2026, registration of Runners below v2.329.0 will be blocked. If you are using custom images, you must verify the Runner version.

# Check current Runner version
kubectl exec -n arc-runners -it <runner-pod> -- ./config.sh --version

# Update image (after modifying values.yaml)
helm upgrade arc-runner-set \
  --namespace arc-runners \
  -f values.yaml \
  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set

Large-Scale Operations Optimization

Runner Group Separation Strategy

Separate Runner Scale Sets by workload characteristics. Handling all workloads with a single Scale Set causes resource contention and noisy neighbor problems.

# Separate Runner Scale Sets by purpose
# 1. General CI (lightweight tests, lint)
# values-ci-light.yaml
minRunners: 2
maxRunners: 20
template:
  spec:
    containers:
      - name: runner
        resources:
          requests:
            cpu: "1"
            memory: "2Gi"

# 2. Build-dedicated (compilation, Docker builds)
# values-ci-build.yaml
minRunners: 1
maxRunners: 10
template:
  spec:
    containers:
      - name: runner
        resources:
          requests:
            cpu: "4"
            memory: "8Gi"

# 3. GPU workloads (ML model testing)
# values-gpu.yaml
minRunners: 0
maxRunners: 4
template:
  spec:
    containers:
      - name: runner
        resources:
          limits:
            nvidia.com/gpu: 1
    nodeSelector:
      accelerator: nvidia-a10g

Graceful Shutdown Handling

If a node drain or scale-down occurs while a runner is executing a job, the job fails. Set RUNNER_GRACEFUL_STOP_TIMEOUT to wait until in-progress jobs complete.

template:
  spec:
    terminationGracePeriodSeconds: 3600 # Wait up to 1 hour
    containers:
      - name: runner
        env:
          - name: RUNNER_GRACEFUL_STOP_TIMEOUT
            value: '3500' # Slightly shorter than terminationGracePeriodSeconds

Integration with Node Autoscalers

Even when ARC creates Runner Pods, if nodes are insufficient, Pods remain in Pending state. Configure Cluster Autoscaler or Karpenter alongside ARC.

# Karpenter NodePool example (dedicated to CI Runners)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: ci-runners
spec:
  template:
    metadata:
      labels:
        workload-type: ci-runner
    spec:
      taints:
        - key: ci-runner
          value: 'true'
          effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['m7i.xlarge', 'm7i.2xlarge', 'm6i.xlarge', 'm6i.2xlarge']
  limits:
    cpu: 200
    memory: 400Gi
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s

Utilizing Spot instances can reduce costs by an additional 50-70%. However, since jobs may fail upon Spot interruption, apply this only to low-priority CI workloads. Use On-Demand instances for production deployment pipelines.

Operations Checklist

Review the following checklists before and after adopting self-hosted runners.

Initial Setup Checklist

GitHub App authentication configuration complete (use GitHub App instead of PAT)
Required tools pre-installed in Runner image
Runner version v2.329.0 or higher confirmed
Ephemeral mode activation confirmed
NetworkPolicy applied (allow only minimum egress)
automountServiceAccountToken: false set on ServiceAccount
Resource requests/limits set on Runner Pods
Build nodes isolated via nodeSelector or Taint/Toleration
Cache strategy decided and implemented (PVC, S3, registry cache)

Security Hardening Checklist

Self-hosted runner usage blocked for public repositories
Repository access scope limited via Runner groups
Minimum privilege permissions declared in workflows
Actions referenced by commit SHA (not tags)
Docker socket mounting prohibited (use container mode)
Secret scanning and leak prevention tools applied
Runner host OS hardened (unnecessary services removed, firewall configured)
Short-lived token-based cloud authentication via OIDC

Operations Monitoring Checklist

Prometheus metrics collection and Grafana dashboard configured
Runner Pool exhaustion alert set (90% of maxRunners threshold)
Runner registration failure alert set
Prolonged Pod Pending alert set
Runner version update alerts (subscribe to GitHub Changelog)
Monthly security audit schedule established (network policy review, secret rotation)

Conclusion

Operating self-hosted runners is not just about spinning up VMs or Pods. It is a platform engineering domain that encompasses security, scaling, caching, monitoring, and incident response. While ARC and Runner Scale Sets have significantly stabilized Kubernetes-based operations, ultimately you need tuning tailored to your organization's workloads and continuous monitoring.

To recap the key points:

Ephemeral is mandatory, not optional. It ensures both security and reproducibility.
ARC Runner Scale Sets is currently the best autoscaling approach. Do not use the legacy webhook-based mode.
Apply security hardening on Day 0, not as an afterthought. NetworkPolicy, RBAC, SHA pinning, and public repo blocking are baseline requirements.
An ephemeral runner without a cache strategy is just a slow runner. Always configure PVC, S3, or registry cache.
Monitoring and alerts are the lifeline of operations. You must be able to immediately detect Runner Pool exhaustion and registration failures.

References

GitHub Docs - Actions Runner Controller - ARC official documentation and architecture explanation
GitHub Docs - Deploying Runner Scale Sets with ARC - Runner Scale Set deployment tutorial
GitHub Docs - Self-Hosted Runners - Self-hosted runner official guide
GitHub Docs - Secure Use Reference - GitHub Actions security reference document
GitHub Actions Runner Controller Repository - ARC source code and Helm chart values reference
AWS Blog - Best Practices for Self-Hosted Runners at Scale - Large-scale runner operations in AWS environments
StepSecurity Harden-Runner - GitHub Actions runtime security monitoring tool
GitHub Blog - Self-Hosted Runner Minimum Version Enforcement - 2026 runner minimum version requirement changes