Skip to content

Split View: Kubernetes RBAC 심화와 OPA Gatekeeper 정책 코드화 운영 가이드

✨ Learn with Quiz
|

Kubernetes RBAC 심화와 OPA Gatekeeper 정책 코드화 운영 가이드

Kubernetes RBAC 심화와 OPA Gatekeeper 정책 코드화 운영 가이드

왜 RBAC만으로는 부족한가

Kubernetes RBAC(Role-Based Access Control)은 "누가 어떤 리소스에 어떤 동작을 할 수 있는가"를 제어하는 핵심 메커니즘이다. 하지만 RBAC만으로는 다음과 같은 요구사항을 충족할 수 없다.

  • 리소스 내용에 대한 제약: RBAC은 Pod를 만들 수 있는지 없는지는 제어하지만, 그 Pod가 privileged: true인지, 허용된 레지스트리의 이미지를 사용하는지는 검증하지 못한다.
  • 네이밍 컨벤션 강제: 특정 label이나 annotation이 반드시 존재해야 한다는 조직 정책을 RBAC으로 표현할 수 없다.
  • 동적 정책 변경: RBAC 변경은 YAML 수정 후 kubectl apply가 필요하다. 수백 개 클러스터에서 일관된 정책 배포가 어렵다.

이 간극을 메우는 것이 Admission Controller 기반의 Policy-as-Code 접근이며, 그 대표 구현체가 OPA Gatekeeper다. 이 글에서는 RBAC 심화 설계부터 시작해 Gatekeeper로 정책을 코드화하고 운영하는 전체 과정을 다룬다.

RBAC 심화 설계 원칙

최소 권한 원칙(Least Privilege) 실전 적용

RBAC 설계의 첫 번째 원칙은 필요한 최소한의 권한만 부여하는 것이다. 이를 위해 다음 규칙을 따른다.

  1. ClusterRoleBinding 대신 RoleBinding 우선 사용: Namespace 범위의 RoleBinding으로 충분한 경우 ClusterRoleBinding을 사용하지 않는다.
  2. 와일드카드 금지: resources: ["*"]verbs: ["*"]는 사용하지 않는다.
  3. system:masters 그룹 사용 금지: 이 그룹의 멤버는 모든 RBAC 검사를 우회한다. break-glass 절차용으로만 별도 관리한다.
# bad-example: 과도한 권한
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: too-permissive
rules:
  - apiGroups: ['*']
    resources: ['*']
    verbs: ['*']
---
# good-example: 필요한 권한만 명시
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app-deployer
  namespace: production
rules:
  - apiGroups: ['apps']
    resources: ['deployments']
    verbs: ['get', 'list', 'watch', 'create', 'update', 'patch']
  - apiGroups: ['']
    resources: ['services', 'configmaps']
    verbs: ['get', 'list', 'watch', 'create', 'update']
  - apiGroups: ['']
    resources: ['pods']
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['']
    resources: ['pods/log']
    verbs: ['get']

권한 에스컬레이션 방지

Kubernetes RBAC API는 기본적으로 권한 에스컬레이션을 차단한다. 사용자가 Role이나 RoleBinding을 생성/수정하려면, 해당 Role에 포함된 모든 권한을 이미 보유하고 있어야 한다. 그러나 다음 두 가지 verb는 이 보호를 우회할 수 있으므로 특별히 주의해야 한다.

위험 Verb설명대응 방안
escalateRole에 자신이 갖지 않은 권한을 추가할 수 있음플랫폼 관리자만 부여, OPA 정책으로 이중 검증
bind자신이 갖지 않은 권한의 Role을 바인딩할 수 있음ClusterRoleBinding 생성 권한 자체를 제한
impersonate다른 사용자/그룹으로 가장할 수 있음감사 로그 필수 모니터링, 특정 대상만 허용
# 권한 에스컬레이션 관련 verb를 모니터링하는 감사 정책
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    verbs: ['escalate', 'bind', 'impersonate']
    resources:
      - group: 'rbac.authorization.k8s.io'
        resources: ['clusterroles', 'clusterrolebindings', 'roles', 'rolebindings']

Aggregated ClusterRole 안전하게 사용하기

Aggregated ClusterRole은 label selector를 기반으로 여러 ClusterRole의 규칙을 자동으로 합산한다. 편리하지만 **의도하지 않은 권한 누적(Role Explosion)**이 발생할 수 있다.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-aggregate
  labels:
    rbac.example.com/aggregate-to-monitoring: 'true'
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.example.com/aggregate-to-monitoring: 'true'
rules: [] # rules는 자동으로 채워진다
---
# 이 ClusterRole의 규칙이 위 aggregate에 자동 병합됨
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-pods
  labels:
    rbac.example.com/aggregate-to-monitoring: 'true'
rules:
  - apiGroups: ['']
    resources: ['pods', 'pods/log']
    verbs: ['get', 'list', 'watch']

운영 팁: Aggregated ClusterRole에 어떤 규칙이 포함되었는지 정기적으로 확인하라.

# Aggregated ClusterRole의 실제 규칙 확인
kubectl get clusterrole monitoring-aggregate -o jsonpath='{.rules}' | jq .

# 전체 ClusterRole 중 특정 label을 가진 것들 조회
kubectl get clusterrole -l rbac.example.com/aggregate-to-monitoring=true

ServiceAccount 관리 전략

모든 Namespace에는 default ServiceAccount가 자동 생성된다. 이것을 그대로 사용하면 안 된다.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa
  namespace: production
automountServiceAccountToken: false # 필요 없으면 토큰 마운트 비활성화
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: my-app-sa
      automountServiceAccountToken: false # Pod 레벨에서도 명시
      containers:
        - name: app
          image: registry.example.com/my-app:v2.1.0

RBAC vs ABAC vs OPA 비교

정책 엔진 선택 전, 각 접근 방식의 차이를 명확히 이해해야 한다.

항목RBACABACOPA Gatekeeper
정책 단위역할(Role) 기반속성(Attribute) 기반규칙(Rego) 기반
설정 방식Kubernetes API 오브젝트정적 파일(API 서버 재시작 필요)CRD (ConstraintTemplate + Constraint)
리소스 내용 검증불가제한적완전 지원
동적 업데이트kubectl applyAPI 서버 재시작kubectl apply (무중단)
Mutation 지원해당 없음해당 없음지원 (Assign, AssignMetadata)
Audit 기능없음없음기존 리소스 감사 가능
학습 곡선낮음중간높음 (Rego 학습 필요)
커뮤니티 성숙도내장 기능Deprecated (비권장)CNCF Graduated (OPA)

핵심 포인트: RBAC은 "접근 가능 여부"를 제어하고, OPA Gatekeeper는 "리소스 내용의 적합성"을 검증한다. 둘은 대체 관계가 아니라 보완 관계다.

OPA Gatekeeper 아키텍처

Admission Controller 동작 흐름

Kubernetes API 서버는 요청을 처리할 때 다음 순서로 Admission Controller를 거친다.

API 요청 → AuthenticationAuthorization(RBAC)Mutating AdmissionValidating Admission → etcd 저장
                                                         ↑                      ↑
                                                   Gatekeeper Mutation    Gatekeeper Validation

Gatekeeper는 ValidatingAdmissionWebhookMutatingAdmissionWebhook 모두에 등록되어, API 서버가 리소스를 저장하기 전에 정책 검증을 수행한다.

핵심 컴포넌트

Gatekeeper는 크게 세 가지 컴포넌트로 구성된다.

  1. Controller Manager: ConstraintTemplate과 Constraint CRD를 관리하고, Rego 정책을 컴파일한다.
  2. Audit Controller: 주기적으로 기존 리소스를 스캔하여 정책 위반을 탐지한다 (기본 60초 간격).
  3. Webhook Server: API 서버의 Admission 요청을 받아 실시간으로 정책을 평가한다.

Gatekeeper 설치

# Helm으로 설치 (권장)
helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo update

helm install gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --create-namespace \
  --set replicas=3 \
  --set audit.replicas=1 \
  --set audit.logLevel=INFO \
  --set logDenies=true \
  --set emitAdmissionEvents=true \
  --set emitAuditEvents=true

# 설치 확인
kubectl get pods -n gatekeeper-system
kubectl get crd | grep gatekeeper

설치 후 확인해야 할 CRD 목록:

assign.mutations.gatekeeper.sh
assignmetadata.mutations.gatekeeper.sh
configs.config.gatekeeper.sh
constraintpodstatuses.status.gatekeeper.sh
constrainttemplatepodstatuses.status.gatekeeper.sh
constrainttemplates.templates.gatekeeper.sh
expansiontemplate.expansion.gatekeeper.sh
modifyset.mutations.gatekeeper.sh
mutatorpodstatuses.status.gatekeeper.sh
providers.externaldata.gatekeeper.sh

ConstraintTemplate 작성 실전

ConstraintTemplate은 정책의 **틀(Template)**을 정의하고, Constraint는 그 틀에 파라미터를 채워 실제 정책을 활성화한다.

예제 1: 필수 Label 강제

모든 Deployment에 app.kubernetes.io/nameapp.kubernetes.io/owner label이 반드시 있어야 한다는 정책이다.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              description: '반드시 존재해야 하는 label 이름 목록'
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("리소스에 필수 label이 누락되었습니다: %v", [missing])
        }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: deployment-must-have-labels
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ['apps']
        kinds: ['Deployment']
    namespaces: ['production', 'staging']
    excludedNamespaces: ['kube-system', 'gatekeeper-system']
  parameters:
    labels:
      - 'app.kubernetes.io/name'
      - 'app.kubernetes.io/owner'

예제 2: 허용된 컨테이너 레지스트리만 사용

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sallowedrepos
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedRepos
      validation:
        openAPIV3Schema:
          type: object
          properties:
            repos:
              type: array
              description: '허용된 컨테이너 레지스트리 prefix 목록'
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedrepos

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not startswith_any(container.image, input.parameters.repos)
          msg := sprintf("컨테이너 '%v'의 이미지 '%v'는 허용된 레지스트리에 속하지 않습니다. 허용 목록: %v", [container.name, container.image, input.parameters.repos])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          not startswith_any(container.image, input.parameters.repos)
          msg := sprintf("initContainer '%v'의 이미지 '%v'는 허용된 레지스트리에 속하지 않습니다. 허용 목록: %v", [container.name, container.image, input.parameters.repos])
        }

        startswith_any(str, prefixes) {
          prefix := prefixes[_]
          startswith(str, prefix)
        }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: allowed-repos-production
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ['']
        kinds: ['Pod']
    namespaces: ['production']
  parameters:
    repos:
      - 'registry.example.com/'
      - 'gcr.io/my-project/'

예제 3: Privileged Container 차단

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8spspprivileged
spec:
  crd:
    spec:
      names:
        kind: K8sPSPPrivileged
      validation:
        openAPIV3Schema:
          type: object
          properties:
            exemptImages:
              type: array
              description: '예외로 허용할 이미지 목록'
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8spspprivileged

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          container.securityContext.privileged == true
          not is_exempt(container.image)
          msg := sprintf("Privileged 컨테이너는 허용되지 않습니다: '%v'", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          container.securityContext.privileged == true
          not is_exempt(container.image)
          msg := sprintf("Privileged initContainer는 허용되지 않습니다: '%v'", [container.name])
        }

        is_exempt(image) {
          exempt := input.parameters.exemptImages[_]
          image == exempt
        }

enforcementAction 전략: Audit부터 Deny까지

Gatekeeper의 핵심 운영 전략은 단계적 롤아웃이다. 처음부터 deny로 시작하면 기존 워크로드가 대량으로 차단될 수 있다.

단계적 적용 흐름

1단계: dryrun  →  2단계: warn  →  3단계: deny
(감사만)          (경고 표시)      (실제 차단)
enforcementAction동작사용 시점
dryrun위반을 Audit 결과에만 기록, 요청은 허용정책 최초 배포 시, 영향도 파악 단계
warn위반 시 경고 메시지를 반환하지만 요청은 허용개발팀에 인지시키는 단계
deny위반 시 요청을 거부충분한 테스트 후 프로덕션 적용

Audit 결과 확인

# 특정 Constraint의 위반 사항 확인
kubectl get k8srequiredlabels deployment-must-have-labels -o yaml

# 위반 리소스만 필터링 (jq 사용)
kubectl get k8srequiredlabels deployment-must-have-labels -o json | \
  jq '.status.violations[] | {name: .name, namespace: .namespace, message: .message}'

# Gatekeeper audit 로그 확인
kubectl logs -n gatekeeper-system -l control-plane=audit-controller --tail=100 | \
  grep '"process":"audit"'

dryrun에서 deny로 전환하는 안전한 방법

#!/bin/bash
# safe-enforcement-switch.sh
# dryrun → deny 전환 전 위반 사항 확인 스크립트

CONSTRAINT_KIND=$1
CONSTRAINT_NAME=$2

echo "=== 현재 위반 건수 확인 ==="
VIOLATIONS=$(kubectl get ${CONSTRAINT_KIND} ${CONSTRAINT_NAME} -o json | \
  jq '.status.totalViolations')

echo "총 위반 건수: ${VIOLATIONS}"

if [ "${VIOLATIONS}" -gt 0 ]; then
  echo ""
  echo "=== 위반 리소스 목록 ==="
  kubectl get ${CONSTRAINT_KIND} ${CONSTRAINT_NAME} -o json | \
    jq -r '.status.violations[] | "\(.namespace)/\(.name): \(.message)"'
  echo ""
  echo "[WARNING] 위반 리소스가 존재합니다. deny로 전환하면 해당 리소스의 업데이트가 차단됩니다."
  echo "위반 리소스를 먼저 수정하세요."
  exit 1
fi

echo ""
echo "위반 사항 없음. deny로 전환합니다."
kubectl patch ${CONSTRAINT_KIND} ${CONSTRAINT_NAME} --type=merge \
  -p '{"spec":{"enforcementAction":"deny"}}'
echo "전환 완료."

Gatekeeper vs Kyverno: 정책 엔진 선택 가이드

OPA Gatekeeper와 Kyverno는 Kubernetes 정책 엔진의 양대 산맥이다. 프로젝트 상황에 맞는 선택이 중요하다.

비교 항목OPA GatekeeperKyverno
정책 언어Rego (전용 언어)YAML (Kubernetes-native)
CNCF 단계Graduated (OPA)Incubating
Validating Webhook지원지원
Mutating Webhook지원 (Assign, AssignMetadata)지원 (네이티브)
리소스 생성(Generate)미지원지원
이미지 서명 검증외부 데이터 연동 필요내장 지원 (Cosign, Notary)
Audit 기능내장 (주기적 스캔)내장 (Policy Report CRD)
외부 데이터 연동External Data Provider APIAPI Call 지원
멀티 클러스터Config Sync 등 외부 도구자체 지원 제한적
ValidatingAdmissionPolicy 연동v3.22부터 통합지원
학습 곡선높음 (Rego)낮음 (YAML)
표현력매우 높음 (복잡한 로직 구현 가능)중간 (CEL 표현식으로 보완)
리소스 사용량높음 (다중 Pod)중간 (단일 Controller)

선택 기준 요약:

  • Rego에 이미 익숙한 팀, 복잡한 크로스 리소스 정책이 필요한 경우: Gatekeeper
  • Kubernetes YAML에 익숙하고, Mutation/Generation이 핵심인 경우: Kyverno
  • 대규모 엔터프라이즈 환경에서 OPA 생태계(Styra DAS 등)를 활용하는 경우: Gatekeeper

CI/CD 파이프라인에 정책 통합

Git 기반 정책 관리 구조

policies/
├── templates/
│   ├── k8s-required-labels.yaml
│   ├── k8s-allowed-repos.yaml
│   └── k8s-psp-privileged.yaml
├── constraints/
│   ├── production/
│   │   ├── required-labels.yaml
│   │   └── allowed-repos.yaml
│   └── staging/
│       └── required-labels.yaml
├── tests/
│   ├── required-labels_test.rego
│   └── allowed-repos_test.rego
└── Makefile

Rego 단위 테스트 작성

Rego 정책은 반드시 단위 테스트를 작성해야 한다. OPA CLI의 opa test 명령을 사용한다.

# tests/required-labels_test.rego
package k8srequiredlabels

test_violation_missing_label {
  input := {
    "review": {
      "object": {
        "metadata": {
          "labels": {
            "app.kubernetes.io/name": "myapp"
          }
        }
      }
    },
    "parameters": {
      "labels": ["app.kubernetes.io/name", "app.kubernetes.io/owner"]
    }
  }
  results := violation with input as input
  count(results) > 0
}

test_no_violation_all_labels_present {
  input := {
    "review": {
      "object": {
        "metadata": {
          "labels": {
            "app.kubernetes.io/name": "myapp",
            "app.kubernetes.io/owner": "team-platform"
          }
        }
      }
    },
    "parameters": {
      "labels": ["app.kubernetes.io/name", "app.kubernetes.io/owner"]
    }
  }
  results := violation with input as input
  count(results) == 0
}
# 테스트 실행
opa test ./policies/templates/ ./policies/tests/ -v

# CI에서 사용할 Makefile 타겟
# Makefile
.PHONY: test-rego lint-rego apply-dryrun

test-rego:
	opa test ./policies/templates/ ./policies/tests/ -v --fail

lint-rego:
	opa check ./policies/templates/ --strict

apply-dryrun:
	kubectl apply -f ./policies/templates/ --dry-run=server
	kubectl apply -f ./policies/constraints/ --dry-run=server

GitHub Actions 통합 예시

# .github/workflows/policy-ci.yaml
name: Policy CI
on:
  pull_request:
    paths:
      - 'policies/**'

jobs:
  test-and-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup OPA
        uses: open-policy-agent/setup-opa@v2
        with:
          version: latest

      - name: Rego Lint
        run: |
          opa check ./policies/templates/ --strict

      - name: Rego Unit Tests
        run: |
          opa test ./policies/templates/ ./policies/tests/ -v --fail

      - name: Validate YAML syntax
        run: |
          for f in $(find policies/ -name '*.yaml'); do
            echo "Validating: $f"
            kubectl apply -f "$f" --dry-run=client 2>&1 || exit 1
          done

      - name: Conftest Policy Check
        uses: instrumenta/conftest-action@main
        with:
          files: policies/constraints/

트러블슈팅 가이드

증상 1: Gatekeeper Webhook이 응답하지 않아 모든 요청이 차단됨

이것은 가장 치명적인 장애 시나리오다. Gatekeeper Pod이 전부 다운되면 failurePolicy 설정에 따라 동작이 달라진다.

# Webhook 설정 확인
kubectl get validatingwebhookconfiguration gatekeeper-validating-webhook-configuration -o yaml | \
  grep failurePolicy

# failurePolicy: Fail → Gatekeeper 장애 시 모든 요청 차단 (위험!)
# failurePolicy: Ignore → Gatekeeper 장애 시 정책 검증 건너뜀

긴급 대응 절차:

# 1. Webhook을 임시로 비활성화 (긴급 상황)
kubectl delete validatingwebhookconfiguration gatekeeper-validating-webhook-configuration

# 2. Gatekeeper Pod 상태 확인 및 복구
kubectl get pods -n gatekeeper-system
kubectl describe pod -n gatekeeper-system -l control-plane=controller-manager

# 3. Pod이 복구된 후 Webhook 재등록 (Helm 재적용)
helm upgrade gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --reuse-values

# 4. Webhook 재등록 확인
kubectl get validatingwebhookconfiguration | grep gatekeeper

운영 권고: 프로덕션 환경에서는 failurePolicy: Ignore로 설정하여 Gatekeeper 장애가 전체 클러스터 장애로 확산되는 것을 방지한다. 다만, 이 경우 Gatekeeper 다운 시 정책 검증이 일시적으로 비활성화되므로 모니터링 알림을 반드시 설정한다.

증상 2: ConstraintTemplate 적용 후 Status가 "Not Ready"

# ConstraintTemplate 상태 확인
kubectl get constrainttemplate k8srequiredlabels -o yaml | grep -A 20 status

# 일반적인 원인: Rego 문법 오류
# Controller Manager 로그에서 컴파일 에러 확인
kubectl logs -n gatekeeper-system -l control-plane=controller-manager --tail=50 | \
  grep -i "error\|compile\|template"

일반적인 Rego 문법 실수:

  • input.review.object 경로 오타
  • 세미콜론 대신 줄바꿈 사용 시 들여쓰기 오류
  • violation 규칙의 반환 형식이 {"msg": msg}를 포함하지 않음

증상 3: Audit가 위반을 감지하지 못함

# Config에 리소스 동기화가 설정되어 있는지 확인
kubectl get config config -n gatekeeper-system -o yaml

# 감사 대상 리소스를 Config에 등록해야 함
# Gatekeeper Config: Audit에서 참조할 리소스 등록
apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: gatekeeper-system
spec:
  sync:
    syncOnly:
      - group: ''
        version: 'v1'
        kind: 'Namespace'
      - group: ''
        version: 'v1'
        kind: 'Pod'
      - group: 'apps'
        version: 'v1'
        kind: 'Deployment'

증상 4: Constraint가 특정 Namespace를 제외하지 않음

# match 블록에서 excludedNamespaces 확인
spec:
  match:
    excludedNamespaces:
      - 'kube-system'
      - 'gatekeeper-system'
      - 'cert-manager' # 시스템 컴포넌트 네임스페이스
      - 'monitoring' # 모니터링 스택

또한, Gatekeeper 전역 설정에서 exempt namespace를 지정할 수 있다.

# Helm values로 전역 제외 설정
helm upgrade gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --set 'exemptNamespaces={kube-system,gatekeeper-system}'

ValidatingAdmissionPolicy 연동 (Kubernetes 1.30+)

Kubernetes 1.30부터 ValidatingAdmissionPolicy(VAP)가 GA되었다. Gatekeeper v3.22부터 VAP와의 통합이 강화되어, sync-vap-enforcement-scope 플래그로 Gatekeeper의 enforcement 범위와 VAP의 enforcement 범위를 일치시킬 수 있다.

# Gatekeeper 3.22+ 에서 VAP 연동 활성화
helm upgrade gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --set 'controllerManager.extraArgs={--sync-vap-enforcement-scope=true}'

VAP는 외부 Webhook 호출 없이 API 서버 내부에서 CEL 표현식으로 검증을 수행하므로 지연 시간이 적다. 단순한 정책은 VAP로, 복잡한 크로스 리소스 정책은 Gatekeeper로 분담하는 하이브리드 전략이 효과적이다.

운영 체크리스트

RBAC 체크리스트

  • system:masters 그룹에 일반 사용자가 포함되어 있지 않은가
  • escalate, bind, impersonate verb 사용을 모니터링하고 있는가
  • 모든 ServiceAccount에 최소 권한만 부여했는가
  • default ServiceAccount를 워크로드에서 직접 사용하고 있지 않은가
  • automountServiceAccountToken: false를 불필요한 Pod에 설정했는가
  • ClusterRoleBinding 대신 Namespace 범위의 RoleBinding을 우선 사용하고 있는가
  • Aggregated ClusterRole의 실제 규칙을 정기적으로 검토하고 있는가
  • RBAC 관련 감사 로그(Audit Log)를 수집하고 있는가

OPA Gatekeeper 체크리스트

  • Gatekeeper Pod이 3개 이상 replica로 운영되고 있는가
  • failurePolicy 설정이 환경에 맞게 되어 있는가 (프로덕션: Ignore 권장)
  • 모든 ConstraintTemplate에 Rego 단위 테스트가 작성되어 있는가
  • 새 정책은 반드시 dryrun 또는 warn으로 먼저 배포하고 있는가
  • Audit Controller가 정상 동작하고 위반 사항을 모니터링하고 있는가
  • kube-system, gatekeeper-system 등 시스템 네임스페이스가 제외되어 있는가
  • Gatekeeper의 리소스(CPU/Memory) 사용량을 모니터링하고 있는가
  • Webhook 응답 지연 시간을 모니터링하고 있는가 (P99 latency)
  • 정책 변경은 Git 기반 PR 리뷰를 거치고 있는가
  • 긴급 상황 시 Webhook 비활성화 절차가 문서화되어 있는가

장애 대응 우선순위

우선순위장애 상황즉시 조치근본 대응
P0Webhook 장애로 모든 배포 차단Webhook 삭제 후 서비스 복구failurePolicy 변경, replica 증설
P1Audit 미작동으로 위반 미감지Audit Controller 재시작Config sync 설정 확인, 로그 파이프라인 점검
P2특정 정책 오탐(False Positive)해당 Constraint를 dryrun으로 전환Rego 로직 수정 및 테스트 보강
P3정책 누락으로 위반 리소스 배포됨수동 감사 후 리소스 수정ConstraintTemplate 추가, CI 파이프라인 보강

실전 시나리오: 처음부터 끝까지

시나리오: 프로덕션 클러스터에 이미지 레지스트리 제한 정책 적용

# Step 1: 현재 상태 파악 - 어떤 레지스트리의 이미지가 사용되고 있는지 확인
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\t"}{range .spec.containers[*]}{.image}{"\n"}{end}{end}' | \
  sort | uniq -c | sort -rn | head -20

# Step 2: ConstraintTemplate 배포
kubectl apply -f policies/templates/k8s-allowed-repos.yaml

# Step 3: dryrun 모드로 Constraint 배포
cat <<EOF | kubectl apply -f -
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: allowed-repos-production
spec:
  enforcementAction: dryrun
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    namespaces: ["production"]
  parameters:
    repos:
      - "registry.example.com/"
      - "gcr.io/my-project/"
EOF

# Step 4: 1-2일 후 위반 사항 확인
kubectl get k8sallowedrepos allowed-repos-production -o json | \
  jq '.status.totalViolations'

# Step 5: 위반 리소스 수정 후 warn으로 전환
kubectl patch k8sallowedrepos allowed-repos-production --type=merge \
  -p '{"spec":{"enforcementAction":"warn"}}'

# Step 6: 개발팀 피드백 수렴 후 deny로 전환
kubectl patch k8sallowedrepos allowed-repos-production --type=merge \
  -p '{"spec":{"enforcementAction":"deny"}}'

# Step 7: 정책 동작 검증
kubectl run test-blocked --image=docker.io/nginx:latest -n production
# Error: admission webhook "validation.gatekeeper.sh" denied the request

마치며

RBAC과 OPA Gatekeeper는 Kubernetes 보안의 서로 다른 계층을 담당한다. RBAC은 "누가 접근할 수 있는가"를 제어하고, Gatekeeper는 "어떤 리소스가 허용되는가"를 검증한다. 이 두 계층을 함께 운영해야 비로소 완전한 정책 체계가 만들어진다.

핵심 원칙을 다시 정리한다.

  1. RBAC은 최소 권한 원칙을 철저히 적용한다. ClusterRoleBinding보다 RoleBinding을, 와일드카드보다 명시적 리소스/verb 지정을 우선한다.
  2. Gatekeeper 정책은 반드시 코드로 관리한다. Git 저장소에서 버전 관리하고, PR 리뷰를 거치며, CI에서 Rego 테스트를 자동 실행한다.
  3. 단계적 롤아웃(dryrun, warn, deny)을 반드시 따른다. 프로덕션에 직접 deny를 배포하는 것은 사고의 시작이다.
  4. 장애 대응 절차를 미리 준비한다. Webhook 삭제 명령어를 runbook에 포함시키고, 정기적으로 훈련한다.

References

Kubernetes RBAC Deep Dive and OPA Gatekeeper Policy-as-Code Operations Guide

Kubernetes RBAC Deep Dive and OPA Gatekeeper Policy-as-Code Operations Guide

Why RBAC Alone Is Not Enough

Kubernetes RBAC (Role-Based Access Control) is the core mechanism for controlling "who can perform which actions on which resources." However, RBAC alone cannot satisfy the following requirements:

  • Constraints on resource content: RBAC controls whether you can create a Pod or not, but it cannot validate whether that Pod has privileged: true or uses images from allowed registries.
  • Enforcing naming conventions: Organizational policies requiring that specific labels or annotations must exist cannot be expressed through RBAC.
  • Dynamic policy changes: RBAC changes require YAML modifications followed by kubectl apply. Consistent policy deployment across hundreds of clusters is challenging.

The Admission Controller-based Policy-as-Code approach bridges this gap, and OPA Gatekeeper is its representative implementation. This article covers the entire process from advanced RBAC design through codifying policies with Gatekeeper and operating them in production.

Advanced RBAC Design Principles

Applying Least Privilege in Practice

The first principle of RBAC design is granting only the minimum necessary permissions. Follow these rules:

  1. Prefer RoleBinding over ClusterRoleBinding: Don't use ClusterRoleBinding when a namespace-scoped RoleBinding is sufficient.
  2. No wildcards: Never use resources: ["*"] or verbs: ["*"].
  3. No system:masters group: Members of this group bypass all RBAC checks. Manage it separately for break-glass procedures only.
# bad-example: Excessive permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: too-permissive
rules:
  - apiGroups: ['*']
    resources: ['*']
    verbs: ['*']
---
# good-example: Only necessary permissions specified
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: app-deployer
  namespace: production
rules:
  - apiGroups: ['apps']
    resources: ['deployments']
    verbs: ['get', 'list', 'watch', 'create', 'update', 'patch']
  - apiGroups: ['']
    resources: ['services', 'configmaps']
    verbs: ['get', 'list', 'watch', 'create', 'update']
  - apiGroups: ['']
    resources: ['pods']
    verbs: ['get', 'list', 'watch']
  - apiGroups: ['']
    resources: ['pods/log']
    verbs: ['get']

Preventing Privilege Escalation

The Kubernetes RBAC API blocks privilege escalation by default. To create or modify a Role or RoleBinding, a user must already possess all the permissions included in that Role. However, the following two verbs can bypass this protection and require special attention:

Dangerous VerbDescriptionMitigation
escalateAllows adding permissions to a Role that the user doesn't haveGrant only to platform admins, double-verify with OPA policies
bindAllows binding to a Role with permissions the user doesn't haveRestrict ClusterRoleBinding creation permissions themselves
impersonateAllows acting as another user/groupMandatory audit log monitoring, allow only specific targets
# Audit policy for monitoring privilege escalation related verbs
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  - level: RequestResponse
    verbs: ['escalate', 'bind', 'impersonate']
    resources:
      - group: 'rbac.authorization.k8s.io'
        resources: ['clusterroles', 'clusterrolebindings', 'roles', 'rolebindings']

Using Aggregated ClusterRoles Safely

Aggregated ClusterRoles automatically sum up rules from multiple ClusterRoles based on label selectors. While convenient, unintended permission accumulation (Role Explosion) can occur.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-aggregate
  labels:
    rbac.example.com/aggregate-to-monitoring: 'true'
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.example.com/aggregate-to-monitoring: 'true'
rules: [] # rules are automatically populated
---
# This ClusterRole's rules are automatically merged into the aggregate above
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-pods
  labels:
    rbac.example.com/aggregate-to-monitoring: 'true'
rules:
  - apiGroups: ['']
    resources: ['pods', 'pods/log']
    verbs: ['get', 'list', 'watch']

Operational tip: Periodically verify which rules are included in Aggregated ClusterRoles.

# Check actual rules of an Aggregated ClusterRole
kubectl get clusterrole monitoring-aggregate -o jsonpath='{.rules}' | jq .

# Query all ClusterRoles with a specific label
kubectl get clusterrole -l rbac.example.com/aggregate-to-monitoring=true

ServiceAccount Management Strategy

Every Namespace has a default ServiceAccount that is auto-created. You should not use it as-is.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa
  namespace: production
automountServiceAccountToken: false # Disable token mount if not needed
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: my-app-sa
      automountServiceAccountToken: false # Also specify at Pod level
      containers:
        - name: app
          image: registry.example.com/my-app:v2.1.0

RBAC vs ABAC vs OPA Comparison

Before choosing a policy engine, you need to clearly understand the differences between each approach.

ItemRBACABACOPA Gatekeeper
Policy unitRole-basedAttribute-basedRule (Rego)-based
Configuration methodKubernetes API objectsStatic files (requires API server restart)CRD (ConstraintTemplate + Constraint)
Resource content validationNot possibleLimitedFully supported
Dynamic updateskubectl applyAPI server restartkubectl apply (zero-downtime)
Mutation supportN/AN/ASupported (Assign, AssignMetadata)
Audit capabilityNoneNoneExisting resource audit available
Learning curveLowMediumHigh (Rego learning required)
Community maturityBuilt-in featureDeprecatedCNCF Graduated (OPA)

Key point: RBAC controls "access eligibility" while OPA Gatekeeper validates "resource content compliance." They are not replacements but complements to each other.

OPA Gatekeeper Architecture

Admission Controller Flow

When processing requests, the Kubernetes API server goes through Admission Controllers in the following order:

API Request -> Authentication -> Authorization(RBAC) -> Mutating Admission -> Validating Admission -> etcd Storage
                                                              ^                      ^
                                                        Gatekeeper Mutation    Gatekeeper Validation

Gatekeeper registers with both ValidatingAdmissionWebhook and MutatingAdmissionWebhook, performing policy validation before the API server stores resources.

Core Components

Gatekeeper consists of three main components:

  1. Controller Manager: Manages ConstraintTemplate and Constraint CRDs, and compiles Rego policies.
  2. Audit Controller: Periodically scans existing resources to detect policy violations (default 60-second interval).
  3. Webhook Server: Receives Admission requests from the API server and evaluates policies in real-time.

Gatekeeper Installation

# Install with Helm (recommended)
helm repo add gatekeeper https://open-policy-agent.github.io/gatekeeper/charts
helm repo update

helm install gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --create-namespace \
  --set replicas=3 \
  --set audit.replicas=1 \
  --set audit.logLevel=INFO \
  --set logDenies=true \
  --set emitAdmissionEvents=true \
  --set emitAuditEvents=true

# Verify installation
kubectl get pods -n gatekeeper-system
kubectl get crd | grep gatekeeper

CRDs to verify after installation:

assign.mutations.gatekeeper.sh
assignmetadata.mutations.gatekeeper.sh
configs.config.gatekeeper.sh
constraintpodstatuses.status.gatekeeper.sh
constrainttemplatepodstatuses.status.gatekeeper.sh
constrainttemplates.templates.gatekeeper.sh
expansiontemplate.expansion.gatekeeper.sh
modifyset.mutations.gatekeeper.sh
mutatorpodstatuses.status.gatekeeper.sh
providers.externaldata.gatekeeper.sh

ConstraintTemplate Authoring in Practice

A ConstraintTemplate defines the policy template, and a Constraint fills in parameters to activate the actual policy.

Example 1: Enforcing Required Labels

A policy requiring that all Deployments must have app.kubernetes.io/name and app.kubernetes.io/owner labels.

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              description: 'List of label names that must exist'
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Resource is missing required labels: %v", [missing])
        }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: deployment-must-have-labels
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ['apps']
        kinds: ['Deployment']
    namespaces: ['production', 'staging']
    excludedNamespaces: ['kube-system', 'gatekeeper-system']
  parameters:
    labels:
      - 'app.kubernetes.io/name'
      - 'app.kubernetes.io/owner'

Example 2: Allow Only Approved Container Registries

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8sallowedrepos
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedRepos
      validation:
        openAPIV3Schema:
          type: object
          properties:
            repos:
              type: array
              description: 'List of allowed container registry prefixes'
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedrepos

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not startswith_any(container.image, input.parameters.repos)
          msg := sprintf("Container '%v' image '%v' is not from an allowed registry. Allowed: %v", [container.name, container.image, input.parameters.repos])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          not startswith_any(container.image, input.parameters.repos)
          msg := sprintf("initContainer '%v' image '%v' is not from an allowed registry. Allowed: %v", [container.name, container.image, input.parameters.repos])
        }

        startswith_any(str, prefixes) {
          prefix := prefixes[_]
          startswith(str, prefix)
        }
---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: allowed-repos-production
spec:
  enforcementAction: deny
  match:
    kinds:
      - apiGroups: ['']
        kinds: ['Pod']
    namespaces: ['production']
  parameters:
    repos:
      - 'registry.example.com/'
      - 'gcr.io/my-project/'

Example 3: Blocking Privileged Containers

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8spspprivileged
spec:
  crd:
    spec:
      names:
        kind: K8sPSPPrivileged
      validation:
        openAPIV3Schema:
          type: object
          properties:
            exemptImages:
              type: array
              description: 'List of images exempt from this policy'
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8spspprivileged

        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          container.securityContext.privileged == true
          not is_exempt(container.image)
          msg := sprintf("Privileged containers are not allowed: '%v'", [container.name])
        }

        violation[{"msg": msg}] {
          container := input.review.object.spec.initContainers[_]
          container.securityContext.privileged == true
          not is_exempt(container.image)
          msg := sprintf("Privileged initContainers are not allowed: '%v'", [container.name])
        }

        is_exempt(image) {
          exempt := input.parameters.exemptImages[_]
          image == exempt
        }

enforcementAction Strategy: From Audit to Deny

The core operational strategy for Gatekeeper is phased rollout. Starting with deny from the beginning can cause mass blocking of existing workloads.

Phased Rollout Flow

Phase 1: dryrun  ->  Phase 2: warn  ->  Phase 3: deny
(audit only)         (show warnings)     (actual blocking)
enforcementActionBehaviorWhen to Use
dryrunRecords violations in Audit results only, allows requestsInitial policy deployment, impact assessment phase
warnReturns warning messages on violations but allows requestsPhase for notifying dev teams
denyRejects requests on violationProduction enforcement after sufficient testing

Checking Audit Results

# Check violations for a specific Constraint
kubectl get k8srequiredlabels deployment-must-have-labels -o yaml

# Filter only violating resources (using jq)
kubectl get k8srequiredlabels deployment-must-have-labels -o json | \
  jq '.status.violations[] | {name: .name, namespace: .namespace, message: .message}'

# Check Gatekeeper audit logs
kubectl logs -n gatekeeper-system -l control-plane=audit-controller --tail=100 | \
  grep '"process":"audit"'

Safe Method to Transition from dryrun to deny

#!/bin/bash
# safe-enforcement-switch.sh
# Script to check violations before transitioning dryrun -> deny

CONSTRAINT_KIND=$1
CONSTRAINT_NAME=$2

echo "=== Checking current violation count ==="
VIOLATIONS=$(kubectl get ${CONSTRAINT_KIND} ${CONSTRAINT_NAME} -o json | \
  jq '.status.totalViolations')

echo "Total violations: ${VIOLATIONS}"

if [ "${VIOLATIONS}" -gt 0 ]; then
  echo ""
  echo "=== Violating resource list ==="
  kubectl get ${CONSTRAINT_KIND} ${CONSTRAINT_NAME} -o json | \
    jq -r '.status.violations[] | "\(.namespace)/\(.name): \(.message)"'
  echo ""
  echo "[WARNING] Violating resources exist. Switching to deny will block updates to those resources."
  echo "Fix the violating resources first."
  exit 1
fi

echo ""
echo "No violations found. Switching to deny."
kubectl patch ${CONSTRAINT_KIND} ${CONSTRAINT_NAME} --type=merge \
  -p '{"spec":{"enforcementAction":"deny"}}'
echo "Transition complete."

Gatekeeper vs Kyverno: Policy Engine Selection Guide

OPA Gatekeeper and Kyverno are the two leading Kubernetes policy engines. Choosing the right one for your project is important.

Comparison ItemOPA GatekeeperKyverno
Policy languageRego (dedicated language)YAML (Kubernetes-native)
CNCF stageGraduated (OPA)Incubating
Validating WebhookSupportedSupported
Mutating WebhookSupported (Assign, AssignMetadata)Supported (native)
Resource generationNot supportedSupported
Image signature verificationRequires external data integrationBuilt-in (Cosign, Notary)
Audit capabilityBuilt-in (periodic scan)Built-in (Policy Report CRD)
External data integrationExternal Data Provider APIAPI Call support
Multi-clusterExternal tools like Config SyncLimited self-support
ValidatingAdmissionPolicy integrationFrom v3.22Supported
Learning curveHigh (Rego)Low (YAML)
ExpressivenessVery high (complex logic possible)Medium (supplemented by CEL)
Resource usageHigh (multiple Pods)Medium (single Controller)

Selection criteria summary:

  • Teams already familiar with Rego, needing complex cross-resource policies: Gatekeeper
  • Teams familiar with Kubernetes YAML, where Mutation/Generation is core: Kyverno
  • Large enterprise environments leveraging the OPA ecosystem (Styra DAS, etc.): Gatekeeper

Integrating Policies into CI/CD Pipelines

Git-Based Policy Management Structure

policies/
├── templates/
│   ├── k8s-required-labels.yaml
│   ├── k8s-allowed-repos.yaml
│   └── k8s-psp-privileged.yaml
├── constraints/
│   ├── production/
│   │   ├── required-labels.yaml
│   │   └── allowed-repos.yaml
│   └── staging/
│       └── required-labels.yaml
├── tests/
│   ├── required-labels_test.rego
│   └── allowed-repos_test.rego
└── Makefile

Writing Rego Unit Tests

Rego policies must have unit tests. Use the OPA CLI's opa test command.

# tests/required-labels_test.rego
package k8srequiredlabels

test_violation_missing_label {
  input := {
    "review": {
      "object": {
        "metadata": {
          "labels": {
            "app.kubernetes.io/name": "myapp"
          }
        }
      }
    },
    "parameters": {
      "labels": ["app.kubernetes.io/name", "app.kubernetes.io/owner"]
    }
  }
  results := violation with input as input
  count(results) > 0
}

test_no_violation_all_labels_present {
  input := {
    "review": {
      "object": {
        "metadata": {
          "labels": {
            "app.kubernetes.io/name": "myapp",
            "app.kubernetes.io/owner": "team-platform"
          }
        }
      }
    },
    "parameters": {
      "labels": ["app.kubernetes.io/name", "app.kubernetes.io/owner"]
    }
  }
  results := violation with input as input
  count(results) == 0
}
# Run tests
opa test ./policies/templates/ ./policies/tests/ -v

# Makefile targets for CI
# Makefile
.PHONY: test-rego lint-rego apply-dryrun

test-rego:
	opa test ./policies/templates/ ./policies/tests/ -v --fail

lint-rego:
	opa check ./policies/templates/ --strict

apply-dryrun:
	kubectl apply -f ./policies/templates/ --dry-run=server
	kubectl apply -f ./policies/constraints/ --dry-run=server

GitHub Actions Integration Example

# .github/workflows/policy-ci.yaml
name: Policy CI
on:
  pull_request:
    paths:
      - 'policies/**'

jobs:
  test-and-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup OPA
        uses: open-policy-agent/setup-opa@v2
        with:
          version: latest

      - name: Rego Lint
        run: |
          opa check ./policies/templates/ --strict

      - name: Rego Unit Tests
        run: |
          opa test ./policies/templates/ ./policies/tests/ -v --fail

      - name: Validate YAML syntax
        run: |
          for f in $(find policies/ -name '*.yaml'); do
            echo "Validating: $f"
            kubectl apply -f "$f" --dry-run=client 2>&1 || exit 1
          done

      - name: Conftest Policy Check
        uses: instrumenta/conftest-action@main
        with:
          files: policies/constraints/

Troubleshooting Guide

Symptom 1: Gatekeeper Webhook Not Responding, Blocking All Requests

This is the most critical failure scenario. If all Gatekeeper Pods go down, behavior varies depending on the failurePolicy setting.

# Check webhook configuration
kubectl get validatingwebhookconfiguration gatekeeper-validating-webhook-configuration -o yaml | \
  grep failurePolicy

# failurePolicy: Fail -> All requests blocked during Gatekeeper outage (dangerous!)
# failurePolicy: Ignore -> Policy validation skipped during Gatekeeper outage

Emergency Response Procedure:

# 1. Temporarily disable webhook (emergency)
kubectl delete validatingwebhookconfiguration gatekeeper-validating-webhook-configuration

# 2. Check Gatekeeper Pod status and recover
kubectl get pods -n gatekeeper-system
kubectl describe pod -n gatekeeper-system -l control-plane=controller-manager

# 3. Re-register webhook after Pod recovery (Helm reapply)
helm upgrade gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --reuse-values

# 4. Verify webhook re-registration
kubectl get validatingwebhookconfiguration | grep gatekeeper

Operational recommendation: In production environments, set failurePolicy: Ignore to prevent Gatekeeper outages from cascading into cluster-wide outages. However, since policy validation is temporarily disabled when Gatekeeper is down, monitoring alerts must be configured.

Symptom 2: ConstraintTemplate Status Shows "Not Ready" After Applying

# Check ConstraintTemplate status
kubectl get constrainttemplate k8srequiredlabels -o yaml | grep -A 20 status

# Common cause: Rego syntax errors
# Check compilation errors in Controller Manager logs
kubectl logs -n gatekeeper-system -l control-plane=controller-manager --tail=50 | \
  grep -i "error\|compile\|template"

Common Rego syntax mistakes:

  • Typos in input.review.object paths
  • Indentation errors when using line breaks instead of semicolons
  • violation rules not including {"msg": msg} in the return format

Symptom 3: Audit Not Detecting Violations

# Check if resource sync is configured in Config
kubectl get config config -n gatekeeper-system -o yaml

# Resources must be registered in Config for audit
# Gatekeeper Config: Register resources for Audit reference
apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: gatekeeper-system
spec:
  sync:
    syncOnly:
      - group: ''
        version: 'v1'
        kind: 'Namespace'
      - group: ''
        version: 'v1'
        kind: 'Pod'
      - group: 'apps'
        version: 'v1'
        kind: 'Deployment'

Symptom 4: Constraint Not Excluding Specific Namespaces

# Check excludedNamespaces in the match block
spec:
  match:
    excludedNamespaces:
      - 'kube-system'
      - 'gatekeeper-system'
      - 'cert-manager' # System component namespace
      - 'monitoring' # Monitoring stack

Additionally, you can specify exempt namespaces in Gatekeeper's global configuration.

# Set global exclusions via Helm values
helm upgrade gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --set 'exemptNamespaces={kube-system,gatekeeper-system}'

ValidatingAdmissionPolicy Integration (Kubernetes 1.30+)

Starting with Kubernetes 1.30, ValidatingAdmissionPolicy (VAP) became GA. From Gatekeeper v3.22, integration with VAP has been strengthened, allowing the sync-vap-enforcement-scope flag to align Gatekeeper's enforcement scope with VAP's enforcement scope.

# Enable VAP integration in Gatekeeper 3.22+
helm upgrade gatekeeper gatekeeper/gatekeeper \
  --namespace gatekeeper-system \
  --set 'controllerManager.extraArgs={--sync-vap-enforcement-scope=true}'

VAP performs validation using CEL expressions within the API server without external webhook calls, resulting in lower latency. A hybrid strategy where simple policies use VAP and complex cross-resource policies use Gatekeeper is effective.

Operations Checklist

RBAC Checklist

  • Are no regular users included in the system:masters group?
  • Are escalate, bind, impersonate verb usage being monitored?
  • Have all ServiceAccounts been granted only minimum permissions?
  • Is the default ServiceAccount not being directly used by workloads?
  • Is automountServiceAccountToken: false set on Pods that don't need it?
  • Are namespace-scoped RoleBindings being preferred over ClusterRoleBindings?
  • Are the actual rules of Aggregated ClusterRoles being regularly reviewed?
  • Are RBAC-related audit logs being collected?

OPA Gatekeeper Checklist

  • Are Gatekeeper Pods running with 3 or more replicas?
  • Is the failurePolicy setting appropriate for the environment? (Production: Ignore recommended)
  • Do all ConstraintTemplates have Rego unit tests?
  • Are new policies always deployed first as dryrun or warn?
  • Is the Audit Controller operating normally and monitoring violations?
  • Are system namespaces like kube-system and gatekeeper-system excluded?
  • Are Gatekeeper resource (CPU/Memory) usage being monitored?
  • Is webhook response latency being monitored? (P99 latency)
  • Are policy changes going through Git-based PR reviews?
  • Is the webhook deactivation procedure for emergencies documented?

Incident Response Priority

PriorityFailure ScenarioImmediate ActionRoot Cause Response
P0All deployments blocked due to webhook failureDelete webhook and restore serviceChange failurePolicy, increase replicas
P1Violations undetected due to audit not workingRestart Audit ControllerVerify Config sync settings, check log pipeline
P2False positive in specific policySwitch that Constraint to dryrunFix Rego logic and strengthen tests
P3Violating resource deployed due to missing policyManual audit then fix resourcesAdd ConstraintTemplate, strengthen CI pipeline

End-to-End Practical Scenario

Scenario: Applying Image Registry Restriction Policy to Production Cluster

# Step 1: Assess current state - check which registry images are in use
kubectl get pods --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\t"}{range .spec.containers[*]}{.image}{"\n"}{end}{end}' | \
  sort | uniq -c | sort -rn | head -20

# Step 2: Deploy ConstraintTemplate
kubectl apply -f policies/templates/k8s-allowed-repos.yaml

# Step 3: Deploy Constraint in dryrun mode
cat <<EOF | kubectl apply -f -
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedRepos
metadata:
  name: allowed-repos-production
spec:
  enforcementAction: dryrun
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    namespaces: ["production"]
  parameters:
    repos:
      - "registry.example.com/"
      - "gcr.io/my-project/"
EOF

# Step 4: Check violations after 1-2 days
kubectl get k8sallowedrepos allowed-repos-production -o json | \
  jq '.status.totalViolations'

# Step 5: Fix violating resources then switch to warn
kubectl patch k8sallowedrepos allowed-repos-production --type=merge \
  -p '{"spec":{"enforcementAction":"warn"}}'

# Step 6: Switch to deny after collecting dev team feedback
kubectl patch k8sallowedrepos allowed-repos-production --type=merge \
  -p '{"spec":{"enforcementAction":"deny"}}'

# Step 7: Verify policy enforcement
kubectl run test-blocked --image=docker.io/nginx:latest -n production
# Error: admission webhook "validation.gatekeeper.sh" denied the request

Conclusion

RBAC and OPA Gatekeeper handle different layers of Kubernetes security. RBAC controls "who can access" while Gatekeeper validates "which resources are allowed." Operating both layers together is what creates a complete policy framework.

Here are the key principles once more:

  1. Apply the principle of least privilege rigorously to RBAC. Prefer RoleBinding over ClusterRoleBinding, and explicit resource/verb specification over wildcards.
  2. Manage Gatekeeper policies as code. Version-control them in a Git repository, go through PR reviews, and automatically run Rego tests in CI.
  3. Always follow phased rollout (dryrun, warn, deny). Deploying deny directly to production is the beginning of an incident.
  4. Prepare incident response procedures in advance. Include the webhook deletion command in your runbook and drill regularly.

References

Quiz

Q1: What is the main topic covered in "Kubernetes RBAC Deep Dive and OPA Gatekeeper Policy-as-Code Operations Guide"?

A practical operations guide from Kubernetes RBAC design to codifying policies with OPA Gatekeeper and integrating them into CI/CD. Covers ConstraintTemplate authoring, Audit mode, and failure recovery.

Q2: Why RBAC Alone Is Not Enough? Kubernetes RBAC (Role-Based Access Control) is the core mechanism for controlling "who can perform which actions on which resources." However, RBAC alone cannot satisfy the following requirements: Constraints on resource content: RBAC controls whether you can create a Pod or not,...

Q3: Describe the Advanced RBAC Design Principles. Applying Least Privilege in Practice The first principle of RBAC design is granting only the minimum necessary permissions. Follow these rules: Prefer RoleBinding over ClusterRoleBinding: Don't use ClusterRoleBinding when a namespace-scoped RoleBinding is sufficient.

Q4: What are the key differences in RBAC vs ABAC vs OPA Comparison? Before choosing a policy engine, you need to clearly understand the differences between each approach. Key point: RBAC controls "access eligibility" while OPA Gatekeeper validates "resource content compliance." They are not replacements but complements to each other.

Q5: Describe the OPA Gatekeeper Architecture. Admission Controller Flow When processing requests, the Kubernetes API server goes through Admission Controllers in the following order: Gatekeeper registers with both ValidatingAdmissionWebhook and MutatingAdmissionWebhook, performing policy validation before the API server stor...