Skip to content

Split View: DevOps 내부 개발자 플랫폼 플레이북 2026

✨ Learn with Quiz
|

DevOps 내부 개발자 플랫폼 플레이북 2026

DevOps 내부 개발자 플랫폼 플레이북 2026

Play 1: IDP가 필요한 시점 판단하기

Internal Developer Platform(IDP)은 개발자가 인프라 요청 없이 셀프서비스로 서비스를 생성, 배포, 모니터링할 수 있는 내부 플랫폼이다. 2026년 Gartner 조사에 따르면 엔터프라이즈의 80%가 플랫폼 엔지니어링에 투자하고 있다.

하지만 모든 조직에 IDP가 필요한 것은 아니다. 다음 조건 중 3개 이상 해당되면 IDP 도입을 검토할 시점이다.

도입 신호 체크리스트:

  • 서비스 수가 20개를 초과했다
  • 신규 서비스 생성에 1주일 이상 걸린다
  • 팀마다 CI/CD 파이프라인이 다르고 표준이 없다
  • 온보딩에 2주 이상 소요된다 (개발 환경 설정, 권한 요청 등)
  • 인프라팀이 반복적인 Jira 티켓 처리에 80% 이상의 시간을 쓴다
  • 배포 후 장애 시 롤백 절차가 팀마다 다르거나 문서화되어 있지 않다
  • 비용 귀속(cost attribution)이 불가능하다

3개 미만이라면 IDP보다 간단한 셸 스크립트 자동화나 GitHub Actions 표준 템플릿만으로도 충분하다.

Play 2: 플랫폼 팀 구성과 역할 정의

IDP는 제품이다. 인프라 엔지니어 몇 명이 사이드 프로젝트로 만드는 것이 아니라, 전담 팀이 제품 마인드셋으로 운영해야 한다.

팀 구성 (50-200명 개발 조직 기준)

역할인원핵심 책임
Platform PM1명개발자 요구사항 수집, 로드맵 관리, 채택률 추적
Platform Engineer2-3명인프라 추상화, API/UI 개발, 골든패스 설계
SRE / DevOps1-2명모니터링 파이프라인, 온콜, 인시던트 대응 자동화
Developer Advocate0.5명 (겸직)문서화, 온보딩 가이드, 내부 교육

핵심 원칙:

  • 플랫폼 팀의 고객은 내부 개발자다. NPS(Net Promoter Score)를 분기마다 측정한다.
  • 채택은 강제가 아니라 매력으로. 골든패스를 따르면 30분에 끝나지만, 따르지 않으면 2주가 걸리게 만든다.
  • 피드백 루프는 2주 이내로. 개발자가 요청한 기능이 2주 내에 최소 응답(구현 계획 또는 거절 사유)을 받아야 한다.

Play 3: 기술 스택 선택

Backstage vs 자체 구축 vs SaaS 비교

기준Backstage (오픈소스)자체 구축SaaS (Port/Cortex 등)
초기 비용중간 (3-6개월 구축)높음 (6-12개월)낮음 (즉시 시작)
커스터마이징높음 (플러그인 생태계)최고제한적
유지보수 부담높음 (업그레이드, 보안패치)매우 높음없음 (벤더 책임)
조직 규모 적합성100명+500명+50-300명
벤더 종속없음없음높음
플러그인/통합2000+ 플러그인필요한 것만벤더 제공 범위

권장 전략: 100명 이하 조직이면 SaaS부터 시작한다. 100-500명이면 Backstage를 도입하되, Roadie 같은 관리형 Backstage도 고려한다. 500명 이상이면 자체 팀이 Backstage를 커스터마이징하여 운영한다.

Play 4: Backstage 셋업과 Software Catalog

Backstage 설치

# Backstage CLI로 신규 프로젝트 생성
npx @backstage/create-app@latest --skip-install

# 결과 디렉토리 구조
my-backstage/
├── app-config.yaml           # 핵심 설정 파일
├── app-config.production.yaml
├── packages/
│   ├── app/                  # 프론트엔드 (React)
│   └── backend/              # 백엔드 (Node.js)
├── plugins/                  # 커스텀 플러그인
├── catalog-info.yaml         # 이 프로젝트 자체의 카탈로그 등록
└── package.json

# 의존성 설치 및 실행
cd my-backstage
yarn install
yarn dev
# http://localhost:3000 에서 접속

app-config.yaml 핵심 설정

# app-config.yaml
app:
  title: 'MyOrg Developer Platform'
  baseUrl: http://localhost:3000

organization:
  name: 'MyOrg'

backend:
  baseUrl: http://localhost:7007
  database:
    client: pg
    connection:
      host: ${POSTGRES_HOST}
      port: ${POSTGRES_PORT}
      user: ${POSTGRES_USER}
      password: ${POSTGRES_PASSWORD}

# GitHub 통합 (서비스 카탈로그 자동 탐색)
integrations:
  github:
    - host: github.com
      token: ${GITHUB_TOKEN}

# Software Catalog 설정
catalog:
  import:
    entityFilename: catalog-info.yaml
  rules:
    - allow: [Component, System, API, Resource, Location, Group, User]
  locations:
    # 조직의 모든 저장소에서 catalog-info.yaml 자동 탐색
    - type: github-discovery
      target: https://github.com/my-org/*/blob/main/catalog-info.yaml
    # 수동 등록
    - type: file
      target: ./catalog-entities/all-systems.yaml

# 인증 (GitHub OAuth)
auth:
  environment: development
  providers:
    github:
      development:
        clientId: ${GITHUB_OAUTH_CLIENT_ID}
        clientSecret: ${GITHUB_OAUTH_CLIENT_SECRET}

서비스 카탈로그 등록 표준

각 서비스 저장소의 루트에 catalog-info.yaml을 배치한다.

# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  description: '주문 처리 마이크로서비스'
  annotations:
    github.com/project-slug: my-org/order-service
    backstage.io/techdocs-ref: dir:.
    datadoghq.com/dashboard-url: https://app.datadoghq.com/dashboard/abc-123
    pagerduty.com/service-id: PXXXXXX
    argocd/app-name: order-service-prod
  tags:
    - java
    - spring-boot
    - tier-1
  links:
    - url: https://order.internal.example.com
      title: 프로덕션 URL
    - url: https://grafana.internal.example.com/d/order-service
      title: Grafana 대시보드
spec:
  type: service
  lifecycle: production
  owner: team-commerce
  system: commerce-platform
  dependsOn:
    - component:payment-service
    - resource:orders-database
  providesApis:
    - order-api
  consumesApis:
    - payment-api
    - inventory-api
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: order-api
  description: '주문 REST API'
spec:
  type: openapi
  lifecycle: production
  owner: team-commerce
  definition:
    $text: ./docs/openapi.yaml

Play 5: 서비스 템플릿 (Scaffolding)

Backstage의 Software Templates는 신규 서비스를 표준화된 구조로 생성하는 기능이다. 30분 내에 CI/CD, 모니터링, 카탈로그 등록까지 완료된 서비스를 만들 수 있다.

# templates/spring-boot-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: spring-boot-service
  title: 'Spring Boot 마이크로서비스'
  description: 'CI/CD, 모니터링, 카탈로그가 자동 설정된 Spring Boot 서비스를 생성합니다'
  tags:
    - java
    - spring-boot
    - recommended
spec:
  owner: team-platform
  type: service
  parameters:
    - title: 서비스 정보
      required:
        - serviceName
        - ownerTeam
        - tier
      properties:
        serviceName:
          title: 서비스 이름
          type: string
          pattern: '^[a-z][a-z0-9-]*$'
          description: '소문자, 숫자, 하이픈만 허용'
        ownerTeam:
          title: 소유 팀
          type: string
          ui:field: OwnerPicker
          ui:options:
            catalogFilter:
              kind: Group
        tier:
          title: 서비스 티어
          type: string
          enum: ['tier-1', 'tier-2', 'tier-3']
          enumNames: ['Tier 1 (99.99% SLO)', 'Tier 2 (99.9% SLO)', 'Tier 3 (99% SLO)']
        javaVersion:
          title: Java 버전
          type: string
          default: '21'
          enum: ['17', '21']

    - title: 인프라 설정
      properties:
        database:
          title: 데이터베이스
          type: string
          default: 'postgresql'
          enum: ['postgresql', 'mysql', 'none']
        messageQueue:
          title: 메시지 큐
          type: string
          default: 'none'
          enum: ['kafka', 'rabbitmq', 'none']
        replicaCount:
          title: 기본 레플리카 수
          type: integer
          default: 3
          minimum: 1
          maximum: 20

  steps:
    # 1. 템플릿에서 저장소 생성
    - id: fetch-template
      name: 템플릿 코드 생성
      action: fetch:template
      input:
        url: ./skeleton
        values:
          serviceName: ${{ parameters.serviceName }}
          ownerTeam: ${{ parameters.ownerTeam }}
          tier: ${{ parameters.tier }}
          javaVersion: ${{ parameters.javaVersion }}
          database: ${{ parameters.database }}
          replicaCount: ${{ parameters.replicaCount }}

    # 2. GitHub 저장소 생성
    - id: publish
      name: GitHub 저장소 생성
      action: publish:github
      input:
        allowedHosts: ['github.com']
        repoUrl: github.com?owner=my-org&repo=${{ parameters.serviceName }}
        defaultBranch: main
        repoVisibility: internal
        protectDefaultBranch: true
        requireCodeOwnerReviews: true

    # 3. ArgoCD 앱 등록
    - id: register-argocd
      name: ArgoCD 애플리케이션 등록
      action: argocd:create-resources
      input:
        appName: ${{ parameters.serviceName }}-prod
        argoInstance: main
        namespace: ${{ parameters.serviceName }}
        repoUrl: https://github.com/my-org/${{ parameters.serviceName }}
        path: k8s/overlays/production

    # 4. Backstage 카탈로그 등록
    - id: register-catalog
      name: 카탈로그 등록
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['publish'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

  output:
    links:
      - title: 저장소
        url: ${{ steps['publish'].output.remoteUrl }}
      - title: 카탈로그
        icon: catalog
        entityRef: ${{ steps['register-catalog'].output.entityRef }}

Play 6: TechDocs로 문서 통합

분산된 문서를 Backstage TechDocs로 통합하면, 서비스 카탈로그에서 바로 기술 문서를 확인할 수 있다.

# mkdocs.yml (각 서비스 저장소 루트)
site_name: order-service
nav:
  - Home: index.md
  - Architecture: architecture.md
  - API Reference: api.md
  - Runbook: runbook.md
  - ADR:
      - adr/001-database-choice.md
      - adr/002-event-schema.md

plugins:
  - techdocs-core
<!-- docs/runbook.md -->

# Order Service 운영 런북

## 장애 대응

### 주문 처리 지연 (P95 > 500ms)

1. Grafana 대시보드 확인: [링크]
2. DB 커넥션 풀 상태 확인:

bash
kubectl exec -it deploy/order-service -- curl localhost:8080/actuator/metrics/hikaricp.connections.active

3. 커넥션 풀 포화 시:
   kubectl scale deploy/order-service --replicas=6

4. DB 슬로우 쿼리 확인:
   SELECT pid, now() - query_start AS duration, query
   FROM pg_stat_activity
   WHERE state = 'active' AND now() - query_start > interval '5s';

### 주문 생성 실패 (HTTP 500)

1. 에러 로그 확인:
   kubectl logs deploy/order-service --tail=100 | grep ERROR
2. 에러 코드별 대응:
   - ORDER-001: 결제 서비스 연결 실패 -> payment-service 상태 확인
   - ORDER-002: 재고 부족 -> inventory-service 동기화 확인
   - ORDER-003: DB 데드락 -> 트랜잭션 격리 수준 확인

Play 7: 셀프서비스 인프라 프로비저닝

개발자가 Backstage UI에서 데이터베이스, 메시지 큐, 캐시 등을 직접 프로비저닝할 수 있게 한다. 실제 인프라 생성은 Terraform + GitOps로 처리한다.

# templates/provision-postgresql/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: provision-postgresql
  title: 'PostgreSQL 데이터베이스 프로비저닝'
  description: 'RDS PostgreSQL 인스턴스를 셀프서비스로 생성합니다'
spec:
  owner: team-platform
  type: resource
  parameters:
    - title: 데이터베이스 설정
      required:
        - dbName
        - environment
        - instanceClass
      properties:
        dbName:
          title: DB 이름
          type: string
          pattern: '^[a-z][a-z0-9_]*$'
        environment:
          title: 환경
          type: string
          enum: ['dev', 'staging', 'production']
        instanceClass:
          title: 인스턴스 크기
          type: string
          default: 'db.r7g.large'
          enum:
            - 'db.t4g.medium'
            - 'db.r7g.large'
            - 'db.r7g.xlarge'
            - 'db.r7g.2xlarge'
          enumNames:
            - 'Small (2 vCPU, 4GB) - dev/staging'
            - 'Medium (2 vCPU, 16GB) - production'
            - 'Large (4 vCPU, 32GB) - production'
            - 'XLarge (8 vCPU, 64GB) - high traffic'
        storageGb:
          title: 스토리지 (GB)
          type: integer
          default: 100
          minimum: 20
          maximum: 16000
        multiAz:
          title: Multi-AZ 배포
          type: boolean
          default: false

  steps:
    - id: create-terraform-pr
      name: Terraform PR 생성
      action: publish:github:pull-request
      input:
        repoUrl: github.com?owner=my-org&repo=infrastructure
        branchName: provision-db-${{ parameters.dbName }}
        title: 'DB 프로비저닝: ${{ parameters.dbName }} (${{ parameters.environment }})'
        description: |
          자동 생성된 DB 프로비저닝 요청입니다.

          - DB 이름: ${{ parameters.dbName }}
          - 환경: ${{ parameters.environment }}
          - 인스턴스: ${{ parameters.instanceClass }}
          - 스토리지: ${{ parameters.storageGb }}GB
          - Multi-AZ: ${{ parameters.multiAz }}
        targetPath: terraform/rds/${{ parameters.environment }}/${{ parameters.dbName }}
        sourcePath: ./terraform-template

Play 8: 플랫폼 성공 지표 측정

IDP의 성과를 측정하지 않으면 투자 대비 효과를 증명할 수 없다. 다음 지표를 분기별로 추적한다.

핵심 성과 지표 (KPI)

# platform_metrics.py - 플랫폼 KPI 대시보드 데이터 수집

import requests
from datetime import datetime, timedelta

class PlatformMetrics:
    def __init__(self, github_token: str, backstage_url: str):
        self.github = github_token
        self.backstage = backstage_url

    def service_creation_lead_time(self) -> dict:
        """신규 서비스 생성 소요 시간 (목표: 30분 이내)"""
        # Backstage scaffolder 로그에서 추출
        response = requests.get(
            f"{self.backstage}/api/scaffolder/v2/tasks",
            params={"createdAfter": (datetime.now() - timedelta(days=90)).isoformat()}
        )
        tasks = response.json()["items"]

        lead_times = []
        for task in tasks:
            if task["status"] == "completed":
                start = datetime.fromisoformat(task["createdAt"])
                end = datetime.fromisoformat(task["completedAt"])
                lead_times.append((end - start).total_seconds() / 60)

        return {
            "median_minutes": sorted(lead_times)[len(lead_times) // 2],
            "p95_minutes": sorted(lead_times)[int(len(lead_times) * 0.95)],
            "total_services_created": len(lead_times),
        }

    def golden_path_adoption_rate(self) -> dict:
        """골든패스 채택률 (목표: 80% 이상)"""
        # GitHub API에서 reusable workflow 사용 현황 조회
        repos = requests.get(
            "https://api.github.com/orgs/my-org/repos",
            headers={"Authorization": f"token {self.github}"},
            params={"per_page": 100, "type": "internal"}
        ).json()

        using_golden_path = 0
        total_active = 0

        for repo in repos:
            if repo["archived"]:
                continue
            total_active += 1
            # CI 워크플로우에서 골든패스 참조 확인
            workflows = requests.get(
                f"https://api.github.com/repos/my-org/{repo['name']}/actions/workflows",
                headers={"Authorization": f"token {self.github}"}
            ).json()

            for wf in workflows.get("workflows", []):
                if "golden" in wf.get("path", "").lower():
                    using_golden_path += 1
                    break

        return {
            "adoption_rate": using_golden_path / max(total_active, 1),
            "using_golden_path": using_golden_path,
            "total_active_repos": total_active,
        }

    def developer_nps(self) -> dict:
        """개발자 만족도 NPS (목표: 30 이상)"""
        # 분기별 서베이 결과 (Google Forms / Typeform 등)
        # 직접 API 연동하거나, 수동 입력
        return {
            "nps_score": 42,
            "promoters_pct": 55,
            "detractors_pct": 13,
            "response_rate": 0.72,
            "top_complaints": [
                "빌드 시간이 느림",
                "로그 검색 UI가 불편함",
                "권한 요청 자동화 부족",
            ]
        }

KPI 목표값

지표나쁨보통좋음목표
서비스 생성 시간1주+1-3일1시간30분
골든패스 채택률30% 미만30-60%60-80%80%+
개발자 NPS0 미만0-2020-4040+
온보딩 시간2주+1-2주2-5일1일
인프라 티켓 수/월50+20-505-205 미만

Play 9: 트러블슈팅

문제 1: Backstage 카탈로그 동기화 지연


WARN: Entity refresh for component:order-service took 45s (threshold: 10s)

원인: GitHub discovery가 수백 개 저장소를 스캔하면서 API rate limit에 걸린다.

# 해결: 스캔 범위 제한 + 캐시 설정
catalog:
  providers:
    github:
      myOrg:
        organization: 'my-org'
        catalogPath: '/catalog-info.yaml'
        filters:
          repository: '^(?!archived-).*$' # archived- 접두사 저장소 제외
          topic:
            include: ['backstage-enabled'] # 토픽 기반 필터링
        schedule:
          frequency: { minutes: 30 } # 30분 주기 (기본 5분)
          timeout: { minutes: 5 }

문제 2: Software Template 실행 실패 - GitHub 권한

Error: Resource not accessible by integration
HttpError: 403 - Resource not accessible by integration

원인: GitHub App의 권한이 부족하거나, 생성하려는 저장소의 organization에 대한 접근 권한이 없다.

# GitHub App 권한 확인
# Settings > Developer settings > GitHub Apps > [앱 이름] > Permissions
# 필요 권한:
# - Repository: Administration (Read & Write)
# - Repository: Contents (Read & Write)
# - Organization: Members (Read)
# 또는 Personal Access Token(PAT) 사용 시 필요 scope:
# repo, workflow, admin:org

문제 3: TechDocs 빌드 실패

mkdocs build failed: No module named 'techdocs_core'
# 해결: TechDocs 빌드 환경에 플러그인 설치
pip install mkdocs-techdocs-core

# Docker 빌드 사용 시
docker run --rm -v $(pwd):/content \
  spotify/techdocs:latest \
  build --site-dir /content/site

# app-config.yaml에서 빌드 방식 설정
# techdocs:
#   builder: 'external' # CI에서 빌드
#   publisher:
#     type: 'awsS3'
#     awsS3:
#       bucketName: 'my-org-techdocs'
#       region: 'ap-northeast-2'

문제 4: 플랫폼 채택률이 올라가지 않음

이것은 기술 문제가 아니라 조직 문제다.

해결 전략:

  1. 챔피언 팀을 먼저 확보한다: 얼리어답터 2-3개 팀을 선정하고, 이 팀의 성공 사례를 내부에 공유한다.
  2. 마찰을 제거한다: 개발자가 골든패스를 따르지 않을 때 겪는 고통(수동 배포, 수동 모니터링 설정)을 유지하면서, 골든패스의 편의성을 극대화한다.
  3. 강제하지 않는다: 매달 "Platform Day"를 열어 데모와 피드백 세션을 진행한다.
  4. 지표로 증명한다: "골든패스를 사용하는 팀의 배포 빈도가 3배 높다"와 같은 데이터를 공유한다.

Play 10: IDP 로드맵 단계별 구현

모든 것을 한번에 만들려고 하면 실패한다. 3단계로 나누어 점진적으로 구축한다.

Phase 1 (1-3개월): 기초

  • Software Catalog 구축 (모든 서비스, 팀, API 등록)
  • CI 골든패스 표준화 (GitHub Actions reusable workflow)
  • 서비스 생성 템플릿 1-2개

Phase 2 (4-6개월): 확장

  • CD 골든패스 (Argo Rollouts 카나리 배포)
  • TechDocs 통합 (런북, ADR)
  • 셀프서비스 인프라 프로비저닝 (DB, 캐시)
  • 비용 태깅 및 대시보드

Phase 3 (7-12개월): 성숙

  • 보안 정책 자동 적용 (OPA/Kyverno)
  • DORA 지표 자동 수집 및 대시보드
  • 내부 마켓플레이스 (공유 라이브러리, 플러그인)
  • 개발 환경 원클릭 프로비저닝

퀴즈

Q1. IDP 도입이 시기상조인 조직의 특징은? 정답: ||서비스 수가 20개 미만이고, 인프라팀의 반복 티켓 처리 비율이 낮으며, 신규 서비스 생성이 1주일 이내에 가능한 조직이다. 이 경우 IDP보다 간단한 스크립트 자동화나 표준 CI/CD 템플릿으로 충분하다.||

Q2. 플랫폼 팀을 구성할 때 Platform PM이 필요한 이유는? 정답: ||IDP는 내부 제품이므로 고객(개발자)의 요구사항 수집, 우선순위 결정, 채택률 측정이 필요하다. 엔지니어만으로 구성하면 기술 중심으로 치우쳐 개발자가 실제로 필요한 기능이 아닌 기술적으로 흥미로운 기능을 만들게 된다.||

Q3. Backstage Software Catalog에서 catalog-info.yaml의 dependsOn 필드가 중요한 이유는?

정답: ||서비스 간 의존성을 명시하여 장애 영향 범위를 즉시 파악할 수 있다. order-service가 payment-service에 dependsOn이면, payment-service 장애 시 order-service도 영향받는다는 것을 카탈로그에서 바로 확인할 수 있다.||

Q4. Software Template에서 ArgoCD 앱 등록 단계를 포함하는 이유는? 정답: ||서비스 생성과 동시에 GitOps 기반 배포 파이프라인이 자동 구성되어, 개발자가 코드를 push하면 즉시 배포가 시작된다. 이 단계가 없으면 개발자가 별도로 ArgoCD 설정을 요청하는 티켓을 생성해야 하며, 이것이 온보딩 시간을 늘리는 주요 원인이다.||

Q5. 플랫폼 채택률이 50% 이하일 때 강제가 아닌 매력으로 높이는 방법은? 정답: ||골든패스를 따르지 않을 때의 불편함을 유지하면서(수동 배포 2주, 수동 모니터링 설정), 골든패스의 편의성(30분 내 서비스 생성, 자동 배포, 자동 모니터링)을 극대화한다. 챔피언 팀의 성공 사례를 공유하고, 지표로 효과를 증명한다.||

Q6. IDP 구축을 3단계로 나누어야 하는 이유는? 정답: ||모든 기능을 한번에 구축하면 12개월 이상 소요되어 ROI를 증명하기 전에 프로젝트가 취소될 수 있다. Phase 1(3개월)에서 카탈로그와 CI 표준화로 빠르게 가치를 보여주고, 이 성과를 기반으로 Phase 2, 3의 투자를 확보한다.||

Q7. 개발자 NPS를 측정할 때 주의할 점은? 정답: ||응답률이 70% 이상이어야 의미있는 지표다. 또한 NPS 점수만 보지 말고 detractor(비추천자)의 구체적 불만 사항을 분석해야 한다. top_complaints를 분기별로 추적하여 개선 여부를 확인하고, 개선된 항목을 공개적으로 알려 피드백 루프를 닫아야 한다.||

References

DevOps Internal Developer Platform Playbook 2026

DevOps Internal Developer Platform Playbook 2026

Play 1: Determining When You Need an IDP

An Internal Developer Platform (IDP) is an internal platform that enables developers to create, deploy, and monitor services through self-service without infrastructure requests. According to a 2026 Gartner survey, 80% of enterprises are investing in platform engineering.

However, not every organization needs an IDP. If three or more of the following conditions apply, it's time to consider adopting an IDP.

Adoption Signal Checklist:

  • Number of services exceeds 20
  • Creating a new service takes more than one week
  • Each team has different CI/CD pipelines with no standards
  • Onboarding takes more than 2 weeks (dev environment setup, permission requests, etc.)
  • The infrastructure team spends more than 80% of its time handling repetitive Jira tickets
  • Rollback procedures after deployment failures differ across teams or are undocumented
  • Cost attribution is impossible

If fewer than three conditions apply, simple shell script automation or standardized GitHub Actions templates are sufficient—no IDP needed.

Play 2: Building the Platform Team and Defining Roles

An IDP is a product. It should not be a side project built by a few infrastructure engineers—it requires a dedicated team operating with a product mindset.

Team Composition (for organizations of 50–200 developers)

RoleHeadcountKey Responsibilities
Platform PM1Collecting developer requirements, roadmap management, tracking adoption rate
Platform Engineer2–3Infrastructure abstraction, API/UI development, golden path design
SRE / DevOps1–2Monitoring pipelines, on-call, incident response automation
Developer Advocate0.5 (shared)Documentation, onboarding guides, internal training

Core Principles:

  • The platform team's customer is the internal developer. Measure NPS (Net Promoter Score) every quarter.
  • Drive adoption through appeal, not enforcement. Make it so following the golden path takes 30 minutes, while not following it takes 2 weeks.
  • Keep the feedback loop under 2 weeks. Developer feature requests must receive at least a minimum response (implementation plan or rejection reason) within 2 weeks.

Play 3: Choosing the Technology Stack

Backstage vs. Build-Your-Own vs. SaaS Comparison

CriteriaBackstage (Open Source)Build-Your-OwnSaaS (Port/Cortex, etc.)
Initial CostMedium (3–6 months to build)High (6–12 months)Low (start immediately)
CustomizationHigh (plugin ecosystem)HighestLimited
Maintenance BurdenHigh (upgrades, security patches)Very HighNone (vendor responsibility)
Org Size Fit100+500+50–300
Vendor Lock-inNoneNoneHigh
Plugins/Integrations2000+ pluginsOnly what you needVendor-provided scope

Recommended Strategy: For organizations with fewer than 100 people, start with SaaS. For 100–500, adopt Backstage but also consider managed Backstage options like Roadie. For 500+, have a dedicated team customize and operate Backstage.

Play 4: Backstage Setup and Software Catalog

Installing Backstage

# Create a new project using the Backstage CLI
npx @backstage/create-app@latest --skip-install

# Resulting directory structure
my-backstage/
├── app-config.yaml           # Core configuration file
├── app-config.production.yaml
├── packages/
│   ├── app/                  # Frontend (React)
│   └── backend/              # Backend (Node.js)
├── plugins/                  # Custom plugins
├── catalog-info.yaml         # Catalog registration for this project itself
└── package.json

# Install dependencies and run
cd my-backstage
yarn install
yarn dev
# Access at http://localhost:3000

Key app-config.yaml Settings

# app-config.yaml
app:
  title: 'MyOrg Developer Platform'
  baseUrl: http://localhost:3000

organization:
  name: 'MyOrg'

backend:
  baseUrl: http://localhost:7007
  database:
    client: pg
    connection:
      host: ${POSTGRES_HOST}
      port: ${POSTGRES_PORT}
      user: ${POSTGRES_USER}
      password: ${POSTGRES_PASSWORD}

# GitHub integration (automatic service catalog discovery)
integrations:
  github:
    - host: github.com
      token: ${GITHUB_TOKEN}

# Software Catalog settings
catalog:
  import:
    entityFilename: catalog-info.yaml
  rules:
    - allow: [Component, System, API, Resource, Location, Group, User]
  locations:
    # Automatically discover catalog-info.yaml from all repositories in the organization
    - type: github-discovery
      target: https://github.com/my-org/*/blob/main/catalog-info.yaml
    # Manual registration
    - type: file
      target: ./catalog-entities/all-systems.yaml

# Authentication (GitHub OAuth)
auth:
  environment: development
  providers:
    github:
      development:
        clientId: ${GITHUB_OAUTH_CLIENT_ID}
        clientSecret: ${GITHUB_OAUTH_CLIENT_SECRET}

Service Catalog Registration Standard

Place a catalog-info.yaml at the root of each service repository.

# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  description: 'Order processing microservice'
  annotations:
    github.com/project-slug: my-org/order-service
    backstage.io/techdocs-ref: dir:.
    datadoghq.com/dashboard-url: https://app.datadoghq.com/dashboard/abc-123
    pagerduty.com/service-id: PXXXXXX
    argocd/app-name: order-service-prod
  tags:
    - java
    - spring-boot
    - tier-1
  links:
    - url: https://order.internal.example.com
      title: Production URL
    - url: https://grafana.internal.example.com/d/order-service
      title: Grafana Dashboard
spec:
  type: service
  lifecycle: production
  owner: team-commerce
  system: commerce-platform
  dependsOn:
    - component:payment-service
    - resource:orders-database
  providesApis:
    - order-api
  consumesApis:
    - payment-api
    - inventory-api
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: order-api
  description: 'Order REST API'
spec:
  type: openapi
  lifecycle: production
  owner: team-commerce
  definition:
    $text: ./docs/openapi.yaml

Play 5: Service Templates (Scaffolding)

Backstage's Software Templates is a feature that creates new services with a standardized structure. You can have a service with CI/CD, monitoring, and catalog registration all set up within 30 minutes.

# templates/spring-boot-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: spring-boot-service
  title: 'Spring Boot Microservice'
  description: 'Creates a Spring Boot service with auto-configured CI/CD, monitoring, and catalog registration'
  tags:
    - java
    - spring-boot
    - recommended
spec:
  owner: team-platform
  type: service
  parameters:
    - title: Service Information
      required:
        - serviceName
        - ownerTeam
        - tier
      properties:
        serviceName:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]*$'
          description: 'Only lowercase letters, numbers, and hyphens allowed'
        ownerTeam:
          title: Owner Team
          type: string
          ui:field: OwnerPicker
          ui:options:
            catalogFilter:
              kind: Group
        tier:
          title: Service Tier
          type: string
          enum: ['tier-1', 'tier-2', 'tier-3']
          enumNames: ['Tier 1 (99.99% SLO)', 'Tier 2 (99.9% SLO)', 'Tier 3 (99% SLO)']
        javaVersion:
          title: Java Version
          type: string
          default: '21'
          enum: ['17', '21']

    - title: Infrastructure Settings
      properties:
        database:
          title: Database
          type: string
          default: 'postgresql'
          enum: ['postgresql', 'mysql', 'none']
        messageQueue:
          title: Message Queue
          type: string
          default: 'none'
          enum: ['kafka', 'rabbitmq', 'none']
        replicaCount:
          title: Default Replica Count
          type: integer
          default: 3
          minimum: 1
          maximum: 20

  steps:
    # 1. Generate repository from template
    - id: fetch-template
      name: Generate Template Code
      action: fetch:template
      input:
        url: ./skeleton
        values:
          serviceName: ${{ parameters.serviceName }}
          ownerTeam: ${{ parameters.ownerTeam }}
          tier: ${{ parameters.tier }}
          javaVersion: ${{ parameters.javaVersion }}
          database: ${{ parameters.database }}
          replicaCount: ${{ parameters.replicaCount }}

    # 2. Create GitHub repository
    - id: publish
      name: Create GitHub Repository
      action: publish:github
      input:
        allowedHosts: ['github.com']
        repoUrl: github.com?owner=my-org&repo=${{ parameters.serviceName }}
        defaultBranch: main
        repoVisibility: internal
        protectDefaultBranch: true
        requireCodeOwnerReviews: true

    # 3. Register ArgoCD application
    - id: register-argocd
      name: Register ArgoCD Application
      action: argocd:create-resources
      input:
        appName: ${{ parameters.serviceName }}-prod
        argoInstance: main
        namespace: ${{ parameters.serviceName }}
        repoUrl: https://github.com/my-org/${{ parameters.serviceName }}
        path: k8s/overlays/production

    # 4. Register in Backstage catalog
    - id: register-catalog
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['publish'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

  output:
    links:
      - title: Repository
        url: ${{ steps['publish'].output.remoteUrl }}
      - title: Catalog
        icon: catalog
        entityRef: ${{ steps['register-catalog'].output.entityRef }}

Play 6: Unifying Documentation with TechDocs

By consolidating scattered documentation into Backstage TechDocs, you can view technical documentation directly from the service catalog.

# mkdocs.yml (root of each service repository)
site_name: order-service
nav:
  - Home: index.md
  - Architecture: architecture.md
  - API Reference: api.md
  - Runbook: runbook.md
  - ADR:
      - adr/001-database-choice.md
      - adr/002-event-schema.md

plugins:
  - techdocs-core
<!-- docs/runbook.md -->

# Order Service Operations Runbook

## Incident Response

### Order Processing Delay (P95 > 500ms)

1. Check Grafana dashboard: [Link]
2. Check DB connection pool status:

bash
kubectl exec -it deploy/order-service -- curl localhost:8080/actuator/metrics/hikaricp.connections.active

3. If connection pool is saturated:
   kubectl scale deploy/order-service --replicas=6

4. Check DB slow queries:
   SELECT pid, now() - query_start AS duration, query
   FROM pg_stat_activity
   WHERE state = 'active' AND now() - query_start > interval '5s';

### Order Creation Failure (HTTP 500)

1. Check error logs:
   kubectl logs deploy/order-service --tail=100 | grep ERROR
2. Response by error code:
   - ORDER-001: Payment service connection failure -> Check payment-service status
   - ORDER-002: Insufficient inventory -> Check inventory-service synchronization
   - ORDER-003: DB deadlock -> Check transaction isolation level

Play 7: Self-Service Infrastructure Provisioning

Enable developers to provision databases, message queues, caches, and more directly from the Backstage UI. Actual infrastructure creation is handled via Terraform + GitOps.

# templates/provision-postgresql/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: provision-postgresql
  title: 'PostgreSQL Database Provisioning'
  description: 'Self-service provisioning of RDS PostgreSQL instances'
spec:
  owner: team-platform
  type: resource
  parameters:
    - title: Database Settings
      required:
        - dbName
        - environment
        - instanceClass
      properties:
        dbName:
          title: DB Name
          type: string
          pattern: '^[a-z][a-z0-9_]*$'
        environment:
          title: Environment
          type: string
          enum: ['dev', 'staging', 'production']
        instanceClass:
          title: Instance Size
          type: string
          default: 'db.r7g.large'
          enum:
            - 'db.t4g.medium'
            - 'db.r7g.large'
            - 'db.r7g.xlarge'
            - 'db.r7g.2xlarge'
          enumNames:
            - 'Small (2 vCPU, 4GB) - dev/staging'
            - 'Medium (2 vCPU, 16GB) - production'
            - 'Large (4 vCPU, 32GB) - production'
            - 'XLarge (8 vCPU, 64GB) - high traffic'
        storageGb:
          title: Storage (GB)
          type: integer
          default: 100
          minimum: 20
          maximum: 16000
        multiAz:
          title: Multi-AZ Deployment
          type: boolean
          default: false

  steps:
    - id: create-terraform-pr
      name: Create Terraform PR
      action: publish:github:pull-request
      input:
        repoUrl: github.com?owner=my-org&repo=infrastructure
        branchName: provision-db-${{ parameters.dbName }}
        title: 'DB Provisioning: ${{ parameters.dbName }} (${{ parameters.environment }})'
        description: |
          Automatically generated DB provisioning request.

          - DB Name: ${{ parameters.dbName }}
          - Environment: ${{ parameters.environment }}
          - Instance: ${{ parameters.instanceClass }}
          - Storage: ${{ parameters.storageGb }}GB
          - Multi-AZ: ${{ parameters.multiAz }}
        targetPath: terraform/rds/${{ parameters.environment }}/${{ parameters.dbName }}
        sourcePath: ./terraform-template

Play 8: Measuring Platform Success Metrics

Without measuring IDP performance, you cannot prove the return on investment. Track the following metrics quarterly.

Key Performance Indicators (KPIs)

# platform_metrics.py - Platform KPI dashboard data collection

import requests
from datetime import datetime, timedelta

class PlatformMetrics:
    def __init__(self, github_token: str, backstage_url: str):
        self.github = github_token
        self.backstage = backstage_url

    def service_creation_lead_time(self) -> dict:
        """New service creation lead time (target: under 30 minutes)"""
        # Extract from Backstage scaffolder logs
        response = requests.get(
            f"{self.backstage}/api/scaffolder/v2/tasks",
            params={"createdAfter": (datetime.now() - timedelta(days=90)).isoformat()}
        )
        tasks = response.json()["items"]

        lead_times = []
        for task in tasks:
            if task["status"] == "completed":
                start = datetime.fromisoformat(task["createdAt"])
                end = datetime.fromisoformat(task["completedAt"])
                lead_times.append((end - start).total_seconds() / 60)

        return {
            "median_minutes": sorted(lead_times)[len(lead_times) // 2],
            "p95_minutes": sorted(lead_times)[int(len(lead_times) * 0.95)],
            "total_services_created": len(lead_times),
        }

    def golden_path_adoption_rate(self) -> dict:
        """Golden path adoption rate (target: 80% or above)"""
        # Query reusable workflow usage from the GitHub API
        repos = requests.get(
            "https://api.github.com/orgs/my-org/repos",
            headers={"Authorization": f"token {self.github}"},
            params={"per_page": 100, "type": "internal"}
        ).json()

        using_golden_path = 0
        total_active = 0

        for repo in repos:
            if repo["archived"]:
                continue
            total_active += 1
            # Check for golden path references in CI workflows
            workflows = requests.get(
                f"https://api.github.com/repos/my-org/{repo['name']}/actions/workflows",
                headers={"Authorization": f"token {self.github}"}
            ).json()

            for wf in workflows.get("workflows", []):
                if "golden" in wf.get("path", "").lower():
                    using_golden_path += 1
                    break

        return {
            "adoption_rate": using_golden_path / max(total_active, 1),
            "using_golden_path": using_golden_path,
            "total_active_repos": total_active,
        }

    def developer_nps(self) -> dict:
        """Developer satisfaction NPS (target: 30 or above)"""
        # Quarterly survey results (Google Forms / Typeform, etc.)
        # Integrate via API directly, or enter manually
        return {
            "nps_score": 42,
            "promoters_pct": 55,
            "detractors_pct": 13,
            "response_rate": 0.72,
            "top_complaints": [
                "Build times are slow",
                "Log search UI is inconvenient",
                "Insufficient permission request automation",
            ]
        }

KPI Target Values

MetricPoorAverageGoodTarget
Service Creation Time1 week+1–3 days1 hour30 min
Golden Path Adoption RateBelow 30%30–60%60–80%80%+
Developer NPSBelow 00–2020–4040+
Onboarding Time2 weeks+1–2 weeks2–5 days1 day
Infra Tickets per Month50+20–505–20Below 5

Play 9: Troubleshooting

Issue 1: Backstage Catalog Synchronization Delay


WARN: Entity refresh for component:order-service took 45s (threshold: 10s)

Cause: GitHub discovery is scanning hundreds of repositories and hitting the API rate limit.

# Solution: Limit scan scope + configure caching
catalog:
  providers:
    github:
      myOrg:
        organization: 'my-org'
        catalogPath: '/catalog-info.yaml'
        filters:
          repository: '^(?!archived-).*$' # Exclude repositories with archived- prefix
          topic:
            include: ['backstage-enabled'] # Topic-based filtering
        schedule:
          frequency: { minutes: 30 } # 30-minute interval (default is 5 minutes)
          timeout: { minutes: 5 }

Issue 2: Software Template Execution Failure — GitHub Permissions

Error: Resource not accessible by integration
HttpError: 403 - Resource not accessible by integration

Cause: The GitHub App lacks sufficient permissions, or it doesn't have access to the organization where the repository is being created.

# Check GitHub App permissions
# Settings > Developer settings > GitHub Apps > [App Name] > Permissions
# Required permissions:
# - Repository: Administration (Read & Write)
# - Repository: Contents (Read & Write)
# - Organization: Members (Read)
# Or when using a Personal Access Token (PAT), required scopes:
# repo, workflow, admin:org

Issue 3: TechDocs Build Failure

mkdocs build failed: No module named 'techdocs_core'
# Solution: Install the plugin in the TechDocs build environment
pip install mkdocs-techdocs-core

# When using Docker build
docker run --rm -v $(pwd):/content \
  spotify/techdocs:latest \
  build --site-dir /content/site

# Configure build method in app-config.yaml
# techdocs:
#   builder: 'external' # Build in CI
#   publisher:
#     type: 'awsS3'
#     awsS3:
#       bucketName: 'my-org-techdocs'
#       region: 'ap-northeast-2'

Issue 4: Platform Adoption Rate Won't Increase

This is not a technical problem—it's an organizational problem.

Resolution Strategy:

  1. Secure champion teams first: Select 2–3 early adopter teams and share their success stories internally.
  2. Remove friction: Maintain the pain of not following the golden path (2 weeks of manual deployment, manual monitoring setup) while maximizing the convenience of the golden path.
  3. Don't force it: Hold a monthly "Platform Day" with demos and feedback sessions.
  4. Prove it with data: Share metrics like "Teams using the golden path deploy 3x more frequently."

Play 10: Phased IDP Roadmap Implementation

Trying to build everything at once will lead to failure. Divide the effort into three phases and build incrementally.

Phase 1 (1–3 months): Foundation

  • Build the Software Catalog (register all services, teams, APIs)
  • Standardize CI golden path (GitHub Actions reusable workflows)
  • Create 1–2 service creation templates

Phase 2 (4–6 months): Expansion

  • CD golden path (Argo Rollouts canary deployments)
  • TechDocs integration (runbooks, ADRs)
  • Self-service infrastructure provisioning (databases, caches)
  • Cost tagging and dashboards

Phase 3 (7–12 months): Maturity

  • Automated security policy enforcement (OPA/Kyverno)
  • Automated DORA metrics collection and dashboards
  • Internal marketplace (shared libraries, plugins)
  • One-click development environment provisioning

Quiz

Q1. What characterizes an organization where IDP adoption is premature? Answer: ||An organization with fewer than 20 services, a low ratio of repetitive ticket processing by the infrastructure team, and the ability to create new services within one week. In such cases, simple script automation or standard CI/CD templates are sufficient rather than an IDP.||

Q2. Why is a Platform PM needed when building a platform team? Answer: ||Since an IDP is an internal product, it requires collecting customer (developer) requirements, prioritizing, and measuring adoption rates. If composed only of engineers, the team tends to skew toward technically interesting features rather than what developers actually need.||

Q3. Why is the dependsOn field in catalog-info.yaml important in the Backstage Software Catalog?

Answer: ||It explicitly declares inter-service dependencies, enabling immediate identification of the blast radius during incidents. If order-service has a dependsOn on payment-service, you can instantly see in the catalog that a payment-service outage would also affect order-service.||

Q4. Why include the ArgoCD app registration step in a Software Template? Answer: ||A GitOps-based deployment pipeline is automatically configured at the same time the service is created, so deployment begins immediately when a developer pushes code. Without this step, developers would have to create a separate ticket to request ArgoCD configuration, which is a major cause of increased onboarding time.||

Q5. How do you increase platform adoption from below 50% through appeal rather than enforcement?

Answer: ||Maintain the inconvenience of not following the golden path (2 weeks for manual deployment, manual monitoring setup) while maximizing the golden path's convenience (service creation in 30 minutes, automatic deployment, automatic monitoring). Share champion teams' success stories and prove the impact with metrics.||

Q6. Why should IDP construction be divided into three phases? Answer: ||Building all features at once takes 12+ months, and the project risks being canceled before ROI can be demonstrated. By delivering value quickly with the catalog and CI standardization in Phase 1 (3 months), you can secure investment for Phases 2 and 3 based on those results.||

Q7. What should you watch out for when measuring developer NPS? Answer: ||The response rate must be 70% or higher for the metric to be meaningful. Also, don't just look at the NPS score—analyze the specific complaints from detractors. Track top_complaints quarterly to verify improvements, and publicly announce resolved items to close the feedback loop.||

References