Split View: DevOps 내부 개발자 플랫폼 플레이북 2026
DevOps 내부 개발자 플랫폼 플레이북 2026

- Play 1: IDP가 필요한 시점 판단하기
- Play 2: 플랫폼 팀 구성과 역할 정의
- Play 3: 기술 스택 선택
- Play 4: Backstage 셋업과 Software Catalog
- Play 5: 서비스 템플릿 (Scaffolding)
- Play 6: TechDocs로 문서 통합
- Play 7: 셀프서비스 인프라 프로비저닝
- Play 8: 플랫폼 성공 지표 측정
- Play 9: 트러블슈팅
- Play 10: IDP 로드맵 단계별 구현
- 퀴즈
- References
Play 1: IDP가 필요한 시점 판단하기
Internal Developer Platform(IDP)은 개발자가 인프라 요청 없이 셀프서비스로 서비스를 생성, 배포, 모니터링할 수 있는 내부 플랫폼이다. 2026년 Gartner 조사에 따르면 엔터프라이즈의 80%가 플랫폼 엔지니어링에 투자하고 있다.
하지만 모든 조직에 IDP가 필요한 것은 아니다. 다음 조건 중 3개 이상 해당되면 IDP 도입을 검토할 시점이다.
도입 신호 체크리스트:
- 서비스 수가 20개를 초과했다
- 신규 서비스 생성에 1주일 이상 걸린다
- 팀마다 CI/CD 파이프라인이 다르고 표준이 없다
- 온보딩에 2주 이상 소요된다 (개발 환경 설정, 권한 요청 등)
- 인프라팀이 반복적인 Jira 티켓 처리에 80% 이상의 시간을 쓴다
- 배포 후 장애 시 롤백 절차가 팀마다 다르거나 문서화되어 있지 않다
- 비용 귀속(cost attribution)이 불가능하다
3개 미만이라면 IDP보다 간단한 셸 스크립트 자동화나 GitHub Actions 표준 템플릿만으로도 충분하다.
Play 2: 플랫폼 팀 구성과 역할 정의
IDP는 제품이다. 인프라 엔지니어 몇 명이 사이드 프로젝트로 만드는 것이 아니라, 전담 팀이 제품 마인드셋으로 운영해야 한다.
팀 구성 (50-200명 개발 조직 기준)
| 역할 | 인원 | 핵심 책임 |
|---|---|---|
| Platform PM | 1명 | 개발자 요구사항 수집, 로드맵 관리, 채택률 추적 |
| Platform Engineer | 2-3명 | 인프라 추상화, API/UI 개발, 골든패스 설계 |
| SRE / DevOps | 1-2명 | 모니터링 파이프라인, 온콜, 인시던트 대응 자동화 |
| Developer Advocate | 0.5명 (겸직) | 문서화, 온보딩 가이드, 내부 교육 |
핵심 원칙:
- 플랫폼 팀의 고객은 내부 개발자다. NPS(Net Promoter Score)를 분기마다 측정한다.
- 채택은 강제가 아니라 매력으로. 골든패스를 따르면 30분에 끝나지만, 따르지 않으면 2주가 걸리게 만든다.
- 피드백 루프는 2주 이내로. 개발자가 요청한 기능이 2주 내에 최소 응답(구현 계획 또는 거절 사유)을 받아야 한다.
Play 3: 기술 스택 선택
Backstage vs 자체 구축 vs SaaS 비교
| 기준 | Backstage (오픈소스) | 자체 구축 | SaaS (Port/Cortex 등) |
|---|---|---|---|
| 초기 비용 | 중간 (3-6개월 구축) | 높음 (6-12개월) | 낮음 (즉시 시작) |
| 커스터마이징 | 높음 (플러그인 생태계) | 최고 | 제한적 |
| 유지보수 부담 | 높음 (업그레이드, 보안패치) | 매우 높음 | 없음 (벤더 책임) |
| 조직 규모 적합성 | 100명+ | 500명+ | 50-300명 |
| 벤더 종속 | 없음 | 없음 | 높음 |
| 플러그인/통합 | 2000+ 플러그인 | 필요한 것만 | 벤더 제공 범위 |
권장 전략: 100명 이하 조직이면 SaaS부터 시작한다. 100-500명이면 Backstage를 도입하되, Roadie 같은 관리형 Backstage도 고려한다. 500명 이상이면 자체 팀이 Backstage를 커스터마이징하여 운영한다.
Play 4: Backstage 셋업과 Software Catalog
Backstage 설치
# Backstage CLI로 신규 프로젝트 생성
npx @backstage/create-app@latest --skip-install
# 결과 디렉토리 구조
my-backstage/
├── app-config.yaml # 핵심 설정 파일
├── app-config.production.yaml
├── packages/
│ ├── app/ # 프론트엔드 (React)
│ └── backend/ # 백엔드 (Node.js)
├── plugins/ # 커스텀 플러그인
├── catalog-info.yaml # 이 프로젝트 자체의 카탈로그 등록
└── package.json
# 의존성 설치 및 실행
cd my-backstage
yarn install
yarn dev
# http://localhost:3000 에서 접속
app-config.yaml 핵심 설정
# app-config.yaml
app:
title: 'MyOrg Developer Platform'
baseUrl: http://localhost:3000
organization:
name: 'MyOrg'
backend:
baseUrl: http://localhost:7007
database:
client: pg
connection:
host: ${POSTGRES_HOST}
port: ${POSTGRES_PORT}
user: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
# GitHub 통합 (서비스 카탈로그 자동 탐색)
integrations:
github:
- host: github.com
token: ${GITHUB_TOKEN}
# Software Catalog 설정
catalog:
import:
entityFilename: catalog-info.yaml
rules:
- allow: [Component, System, API, Resource, Location, Group, User]
locations:
# 조직의 모든 저장소에서 catalog-info.yaml 자동 탐색
- type: github-discovery
target: https://github.com/my-org/*/blob/main/catalog-info.yaml
# 수동 등록
- type: file
target: ./catalog-entities/all-systems.yaml
# 인증 (GitHub OAuth)
auth:
environment: development
providers:
github:
development:
clientId: ${GITHUB_OAUTH_CLIENT_ID}
clientSecret: ${GITHUB_OAUTH_CLIENT_SECRET}
서비스 카탈로그 등록 표준
각 서비스 저장소의 루트에 catalog-info.yaml을 배치한다.
# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: order-service
description: '주문 처리 마이크로서비스'
annotations:
github.com/project-slug: my-org/order-service
backstage.io/techdocs-ref: dir:.
datadoghq.com/dashboard-url: https://app.datadoghq.com/dashboard/abc-123
pagerduty.com/service-id: PXXXXXX
argocd/app-name: order-service-prod
tags:
- java
- spring-boot
- tier-1
links:
- url: https://order.internal.example.com
title: 프로덕션 URL
- url: https://grafana.internal.example.com/d/order-service
title: Grafana 대시보드
spec:
type: service
lifecycle: production
owner: team-commerce
system: commerce-platform
dependsOn:
- component:payment-service
- resource:orders-database
providesApis:
- order-api
consumesApis:
- payment-api
- inventory-api
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
name: order-api
description: '주문 REST API'
spec:
type: openapi
lifecycle: production
owner: team-commerce
definition:
$text: ./docs/openapi.yaml
Play 5: 서비스 템플릿 (Scaffolding)
Backstage의 Software Templates는 신규 서비스를 표준화된 구조로 생성하는 기능이다. 30분 내에 CI/CD, 모니터링, 카탈로그 등록까지 완료된 서비스를 만들 수 있다.
# templates/spring-boot-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: spring-boot-service
title: 'Spring Boot 마이크로서비스'
description: 'CI/CD, 모니터링, 카탈로그가 자동 설정된 Spring Boot 서비스를 생성합니다'
tags:
- java
- spring-boot
- recommended
spec:
owner: team-platform
type: service
parameters:
- title: 서비스 정보
required:
- serviceName
- ownerTeam
- tier
properties:
serviceName:
title: 서비스 이름
type: string
pattern: '^[a-z][a-z0-9-]*$'
description: '소문자, 숫자, 하이픈만 허용'
ownerTeam:
title: 소유 팀
type: string
ui:field: OwnerPicker
ui:options:
catalogFilter:
kind: Group
tier:
title: 서비스 티어
type: string
enum: ['tier-1', 'tier-2', 'tier-3']
enumNames: ['Tier 1 (99.99% SLO)', 'Tier 2 (99.9% SLO)', 'Tier 3 (99% SLO)']
javaVersion:
title: Java 버전
type: string
default: '21'
enum: ['17', '21']
- title: 인프라 설정
properties:
database:
title: 데이터베이스
type: string
default: 'postgresql'
enum: ['postgresql', 'mysql', 'none']
messageQueue:
title: 메시지 큐
type: string
default: 'none'
enum: ['kafka', 'rabbitmq', 'none']
replicaCount:
title: 기본 레플리카 수
type: integer
default: 3
minimum: 1
maximum: 20
steps:
# 1. 템플릿에서 저장소 생성
- id: fetch-template
name: 템플릿 코드 생성
action: fetch:template
input:
url: ./skeleton
values:
serviceName: ${{ parameters.serviceName }}
ownerTeam: ${{ parameters.ownerTeam }}
tier: ${{ parameters.tier }}
javaVersion: ${{ parameters.javaVersion }}
database: ${{ parameters.database }}
replicaCount: ${{ parameters.replicaCount }}
# 2. GitHub 저장소 생성
- id: publish
name: GitHub 저장소 생성
action: publish:github
input:
allowedHosts: ['github.com']
repoUrl: github.com?owner=my-org&repo=${{ parameters.serviceName }}
defaultBranch: main
repoVisibility: internal
protectDefaultBranch: true
requireCodeOwnerReviews: true
# 3. ArgoCD 앱 등록
- id: register-argocd
name: ArgoCD 애플리케이션 등록
action: argocd:create-resources
input:
appName: ${{ parameters.serviceName }}-prod
argoInstance: main
namespace: ${{ parameters.serviceName }}
repoUrl: https://github.com/my-org/${{ parameters.serviceName }}
path: k8s/overlays/production
# 4. Backstage 카탈로그 등록
- id: register-catalog
name: 카탈로그 등록
action: catalog:register
input:
repoContentsUrl: ${{ steps['publish'].output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: 저장소
url: ${{ steps['publish'].output.remoteUrl }}
- title: 카탈로그
icon: catalog
entityRef: ${{ steps['register-catalog'].output.entityRef }}
Play 6: TechDocs로 문서 통합
분산된 문서를 Backstage TechDocs로 통합하면, 서비스 카탈로그에서 바로 기술 문서를 확인할 수 있다.
# mkdocs.yml (각 서비스 저장소 루트)
site_name: order-service
nav:
- Home: index.md
- Architecture: architecture.md
- API Reference: api.md
- Runbook: runbook.md
- ADR:
- adr/001-database-choice.md
- adr/002-event-schema.md
plugins:
- techdocs-core
<!-- docs/runbook.md -->
# Order Service 운영 런북
## 장애 대응
### 주문 처리 지연 (P95 > 500ms)
1. Grafana 대시보드 확인: [링크]
2. DB 커넥션 풀 상태 확인:
bash
kubectl exec -it deploy/order-service -- curl localhost:8080/actuator/metrics/hikaricp.connections.active
3. 커넥션 풀 포화 시:
kubectl scale deploy/order-service --replicas=6
4. DB 슬로우 쿼리 확인:
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5s';
### 주문 생성 실패 (HTTP 500)
1. 에러 로그 확인:
kubectl logs deploy/order-service --tail=100 | grep ERROR
2. 에러 코드별 대응:
- ORDER-001: 결제 서비스 연결 실패 -> payment-service 상태 확인
- ORDER-002: 재고 부족 -> inventory-service 동기화 확인
- ORDER-003: DB 데드락 -> 트랜잭션 격리 수준 확인
Play 7: 셀프서비스 인프라 프로비저닝
개발자가 Backstage UI에서 데이터베이스, 메시지 큐, 캐시 등을 직접 프로비저닝할 수 있게 한다. 실제 인프라 생성은 Terraform + GitOps로 처리한다.
# templates/provision-postgresql/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: provision-postgresql
title: 'PostgreSQL 데이터베이스 프로비저닝'
description: 'RDS PostgreSQL 인스턴스를 셀프서비스로 생성합니다'
spec:
owner: team-platform
type: resource
parameters:
- title: 데이터베이스 설정
required:
- dbName
- environment
- instanceClass
properties:
dbName:
title: DB 이름
type: string
pattern: '^[a-z][a-z0-9_]*$'
environment:
title: 환경
type: string
enum: ['dev', 'staging', 'production']
instanceClass:
title: 인스턴스 크기
type: string
default: 'db.r7g.large'
enum:
- 'db.t4g.medium'
- 'db.r7g.large'
- 'db.r7g.xlarge'
- 'db.r7g.2xlarge'
enumNames:
- 'Small (2 vCPU, 4GB) - dev/staging'
- 'Medium (2 vCPU, 16GB) - production'
- 'Large (4 vCPU, 32GB) - production'
- 'XLarge (8 vCPU, 64GB) - high traffic'
storageGb:
title: 스토리지 (GB)
type: integer
default: 100
minimum: 20
maximum: 16000
multiAz:
title: Multi-AZ 배포
type: boolean
default: false
steps:
- id: create-terraform-pr
name: Terraform PR 생성
action: publish:github:pull-request
input:
repoUrl: github.com?owner=my-org&repo=infrastructure
branchName: provision-db-${{ parameters.dbName }}
title: 'DB 프로비저닝: ${{ parameters.dbName }} (${{ parameters.environment }})'
description: |
자동 생성된 DB 프로비저닝 요청입니다.
- DB 이름: ${{ parameters.dbName }}
- 환경: ${{ parameters.environment }}
- 인스턴스: ${{ parameters.instanceClass }}
- 스토리지: ${{ parameters.storageGb }}GB
- Multi-AZ: ${{ parameters.multiAz }}
targetPath: terraform/rds/${{ parameters.environment }}/${{ parameters.dbName }}
sourcePath: ./terraform-template
Play 8: 플랫폼 성공 지표 측정
IDP의 성과를 측정하지 않으면 투자 대비 효과를 증명할 수 없다. 다음 지표를 분기별로 추적한다.
핵심 성과 지표 (KPI)
# platform_metrics.py - 플랫폼 KPI 대시보드 데이터 수집
import requests
from datetime import datetime, timedelta
class PlatformMetrics:
def __init__(self, github_token: str, backstage_url: str):
self.github = github_token
self.backstage = backstage_url
def service_creation_lead_time(self) -> dict:
"""신규 서비스 생성 소요 시간 (목표: 30분 이내)"""
# Backstage scaffolder 로그에서 추출
response = requests.get(
f"{self.backstage}/api/scaffolder/v2/tasks",
params={"createdAfter": (datetime.now() - timedelta(days=90)).isoformat()}
)
tasks = response.json()["items"]
lead_times = []
for task in tasks:
if task["status"] == "completed":
start = datetime.fromisoformat(task["createdAt"])
end = datetime.fromisoformat(task["completedAt"])
lead_times.append((end - start).total_seconds() / 60)
return {
"median_minutes": sorted(lead_times)[len(lead_times) // 2],
"p95_minutes": sorted(lead_times)[int(len(lead_times) * 0.95)],
"total_services_created": len(lead_times),
}
def golden_path_adoption_rate(self) -> dict:
"""골든패스 채택률 (목표: 80% 이상)"""
# GitHub API에서 reusable workflow 사용 현황 조회
repos = requests.get(
"https://api.github.com/orgs/my-org/repos",
headers={"Authorization": f"token {self.github}"},
params={"per_page": 100, "type": "internal"}
).json()
using_golden_path = 0
total_active = 0
for repo in repos:
if repo["archived"]:
continue
total_active += 1
# CI 워크플로우에서 골든패스 참조 확인
workflows = requests.get(
f"https://api.github.com/repos/my-org/{repo['name']}/actions/workflows",
headers={"Authorization": f"token {self.github}"}
).json()
for wf in workflows.get("workflows", []):
if "golden" in wf.get("path", "").lower():
using_golden_path += 1
break
return {
"adoption_rate": using_golden_path / max(total_active, 1),
"using_golden_path": using_golden_path,
"total_active_repos": total_active,
}
def developer_nps(self) -> dict:
"""개발자 만족도 NPS (목표: 30 이상)"""
# 분기별 서베이 결과 (Google Forms / Typeform 등)
# 직접 API 연동하거나, 수동 입력
return {
"nps_score": 42,
"promoters_pct": 55,
"detractors_pct": 13,
"response_rate": 0.72,
"top_complaints": [
"빌드 시간이 느림",
"로그 검색 UI가 불편함",
"권한 요청 자동화 부족",
]
}
KPI 목표값
| 지표 | 나쁨 | 보통 | 좋음 | 목표 |
|---|---|---|---|---|
| 서비스 생성 시간 | 1주+ | 1-3일 | 1시간 | 30분 |
| 골든패스 채택률 | 30% 미만 | 30-60% | 60-80% | 80%+ |
| 개발자 NPS | 0 미만 | 0-20 | 20-40 | 40+ |
| 온보딩 시간 | 2주+ | 1-2주 | 2-5일 | 1일 |
| 인프라 티켓 수/월 | 50+ | 20-50 | 5-20 | 5 미만 |
Play 9: 트러블슈팅
문제 1: Backstage 카탈로그 동기화 지연
WARN: Entity refresh for component:order-service took 45s (threshold: 10s)
원인: GitHub discovery가 수백 개 저장소를 스캔하면서 API rate limit에 걸린다.
# 해결: 스캔 범위 제한 + 캐시 설정
catalog:
providers:
github:
myOrg:
organization: 'my-org'
catalogPath: '/catalog-info.yaml'
filters:
repository: '^(?!archived-).*$' # archived- 접두사 저장소 제외
topic:
include: ['backstage-enabled'] # 토픽 기반 필터링
schedule:
frequency: { minutes: 30 } # 30분 주기 (기본 5분)
timeout: { minutes: 5 }
문제 2: Software Template 실행 실패 - GitHub 권한
Error: Resource not accessible by integration
HttpError: 403 - Resource not accessible by integration
원인: GitHub App의 권한이 부족하거나, 생성하려는 저장소의 organization에 대한 접근 권한이 없다.
# GitHub App 권한 확인
# Settings > Developer settings > GitHub Apps > [앱 이름] > Permissions
# 필요 권한:
# - Repository: Administration (Read & Write)
# - Repository: Contents (Read & Write)
# - Organization: Members (Read)
# 또는 Personal Access Token(PAT) 사용 시 필요 scope:
# repo, workflow, admin:org
문제 3: TechDocs 빌드 실패
mkdocs build failed: No module named 'techdocs_core'
# 해결: TechDocs 빌드 환경에 플러그인 설치
pip install mkdocs-techdocs-core
# Docker 빌드 사용 시
docker run --rm -v $(pwd):/content \
spotify/techdocs:latest \
build --site-dir /content/site
# app-config.yaml에서 빌드 방식 설정
# techdocs:
# builder: 'external' # CI에서 빌드
# publisher:
# type: 'awsS3'
# awsS3:
# bucketName: 'my-org-techdocs'
# region: 'ap-northeast-2'
문제 4: 플랫폼 채택률이 올라가지 않음
이것은 기술 문제가 아니라 조직 문제다.
해결 전략:
- 챔피언 팀을 먼저 확보한다: 얼리어답터 2-3개 팀을 선정하고, 이 팀의 성공 사례를 내부에 공유한다.
- 마찰을 제거한다: 개발자가 골든패스를 따르지 않을 때 겪는 고통(수동 배포, 수동 모니터링 설정)을 유지하면서, 골든패스의 편의성을 극대화한다.
- 강제하지 않는다: 매달 "Platform Day"를 열어 데모와 피드백 세션을 진행한다.
- 지표로 증명한다: "골든패스를 사용하는 팀의 배포 빈도가 3배 높다"와 같은 데이터를 공유한다.
Play 10: IDP 로드맵 단계별 구현
모든 것을 한번에 만들려고 하면 실패한다. 3단계로 나누어 점진적으로 구축한다.
Phase 1 (1-3개월): 기초
- Software Catalog 구축 (모든 서비스, 팀, API 등록)
- CI 골든패스 표준화 (GitHub Actions reusable workflow)
- 서비스 생성 템플릿 1-2개
Phase 2 (4-6개월): 확장
- CD 골든패스 (Argo Rollouts 카나리 배포)
- TechDocs 통합 (런북, ADR)
- 셀프서비스 인프라 프로비저닝 (DB, 캐시)
- 비용 태깅 및 대시보드
Phase 3 (7-12개월): 성숙
- 보안 정책 자동 적용 (OPA/Kyverno)
- DORA 지표 자동 수집 및 대시보드
- 내부 마켓플레이스 (공유 라이브러리, 플러그인)
- 개발 환경 원클릭 프로비저닝
퀴즈
Q1. IDP 도입이 시기상조인 조직의 특징은?
정답: ||서비스 수가 20개 미만이고, 인프라팀의 반복 티켓 처리 비율이 낮으며, 신규 서비스 생성이
1주일 이내에 가능한 조직이다. 이 경우 IDP보다 간단한 스크립트 자동화나 표준 CI/CD 템플릿으로
충분하다.||
Q2. 플랫폼 팀을 구성할 때 Platform PM이 필요한 이유는?
정답: ||IDP는 내부 제품이므로 고객(개발자)의 요구사항 수집, 우선순위 결정, 채택률 측정이 필요하다.
엔지니어만으로 구성하면 기술 중심으로 치우쳐 개발자가 실제로 필요한 기능이 아닌 기술적으로
흥미로운 기능을 만들게 된다.||
Q3. Backstage Software Catalog에서 catalog-info.yaml의 dependsOn 필드가 중요한 이유는?
정답: ||서비스 간 의존성을 명시하여 장애 영향 범위를 즉시 파악할 수 있다. order-service가 payment-service에 dependsOn이면, payment-service 장애 시 order-service도 영향받는다는 것을 카탈로그에서 바로 확인할 수 있다.||
Q4. Software Template에서 ArgoCD 앱 등록 단계를 포함하는 이유는?
정답: ||서비스 생성과 동시에 GitOps 기반 배포 파이프라인이 자동 구성되어, 개발자가 코드를 push하면
즉시 배포가 시작된다. 이 단계가 없으면 개발자가 별도로 ArgoCD 설정을 요청하는 티켓을 생성해야
하며, 이것이 온보딩 시간을 늘리는 주요 원인이다.||
Q5. 플랫폼 채택률이 50% 이하일 때 강제가 아닌 매력으로 높이는 방법은?
정답: ||골든패스를 따르지 않을 때의 불편함을 유지하면서(수동 배포 2주, 수동 모니터링 설정),
골든패스의 편의성(30분 내 서비스 생성, 자동 배포, 자동 모니터링)을 극대화한다. 챔피언 팀의 성공
사례를 공유하고, 지표로 효과를 증명한다.||
Q6. IDP 구축을 3단계로 나누어야 하는 이유는?
정답: ||모든 기능을 한번에 구축하면 12개월 이상 소요되어 ROI를 증명하기 전에 프로젝트가 취소될 수
있다. Phase 1(3개월)에서 카탈로그와 CI 표준화로 빠르게 가치를 보여주고, 이 성과를 기반으로 Phase
2, 3의 투자를 확보한다.||
Q7. 개발자 NPS를 측정할 때 주의할 점은?
정답: ||응답률이 70% 이상이어야 의미있는 지표다. 또한 NPS 점수만 보지 말고 detractor(비추천자)의
구체적 불만 사항을 분석해야 한다. top_complaints를 분기별로 추적하여 개선 여부를 확인하고, 개선된
항목을 공개적으로 알려 피드백 루프를 닫아야 한다.||
References
DevOps Internal Developer Platform Playbook 2026

- Play 1: Determining When You Need an IDP
- Play 2: Building the Platform Team and Defining Roles
- Play 3: Choosing the Technology Stack
- Play 4: Backstage Setup and Software Catalog
- Play 5: Service Templates (Scaffolding)
- Play 6: Unifying Documentation with TechDocs
- Play 7: Self-Service Infrastructure Provisioning
- Play 8: Measuring Platform Success Metrics
- Play 9: Troubleshooting
- Play 10: Phased IDP Roadmap Implementation
- Quiz
- References
Play 1: Determining When You Need an IDP
An Internal Developer Platform (IDP) is an internal platform that enables developers to create, deploy, and monitor services through self-service without infrastructure requests. According to a 2026 Gartner survey, 80% of enterprises are investing in platform engineering.
However, not every organization needs an IDP. If three or more of the following conditions apply, it's time to consider adopting an IDP.
Adoption Signal Checklist:
- Number of services exceeds 20
- Creating a new service takes more than one week
- Each team has different CI/CD pipelines with no standards
- Onboarding takes more than 2 weeks (dev environment setup, permission requests, etc.)
- The infrastructure team spends more than 80% of its time handling repetitive Jira tickets
- Rollback procedures after deployment failures differ across teams or are undocumented
- Cost attribution is impossible
If fewer than three conditions apply, simple shell script automation or standardized GitHub Actions templates are sufficient—no IDP needed.
Play 2: Building the Platform Team and Defining Roles
An IDP is a product. It should not be a side project built by a few infrastructure engineers—it requires a dedicated team operating with a product mindset.
Team Composition (for organizations of 50–200 developers)
| Role | Headcount | Key Responsibilities |
|---|---|---|
| Platform PM | 1 | Collecting developer requirements, roadmap management, tracking adoption rate |
| Platform Engineer | 2–3 | Infrastructure abstraction, API/UI development, golden path design |
| SRE / DevOps | 1–2 | Monitoring pipelines, on-call, incident response automation |
| Developer Advocate | 0.5 (shared) | Documentation, onboarding guides, internal training |
Core Principles:
- The platform team's customer is the internal developer. Measure NPS (Net Promoter Score) every quarter.
- Drive adoption through appeal, not enforcement. Make it so following the golden path takes 30 minutes, while not following it takes 2 weeks.
- Keep the feedback loop under 2 weeks. Developer feature requests must receive at least a minimum response (implementation plan or rejection reason) within 2 weeks.
Play 3: Choosing the Technology Stack
Backstage vs. Build-Your-Own vs. SaaS Comparison
| Criteria | Backstage (Open Source) | Build-Your-Own | SaaS (Port/Cortex, etc.) |
|---|---|---|---|
| Initial Cost | Medium (3–6 months to build) | High (6–12 months) | Low (start immediately) |
| Customization | High (plugin ecosystem) | Highest | Limited |
| Maintenance Burden | High (upgrades, security patches) | Very High | None (vendor responsibility) |
| Org Size Fit | 100+ | 500+ | 50–300 |
| Vendor Lock-in | None | None | High |
| Plugins/Integrations | 2000+ plugins | Only what you need | Vendor-provided scope |
Recommended Strategy: For organizations with fewer than 100 people, start with SaaS. For 100–500, adopt Backstage but also consider managed Backstage options like Roadie. For 500+, have a dedicated team customize and operate Backstage.
Play 4: Backstage Setup and Software Catalog
Installing Backstage
# Create a new project using the Backstage CLI
npx @backstage/create-app@latest --skip-install
# Resulting directory structure
my-backstage/
├── app-config.yaml # Core configuration file
├── app-config.production.yaml
├── packages/
│ ├── app/ # Frontend (React)
│ └── backend/ # Backend (Node.js)
├── plugins/ # Custom plugins
├── catalog-info.yaml # Catalog registration for this project itself
└── package.json
# Install dependencies and run
cd my-backstage
yarn install
yarn dev
# Access at http://localhost:3000
Key app-config.yaml Settings
# app-config.yaml
app:
title: 'MyOrg Developer Platform'
baseUrl: http://localhost:3000
organization:
name: 'MyOrg'
backend:
baseUrl: http://localhost:7007
database:
client: pg
connection:
host: ${POSTGRES_HOST}
port: ${POSTGRES_PORT}
user: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
# GitHub integration (automatic service catalog discovery)
integrations:
github:
- host: github.com
token: ${GITHUB_TOKEN}
# Software Catalog settings
catalog:
import:
entityFilename: catalog-info.yaml
rules:
- allow: [Component, System, API, Resource, Location, Group, User]
locations:
# Automatically discover catalog-info.yaml from all repositories in the organization
- type: github-discovery
target: https://github.com/my-org/*/blob/main/catalog-info.yaml
# Manual registration
- type: file
target: ./catalog-entities/all-systems.yaml
# Authentication (GitHub OAuth)
auth:
environment: development
providers:
github:
development:
clientId: ${GITHUB_OAUTH_CLIENT_ID}
clientSecret: ${GITHUB_OAUTH_CLIENT_SECRET}
Service Catalog Registration Standard
Place a catalog-info.yaml at the root of each service repository.
# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: order-service
description: 'Order processing microservice'
annotations:
github.com/project-slug: my-org/order-service
backstage.io/techdocs-ref: dir:.
datadoghq.com/dashboard-url: https://app.datadoghq.com/dashboard/abc-123
pagerduty.com/service-id: PXXXXXX
argocd/app-name: order-service-prod
tags:
- java
- spring-boot
- tier-1
links:
- url: https://order.internal.example.com
title: Production URL
- url: https://grafana.internal.example.com/d/order-service
title: Grafana Dashboard
spec:
type: service
lifecycle: production
owner: team-commerce
system: commerce-platform
dependsOn:
- component:payment-service
- resource:orders-database
providesApis:
- order-api
consumesApis:
- payment-api
- inventory-api
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
name: order-api
description: 'Order REST API'
spec:
type: openapi
lifecycle: production
owner: team-commerce
definition:
$text: ./docs/openapi.yaml
Play 5: Service Templates (Scaffolding)
Backstage's Software Templates is a feature that creates new services with a standardized structure. You can have a service with CI/CD, monitoring, and catalog registration all set up within 30 minutes.
# templates/spring-boot-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: spring-boot-service
title: 'Spring Boot Microservice'
description: 'Creates a Spring Boot service with auto-configured CI/CD, monitoring, and catalog registration'
tags:
- java
- spring-boot
- recommended
spec:
owner: team-platform
type: service
parameters:
- title: Service Information
required:
- serviceName
- ownerTeam
- tier
properties:
serviceName:
title: Service Name
type: string
pattern: '^[a-z][a-z0-9-]*$'
description: 'Only lowercase letters, numbers, and hyphens allowed'
ownerTeam:
title: Owner Team
type: string
ui:field: OwnerPicker
ui:options:
catalogFilter:
kind: Group
tier:
title: Service Tier
type: string
enum: ['tier-1', 'tier-2', 'tier-3']
enumNames: ['Tier 1 (99.99% SLO)', 'Tier 2 (99.9% SLO)', 'Tier 3 (99% SLO)']
javaVersion:
title: Java Version
type: string
default: '21'
enum: ['17', '21']
- title: Infrastructure Settings
properties:
database:
title: Database
type: string
default: 'postgresql'
enum: ['postgresql', 'mysql', 'none']
messageQueue:
title: Message Queue
type: string
default: 'none'
enum: ['kafka', 'rabbitmq', 'none']
replicaCount:
title: Default Replica Count
type: integer
default: 3
minimum: 1
maximum: 20
steps:
# 1. Generate repository from template
- id: fetch-template
name: Generate Template Code
action: fetch:template
input:
url: ./skeleton
values:
serviceName: ${{ parameters.serviceName }}
ownerTeam: ${{ parameters.ownerTeam }}
tier: ${{ parameters.tier }}
javaVersion: ${{ parameters.javaVersion }}
database: ${{ parameters.database }}
replicaCount: ${{ parameters.replicaCount }}
# 2. Create GitHub repository
- id: publish
name: Create GitHub Repository
action: publish:github
input:
allowedHosts: ['github.com']
repoUrl: github.com?owner=my-org&repo=${{ parameters.serviceName }}
defaultBranch: main
repoVisibility: internal
protectDefaultBranch: true
requireCodeOwnerReviews: true
# 3. Register ArgoCD application
- id: register-argocd
name: Register ArgoCD Application
action: argocd:create-resources
input:
appName: ${{ parameters.serviceName }}-prod
argoInstance: main
namespace: ${{ parameters.serviceName }}
repoUrl: https://github.com/my-org/${{ parameters.serviceName }}
path: k8s/overlays/production
# 4. Register in Backstage catalog
- id: register-catalog
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps['publish'].output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: Repository
url: ${{ steps['publish'].output.remoteUrl }}
- title: Catalog
icon: catalog
entityRef: ${{ steps['register-catalog'].output.entityRef }}
Play 6: Unifying Documentation with TechDocs
By consolidating scattered documentation into Backstage TechDocs, you can view technical documentation directly from the service catalog.
# mkdocs.yml (root of each service repository)
site_name: order-service
nav:
- Home: index.md
- Architecture: architecture.md
- API Reference: api.md
- Runbook: runbook.md
- ADR:
- adr/001-database-choice.md
- adr/002-event-schema.md
plugins:
- techdocs-core
<!-- docs/runbook.md -->
# Order Service Operations Runbook
## Incident Response
### Order Processing Delay (P95 > 500ms)
1. Check Grafana dashboard: [Link]
2. Check DB connection pool status:
bash
kubectl exec -it deploy/order-service -- curl localhost:8080/actuator/metrics/hikaricp.connections.active
3. If connection pool is saturated:
kubectl scale deploy/order-service --replicas=6
4. Check DB slow queries:
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5s';
### Order Creation Failure (HTTP 500)
1. Check error logs:
kubectl logs deploy/order-service --tail=100 | grep ERROR
2. Response by error code:
- ORDER-001: Payment service connection failure -> Check payment-service status
- ORDER-002: Insufficient inventory -> Check inventory-service synchronization
- ORDER-003: DB deadlock -> Check transaction isolation level
Play 7: Self-Service Infrastructure Provisioning
Enable developers to provision databases, message queues, caches, and more directly from the Backstage UI. Actual infrastructure creation is handled via Terraform + GitOps.
# templates/provision-postgresql/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: provision-postgresql
title: 'PostgreSQL Database Provisioning'
description: 'Self-service provisioning of RDS PostgreSQL instances'
spec:
owner: team-platform
type: resource
parameters:
- title: Database Settings
required:
- dbName
- environment
- instanceClass
properties:
dbName:
title: DB Name
type: string
pattern: '^[a-z][a-z0-9_]*$'
environment:
title: Environment
type: string
enum: ['dev', 'staging', 'production']
instanceClass:
title: Instance Size
type: string
default: 'db.r7g.large'
enum:
- 'db.t4g.medium'
- 'db.r7g.large'
- 'db.r7g.xlarge'
- 'db.r7g.2xlarge'
enumNames:
- 'Small (2 vCPU, 4GB) - dev/staging'
- 'Medium (2 vCPU, 16GB) - production'
- 'Large (4 vCPU, 32GB) - production'
- 'XLarge (8 vCPU, 64GB) - high traffic'
storageGb:
title: Storage (GB)
type: integer
default: 100
minimum: 20
maximum: 16000
multiAz:
title: Multi-AZ Deployment
type: boolean
default: false
steps:
- id: create-terraform-pr
name: Create Terraform PR
action: publish:github:pull-request
input:
repoUrl: github.com?owner=my-org&repo=infrastructure
branchName: provision-db-${{ parameters.dbName }}
title: 'DB Provisioning: ${{ parameters.dbName }} (${{ parameters.environment }})'
description: |
Automatically generated DB provisioning request.
- DB Name: ${{ parameters.dbName }}
- Environment: ${{ parameters.environment }}
- Instance: ${{ parameters.instanceClass }}
- Storage: ${{ parameters.storageGb }}GB
- Multi-AZ: ${{ parameters.multiAz }}
targetPath: terraform/rds/${{ parameters.environment }}/${{ parameters.dbName }}
sourcePath: ./terraform-template
Play 8: Measuring Platform Success Metrics
Without measuring IDP performance, you cannot prove the return on investment. Track the following metrics quarterly.
Key Performance Indicators (KPIs)
# platform_metrics.py - Platform KPI dashboard data collection
import requests
from datetime import datetime, timedelta
class PlatformMetrics:
def __init__(self, github_token: str, backstage_url: str):
self.github = github_token
self.backstage = backstage_url
def service_creation_lead_time(self) -> dict:
"""New service creation lead time (target: under 30 minutes)"""
# Extract from Backstage scaffolder logs
response = requests.get(
f"{self.backstage}/api/scaffolder/v2/tasks",
params={"createdAfter": (datetime.now() - timedelta(days=90)).isoformat()}
)
tasks = response.json()["items"]
lead_times = []
for task in tasks:
if task["status"] == "completed":
start = datetime.fromisoformat(task["createdAt"])
end = datetime.fromisoformat(task["completedAt"])
lead_times.append((end - start).total_seconds() / 60)
return {
"median_minutes": sorted(lead_times)[len(lead_times) // 2],
"p95_minutes": sorted(lead_times)[int(len(lead_times) * 0.95)],
"total_services_created": len(lead_times),
}
def golden_path_adoption_rate(self) -> dict:
"""Golden path adoption rate (target: 80% or above)"""
# Query reusable workflow usage from the GitHub API
repos = requests.get(
"https://api.github.com/orgs/my-org/repos",
headers={"Authorization": f"token {self.github}"},
params={"per_page": 100, "type": "internal"}
).json()
using_golden_path = 0
total_active = 0
for repo in repos:
if repo["archived"]:
continue
total_active += 1
# Check for golden path references in CI workflows
workflows = requests.get(
f"https://api.github.com/repos/my-org/{repo['name']}/actions/workflows",
headers={"Authorization": f"token {self.github}"}
).json()
for wf in workflows.get("workflows", []):
if "golden" in wf.get("path", "").lower():
using_golden_path += 1
break
return {
"adoption_rate": using_golden_path / max(total_active, 1),
"using_golden_path": using_golden_path,
"total_active_repos": total_active,
}
def developer_nps(self) -> dict:
"""Developer satisfaction NPS (target: 30 or above)"""
# Quarterly survey results (Google Forms / Typeform, etc.)
# Integrate via API directly, or enter manually
return {
"nps_score": 42,
"promoters_pct": 55,
"detractors_pct": 13,
"response_rate": 0.72,
"top_complaints": [
"Build times are slow",
"Log search UI is inconvenient",
"Insufficient permission request automation",
]
}
KPI Target Values
| Metric | Poor | Average | Good | Target |
|---|---|---|---|---|
| Service Creation Time | 1 week+ | 1–3 days | 1 hour | 30 min |
| Golden Path Adoption Rate | Below 30% | 30–60% | 60–80% | 80%+ |
| Developer NPS | Below 0 | 0–20 | 20–40 | 40+ |
| Onboarding Time | 2 weeks+ | 1–2 weeks | 2–5 days | 1 day |
| Infra Tickets per Month | 50+ | 20–50 | 5–20 | Below 5 |
Play 9: Troubleshooting
Issue 1: Backstage Catalog Synchronization Delay
WARN: Entity refresh for component:order-service took 45s (threshold: 10s)
Cause: GitHub discovery is scanning hundreds of repositories and hitting the API rate limit.
# Solution: Limit scan scope + configure caching
catalog:
providers:
github:
myOrg:
organization: 'my-org'
catalogPath: '/catalog-info.yaml'
filters:
repository: '^(?!archived-).*$' # Exclude repositories with archived- prefix
topic:
include: ['backstage-enabled'] # Topic-based filtering
schedule:
frequency: { minutes: 30 } # 30-minute interval (default is 5 minutes)
timeout: { minutes: 5 }
Issue 2: Software Template Execution Failure — GitHub Permissions
Error: Resource not accessible by integration
HttpError: 403 - Resource not accessible by integration
Cause: The GitHub App lacks sufficient permissions, or it doesn't have access to the organization where the repository is being created.
# Check GitHub App permissions
# Settings > Developer settings > GitHub Apps > [App Name] > Permissions
# Required permissions:
# - Repository: Administration (Read & Write)
# - Repository: Contents (Read & Write)
# - Organization: Members (Read)
# Or when using a Personal Access Token (PAT), required scopes:
# repo, workflow, admin:org
Issue 3: TechDocs Build Failure
mkdocs build failed: No module named 'techdocs_core'
# Solution: Install the plugin in the TechDocs build environment
pip install mkdocs-techdocs-core
# When using Docker build
docker run --rm -v $(pwd):/content \
spotify/techdocs:latest \
build --site-dir /content/site
# Configure build method in app-config.yaml
# techdocs:
# builder: 'external' # Build in CI
# publisher:
# type: 'awsS3'
# awsS3:
# bucketName: 'my-org-techdocs'
# region: 'ap-northeast-2'
Issue 4: Platform Adoption Rate Won't Increase
This is not a technical problem—it's an organizational problem.
Resolution Strategy:
- Secure champion teams first: Select 2–3 early adopter teams and share their success stories internally.
- Remove friction: Maintain the pain of not following the golden path (2 weeks of manual deployment, manual monitoring setup) while maximizing the convenience of the golden path.
- Don't force it: Hold a monthly "Platform Day" with demos and feedback sessions.
- Prove it with data: Share metrics like "Teams using the golden path deploy 3x more frequently."
Play 10: Phased IDP Roadmap Implementation
Trying to build everything at once will lead to failure. Divide the effort into three phases and build incrementally.
Phase 1 (1–3 months): Foundation
- Build the Software Catalog (register all services, teams, APIs)
- Standardize CI golden path (GitHub Actions reusable workflows)
- Create 1–2 service creation templates
Phase 2 (4–6 months): Expansion
- CD golden path (Argo Rollouts canary deployments)
- TechDocs integration (runbooks, ADRs)
- Self-service infrastructure provisioning (databases, caches)
- Cost tagging and dashboards
Phase 3 (7–12 months): Maturity
- Automated security policy enforcement (OPA/Kyverno)
- Automated DORA metrics collection and dashboards
- Internal marketplace (shared libraries, plugins)
- One-click development environment provisioning
Quiz
Q1. What characterizes an organization where IDP adoption is premature?
Answer: ||An organization with fewer than 20 services, a low ratio of repetitive ticket processing
by the infrastructure team, and the ability to create new services within one week. In such cases,
simple script automation or standard CI/CD templates are sufficient rather than an IDP.||
Q2. Why is a Platform PM needed when building a platform team?
Answer: ||Since an IDP is an internal product, it requires collecting customer (developer)
requirements, prioritizing, and measuring adoption rates. If composed only of engineers, the team
tends to skew toward technically interesting features rather than what developers actually need.||
Q3. Why is the dependsOn field in catalog-info.yaml important in the Backstage Software Catalog?
Answer: ||It explicitly declares inter-service dependencies, enabling immediate identification of the blast radius during incidents. If order-service has a dependsOn on payment-service, you can instantly see in the catalog that a payment-service outage would also affect order-service.||
Q4. Why include the ArgoCD app registration step in a Software Template?
Answer: ||A GitOps-based deployment pipeline is automatically configured at the same time the
service is created, so deployment begins immediately when a developer pushes code. Without this
step, developers would have to create a separate ticket to request ArgoCD configuration, which is
a major cause of increased onboarding time.||
Q5. How do you increase platform adoption from below 50% through appeal rather than enforcement?
Answer: ||Maintain the inconvenience of not following the golden path (2 weeks for manual deployment, manual monitoring setup) while maximizing the golden path's convenience (service creation in 30 minutes, automatic deployment, automatic monitoring). Share champion teams' success stories and prove the impact with metrics.||
Q6. Why should IDP construction be divided into three phases?
Answer: ||Building all features at once takes 12+ months, and the project risks being canceled
before ROI can be demonstrated. By delivering value quickly with the catalog and CI
standardization in Phase 1 (3 months), you can secure investment for Phases 2 and 3 based on those
results.||
Q7. What should you watch out for when measuring developer NPS?
Answer: ||The response rate must be 70% or higher for the metric to be meaningful. Also, don't
just look at the NPS score—analyze the specific complaints from detractors. Track top_complaints
quarterly to verify improvements, and publicly announce resolved items to close the feedback
loop.||