- Authors
- Name

- Play 1: Determining When You Need an IDP
- Play 2: Building the Platform Team and Defining Roles
- Play 3: Choosing the Technology Stack
- Play 4: Backstage Setup and Software Catalog
- Play 5: Service Templates (Scaffolding)
- Play 6: Unifying Documentation with TechDocs
- Play 8: Measuring Platform Success Metrics
- Play 9: Troubleshooting
- Play 10: Phased IDP Roadmap Implementation
- Quiz
- References
Play 1: Determining When You Need an IDP
An Internal Developer Platform (IDP) is an internal platform that enables developers to create, deploy, and monitor services through self-service without infrastructure requests. According to a 2026 Gartner survey, 80% of enterprises are investing in platform engineering.
However, not every organization needs an IDP. If three or more of the following conditions apply, it's time to consider adopting an IDP.
Adoption Signal Checklist:
- Number of services exceeds 20
- Creating a new service takes more than one week
- Each team has different CI/CD pipelines with no standards
- Onboarding takes more than 2 weeks (dev environment setup, permission requests, etc.)
- The infrastructure team spends more than 80% of its time handling repetitive Jira tickets
- Rollback procedures after deployment failures differ across teams or are undocumented
- Cost attribution is impossible
If fewer than three conditions apply, simple shell script automation or standardized GitHub Actions templates are sufficient—no IDP needed.
Play 2: Building the Platform Team and Defining Roles
An IDP is a product. It should not be a side project built by a few infrastructure engineers—it requires a dedicated team operating with a product mindset.
Team Composition (for organizations of 50–200 developers)
| Role | Headcount | Key Responsibilities |
|---|---|---|
| Platform PM | 1 | Collecting developer requirements, roadmap management, tracking adoption rate |
| Platform Engineer | 2–3 | Infrastructure abstraction, API/UI development, golden path design |
| SRE / DevOps | 1–2 | Monitoring pipelines, on-call, incident response automation |
| Developer Advocate | 0.5 (shared) | Documentation, onboarding guides, internal training |
Core Principles:
- The platform team's customer is the internal developer. Measure NPS (Net Promoter Score) every quarter.
- Drive adoption through appeal, not enforcement. Make it so following the golden path takes 30 minutes, while not following it takes 2 weeks.
- Keep the feedback loop under 2 weeks. Developer feature requests must receive at least a minimum response (implementation plan or rejection reason) within 2 weeks.
Play 3: Choosing the Technology Stack
Backstage vs. Build-Your-Own vs. SaaS Comparison
| Criteria | Backstage (Open Source) | Build-Your-Own | SaaS (Port/Cortex, etc.) |
|---|---|---|---|
| Initial Cost | Medium (3–6 months to build) | High (6–12 months) | Low (start immediately) |
| Customization | High (plugin ecosystem) | Highest | Limited |
| Maintenance Burden | High (upgrades, security patches) | Very High | None (vendor responsibility) |
| Org Size Fit | 100+ | 500+ | 50–300 |
| Vendor Lock-in | None | None | High |
| Plugins/Integrations | 2000+ plugins | Only what you need | Vendor-provided scope |
Recommended Strategy: For organizations with fewer than 100 people, start with SaaS. For 100–500, adopt Backstage but also consider managed Backstage options like Roadie. For 500+, have a dedicated team customize and operate Backstage.
Play 4: Backstage Setup and Software Catalog
Installing Backstage
# Create a new project using the Backstage CLI
npx @backstage/create-app@latest --skip-install
# Resulting directory structure
my-backstage/
├── app-config.yaml # Core configuration file
├── app-config.production.yaml
├── packages/
│ ├── app/ # Frontend (React)
│ └── backend/ # Backend (Node.js)
├── plugins/ # Custom plugins
├── catalog-info.yaml # Catalog registration for this project itself
└── package.json
# Install dependencies and run
cd my-backstage
yarn install
yarn dev
# Access at http://localhost:3000
Key app-config.yaml Settings
# app-config.yaml
app:
title: 'MyOrg Developer Platform'
baseUrl: http://localhost:3000
organization:
name: 'MyOrg'
backend:
baseUrl: http://localhost:7007
database:
client: pg
connection:
host: ${POSTGRES_HOST}
port: ${POSTGRES_PORT}
user: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
# GitHub integration (automatic service catalog discovery)
integrations:
github:
- host: github.com
token: ${GITHUB_TOKEN}
# Software Catalog settings
catalog:
import:
entityFilename: catalog-info.yaml
rules:
- allow: [Component, System, API, Resource, Location, Group, User]
locations:
# Automatically discover catalog-info.yaml from all repositories in the organization
- type: github-discovery
target: https://github.com/my-org/*/blob/main/catalog-info.yaml
# Manual registration
- type: file
target: ./catalog-entities/all-systems.yaml
# Authentication (GitHub OAuth)
auth:
environment: development
providers:
github:
development:
clientId: ${GITHUB_OAUTH_CLIENT_ID}
clientSecret: ${GITHUB_OAUTH_CLIENT_SECRET}
Service Catalog Registration Standard
Place a catalog-info.yaml at the root of each service repository.
# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: order-service
description: 'Order processing microservice'
annotations:
github.com/project-slug: my-org/order-service
backstage.io/techdocs-ref: dir:.
datadoghq.com/dashboard-url: https://app.datadoghq.com/dashboard/abc-123
pagerduty.com/service-id: PXXXXXX
argocd/app-name: order-service-prod
tags:
- java
- spring-boot
- tier-1
links:
- url: https://order.internal.example.com
title: Production URL
- url: https://grafana.internal.example.com/d/order-service
title: Grafana Dashboard
spec:
type: service
lifecycle: production
owner: team-commerce
system: commerce-platform
dependsOn:
- component:payment-service
- resource:orders-database
providesApis:
- order-api
consumesApis:
- payment-api
- inventory-api
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
name: order-api
description: 'Order REST API'
spec:
type: openapi
lifecycle: production
owner: team-commerce
definition:
$text: ./docs/openapi.yaml
Play 5: Service Templates (Scaffolding)
Backstage's Software Templates is a feature that creates new services with a standardized structure. You can have a service with CI/CD, monitoring, and catalog registration all set up within 30 minutes.
# templates/spring-boot-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: spring-boot-service
title: 'Spring Boot Microservice'
description: 'Creates a Spring Boot service with auto-configured CI/CD, monitoring, and catalog registration'
tags:
- java
- spring-boot
- recommended
spec:
owner: team-platform
type: service
parameters:
- title: Service Information
required:
- serviceName
- ownerTeam
- tier
properties:
serviceName:
title: Service Name
type: string
pattern: '^[a-z][a-z0-9-]*$'
description: 'Only lowercase letters, numbers, and hyphens allowed'
ownerTeam:
title: Owner Team
type: string
ui:field: OwnerPicker
ui:options:
catalogFilter:
kind: Group
tier:
title: Service Tier
type: string
enum: ['tier-1', 'tier-2', 'tier-3']
enumNames: ['Tier 1 (99.99% SLO)', 'Tier 2 (99.9% SLO)', 'Tier 3 (99% SLO)']
javaVersion:
title: Java Version
type: string
default: '21'
enum: ['17', '21']
- title: Infrastructure Settings
properties:
database:
title: Database
type: string
default: 'postgresql'
enum: ['postgresql', 'mysql', 'none']
messageQueue:
title: Message Queue
type: string
default: 'none'
enum: ['kafka', 'rabbitmq', 'none']
replicaCount:
title: Default Replica Count
type: integer
default: 3
minimum: 1
maximum: 20
steps:
# 1. Generate repository from template
- id: fetch-template
name: Generate Template Code
action: fetch:template
input:
url: ./skeleton
values:
serviceName: ${{ parameters.serviceName }}
ownerTeam: ${{ parameters.ownerTeam }}
tier: ${{ parameters.tier }}
javaVersion: ${{ parameters.javaVersion }}
database: ${{ parameters.database }}
replicaCount: ${{ parameters.replicaCount }}
# 2. Create GitHub repository
- id: publish
name: Create GitHub Repository
action: publish:github
input:
allowedHosts: ['github.com']
repoUrl: github.com?owner=my-org&repo=${{ parameters.serviceName }}
defaultBranch: main
repoVisibility: internal
protectDefaultBranch: true
requireCodeOwnerReviews: true
# 3. Register ArgoCD application
- id: register-argocd
name: Register ArgoCD Application
action: argocd:create-resources
input:
appName: ${{ parameters.serviceName }}-prod
argoInstance: main
namespace: ${{ parameters.serviceName }}
repoUrl: https://github.com/my-org/${{ parameters.serviceName }}
path: k8s/overlays/production
# 4. Register in Backstage catalog
- id: register-catalog
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps['publish'].output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: Repository
url: ${{ steps['publish'].output.remoteUrl }}
- title: Catalog
icon: catalog
entityRef: ${{ steps['register-catalog'].output.entityRef }}
Play 6: Unifying Documentation with TechDocs
By consolidating scattered documentation into Backstage TechDocs, you can view technical documentation directly from the service catalog.
# mkdocs.yml (root of each service repository)
site_name: order-service
nav:
- Home: index.md
- Architecture: architecture.md
- API Reference: api.md
- Runbook: runbook.md
- ADR:
- adr/001-database-choice.md
- adr/002-event-schema.md
plugins:
- techdocs-core
<!-- docs/runbook.md -->
# Order Service Operations Runbook
## Incident Response
### Order Processing Delay (P95 > 500ms)
1. Check Grafana dashboard: [Link]
2. Check DB connection pool status:
```bash
kubectl exec -it deploy/order-service -- curl localhost:8080/actuator/metrics/hikaricp.connections.active
```
- If connection pool is saturated:
kubectl scale deploy/order-service --replicas=6 - Check DB slow queries:
SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '5s';
Order Creation Failure (HTTP 500)
- Check error logs:
kubectl logs deploy/order-service --tail=100 | grep ERROR - Response by error code:
ORDER-001: Payment service connection failure -> Check payment-service statusORDER-002: Insufficient inventory -> Check inventory-service synchronizationORDER-003: DB deadlock -> Check transaction isolation level
## Play 7: Self-Service Infrastructure Provisioning
Enable developers to provision databases, message queues, caches, and more directly from the Backstage UI. Actual infrastructure creation is handled via Terraform + GitOps.
```yaml
# templates/provision-postgresql/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: provision-postgresql
title: 'PostgreSQL Database Provisioning'
description: 'Self-service provisioning of RDS PostgreSQL instances'
spec:
owner: team-platform
type: resource
parameters:
- title: Database Settings
required:
- dbName
- environment
- instanceClass
properties:
dbName:
title: DB Name
type: string
pattern: '^[a-z][a-z0-9_]*$'
environment:
title: Environment
type: string
enum: ['dev', 'staging', 'production']
instanceClass:
title: Instance Size
type: string
default: 'db.r7g.large'
enum:
- 'db.t4g.medium'
- 'db.r7g.large'
- 'db.r7g.xlarge'
- 'db.r7g.2xlarge'
enumNames:
- 'Small (2 vCPU, 4GB) - dev/staging'
- 'Medium (2 vCPU, 16GB) - production'
- 'Large (4 vCPU, 32GB) - production'
- 'XLarge (8 vCPU, 64GB) - high traffic'
storageGb:
title: Storage (GB)
type: integer
default: 100
minimum: 20
maximum: 16000
multiAz:
title: Multi-AZ Deployment
type: boolean
default: false
steps:
- id: create-terraform-pr
name: Create Terraform PR
action: publish:github:pull-request
input:
repoUrl: github.com?owner=my-org&repo=infrastructure
branchName: provision-db-${{ parameters.dbName }}
title: 'DB Provisioning: ${{ parameters.dbName }} (${{ parameters.environment }})'
description: |
Automatically generated DB provisioning request.
- DB Name: ${{ parameters.dbName }}
- Environment: ${{ parameters.environment }}
- Instance: ${{ parameters.instanceClass }}
- Storage: ${{ parameters.storageGb }}GB
- Multi-AZ: ${{ parameters.multiAz }}
targetPath: terraform/rds/${{ parameters.environment }}/${{ parameters.dbName }}
sourcePath: ./terraform-template
Play 8: Measuring Platform Success Metrics
Without measuring IDP performance, you cannot prove the return on investment. Track the following metrics quarterly.
Key Performance Indicators (KPIs)
# platform_metrics.py - Platform KPI dashboard data collection
import requests
from datetime import datetime, timedelta
class PlatformMetrics:
def __init__(self, github_token: str, backstage_url: str):
self.github = github_token
self.backstage = backstage_url
def service_creation_lead_time(self) -> dict:
"""New service creation lead time (target: under 30 minutes)"""
# Extract from Backstage scaffolder logs
response = requests.get(
f"{self.backstage}/api/scaffolder/v2/tasks",
params={"createdAfter": (datetime.now() - timedelta(days=90)).isoformat()}
)
tasks = response.json()["items"]
lead_times = []
for task in tasks:
if task["status"] == "completed":
start = datetime.fromisoformat(task["createdAt"])
end = datetime.fromisoformat(task["completedAt"])
lead_times.append((end - start).total_seconds() / 60)
return {
"median_minutes": sorted(lead_times)[len(lead_times) // 2],
"p95_minutes": sorted(lead_times)[int(len(lead_times) * 0.95)],
"total_services_created": len(lead_times),
}
def golden_path_adoption_rate(self) -> dict:
"""Golden path adoption rate (target: 80% or above)"""
# Query reusable workflow usage from the GitHub API
repos = requests.get(
"https://api.github.com/orgs/my-org/repos",
headers={"Authorization": f"token {self.github}"},
params={"per_page": 100, "type": "internal"}
).json()
using_golden_path = 0
total_active = 0
for repo in repos:
if repo["archived"]:
continue
total_active += 1
# Check for golden path references in CI workflows
workflows = requests.get(
f"https://api.github.com/repos/my-org/{repo['name']}/actions/workflows",
headers={"Authorization": f"token {self.github}"}
).json()
for wf in workflows.get("workflows", []):
if "golden" in wf.get("path", "").lower():
using_golden_path += 1
break
return {
"adoption_rate": using_golden_path / max(total_active, 1),
"using_golden_path": using_golden_path,
"total_active_repos": total_active,
}
def developer_nps(self) -> dict:
"""Developer satisfaction NPS (target: 30 or above)"""
# Quarterly survey results (Google Forms / Typeform, etc.)
# Integrate via API directly, or enter manually
return {
"nps_score": 42,
"promoters_pct": 55,
"detractors_pct": 13,
"response_rate": 0.72,
"top_complaints": [
"Build times are slow",
"Log search UI is inconvenient",
"Insufficient permission request automation",
]
}
KPI Target Values
| Metric | Poor | Average | Good | Target |
|---|---|---|---|---|
| Service Creation Time | 1 week+ | 1–3 days | 1 hour | 30 min |
| Golden Path Adoption Rate | Below 30% | 30–60% | 60–80% | 80%+ |
| Developer NPS | Below 0 | 0–20 | 20–40 | 40+ |
| Onboarding Time | 2 weeks+ | 1–2 weeks | 2–5 days | 1 day |
| Infra Tickets per Month | 50+ | 20–50 | 5–20 | Below 5 |
Play 9: Troubleshooting
Issue 1: Backstage Catalog Synchronization Delay
WARN: Entity refresh for component:order-service took 45s (threshold: 10s)
Cause: GitHub discovery is scanning hundreds of repositories and hitting the API rate limit.
# Solution: Limit scan scope + configure caching
catalog:
providers:
github:
myOrg:
organization: 'my-org'
catalogPath: '/catalog-info.yaml'
filters:
repository: '^(?!archived-).*$' # Exclude repositories with archived- prefix
topic:
include: ['backstage-enabled'] # Topic-based filtering
schedule:
frequency: { minutes: 30 } # 30-minute interval (default is 5 minutes)
timeout: { minutes: 5 }
Issue 2: Software Template Execution Failure — GitHub Permissions
Error: Resource not accessible by integration
HttpError: 403 - Resource not accessible by integration
Cause: The GitHub App lacks sufficient permissions, or it doesn't have access to the organization where the repository is being created.
# Check GitHub App permissions
# Settings > Developer settings > GitHub Apps > [App Name] > Permissions
# Required permissions:
# - Repository: Administration (Read & Write)
# - Repository: Contents (Read & Write)
# - Organization: Members (Read)
# Or when using a Personal Access Token (PAT), required scopes:
# repo, workflow, admin:org
Issue 3: TechDocs Build Failure
mkdocs build failed: No module named 'techdocs_core'
# Solution: Install the plugin in the TechDocs build environment
pip install mkdocs-techdocs-core
# When using Docker build
docker run --rm -v $(pwd):/content \
spotify/techdocs:latest \
build --site-dir /content/site
# Configure build method in app-config.yaml
techdocs:
builder: 'external' # Build in CI
publisher:
type: 'awsS3'
awsS3:
bucketName: 'my-org-techdocs'
region: 'ap-northeast-2'
Issue 4: Platform Adoption Rate Won't Increase
This is not a technical problem—it's an organizational problem.
Resolution Strategy:
- Secure champion teams first: Select 2–3 early adopter teams and share their success stories internally.
- Remove friction: Maintain the pain of not following the golden path (2 weeks of manual deployment, manual monitoring setup) while maximizing the convenience of the golden path.
- Don't force it: Hold a monthly "Platform Day" with demos and feedback sessions.
- Prove it with data: Share metrics like "Teams using the golden path deploy 3x more frequently."
Play 10: Phased IDP Roadmap Implementation
Trying to build everything at once will lead to failure. Divide the effort into three phases and build incrementally.
Phase 1 (1–3 months): Foundation
- Build the Software Catalog (register all services, teams, APIs)
- Standardize CI golden path (GitHub Actions reusable workflows)
- Create 1–2 service creation templates
Phase 2 (4–6 months): Expansion
- CD golden path (Argo Rollouts canary deployments)
- TechDocs integration (runbooks, ADRs)
- Self-service infrastructure provisioning (databases, caches)
- Cost tagging and dashboards
Phase 3 (7–12 months): Maturity
- Automated security policy enforcement (OPA/Kyverno)
- Automated DORA metrics collection and dashboards
- Internal marketplace (shared libraries, plugins)
- One-click development environment provisioning
Quiz
Q1. What characterizes an organization where IDP adoption is premature?
Answer: ||An organization with fewer than 20 services, a low ratio of repetitive ticket processing by the infrastructure team, and the ability to create new services within one week. In such cases, simple script automation or standard CI/CD templates are sufficient rather than an IDP.||
Q2. Why is a Platform PM needed when building a platform team?
Answer: ||Since an IDP is an internal product, it requires collecting customer (developer) requirements, prioritizing, and measuring adoption rates. If composed only of engineers, the team tends to skew toward technically interesting features rather than what developers actually need.||
Q3. Why is the dependsOn field in catalog-info.yaml important in the Backstage Software Catalog?
Answer: ||It explicitly declares inter-service dependencies, enabling immediate identification of the blast radius during incidents. If order-service has a dependsOn on payment-service, you can instantly see in the catalog that a payment-service outage would also affect order-service.||
Q4. Why include the ArgoCD app registration step in a Software Template?
Answer: ||A GitOps-based deployment pipeline is automatically configured at the same time the service is created, so deployment begins immediately when a developer pushes code. Without this step, developers would have to create a separate ticket to request ArgoCD configuration, which is a major cause of increased onboarding time.||
Q5. How do you increase platform adoption from below 50% through appeal rather than enforcement?
Answer: ||Maintain the inconvenience of not following the golden path (2 weeks for manual deployment, manual monitoring setup) while maximizing the golden path's convenience (service creation in 30 minutes, automatic deployment, automatic monitoring). Share champion teams' success stories and prove the impact with metrics.||
Q6. Why should IDP construction be divided into three phases?
Answer: ||Building all features at once takes 12+ months, and the project risks being canceled before ROI can be demonstrated. By delivering value quickly with the catalog and CI standardization in Phase 1 (3 months), you can secure investment for Phases 2 and 3 based on those results.||
Q7. What should you watch out for when measuring developer NPS?
Answer: ||The response rate must be 70% or higher for the metric to be meaningful. Also, don't just look at the NPS score—analyze the specific complaints from detractors. Track top_complaints quarterly to verify improvements, and publicly announce resolved items to close the feedback loop.||