Skip to content
Published on

DevOps Internal Developer Platform Playbook 2026

Authors
  • Name
    Twitter
DevOps Internal Developer Platform Playbook 2026

Play 1: Determining When You Need an IDP

An Internal Developer Platform (IDP) is an internal platform that enables developers to create, deploy, and monitor services through self-service without infrastructure requests. According to a 2026 Gartner survey, 80% of enterprises are investing in platform engineering.

However, not every organization needs an IDP. If three or more of the following conditions apply, it's time to consider adopting an IDP.

Adoption Signal Checklist:

  • Number of services exceeds 20
  • Creating a new service takes more than one week
  • Each team has different CI/CD pipelines with no standards
  • Onboarding takes more than 2 weeks (dev environment setup, permission requests, etc.)
  • The infrastructure team spends more than 80% of its time handling repetitive Jira tickets
  • Rollback procedures after deployment failures differ across teams or are undocumented
  • Cost attribution is impossible

If fewer than three conditions apply, simple shell script automation or standardized GitHub Actions templates are sufficient—no IDP needed.

Play 2: Building the Platform Team and Defining Roles

An IDP is a product. It should not be a side project built by a few infrastructure engineers—it requires a dedicated team operating with a product mindset.

Team Composition (for organizations of 50–200 developers)

RoleHeadcountKey Responsibilities
Platform PM1Collecting developer requirements, roadmap management, tracking adoption rate
Platform Engineer2–3Infrastructure abstraction, API/UI development, golden path design
SRE / DevOps1–2Monitoring pipelines, on-call, incident response automation
Developer Advocate0.5 (shared)Documentation, onboarding guides, internal training

Core Principles:

  • The platform team's customer is the internal developer. Measure NPS (Net Promoter Score) every quarter.
  • Drive adoption through appeal, not enforcement. Make it so following the golden path takes 30 minutes, while not following it takes 2 weeks.
  • Keep the feedback loop under 2 weeks. Developer feature requests must receive at least a minimum response (implementation plan or rejection reason) within 2 weeks.

Play 3: Choosing the Technology Stack

Backstage vs. Build-Your-Own vs. SaaS Comparison

CriteriaBackstage (Open Source)Build-Your-OwnSaaS (Port/Cortex, etc.)
Initial CostMedium (3–6 months to build)High (6–12 months)Low (start immediately)
CustomizationHigh (plugin ecosystem)HighestLimited
Maintenance BurdenHigh (upgrades, security patches)Very HighNone (vendor responsibility)
Org Size Fit100+500+50–300
Vendor Lock-inNoneNoneHigh
Plugins/Integrations2000+ pluginsOnly what you needVendor-provided scope

Recommended Strategy: For organizations with fewer than 100 people, start with SaaS. For 100–500, adopt Backstage but also consider managed Backstage options like Roadie. For 500+, have a dedicated team customize and operate Backstage.

Play 4: Backstage Setup and Software Catalog

Installing Backstage

# Create a new project using the Backstage CLI
npx @backstage/create-app@latest --skip-install

# Resulting directory structure
my-backstage/
├── app-config.yaml           # Core configuration file
├── app-config.production.yaml
├── packages/
│   ├── app/                  # Frontend (React)
│   └── backend/              # Backend (Node.js)
├── plugins/                  # Custom plugins
├── catalog-info.yaml         # Catalog registration for this project itself
└── package.json

# Install dependencies and run
cd my-backstage
yarn install
yarn dev
# Access at http://localhost:3000

Key app-config.yaml Settings

# app-config.yaml
app:
  title: 'MyOrg Developer Platform'
  baseUrl: http://localhost:3000

organization:
  name: 'MyOrg'

backend:
  baseUrl: http://localhost:7007
  database:
    client: pg
    connection:
      host: ${POSTGRES_HOST}
      port: ${POSTGRES_PORT}
      user: ${POSTGRES_USER}
      password: ${POSTGRES_PASSWORD}

# GitHub integration (automatic service catalog discovery)
integrations:
  github:
    - host: github.com
      token: ${GITHUB_TOKEN}

# Software Catalog settings
catalog:
  import:
    entityFilename: catalog-info.yaml
  rules:
    - allow: [Component, System, API, Resource, Location, Group, User]
  locations:
    # Automatically discover catalog-info.yaml from all repositories in the organization
    - type: github-discovery
      target: https://github.com/my-org/*/blob/main/catalog-info.yaml
    # Manual registration
    - type: file
      target: ./catalog-entities/all-systems.yaml

# Authentication (GitHub OAuth)
auth:
  environment: development
  providers:
    github:
      development:
        clientId: ${GITHUB_OAUTH_CLIENT_ID}
        clientSecret: ${GITHUB_OAUTH_CLIENT_SECRET}

Service Catalog Registration Standard

Place a catalog-info.yaml at the root of each service repository.

# catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  description: 'Order processing microservice'
  annotations:
    github.com/project-slug: my-org/order-service
    backstage.io/techdocs-ref: dir:.
    datadoghq.com/dashboard-url: https://app.datadoghq.com/dashboard/abc-123
    pagerduty.com/service-id: PXXXXXX
    argocd/app-name: order-service-prod
  tags:
    - java
    - spring-boot
    - tier-1
  links:
    - url: https://order.internal.example.com
      title: Production URL
    - url: https://grafana.internal.example.com/d/order-service
      title: Grafana Dashboard
spec:
  type: service
  lifecycle: production
  owner: team-commerce
  system: commerce-platform
  dependsOn:
    - component:payment-service
    - resource:orders-database
  providesApis:
    - order-api
  consumesApis:
    - payment-api
    - inventory-api
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: order-api
  description: 'Order REST API'
spec:
  type: openapi
  lifecycle: production
  owner: team-commerce
  definition:
    $text: ./docs/openapi.yaml

Play 5: Service Templates (Scaffolding)

Backstage's Software Templates is a feature that creates new services with a standardized structure. You can have a service with CI/CD, monitoring, and catalog registration all set up within 30 minutes.

# templates/spring-boot-service/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: spring-boot-service
  title: 'Spring Boot Microservice'
  description: 'Creates a Spring Boot service with auto-configured CI/CD, monitoring, and catalog registration'
  tags:
    - java
    - spring-boot
    - recommended
spec:
  owner: team-platform
  type: service
  parameters:
    - title: Service Information
      required:
        - serviceName
        - ownerTeam
        - tier
      properties:
        serviceName:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]*$'
          description: 'Only lowercase letters, numbers, and hyphens allowed'
        ownerTeam:
          title: Owner Team
          type: string
          ui:field: OwnerPicker
          ui:options:
            catalogFilter:
              kind: Group
        tier:
          title: Service Tier
          type: string
          enum: ['tier-1', 'tier-2', 'tier-3']
          enumNames: ['Tier 1 (99.99% SLO)', 'Tier 2 (99.9% SLO)', 'Tier 3 (99% SLO)']
        javaVersion:
          title: Java Version
          type: string
          default: '21'
          enum: ['17', '21']

    - title: Infrastructure Settings
      properties:
        database:
          title: Database
          type: string
          default: 'postgresql'
          enum: ['postgresql', 'mysql', 'none']
        messageQueue:
          title: Message Queue
          type: string
          default: 'none'
          enum: ['kafka', 'rabbitmq', 'none']
        replicaCount:
          title: Default Replica Count
          type: integer
          default: 3
          minimum: 1
          maximum: 20

  steps:
    # 1. Generate repository from template
    - id: fetch-template
      name: Generate Template Code
      action: fetch:template
      input:
        url: ./skeleton
        values:
          serviceName: ${{ parameters.serviceName }}
          ownerTeam: ${{ parameters.ownerTeam }}
          tier: ${{ parameters.tier }}
          javaVersion: ${{ parameters.javaVersion }}
          database: ${{ parameters.database }}
          replicaCount: ${{ parameters.replicaCount }}

    # 2. Create GitHub repository
    - id: publish
      name: Create GitHub Repository
      action: publish:github
      input:
        allowedHosts: ['github.com']
        repoUrl: github.com?owner=my-org&repo=${{ parameters.serviceName }}
        defaultBranch: main
        repoVisibility: internal
        protectDefaultBranch: true
        requireCodeOwnerReviews: true

    # 3. Register ArgoCD application
    - id: register-argocd
      name: Register ArgoCD Application
      action: argocd:create-resources
      input:
        appName: ${{ parameters.serviceName }}-prod
        argoInstance: main
        namespace: ${{ parameters.serviceName }}
        repoUrl: https://github.com/my-org/${{ parameters.serviceName }}
        path: k8s/overlays/production

    # 4. Register in Backstage catalog
    - id: register-catalog
      name: Register in Catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps['publish'].output.repoContentsUrl }}
        catalogInfoPath: /catalog-info.yaml

  output:
    links:
      - title: Repository
        url: ${{ steps['publish'].output.remoteUrl }}
      - title: Catalog
        icon: catalog
        entityRef: ${{ steps['register-catalog'].output.entityRef }}

Play 6: Unifying Documentation with TechDocs

By consolidating scattered documentation into Backstage TechDocs, you can view technical documentation directly from the service catalog.

# mkdocs.yml (root of each service repository)
site_name: order-service
nav:
  - Home: index.md
  - Architecture: architecture.md
  - API Reference: api.md
  - Runbook: runbook.md
  - ADR:
      - adr/001-database-choice.md
      - adr/002-event-schema.md

plugins:
  - techdocs-core
<!-- docs/runbook.md -->

# Order Service Operations Runbook

## Incident Response

### Order Processing Delay (P95 > 500ms)

1. Check Grafana dashboard: [Link]
2. Check DB connection pool status:
   ```bash
   kubectl exec -it deploy/order-service -- curl localhost:8080/actuator/metrics/hikaricp.connections.active
   ```
  1. If connection pool is saturated:
    kubectl scale deploy/order-service --replicas=6
    
  2. Check DB slow queries:
    SELECT pid, now() - query_start AS duration, query
    FROM pg_stat_activity
    WHERE state = 'active' AND now() - query_start > interval '5s';
    

Order Creation Failure (HTTP 500)

  1. Check error logs:
    kubectl logs deploy/order-service --tail=100 | grep ERROR
    
  2. Response by error code:
    • ORDER-001: Payment service connection failure -> Check payment-service status
    • ORDER-002: Insufficient inventory -> Check inventory-service synchronization
    • ORDER-003: DB deadlock -> Check transaction isolation level

## Play 7: Self-Service Infrastructure Provisioning

Enable developers to provision databases, message queues, caches, and more directly from the Backstage UI. Actual infrastructure creation is handled via Terraform + GitOps.

```yaml
# templates/provision-postgresql/template.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: provision-postgresql
  title: 'PostgreSQL Database Provisioning'
  description: 'Self-service provisioning of RDS PostgreSQL instances'
spec:
  owner: team-platform
  type: resource
  parameters:
    - title: Database Settings
      required:
        - dbName
        - environment
        - instanceClass
      properties:
        dbName:
          title: DB Name
          type: string
          pattern: '^[a-z][a-z0-9_]*$'
        environment:
          title: Environment
          type: string
          enum: ['dev', 'staging', 'production']
        instanceClass:
          title: Instance Size
          type: string
          default: 'db.r7g.large'
          enum:
            - 'db.t4g.medium'
            - 'db.r7g.large'
            - 'db.r7g.xlarge'
            - 'db.r7g.2xlarge'
          enumNames:
            - 'Small (2 vCPU, 4GB) - dev/staging'
            - 'Medium (2 vCPU, 16GB) - production'
            - 'Large (4 vCPU, 32GB) - production'
            - 'XLarge (8 vCPU, 64GB) - high traffic'
        storageGb:
          title: Storage (GB)
          type: integer
          default: 100
          minimum: 20
          maximum: 16000
        multiAz:
          title: Multi-AZ Deployment
          type: boolean
          default: false

  steps:
    - id: create-terraform-pr
      name: Create Terraform PR
      action: publish:github:pull-request
      input:
        repoUrl: github.com?owner=my-org&repo=infrastructure
        branchName: provision-db-${{ parameters.dbName }}
        title: 'DB Provisioning: ${{ parameters.dbName }} (${{ parameters.environment }})'
        description: |
          Automatically generated DB provisioning request.

          - DB Name: ${{ parameters.dbName }}
          - Environment: ${{ parameters.environment }}
          - Instance: ${{ parameters.instanceClass }}
          - Storage: ${{ parameters.storageGb }}GB
          - Multi-AZ: ${{ parameters.multiAz }}
        targetPath: terraform/rds/${{ parameters.environment }}/${{ parameters.dbName }}
        sourcePath: ./terraform-template

Play 8: Measuring Platform Success Metrics

Without measuring IDP performance, you cannot prove the return on investment. Track the following metrics quarterly.

Key Performance Indicators (KPIs)

# platform_metrics.py - Platform KPI dashboard data collection
import requests
from datetime import datetime, timedelta

class PlatformMetrics:
    def __init__(self, github_token: str, backstage_url: str):
        self.github = github_token
        self.backstage = backstage_url

    def service_creation_lead_time(self) -> dict:
        """New service creation lead time (target: under 30 minutes)"""
        # Extract from Backstage scaffolder logs
        response = requests.get(
            f"{self.backstage}/api/scaffolder/v2/tasks",
            params={"createdAfter": (datetime.now() - timedelta(days=90)).isoformat()}
        )
        tasks = response.json()["items"]

        lead_times = []
        for task in tasks:
            if task["status"] == "completed":
                start = datetime.fromisoformat(task["createdAt"])
                end = datetime.fromisoformat(task["completedAt"])
                lead_times.append((end - start).total_seconds() / 60)

        return {
            "median_minutes": sorted(lead_times)[len(lead_times) // 2],
            "p95_minutes": sorted(lead_times)[int(len(lead_times) * 0.95)],
            "total_services_created": len(lead_times),
        }

    def golden_path_adoption_rate(self) -> dict:
        """Golden path adoption rate (target: 80% or above)"""
        # Query reusable workflow usage from the GitHub API
        repos = requests.get(
            "https://api.github.com/orgs/my-org/repos",
            headers={"Authorization": f"token {self.github}"},
            params={"per_page": 100, "type": "internal"}
        ).json()

        using_golden_path = 0
        total_active = 0

        for repo in repos:
            if repo["archived"]:
                continue
            total_active += 1
            # Check for golden path references in CI workflows
            workflows = requests.get(
                f"https://api.github.com/repos/my-org/{repo['name']}/actions/workflows",
                headers={"Authorization": f"token {self.github}"}
            ).json()

            for wf in workflows.get("workflows", []):
                if "golden" in wf.get("path", "").lower():
                    using_golden_path += 1
                    break

        return {
            "adoption_rate": using_golden_path / max(total_active, 1),
            "using_golden_path": using_golden_path,
            "total_active_repos": total_active,
        }

    def developer_nps(self) -> dict:
        """Developer satisfaction NPS (target: 30 or above)"""
        # Quarterly survey results (Google Forms / Typeform, etc.)
        # Integrate via API directly, or enter manually
        return {
            "nps_score": 42,
            "promoters_pct": 55,
            "detractors_pct": 13,
            "response_rate": 0.72,
            "top_complaints": [
                "Build times are slow",
                "Log search UI is inconvenient",
                "Insufficient permission request automation",
            ]
        }

KPI Target Values

MetricPoorAverageGoodTarget
Service Creation Time1 week+1–3 days1 hour30 min
Golden Path Adoption RateBelow 30%30–60%60–80%80%+
Developer NPSBelow 00–2020–4040+
Onboarding Time2 weeks+1–2 weeks2–5 days1 day
Infra Tickets per Month50+20–505–20Below 5

Play 9: Troubleshooting

Issue 1: Backstage Catalog Synchronization Delay

WARN: Entity refresh for component:order-service took 45s (threshold: 10s)

Cause: GitHub discovery is scanning hundreds of repositories and hitting the API rate limit.

# Solution: Limit scan scope + configure caching
catalog:
  providers:
    github:
      myOrg:
        organization: 'my-org'
        catalogPath: '/catalog-info.yaml'
        filters:
          repository: '^(?!archived-).*$' # Exclude repositories with archived- prefix
          topic:
            include: ['backstage-enabled'] # Topic-based filtering
        schedule:
          frequency: { minutes: 30 } # 30-minute interval (default is 5 minutes)
          timeout: { minutes: 5 }

Issue 2: Software Template Execution Failure — GitHub Permissions

Error: Resource not accessible by integration
HttpError: 403 - Resource not accessible by integration

Cause: The GitHub App lacks sufficient permissions, or it doesn't have access to the organization where the repository is being created.

# Check GitHub App permissions
# Settings > Developer settings > GitHub Apps > [App Name] > Permissions
# Required permissions:
# - Repository: Administration (Read & Write)
# - Repository: Contents (Read & Write)
# - Organization: Members (Read)

# Or when using a Personal Access Token (PAT), required scopes:
# repo, workflow, admin:org

Issue 3: TechDocs Build Failure

mkdocs build failed: No module named 'techdocs_core'
# Solution: Install the plugin in the TechDocs build environment
pip install mkdocs-techdocs-core

# When using Docker build
docker run --rm -v $(pwd):/content \
  spotify/techdocs:latest \
  build --site-dir /content/site

# Configure build method in app-config.yaml
techdocs:
  builder: 'external'          # Build in CI
  publisher:
    type: 'awsS3'
    awsS3:
      bucketName: 'my-org-techdocs'
      region: 'ap-northeast-2'

Issue 4: Platform Adoption Rate Won't Increase

This is not a technical problem—it's an organizational problem.

Resolution Strategy:

  1. Secure champion teams first: Select 2–3 early adopter teams and share their success stories internally.
  2. Remove friction: Maintain the pain of not following the golden path (2 weeks of manual deployment, manual monitoring setup) while maximizing the convenience of the golden path.
  3. Don't force it: Hold a monthly "Platform Day" with demos and feedback sessions.
  4. Prove it with data: Share metrics like "Teams using the golden path deploy 3x more frequently."

Play 10: Phased IDP Roadmap Implementation

Trying to build everything at once will lead to failure. Divide the effort into three phases and build incrementally.

Phase 1 (1–3 months): Foundation

  • Build the Software Catalog (register all services, teams, APIs)
  • Standardize CI golden path (GitHub Actions reusable workflows)
  • Create 1–2 service creation templates

Phase 2 (4–6 months): Expansion

  • CD golden path (Argo Rollouts canary deployments)
  • TechDocs integration (runbooks, ADRs)
  • Self-service infrastructure provisioning (databases, caches)
  • Cost tagging and dashboards

Phase 3 (7–12 months): Maturity

  • Automated security policy enforcement (OPA/Kyverno)
  • Automated DORA metrics collection and dashboards
  • Internal marketplace (shared libraries, plugins)
  • One-click development environment provisioning

Quiz

Q1. What characterizes an organization where IDP adoption is premature? Answer: ||An organization with fewer than 20 services, a low ratio of repetitive ticket processing by the infrastructure team, and the ability to create new services within one week. In such cases, simple script automation or standard CI/CD templates are sufficient rather than an IDP.||

Q2. Why is a Platform PM needed when building a platform team? Answer: ||Since an IDP is an internal product, it requires collecting customer (developer) requirements, prioritizing, and measuring adoption rates. If composed only of engineers, the team tends to skew toward technically interesting features rather than what developers actually need.||

Q3. Why is the dependsOn field in catalog-info.yaml important in the Backstage Software Catalog?

Answer: ||It explicitly declares inter-service dependencies, enabling immediate identification of the blast radius during incidents. If order-service has a dependsOn on payment-service, you can instantly see in the catalog that a payment-service outage would also affect order-service.||

Q4. Why include the ArgoCD app registration step in a Software Template? Answer: ||A GitOps-based deployment pipeline is automatically configured at the same time the service is created, so deployment begins immediately when a developer pushes code. Without this step, developers would have to create a separate ticket to request ArgoCD configuration, which is a major cause of increased onboarding time.||

Q5. How do you increase platform adoption from below 50% through appeal rather than enforcement?

Answer: ||Maintain the inconvenience of not following the golden path (2 weeks for manual deployment, manual monitoring setup) while maximizing the golden path's convenience (service creation in 30 minutes, automatic deployment, automatic monitoring). Share champion teams' success stories and prove the impact with metrics.||

Q6. Why should IDP construction be divided into three phases? Answer: ||Building all features at once takes 12+ months, and the project risks being canceled before ROI can be demonstrated. By delivering value quickly with the catalog and CI standardization in Phase 1 (3 months), you can secure investment for Phases 2 and 3 based on those results.||

Q7. What should you watch out for when measuring developer NPS? Answer: ||The response rate must be 70% or higher for the metric to be meaningful. Also, don't just look at the NPS score—analyze the specific complaints from detractors. Track top_complaints quarterly to verify improvements, and publicly announce resolved items to close the feedback loop.||

References