필사 모드: Backstage in Production — TechDocs, Plugin Development, and the Production Checklist
EnglishIntroduction — Installation Is Easy, Operation Is Hard
Spinning up Backstage as a demo takes half a day. But a Backstage that is still alive six months later — documentation up to date, plugins running reliably, upgrades not falling behind, hundreds of people using it daily — is an entirely different problem. Part 1 covered the catalog and Part 2 the Scaffolder; this final Part 3 is about operations. We will walk through the pipeline that manages documentation like code with TechDocs, plugin architecture and custom plugin development, and the full scope of production operations (permissions, upgrades, performance, observability, security, and the organizational model), all wrapped up with checklists.
TechDocs — Documentation as Code
TechDocs is the docs-as-code solution in Backstage. The core idea is simple: keep documentation as Markdown next to the code, build it with MkDocs, and serve it as a documentation site attached to a catalog entity. Because docs go through the same repository, the same PRs, and the same reviews as the code, the "code changed but docs did not" problem shrinks structurally.
Two things are needed on the repository side.
mkdocs.yml (repository root)
site_name: payment-api
site_description: Payment API service documentation
nav:
- Introduction: index.md
- Architecture: architecture.md
- Runbook: runbook.md
- Onboarding: onboarding.md
plugins:
- techdocs-core
payment-api/
├── catalog-info.yaml # backstage.io/techdocs-ref: dir:. annotation
├── mkdocs.yml
└── docs/
├── index.md
├── architecture.md
├── runbook.md
└── onboarding.md
Build Strategy — Local Build vs External Pipeline
TechDocs has two build modes.
| Strategy | Behavior | Suited for |
| --- | --- | --- |
| Local build (out-of-the-box) | Backstage backend builds on demand | Demos, small scale, early adoption |
| External pipeline (recommended) | CI builds and uploads to object storage | Production, large scale |
The recommended production setup is the external pipeline. It offloads build work from the Backstage instance itself and makes documentation refresh happen at code merge time.
Developer merges PR
|
v
CI (GitHub Actions)
| techdocs-cli generate (MkDocs build)
| techdocs-cli publish (S3 upload)
v
S3 bucket (techdocs-bucket)
^
| read-only
Backstage TechDocs Backend ---> user browser
A CI pipeline example:
.github/workflows/techdocs.yaml
name: techdocs
on:
push:
branches: [main]
paths: ['docs/**', 'mkdocs.yml']
jobs:
publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- uses: actions/setup-node@v4
with:
node-version: 20
- name: Install tooling
run: |
pip install mkdocs-techdocs-core
npm install -g @techdocs/cli
- name: Generate docs
run: techdocs-cli generate --no-docker
- name: Publish to S3
env:
AWS_ACCESS_KEY_ID: ${{ secrets.TECHDOCS_AWS_KEY }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.TECHDOCS_AWS_SECRET }}
run: |
techdocs-cli publish \
--publisher-type awsS3 \
--storage-name techdocs-bucket \
--entity default/Component/payment-api
The Backstage-side configuration:
app-config.yaml
techdocs:
builder: 'external' # disable backend builds
publisher:
type: 'awsS3'
awsS3:
bucketName: techdocs-bucket
region: ap-northeast-2
One operational tip: include mkdocs.yml and a docs directory skeleton in the Scaffolder skeleton from Part 2. Making every new service born "with documentation" is the most realistic way to build a documentation culture.
Plugin Architecture — Everything Is a Plugin
The Backstage design philosophy is "every feature is a plugin." The catalog, the Scaffolder, and TechDocs are all plugins. Understanding the structure makes custom development much easier.
+------------------- Backstage monorepo (yarn workspaces) -------------------------+
| |
| packages/app (frontend shell) packages/backend (backend shell) |
| +--------------------------+ +-----------------------------+ |
| | React SPA | | Node.js | |
| | - plugin page routing | HTTP | - mounts per-plugin routers | |
| | - entity page tabs/cards | <-------> | - provides DB/cache/sched | |
| +--------------------------+ +-----------------------------+ |
| |
| plugins/ |
| ├── my-plugin (frontend plugin: React components) |
| ├── my-plugin-backend (backend plugin: REST API, DB access) |
| └── my-plugin-common (shared types, isomorphic code) |
+---------------------------------------------------------------------------------+
On the backend, the "new backend system" is the standard. It is dependency-injection based: a plugin declares the infrastructure services it needs (logger, DB, scheduler, discovery) and the framework injects them. The key change is that the backend entry point became declarative, like this:
// packages/backend/src/index.ts (new backend system)
const backend = createBackend();
backend.add(import('@backstage/plugin-app-backend'));
backend.add(import('@backstage/plugin-catalog-backend'));
backend.add(import('@backstage/plugin-scaffolder-backend'));
backend.add(import('@backstage/plugin-techdocs-backend'));
backend.add(import('@backstage/plugin-auth-backend'));
backend.add(import('@backstage/plugin-permission-backend'));
// internal custom plugin
backend.add(import('@internal/plugin-deploy-board-backend'));
backend.start();
Custom Plugin in Practice — An Internal Deployment Board
Suppose we build a deployment board showing "which service is running which version in which environment right now." Let us look at the frontend and backend skeletons. Plugin creation starts with the CLI.
yarn new # choose plugin / backend-plugin from the menu
Backend — the deployment data API
// plugins/deploy-board-backend/src/plugin.ts
coreServices,
createBackendPlugin,
} from '@backstage/backend-plugin-api';
export const deployBoardPlugin = createBackendPlugin({
pluginId: 'deploy-board',
register(env) {
env.registerInit({
deps: {
logger: coreServices.logger,
httpRouter: coreServices.httpRouter,
database: coreServices.database,
},
async init({ logger, httpRouter, database }) {
httpRouter.use(await createRouter({ logger, database }));
},
});
},
});
// plugins/deploy-board-backend/src/router.ts
export async function createRouter(options: {
logger: any;
database: any;
}): Promise<Router> {
const router = Router();
router.use(express.json());
// Ingest deployment events (called by the CD pipeline)
router.post('/deployments', async (req, res) => {
const { component, environment, version, status } = req.body;
options.logger.info(`deployment: ${component} ${version} -> ${environment}`);
// get a knex instance from the database handle and persist
const knex = await options.database.getClient();
await knex('deployments').insert({
component,
environment,
version,
status,
deployed_at: new Date(),
});
res.status(201).json({ ok: true });
});
// Latest deployments per component (called by the frontend)
router.get('/deployments/:component', async (req, res) => {
const knex = await options.database.getClient();
const rows = await knex('deployments')
.where({ component: req.params.component })
.orderBy('deployed_at', 'desc')
.limit(20);
res.json(rows);
});
return router;
}
Frontend — an entity page card
// plugins/deploy-board/src/components/DeployBoardCard.tsx
export const DeployBoardCard = () => {
const { entity } = useEntity();
const fetchApi = useApi(fetchApiRef);
const discoveryApi = useApi(discoveryApiRef);
const { value, loading, error } = useAsync(async () => {
const baseUrl = await discoveryApi.getBaseUrl('deploy-board');
const res = await fetchApi.fetch(
`${baseUrl}/deployments/${entity.metadata.name}`,
);
return res.json();
}, [entity.metadata.name]);
if (error) return <InfoCard title="Deployments">Failed to load</InfoCard>;
return (
isLoading={loading}
options={{ paging: false, search: false }}
columns={[
{ title: 'Environment', field: 'environment' },
{ title: 'Version', field: 'version' },
{ title: 'Status', field: 'status' },
{ title: 'Deployed at', field: 'deployed_at' },
]}
data={value ?? []}
/>
);
};
Slot this card into the overview tab of the entity page (EntityPage), and every time someone opens a service in the catalog they see per-environment deployment status. This is the canonical pattern of layering data on top of the catalog (Part 1). On the CD pipeline side, all that is needed is a single line that calls the POST endpoint above when a deployment finishes.
Key Ecosystem Plugins
Check the ecosystem before building anything yourself. These plugins have high adoption in practice.
| Plugin | Capability | Integration key |
| --- | --- | --- |
| Kubernetes | Pod/deployment status per entity | kubernetes-id annotation |
| ArgoCD | Sync status, deployment history | argocd app selector annotation |
| SonarQube | Quality gates, coverage | sonarqube project key |
| Grafana | Embedded dashboards/alerts | grafana dashboard selector |
| PagerDuty | On-call display, incident creation | pagerduty integration key |
| GitHub Actions | Workflow run status | project-slug annotation |
| Cost Insights | Cloud cost visibility | cost data adapter implementation |
You will notice the common pattern: all of them are activated via catalog annotations. The "annotation governance" emphasized in Part 1 pays off here. With a well-annotated catalog, adding a plugin is a matter of a few lines of configuration.
Permissions — The Permission Framework
By default, every logged-in Backstage user can see everything. As the organization grows, requirements appear like "only full-time employees may run Scaffolder templates" or "entities of a given system are visible only to that division." The permission framework solves this with policy code.
The structure has three parts: plugins issue permission queries (for example catalog.entity.delete), a policy decides allow/deny, and decisions can take user identity and conditions (conditional decisions) into account.
// packages/backend/src/extensions/permissionPolicy.ts
PermissionPolicy,
PolicyQuery,
PolicyQueryUser,
} from '@backstage/plugin-permission-node';
AuthorizeResult,
PolicyDecision,
isPermission,
} from '@backstage/plugin-permission-common';
class AcmePermissionPolicy implements PermissionPolicy {
async handle(
request: PolicyQuery,
user?: PolicyQueryUser,
): Promise<PolicyDecision> {
// only platform-team may delete catalog entities
if (isPermission(request.permission, catalogEntityDeletePermission)) {
const isPlatformTeam = user?.info.ownershipEntityRefs.includes(
'group:default/platform-team',
);
return {
result: isPlatformTeam ? AuthorizeResult.ALLOW : AuthorizeResult.DENY,
};
}
return { result: AuthorizeResult.ALLOW };
}
}
export const permissionModuleAcmePolicy = createBackendModule({
pluginId: 'permission',
moduleId: 'acme-policy',
register(reg) {
reg.registerInit({
deps: { policy: policyExtensionPoint },
async init({ policy }) {
policy.setPolicy(new AcmePermissionPolicy());
},
});
},
});
Enable it in app-config:
permission:
enabled: true
The guiding principle for policy design is to start with "allow by default, restrict only dangerous actions." Laying down a dense deny-first policy from day one blocks portal adoption itself.
Production Operations — Upgrades, the Monorepo, and the DB
Release Trains and Upgrade Strategy
Backstage ships a mainline release every month, with patches in between. A universal lesson from practice: the longer you defer upgrades, the more nonlinearly the migration cost grows. Recommended operating mode:
- **Pin an upgrade slot in the calendar at least quarterly, ideally monthly.**
- Perform upgrades with the dedicated CLI.
bump all @backstage packages to the target release
yarn backstage-cli versions:bump
check for duplicates/conflicts among changed packages
yarn backstage-cli versions:check
verify with build and tests
yarn tsc && yarn build:all && yarn test:all
- Apply each release while reviewing diffs with the Backstage Upgrade Helper. In particular, the migration guides for the backend system and the frontend system must be read alongside the release notes.
Managing the yarn Monorepo
A Backstage app is a yarn workspaces monorepo. The most common operational problem is dependency duplication: different versions of the same @backstage package coexisting can cause type errors or runtime misbehavior. Resolve it with `yarn backstage-cli versions:check --fix`, and manage the @backstage dependencies of custom plugins in a peerDependency style so they track the host version.
DB Migrations
Each backend plugin manages its own schema with Knex migrations, applied automatically at boot. Two things matter operationally. First, the first boot right after an upgrade may be slow for as long as migrations take, so leave headroom in readiness probe timeouts. Second, the rollback scenario: rolling back to an older version after the schema has advanced can break, so include a DB snapshot before upgrades in your standard procedure.
Performance — Catalog Scale and Caching
Past a few thousand entities, several points need attention.
- **Processing loop load**: every entity is reprocessed periodically. As entity count grows, DB and CPU load grow with it. If you wrote custom processors, make sure slow work like external API calls does not sit inside the loop.
- **DB tuning**: monitor the PostgreSQL connection pool size and the index behavior centered on the refresh_state table. Most catalog load shows up in the database.
- **Search indexing**: tune the search plugin indexing schedule to your entity volume, and at large scale move the search backend from PostgreSQL search to a dedicated engine (e.g. OpenSearch/Elasticsearch).
- **Cache layer**: switching the cache store from memory to an external store like Redis stabilizes cache hit rates across multiple replicas.
backend:
cache:
store: redis
connection: redis://backstage-redis:6379
- **Horizontal scaling**: the Backstage backend is designed close to stateless, so read load can be spread by adding replicas. Scheduled tasks avoid duplicate execution through DB coordination.
Observability — The Portal Is a Service Too
Backstage itself must be treated as a production service with SLOs.
- **Metrics**: the backend supports OpenTelemetry instrumentation. Collect HTTP request latency/error rate, catalog processing queue delay, and Scaffolder task success rate as core indicators and export them to Prometheus.
- **Logs**: send structured JSON logs to your standard collection pipeline (e.g. Loki, CloudWatch). Catalog processing error logs in particular are the primary signal telling you "which entity is failing to refresh and why."
- **Dashboards and alerts**: portal availability, login success rate, catalog freshness (timestamp of last successful sync), and TechDocs build failure rate are enough for an initial operations dashboard.
Security — Plugin Supply Chain and Secrets
- **Plugin supply chain**: community plugins are npm packages. Before adopting one, review maintenance activity and the scope of permissions it requires, and include internal npm proxying/lockfile auditing (yarn npm audit, socket-style tools) in CI. Never forget this is code entering your trust boundary.
- **Secret management**: keep all app-config secrets as environment variable references and inject real values from Kubernetes Secrets or an external secret manager (Vault, AWS Secrets Manager). Credentials like GitHub App key files must never be baked into images.
- **Network boundary**: the Backstage backend becomes a hub with access to systems all over the company. Make outbound access targets explicit with NetworkPolicy and issue tokens with least privilege.
- **Auth token verification**: backend-to-backend calls are protected with service tokens; regularly verify that externally exposed endpoints sit behind authentication middleware.
Organizational Operating Model — Platform Team and Contribution Model
Half of operating Backstage is organizational design. The proven model is the combination of "a central platform team + internal open-source contributions."
- **The platform team** owns core operations (upgrades, auth, catalog governance, shared plugins). Starting with two to four dedicated people is common.
- **Domain teams** contribute plugins/templates for their own tools directly. This requires an internal open-source model with documented contribution guidelines (code review criteria, plugin quality bar, ownership commitments).
- **An RFC process**: when introducing major changes to the portal (a new core plugin, permission policy changes, metadata schema changes), a lightweight RFC document process for gathering input prevents both platform-team unilateralism and field-level chaos.
Adoption Maturity Roadmap
Tying the three parts of this series into a maturity model:
Level 0 Installed Demo instance, a few manually registered entities
Level 1 Catalog live Discovery automation, ownership model, auth [Part 1]
Level 2 Self-service Golden path templates, day-2 actions, adoption [Part 2]
Level 3 Knowledge hub TechDocs everywhere, ecosystem plugins, search [Part 3]
Level 4 Mature ops Permission policy, SLOs, contribution model,
regular upgrades [Part 3]
Level 5 Standard platform Portal is the default path, metric-driven
continuous improvement
Let me re-emphasize: attempts to skip levels (plugins before the catalog, permissions before adoption) are a leading cause of failure.
Production Checklist
**TechDocs**
- [ ] builder external + object storage publisher configured
- [ ] Documentation CI pipeline (auto build/upload on merge)
- [ ] mkdocs skeleton included in Scaffolder skeletons
**Plugins**
- [ ] New needs evaluated against ecosystem plugins first
- [ ] Custom plugins follow the frontend/backend/common three-package structure
- [ ] Owner and on-call assigned per plugin
**Permissions/Security**
- [ ] Permission framework enabled, policies restricting dangerous actions
- [ ] Secrets injected externally, no credentials in images
- [ ] Plugin dependency auditing included in CI
**Operations**
- [ ] Monthly/quarterly upgrade slots pinned, versions:bump procedure documented
- [ ] Standard DB snapshot procedure before upgrades
- [ ] OpenTelemetry metrics + structured log collection
- [ ] Portal SLOs defined (availability, catalog freshness)
- [ ] Redis cache + multi-replica configuration
- [ ] Platform team ownership and contribution guidelines documented
- [ ] RFC process in operation
Closing
Across three parts we have walked through the skeleton of a Backstage-based IDP. The catalog in Part 1 turned the organization's software reality into data; the Scaffolder in Part 2 turned organizational standards into executable code; and this Part 3 layered documentation and the plugin ecosystem on top, covering how to operate the whole thing at production quality. One final thought: an IDP is not a tooling project but a product. It has users (developers), a metric called adoption, and it needs a roadmap and operational ownership. Backstage is the most mature open-source foundation for building that product.
References
- [Backstage TechDocs documentation](https://backstage.io/docs/features/techdocs/)
- [TechDocs recommended CI/CD architecture](https://backstage.io/docs/features/techdocs/architecture)
- [MkDocs official documentation](https://www.mkdocs.org/)
- [Backstage plugin development guide](https://backstage.io/docs/plugins/)
- [New backend system documentation](https://backstage.io/docs/backend-system/)
- [Permission Framework documentation](https://backstage.io/docs/permissions/overview)
- [Backstage Upgrade Helper](https://backstage.github.io/upgrade-helper/)
- [Backstage release and versioning policy](https://backstage.io/docs/releases/versioning-policy)
- [OpenTelemetry official site](https://opentelemetry.io/)
- [CNCF Backstage project page](https://www.cncf.io/projects/backstage/)
현재 단락 (1/313)
Spinning up Backstage as a demo takes half a day. But a Backstage that is still alive six months lat...