Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Prologue — The field has consolidated into a few serious harnesses

The AI coding tool market in 2023 was chaos. A new extension shipped every week, the demos were dazzling, and almost nothing survived contact with real work. The view in spring 2026 is different. The field has consolidated. The number of "harnesses" you can seriously trust with production code now fits on one hand.

I use the word harness deliberately. What we're comparing is not the model. Claude, GPT, Gemini — they're all good. What we're comparing is the runtime that connects a model to your codebase, terminal, and CI — how it gathers context, how it calls tools, how it applies changes, where it puts the guardrails. The same Claude Opus gives you a completely different experience inside Claude Code, Cursor, and Aider. The harness makes the difference.

This is not a career-survival post. It does not touch questions like "will AI replace developers." This is a practitioner's buyer's guide. It compares six tools — Claude Code, Cursor, GitHub Copilot, OpenAI Codex, Aider, OpenClaw — head-to-head along the same axes, tells you which tool fits which situation, and above all, shows you how to verify the choice on your own codebase.

Why these six? The criteria are simple. (1) Is it actually maintained and updated as of spring 2026? (2) Does it have enough autonomy to trust with production code, not just toy problems? (3) Does it represent a distinct workflow? I picked the tools that satisfy all three. Tools like Windsurf, JetBrains AI, Cline, Antigravity, and Kiro are serious too, but these six cover nearly the whole design space of "surface x autonomy x pricing model." Understand the six and the rest read as variations.

Pricing and feature numbers change fast. In early 2026 alone, three of these tools changed their pricing model. I'll pin specific numbers as "as of early 2026" and focus on the structural differences that drive decisions — the parts that will still be true six months out. Verify the numbers yourself, but if you understand the structure, your judgment doesn't wobble when the numbers move.

Models are becoming commodities; the harness is becoming the moat. Choosing a tool means choosing a workflow, not a model.

Chapter 1 · The comparison axes — what to actually look at

Pick a tool by "vibe" and you'll regret it in three months. Decompose it along these seven axes.

Axis 1 · Surface (where it runs) CLI, IDE, or cloud? This is not taste — it's a workflow decision. A CLI harness attaches naturally to the terminal, Git, and CI, and is easy to script. An IDE harness is strong on inline diffs, tab completion, and debugger integration. A cloud harness is asynchronous — you throw it a ticket, go do something else, and get a PR back.

Axis 2 · Autonomy level Completion (next-line suggestion) -> inline edit (block level) -> agent (multi-file, multi-step, runs its own tests) -> async agent (finishes end-to-end with no human). Every tool has a different "default mode." Copilot started from completion; Claude Code and Codex started from the agent.

Axis 3 · Context handling A large model context window and a harness that fills it well are two different things. The key questions: how does it find relevant files (embedding index, grep, both?), how does it compress a large repo, how does it manage context over a long session? As of early 2026, some harnesses experimentally support a 1M-token window — roughly 25,000-30,000 lines seen at once without chunking.

Axis 4 · Tool / MCP support An agent needs tools to do work. Bash, file editing, and Git are table stakes. Above that, support for MCP (Model Context Protocol) is the dividing line. MCP is a protocol for attaching external tools — databases, issue trackers, browsers, internal APIs — in a standard way, and as of 2026 it has effectively become the industry standard. Support MCP and you borrow the whole ecosystem.

Axis 5 · Pricing model There are three patterns. (a) Flat subscription — predictable, favors heavy users. (b) Token/credit based — pay for what you use, favors light users but high variance. (c) Seat based — per team. As of early 2026 the industry is broadly moving toward token-based pricing, so "how much per month" is an increasingly hard answer. Always estimate the real monthly cost for a heavy user.

Axis 6 · Sandbox model Can the agent run rm -rf? The permission model is central. (a) Approval gate — a human says yes/no on every dangerous command. (b) Sandbox — runs in an isolated environment (container/VM) and only shows you the diff. (c) Full access — fast but dangerous. Cloud harnesses are usually (b); CLI harnesses offer (a) and (c) as options.

Axis 7 · Ecosystem and governance SSO, audit logs, team policy, third-party extensions, community size. Trivial for a solo developer, decisive for a 50-person team. Is it tracked who ran an agent on which code? Can cost be split per team and per project? Is there a data-handling policy your security team will approve? Without answers to these, enterprise adoption stalls.

How to use the axes Don't use these seven as a checklist — assign weights. For a solo IC, axes 1, 2, 3, and 5 matter and axis 7 is nearly meaningless. For a platform engineer on a 50-person team, axes 5, 6, and 7 are decisive and the fine differences on axis 2 are noise. Looking at the same table, a different tool wins depending on the role. That's why a headline like "the best AI coding tool" is meaningless — the question is wrong.

Keep these seven axes in mind, and now let's go through the tools one by one. Each chapter follows the same frame — Surface, strengths, autonomy/sandbox, pricing, weaknesses, one-line summary. Fixing the frame is what makes the comparison fair.

Chapter 2 · Claude Code — the reference point for terminal-native agents

Surface: CLI-first. It's an agent that runs in the terminal; there are IDE extensions (VS Code, etc.) too, but its identity is the CLI.

What it does well Claude Code is the reference point for the "agent by default" harness. It holds the filesystem, Git, and Bash as tools, and is strong at multi-file refactors and large-codebase exploration. As of early 2026, Claude Opus 4.6 processes a 1M-token context — meaning it reads a large repo whole, without chunking, and the felt difference is large on tasks like "find everywhere this pattern breaks."

It treats MCP as a first-class citizen. It attaches internal databases, issue trackers, and browser automation via the standard protocol. The skill and subagent concepts break a large task into small units, and project memory like CLAUDE.md injects your conventions.

Autonomy and sandbox Approval gate by default — a human confirms dangerous commands. You can reduce friction by pre-listing permissions in an allowlist. Loosen it as trust builds; tighten it on a codebase you don't know.

Pricing As of early 2026, a Claude Pro subscription (around $20/month) includes Claude Code, and there's a separate Max plan (around$ 100/month and $200/month) for heavy users. If your usage is high, a higher tier is effectively mandatory.

Weaknesses The pure inline-edit and tab-completion experience is weaker than IDE-native tools. The terminal is the primary interface, so don't expect GUI debugger integration. Use it heavily and cost climbs fast, pushing you to a higher tier — it can be overkill for a light user.

When not to use it If most of your day is "writing a few functions fast within a single file," Claude Code is overkill. IDE tab completion is faster for that loop. Claude Code's value comes from multi-file, large-scale, exploratory work — if you have little of that, another tool is better.

One-line summary: The quality reference point for multi-file work and large-repo exploration. The first candidate for anyone on a terminal workflow.

Chapter 3 · Cursor — the speed of an AI-native IDE

Surface: IDE. A standalone editor forked from VS Code.

What it does well Cursor's identity is speed. Tab completion (next-edit prediction) is the smoothest in the industry, and multi-file editing is handled by Agent/Composer mode. The loop of seeing a change inline and instantly accepting or rejecting it is fast — the "never take your hands off the editor" experience.

You can choose among several backend models, and it finds relevant files via a codebase embedding index. The turnaround speed of everyday editing — writing functions, small refactors, boilerplate — is the core strength.

Autonomy and sandbox Completion and inline editing are the sweet spot, but Agent mode does multi-step autonomous execution too. Running terminal commands goes through an approval gate. It's not as deep a sandbox isolation as a CLI harness.

Pricing As of early 2026, the individual plans are Hobby (free), Pro (around $20/month), Pro+ (around$ 60/month), and Ultra (around $200/month). But Cursor itself notes that "daily Agent users typically need$ 60-100/month of usage, power users often $200+" — so beware: you come in expecting a flat fee and get surprised by usage billing.

Weaknesses As a standalone editor, you have to leave VS Code (an advantage if you're used to it, a downside if not). It's weak on async ticket work. The real cost for a heavy user runs higher than the surface price — this is the most common complaint.

When not to use it If your main mode is the async "throw it an issue and walk away" workflow, Cursor is not the fit. Cursor's strength shows when a human is sitting in front of the editor. Also, in an environment that can't tolerate cost variance (a budget-tight team), a tool with a predictable flat fee is better.

One-line summary: If speed inside the editor is the top priority, Cursor. Just estimate your real usage cost first.

Chapter 4 · GitHub Copilot — value and integration

Surface: Multi-IDE extension. It attaches to VS Code, JetBrains, and a CLI. Not a standalone app — it sits on top of "the editor you already use."

What it does well Copilot started from completion and expanded into agent mode and a coding agent. There are two strengths. First, value — it's the cheapest serious option. Second, GitHub integration — coupling with issues, PRs, and Actions, plus mature enterprise licensing, SSO, and policy management.

The coding agent is an async workflow: assign it a GitHub issue and it creates a branch and opens a PR in the background. If your team already lives on GitHub, the friction is the lowest.

Autonomy and sandbox Completion and inline are still the core, but agent mode does multi-file work and the coding agent does async work. The cloud agent runs in an isolated environment and returns its result as a PR.

Pricing As of early 2026: Free (limited), Pro (around $10/month), Pro+ (around$ 39/month), Business (around $19/user/month), Enterprise (around$ 39/user/month). Note: it was announced that as of June 1, 2026 billing moves from request-based to usage-based, so keep the billing-structure change in mind.

Weaknesses The "depth" of agent autonomy is widely seen as not yet matching the full-agent experience of Claude Code or Codex. As a multi-IDE extension, the weight is on "editor augmentation" rather than the most aggressive agent workflows.

When not to use it If the most aggressive autonomous workflow — "the agent finishes it all on its own" — is the core value, Copilot's agent depth may feel lacking. Also, for an organization not on GitHub (GitLab/Bitbucket centric), its biggest strength, the integration, disappears.

One-line summary: If you already live on GitHub and value and enterprise management matter, Copilot. The safe default for a team.

Chapter 5 · OpenAI Codex — ambidextrous across CLI and cloud

Surface: CLI + cloud + desktop app. Three branches: an open-source CLI tool, a cloud agent bundled into ChatGPT subscriptions, and a macOS desktop app launched February 2026.

What it does well Codex's strength is that it binds CLI and cloud into one flow. The codex cloud command lets you launch and triage cloud tasks without leaving the terminal, and you browse active and finished tasks in an interactive picker. You can also give a task --attempts (1-4) to request best-of-N runs — run the same task several times and pick the best.

As of early 2026, GPT-5.4 has native computer-use capability and experimental support for a 1M context window, and stronger tool use plus tool search help the agent find the right tool more efficiently. Remote workflows are polished too — codex remote-control brings up a headless, remotely controllable app server.

Autonomy and sandbox Agent by default. The local CLI offers an approval gate and a sandbox mode as options; the cloud runs in an isolated environment and returns its result. The /goal workflow creates a long-horizon goal you can pause, resume, and clear.

Pricing As of early 2026, Codex is included with ChatGPT Plus, Pro, Business, and Enterprise/Edu, with limited-time Free and Go access. But as of April 2, 2026, Codex pricing moved to token-based credits for most Plus, Pro, Business, and Enterprise customers — usage tracking is mandatory.

Weaknesses The three-branch surface (CLI/cloud/desktop) is both a strength and a learning curve. The token-based shift made cost prediction harder. You're tied to the OpenAI ecosystem.

When not to use it If you don't want to be tied to a model vendor, Codex is not the fit — it presumes OpenAI models. Also, if you only want simple inline editing but have to learn the concepts of all three branches (CLI, cloud, desktop), the learning cost is excessive.

One-line summary: If you want one tool to move between async cloud work and terminal work, and you already use ChatGPT, Codex.

Chapter 6 · Aider — Git-first, model-neutral

Surface: CLI. A pair-programming tool that runs in the terminal, and it's open source.

What it does well Aider's philosophy is Git-first. It auto-commits every change as a meaningful unit — what the agent did is perfectly traceable via git log, and if you don't like it, it's one git revert. This isn't a small detail; it changes the entire trust model.

The second strength is model neutrality. GPT, Claude, Gemini, local models — attach anything. Architect mode is especially clever: a strong (expensive) model designs "how to solve it," and a cheap, fast editor model translates that design into specific file edits. The recommended 2026 workflow is a GPT-5 architect plus a cheaper editor, and on multi-file refactors it measurably reduces errors versus a single model while costing 30-50% less.

Practical features are solid — watch mode (instructing via code comments), prompt caching, /web and /voice, the .aider.conf.yml config model, and the polyglot leaderboard. Being open source, there's no subscription cost — you only pay model API costs.

Autonomy and sandbox Inline editing plus auto-commit is the core loop. It's closer to a "traceable pair programmer" than a large autonomous agent. The guardrail is Git itself — everything is committed, so reverting is easy.

Pricing The tool itself is free (open source). Cost is entirely model API usage. Architect mode lowers cost substantially.

Weaknesses The MCP and third-party extension ecosystem is thinner than commercial tools. There's no IDE integration or GUI (the CLI is everything). It's weak on the most aggressive async agent workflows.

One-line summary: If Git traceability, model-choice freedom, and cost control are the top priorities, Aider. The choice of the open-source minimalist.

Chapter 7 · OpenClaw — an autonomous agent with a messaging interface

Surface: Messaging app. It works as a chatbot inside Signal, Telegram, Discord, and WhatsApp, runs locally, and is open source.

What it does well OpenClaw is the most different-grained tool on this list. It is not originally a coding-only IDE agent but a general-purpose personal AI agent — first released in November 2025 under the name Clawdbot, it went through two renames in early 2026 (Moltbot -> OpenClaw). It was created by PSPDFKit founder Peter Steinberger, and in early 2026 it became a phenomenon as its GitHub star count crossed 100,000.

The core trait is self-improvement. For a task you want done, it writes its own code to create new skills, implements proactive automation, and maintains long-term memory of your preferences. It does coding work through a coding-agent skill. It plugs into an external LLM (Claude, DeepSeek, OpenAI GPT, etc.), so it's model-neutral.

The real appeal is the interface. It lives in a messenger, not an IDE or a terminal — making an async, ambient workflow possible, like sending "fix that bug from yesterday and open a PR" over Signal on your commute.

Autonomy and sandbox It aims for high autonomy — that's why it's called "self-improving." Because it runs locally, you have to design the sandbox and permission management yourself. The higher the autonomy, the more careful the setup needs to be.

Pricing Open source and run locally. There's no tool cost — you only pay the API cost of the LLM you attach.

Weaknesses Its maturity as a pure coding harness lags Claude Code, Codex, and Cursor — its essence is a general-purpose assistant. A messaging interface is inconvenient for fast inline code review. The higher the autonomy, the heavier the burden of local security and permission design. As of early 2026, the governance structure (a non-profit foundation) is only just settling in.

One-line summary: If you want an ambient agent that automates not just coding but your whole life, and you can manage the local setup yourself, OpenClaw. The most experimental choice.

Chapter 8 · The big comparison table

Six tools, seven axes, at a glance. All figures are as of early 2026 and change fast.

Axis	Claude Code	Cursor	GitHub Copilot	OpenAI Codex	Aider	OpenClaw
Surface	CLI-first (+IDE ext.)	AI-native IDE	Multi-IDE ext. +CLI	CLI +cloud +desktop	CLI	Messaging app
Default autonomy	Agent	Completion/inline (+agent)	Completion/inline (+agent)	Agent (+async)	Inline +auto-commit	High-autonomy general
Context handling	1M window, whole large repo	Embedding index	Repo-aware	1M window exp., tool search	Repo map +manual add	Long-term memory
MCP / tools	MCP first-class	Tool support	Tools +GitHub integration	Stronger tool use/search	Thin extensions	Self-written skills
Pricing model	Subscription (Pro/Max)	Subscription+usage (surprise risk)	Seat+usage (transition coming)	Token credits (moved)	Free (API cost only)	Free (API cost only)
Sandbox	Approval gate	Approval gate	Cloud isolation	Gate+sandbox, cloud isolation	Git = guardrail	User-designed
Ecosystem/governance	MCP ecosystem, fast	Editor ecosystem	Mature enterprise/SSO	OpenAI ecosystem	Open source, thin	New foundation, huge community
Async ticket work	Moderate	Weak	Strong (coding agent)	Strong (cloud)	Weak	Strong (messenger)
Solo IC fit	High	Very high	High	High	High	Medium
Team/governance fit	High	Medium	Very high	High	Medium	Low
Cost predictability	Medium	Low	Medium	Low	High (controlled via architect)	High
One-line identity	Multi-file quality reference	Editor speed	Value + integration	CLI/cloud ambidextrous	Git-first, model-neutral	Ambient autonomous agent

Don't pick from the table alone. The table is a tool for narrowing candidates — the decision happens in the next two chapters.

Chapter 9 · The decision matrix — which tool for which situation

There is no "best" tool. There is only "fits this situation."

Situation 1 · Solo IC, everyday-editing centric If "never taking your hands off the editor, writing functions and small refactors fast" is 80% of your day -> Cursor. But if you're a heavy user, estimate the monthly cost first. If you want tight cost control and the terminal is comfortable -> Aider (architect mode).

Situation 2 · Solo IC, big-refactor / exploration centric If you have a lot of multi-file, large-scale work like "find everywhere this pattern breaks" or "migrate this entire module to the new API" -> Claude Code. It sees a 1M context without chunking. Codex CLI is a strong alternative too.

Situation 3 · Async ticket work If you want to throw it an issue, go do something else, and get a PR back -> GitHub Copilot coding agent (when you already live on GitHub) or OpenAI Codex cloud. If a messenger-based ambient workflow appeals to you -> OpenClaw.

Situation 4 · Team, governance matters If you need SSO, audit logs, seat management, and policy -> GitHub Copilot is the safest default. Claude Code has high team fit too. Cursor is possible but weigh its cost variance; OpenClaw, weigh its governance maturity.

Situation 5 · Control cost to the cent If you want model API cost only, no subscription, and even that minimized via architect mode -> Aider. OpenClaw is open source and local too, so the tool cost is zero.

Situation 6 · You need model-choice freedom If you don't want to be tied to a specific vendor and want to freely swap GPT, Claude, Gemini, and local models -> Aider or OpenClaw. Both are model-neutral.

The realistic combination The common 2026 setup is not a single tool but a combination — everyday editing in Cursor or Copilot (IDE), complex multi-file work in Claude Code or Codex (terminal). Don't make a religion of one tool; switch hands to match the work type.

Chapter 10 · How to actually evaluate them on your codebase

Review posts, benchmarks, and leaderboards are only a starting point. Performance on your own repo is the only data that means anything. Verify with this protocol within one to two weeks.

Step 1 · Pick 5 representative tasks Pull them from your real backlog. Not toy problems for a demo, but: (a) one small bug fix, (b) one new feature, (c) one multi-file refactor, (d) one test addition, (e) one understand-and-explain of an unfamiliar code area. These five should represent the distribution of your work.

Step 2 · Run the same task through 2-3 candidates You should have narrowed to 2-3 candidates in Chapter 9. Run each with the same task, the same prompt, the same starting commit. A fair comparison comes from controlled inputs.

Step 3 · Record the quantitative metrics Measure per task: (a) first-attempt accuracy (did it pass with no human intervention?), (b) wall-clock time, (c) tokens/cost, (d) number of human revision rounds, (e) cleanliness of the final diff (did unnecessary changes get mixed in?).

Step 4 · Watch the qualitative signals The things numbers can't catch: does it follow conventions, does it add guardrails (tests, types, validation) on its own, when stuck does it honestly say it's stuck or does it emit a plausible falsehood, is its context handling smooth?

Step 5 · Compute the friction cost Are there so many approval gates that the flow keeps breaking? So few that you're anxious? How much time went into setup, configuration, and MCP wiring? The cumulative friction of using the tool every day matters more than a one-time impression.

Step 6 · Decide, then re-evaluate in 3 months This field is fast. There's no guarantee that "best now" is still best six months out. Re-verify briefly every quarter — with the 5-task protocol it's half a day.

Keep the evaluation record a simple table You don't need a grand tool. One spreadsheet does it. Avoid just one common trap — being swayed by first impressions. If tool A finishes the first task dazzlingly, you grade the other four generously. So run all five, then score them all at once. The evaluation-record skeleton is this simple.

task    | tool | first-pass | wall (min) | cost ($) | revisions | diff cleanliness (1-5) | notes
T1-bug  | A    | Y          | 4          | 0.12     | 0         | 5                      | follows conventions
T1-bug  | B    | N          | 9          | 0.21     | 2         | 3                      | unrelated changes mixed in
...

5 tasks x 3 candidates = 15 rows. Fill them all and the pattern becomes visible — which tool is strong at which type. Don't only look at the average; look at the variance too. A tool with a good average that occasionally misses badly isn't trustworthy.

Someone else's benchmark is about someone else's codebase. Spend half a day measuring on your own repo and you prevent six months of the wrong tool choice.

Epilogue — checklist, anti-patterns, next-post teaser

In spring 2026, the AI coding agent field has consolidated. The six tools exist for different workflows, and there is no "best." There is only the tool that fits the distribution of your work.

Tool-selection checklist (in order)

Know the distribution of your work first — everyday editing vs. big refactors vs. async tickets; write down the ratios.
Decide the surface — CLI / IDE / cloud / messenger, whichever fits the workflow.
Set the autonomy level you need — is completion enough, or do you need a full agent?
Look at your context needs — do you have a lot of work that needs to see a large repo whole?
Weigh the need for an MCP/tool ecosystem — do you have to attach internal tools?
Understand the pricing model — flat / token / seat, and estimate the real heavy-user cost.
Check the sandbox/permission model — for a team, governance (SSO, audit logs) too.
Narrow to 2-3 candidates — the table is a narrowing tool, not a decision tool.
Verify on your own codebase with the 5-task protocol — quantitative plus qualitative.
Decide, and re-evaluate half a day every quarter — this field is fast.

Anti-patterns (do not do these)

Deciding from benchmarks and leaderboards alone — that's about someone else's codebase. Measure on your own repo.
Looking at the surface price and relaxing because it's flat — pricing is moving to token/usage based. Estimate the real heavy-user cost.
Making a religion of one tool — a different tool is better for everyday editing vs. multi-file work. Use a combination.
Granting full permissions on a codebase you don't know — before trust is built, tighten the approval gate.
Skipping convention injection — run it without project memory like CLAUDE.md or .aider.conf.yml and the agent doesn't know your style.
Trading away traceability for autonomy — the higher the autonomy, the more you reinforce traceability with Git commits, diff review, and a sandbox.
Choosing once and never looking again — skip the quarterly re-evaluation and six months later you're on an outdated tool.
Ignoring setup friction — daily cumulative friction matters more than a one-time impression.

Next-post teaser

The next post covers the step after tool selection — agent workflow engineering. Once you've picked a tool, the next thing is using it well. Designing project memory (CLAUDE.md, rule files), building your own MCP server to attach internal tools, decomposing large work with subagents, and the team process for safely reviewing and merging agent-made PRs. The tool is only the start; the workflow makes the result.