Split View: AI 코딩 어시스턴트의 한계 2026 — 솔직한, 그러나 공정한 평가 (안 되는 일들에 대하여)

AI 코딩 어시스턴트의 한계 2026 — 솔직한, 그러나 공정한 평가 (안 되는 일들에 대하여)

프롤로그 — 칭찬이 아니라 보정(calibration)이 필요한 시기

지난 18개월 동안의 AI 코딩 콘텐츠는 대부분 세 가지 모드 중 하나였다.

유토피아 — "이제 개발자는 필요 없다"
디스토피아 — "AI 코드는 다 쓰레기다"
마케팅 — "우리 도구가 X%를 자동화합니다"

이 글은 그 중 어느 것도 아니다. 이 글은 calibration에 관한 글이다.

AI 코딩 어시스턴트는 강력하다. 그러나 강력함의 모양은 균일하지 않다. 어떤 작업에서는 3x~10x의 multiplier이고, 어떤 작업에서는 0x이며, 어떤 작업에서는 음수의 multiplier — 즉, 사람이 직접 했을 때보다 더 느리고 더 잘못된 결과를 낸다.

2026년 5월 현재, SWE-bench Verified 상위권 에이전트들은 75% 안팎에 도달했다. 좋은 수치다. 그러나 그 25%가 어떤 25%인지를 모르면, 우리는 25%의 영역에서 신뢰를 1로 두고 손해를 본다.

이 글은 AI가 잘 못하는 일 10가지를 다룬다. 각각에 대해:

실제 실패 패턴 — 추상이 아니라 코드.
왜 그런가 — 모델 아키텍처와 하네스의 한계로 환원.
개발자가 계속 해야 할 일 — 인간 판단이 multiplier가 되는 지점.

마지막에 한 매트릭스, 체크리스트, 안티패턴 목록을 둔다. AI 회의론이 아니라 AI 도구를 더 잘 쓰기 위한 보정이다.

1장 · 깊은 버그 디버깅 — "에이전트는 코드는 쓰지만 버그는 좁히지 못한다"

패턴

상용 서비스에서 5분에 한 번씩 P99 레이턴시가 튀는 버그가 있었다. 원인은 결국 GC가 아니라 TLS 세션 재협상 폭주였다. 에이전트는 8시간을 GC 튜닝에 쓰고, 그 8시간 끝에 "GC 로그를 더 자세히 보고 싶으니 권한을 달라"고 했다.

왜 그런가

LLM 에이전트는 **가설 터널링(hypothesis tunneling)**에 취약하다. 한번 "이건 GC 문제 같다"고 컨텍스트에 들어가면, 그 다음 모든 도구 호출과 추론이 그 가설을 확정하는 방향으로 편향된다. 인간 디버거는 1시간쯤 뒤에 "이게 아닌가?"라고 멈추고 다른 가지를 본다. 에이전트는 컨텍스트가 점점 두꺼워지면서 처음의 가설을 강화한다 — context rot의 가장 음험한 형태다.

게다가 에이전트의 도구 호출은 로컬한 정보를 본다. 디버거를 띄워서 코어 덤프를 보고, 패킷 캡처를 추적하고, 커널 trace를 읽는 — 이런 작업은 부분적으로만 자동화된다. 그리고 "재현되지 않는" 버그는 에이전트의 약점 중 하나다 — 재현이 어려우면 에이전트는 추측을 더 늘린다.

사람이 계속 해야 할 일

가설을 닫지 마라. 에이전트에게 "이건 GC가 아닐 수도 있다, 5가지 다른 가설을 나열하라"고 강제하라.
반증 가능한 실험을 설계하라. 에이전트는 "내가 옳다는 증거"를 잘 찾지만, "내가 틀렸다는 증거"를 잘 찾지 못한다.
8시간 룰. 같은 가설을 8시간 이상 쫓고 있다면 멈춰라. 에이전트는 멈추지 않는다 — 당신이 멈춰야 한다.

실제 다이얼로그 예시

사람: "P99이 갑자기 튀는데 GC인 것 같아. 봐줘."
에이전트: (1시간) "G1GC 파라미터를 튜닝합시다…"
사람: "어떤 가설을 더 가질 수 있을까? 5개 나열해."
에이전트: "1) GC, 2) 디스크 IO 폭주, 3) 네트워크 재전송, 4) TLS 재협상, 5) 컴팩션 작업"
사람: "GC 외의 4개를 5분씩 봐. 증거가 있는지."

이 강제 분기가 없으면 에이전트는 1번에서 빠져나오지 못한다. 메타 지시(meta-instruction) — "다른 가설도 봐라" — 가 multiplier다.

2장 · 대규모 코드베이스의 아키텍처 — "에이전트는 단면을 보고, 전체를 못 본다"

패턴

100만 라인 모노레포에서 새 기능을 추가하라고 한다. 에이전트는 그 기능을 new-feature/ 디렉터리에 깔끔하게 만든다. 코드는 작동한다. 그러나 같은 팀의 시니어가 보면 "어… 이거 이미 platform/shared/orchestration에 비슷한 게 있는데?"라고 한다. 에이전트는 그걸 못 본다 — 그 디렉터리를 한 번도 읽지 않았기 때문이다.

왜 그런가

LLM은 컨텍스트 윈도우 안에서만 추론한다. 1M 토큰 모델도 100만 라인 모노레포 전체를 한 번에 못 본다. 에이전트는 이름 기반 검색과 언급된 경로로 코드베이스를 탐색하는데, 이는 암묵적 지식을 못 잡는다.

특히 위험한 것은 abstraction conflict다. 에이전트는 "깔끔한 추상화"를 만든다. 그러나 그 추상이 기존 추상과 모양만 다르고 의도는 겹친다면, 코드베이스는 두 개의 비슷한 abstraction layer를 갖게 된다. 6개월 뒤 어떤 신입은 어느 쪽을 써야 하는지 모르고, 양쪽 다 쓰고, 결국 third one을 만든다.

사람이 계속 해야 할 일

에이전트에게 "비슷한 게 이미 있는지 먼저 검색하라"는 명시 단계를 강제하라. AGENTS.md 또는 prompt에.
PR 단계의 "abstraction review" — 시니어가 "이게 기존 X와 어떻게 다르냐"를 묻는 단계. AI는 이 질문을 자기 자신에게 잘 던지지 못한다.
아키텍처 문서를 텍스트가 아니라 에이전트가 읽을 수 있는 형태로 두라. 결정 기록(ADR), 모듈 책임 매트릭스. 에이전트는 못 읽는 것을 못 본다.

한 발 더 — "기존 코드를 두 번 읽게 하라"

대규모 코드베이스에서 효과가 좋았던 트릭 하나. 에이전트에게 task를 주기 전에 명시적으로 두 단계를 시킨다.

1) "이 task와 관련될 수 있는 기존 모듈을 5분간 검색해서 나열하라."
2) "그 모듈들의 책임이 task와 어떻게 겹치는지, 새 코드를 어디에 두는 것이 옳은지 후보를 3개 제시하라."
3) "사람의 확인 받은 뒤에 코딩을 시작하라."

3단계 게이트는 추상화 충돌을 30~50% 줄인다는 게 우리 팀의 경험이다. 에이전트는 "코드 쓰기 시작"이 디폴트라서, 쓰기 전에 멈추도록 명시해야 한다.

3장 · 미묘한 성능 회귀 — "보기엔 같은데 10배 느리다"

패턴

에이전트가 N+1 쿼리를 "리팩터링"한다. 코드는 더 깔끔해 보인다. 테스트는 전부 통과한다. 운영에 배포한다. P95 레이턴시가 10배가 된다. 보니까, 깔끔해 보이는 그 코드가 인덱스가 안 타는 쿼리 패턴으로 바뀌었다. 또는 lazy load이던 게 eager load이 됐다.

왜 그런가

LLM은 코드의 모양을 본다. 코드의 런타임 비용은 직접 보지 않는다 — 추론할 뿐이다. 그리고 그 추론은 종종 데이터 분포에 의존한다. 100건의 row에서는 안 보이는 N+1이 100만 건에서는 P95를 죽인다. 에이전트는 100만 건의 분포를 모른다.

또한 에이전트는 micro-benchmark를 잘 짜지 못한다. JIT 워밍업, 캐시 효과, 측정 노이즈 — 모두 인간 직관이 필요한 영역이다.

사람이 계속 해야 할 일

성능 회귀 테스트를 CI에 박아라. 에이전트는 회귀를 안 본다 — CI가 봐줘야 한다.
프로파일러 결과를 에이전트에게 직접 먹여라. "이 PR이 어떤지 봐"가 아니라 "이 flamegraph를 보고 회귀를 찾아라". 도구가 다르면 능력이 다르다.
데이터 분포에 대한 책임은 인간이 진다. 에이전트는 "1000건에서 잘 돌아간다"는 알지만, "1억 건일 때 어떨까"는 모른다.

작은 사례 — 알고리즘 복잡도의 hidden upgrade

에이전트가 Set 기반 dedup을 Array.filter + includes로 "단순화"하는 PR을 자주 본다. 코드는 더 짧다. 100건일 때는 더 빠르다(상수 작은). 10만 건일 때는 O(n²)이라 30배 느리다. 단위 테스트는 10건짜리라 못 잡는다. 운영에서 잡힌다.

이런 패턴이 위험한 이유는 **"리팩터처럼 보이는 회귀"**이기 때문이다. PR 리뷰어도 "더 깔끔하네"하고 넘긴다. 자동 성능 회귀 테스트만이 객관적인 안전망이다.

4장 · 동시성 정확성 — "패턴매칭으로 race-free 흉내를 낸다"

패턴

에이전트가 멀티 스레드 큐 구현을 짠다. 코드는 Mutex와 Condvar를 정석대로 쓴다. 리뷰하면 "어, 잘 했네"라고 한다. 6개월 뒤, 부하 상황에서 deadlock이 발생한다. 보니까 두 락의 획득 순서가 두 곳에서 반대였다 — 경합이 거의 일어나지 않는 경로에서.

왜 그런가

동시성 코드는 사례 학습으로 잘 안 잡힌다. 모델은 "흔한 race-free 패턴"을 매우 잘 흉내낸다 — producer/consumer, reader/writer lock, channel 기반 메시지 패싱. 그러나 두 락의 순서 일관성, 메모리 모델(Acquire/Release/SeqCst), 재진입 조건 같은 것들은 전역 추론이 필요하다. 모델은 슬라이스를 본다.

또한 테스트로 잡히지 않는다. race condition은 본질적으로 비결정적이고, 단위 테스트는 대부분 결정적인 케이스만 친다. ThreadSanitizer, loom, jepsen 같은 도구가 필요한데, 에이전트는 이런 도구를 능숙하게 다루지 못한다.

사람이 계속 해야 할 일

동시성 코드는 항상 사람의 두 번째 눈으로. PR에 "concurrency: review needed" 라벨을 박아라.
모델 체커를 써라. TLA+, loom, Coq — 동시성 정확성에 대한 진짜 보장은 여전히 형식 검증.
에이전트에게는 "interleaving을 따져봐라"고 명시. 에이전트는 묻지 않으면 안 한다.

한 줄 코멘트로 lock order 강제

// LOCK_ORDER: always (config -> session) — see SECURITY.md L42
let _c = config.lock();
let _s = session.lock();

이런 명시 코멘트가 두 곳에서 같은 순서를 강제하는 도구가 된다. 에이전트는 "맥락 표시 코멘트"를 잘 따른다. 단, 코멘트를 쓰는 것은 사람의 몫이다.

5장 · 모호함 속의 판단 — "팀이 어떤 트레이드오프를 선호하나"

패턴

"이 API의 에러 처리를 추가하라." 에이전트는 깔끔한 Result 기반 에러 처리를 추가한다. 그런데 이 팀은 panic-on-invariant-violation이 컨벤션이었다. 또 다른 팀은 OpenTelemetry로 에러를 보내는 게 컨벤션이었다. 에이전트는 "옳은" 답을 모른다 — 답은 팀의 선택에 달려 있기 때문이다.

왜 그런가

소프트웨어 엔지니어링의 큰 부분은 취향과 트레이드오프다. "어떤 패러다임을 쓸까", "여기까지가 적절한 추상화 수준일까", "이 latency 50ms를 코드 가독성 30% 향상과 바꿀 가치가 있을까". 이 질문들에는 컨텍스트 외부의 정보가 필요하다 — 팀의 가치관, 회사의 우선순위, 사용자의 통증점.

에이전트는 이걸 추측한다. 좋은 추측을 한다. 그러나 추측은 추측이다.

사람이 계속 해야 할 일

"선호도" 문서를 만들어라. AGENTS.md에 "에러 처리는 panic이 아니라 Result", "로깅은 structured", "비동기는 채널 기반" 같은 것을 박아라.
에이전트에게 "어떤 트레이드오프가 있냐"고 묻게 하라. 모호하면 묻게 해라. 에이전트는 묻지 않으면 추측한다.
PR 단계에서 "팀 컨벤션 일치"를 명시적으로 본다. 자동화에 맡기지 마라.

6장 · 진짜 새로운 기술 — "훈련 컷오프 편향, 새 문서를 줘도 옛 패턴으로 회귀한다"

패턴

2026년 초에 나온 라이브러리(가령 Effect-TS 3.x, React 19.5의 new concurrent features, Next.js 16의 PPR 안정화)에 대해 새 문서를 컨텍스트에 통째로 넣어준다. 에이전트는 처음엔 새 API를 잘 쓴다. 그러다 200 토큰쯤 지나면 2024년 패턴으로 회귀한다 — useEffect 패턴을 새 동시성 hook 위에 얹는다든가, 새 API의 시그니처가 기억나지 않아 옛 시그니처로 호출한다든가.

왜 그런가

LLM은 훈련 데이터에서 본 빈도에 강하게 영향받는다. 새 라이브러리의 문서는 토큰 수십 K에 불과한 반면, 옛 라이브러리는 모델 가중치에 깊이 박혀 있다. 컨텍스트의 정보는 가중치의 prior를 완전히 덮지 못한다.

이건 RAG로도 잘 안 풀린다. 검색해서 새 문서를 줘도, 모델은 일관성 있게 새 패턴을 따르는 게 아니라 — 새 시그니처 + 옛 패턴이라는 혼합 hallucination을 만든다.

사람이 계속 해야 할 일

새 라이브러리는 직접 손으로 한 번 짜라. 에이전트에게 넘기기 전에 본인이 패턴을 익혀라. 그러면 에이전트의 회귀를 알아챌 수 있다.
에이전트 코드에 "버전 강제" 어설션을 박아라. lint 룰, 타입 검사. "옛 API 호출은 컴파일 에러"가 되도록.
에이전트가 옛 패턴으로 회귀할 때마다 명시적으로 정정하라. 한 세션 안에서 반복 정정은 효과가 있다.

검증된 회귀 신호

새 라이브러리 작업 중 다음을 보면 회귀를 의심해라:

import 경로가 옛 path로 슬며시 돌아간다 (react/jsx-runtime이 아니라 react).
새 hook 안에서 옛 hook의 mental model이 보인다.
에러 메시지가 옛 버전의 시그니처와 일치하지 않는다 (이건 옛 API를 가정한 try/catch라는 뜻).

이 셋 중 하나라도 보이면 새 세션을 시작하고 새 docs를 더 강한 어조로 주입.

7장 · 멀티 레포 cross-cutting 변경 — "여전히 어렵다"

패턴

40개의 마이크로서비스에서 같은 변경(예: deprecated SDK 1.x → 2.x)을 해야 한다. 에이전트는 한 레포는 잘 한다. 두 번째 레포도 잘 한다. 다섯 번째 레포에서 각 레포가 가진 미묘한 차이(테스트 러너 차이, 빌드 시스템 차이, 의존성 버전 차이)를 못 잡는다. PR을 40개 열지만, 그 중 12개가 빌드 깨진 채로 올라간다.

왜 그런가

각 레포는 자체의 숨겨진 컨텍스트가 있다. 멀티 레포 변경은 본질적으로 N개의 컨텍스트를 다루는 것인데, LLM의 컨텍스트 윈도우는 하나다. 에이전트는 한 레포의 결정을 다른 레포에 일반화하려 한다 — 그게 안 되는 부분에서 실패한다.

또한 멀티 레포는 의존성 그래프다. A를 바꾸려면 B를 먼저 배포해야 하고, 그러려면 C가 호환되어야 한다. 이 그래프 추론은 단일 컨텍스트에서 잘 안 된다.

사람이 계속 해야 할 일

한 레포에서 패턴을 정착시킨 다음 다른 레포로 확장하라. 첫 5개는 사람이 직접, 나머지는 에이전트 + 인간 검토.
공통 변경은 codemod로 만들어라. AST 변환 + AI 검토가 순수 AI보다 안정적이다.
배포 순서는 사람이 진다. 에이전트는 의존성 그래프 추론에 약하다.

8장 · 보안 크리티컬 코드 — "맞아 보이는데, 미묘한 결함이 있다"

패턴

에이전트가 비밀번호 해싱 함수를 짠다. bcrypt를 쓰고, salt는 random에서 가져온다. 보기엔 정석이다. 그런데 타이밍 공격에 취약한 비교 (== 대신 constant_time_eq을 써야 함), salt rounds가 cost factor 8 (2026 기준 10~12 권장), 에러 메시지가 username을 echo한다 (timing leak). 모두 미묘하고, 코드 리뷰에서 못 잡을 수 있다.

왜 그런가

보안은 negative space의 학문이다 — "있어야 할 것이 없으면 보안 결함". LLM은 "있는 것"을 보고 추론한다. "없는 것"을 잡으려면 보안 mental model이 필요하다. 에이전트의 mental model은 평균이지, 전문가 수준이 아니다.

또한 보안 vulnerability는 느린 발견이다. 6개월 뒤 침해 사고가 나서야 보인다. 그 시점엔 에이전트는 다른 일을 하고 있다.

사람이 계속 해야 할 일

보안 크리티컬 코드는 보안 전문가 리뷰가 필수. AI 리뷰는 보조.
자동화 도구(semgrep, bandit, gosec, cargo-audit)를 강제하라. AI보다 결정적이다.
AGENTS.md에 "보안 위험" 섹션을 명시. "이 디렉터리의 코드는 항상 인간 보안 리뷰"라고 박아라.

자주 놓치는 negative space 목록

에이전트 코드 보안 리뷰에서 우리가 체크하는 항목:

timing-safe 비교가 쓰였는가
비밀 키가 로그에 들어갈 수 있는가
에러 응답이 timing oracle을 만드는가
rate limit이 모든 path에 걸렸는가
input validation이 sanitization 직전이 아니라 직후에 있는가
CSRF/CORS가 명시적으로 설정됐는가 (defaults 의존이 아니라)

이 6가지를 PR 템플릿에 박으면 보안 회귀의 80%는 잡힌다. 나머지 20%가 전문가 리뷰의 영역이다.

9장 · 레거시 코드 — "코드가 '아는' 것을 코드가 '말하지' 않는다"

패턴

15년 된 모놀리스의 한 함수가 이상하게 생겼다. 변수 이름은 magic_offset_42다. 에이전트는 "이건 정리해야 한다"고 판단하고 리팩터링한다. 배포한다. 한 달 뒤, 오래된 데이터 마이그레이션 파이프라인이 죽는다. 보니까 magic_offset_42는 2012년의 데이터베이스 마이그레이션 사고에서 생긴 off-by-one 보정이었다. 코드는 그걸 안 적었다. 그러나 코드는 그걸 알았다.

왜 그런가

레거시 코드는 암묵적 지식의 압축이다. 코드가 그런 모양인 이유는 종종 코드 자체에 안 적혀 있다 — 슬랙 채널의 어딘가, 떠난 엔지니어의 머릿속, 5년 전의 사고 보고서에 적혀 있다. LLM은 코드만 본다.

이건 RAG로도 해결 안 된다. 사고 보고서가 검색되어도, 그게 이 코드 라인과 연결되어야 한다는 걸 모델이 알아내야 하는데 — 그 연결이 종종 안 일어난다.

사람이 계속 해야 할 일

레거시 코드에는 "리팩터링 금지" 마커를 박아라. 주석으로, 또는 별도 파일로.
git blame + 변경 히스토리를 에이전트에게 강제 주입. "이 줄을 왜 이렇게 짰는지 알려면 커밋 history를 봐라".
레거시 코드 변경은 항상 사람이 한 번 본다. 자동 리팩터링은 위험.

10장 · 긴 세션의 prompt drift — "에이전트가 천천히 줄을 놓는다"

패턴

8시간짜리 에이전트 세션. 처음 1시간은 흠 잡을 데 없다. 4시간쯤 지나면 코드 스타일이 이상해진다 — 변수 이름 컨벤션이 조금 바뀌고, 에러 처리 패턴이 살짝 다르고, 주석이 점점 줄어든다. 7시간쯤에는 처음의 지시사항을 잊은 것처럼 행동한다.

왜 그런가

컨텍스트가 길어질수록 모델은 초반의 instructions보다 최근의 도구 결과에 더 가중치를 둔다. 이게 본질적인 attention dilution이다. 또한 자동 컨텍스트 압축(summarization)을 쓰면, 압축 과정에서 미묘한 정보가 사라진다. 결과적으로 모델은 자신이 누구인지 천천히 잊는다.

이건 모델 크기로 완전히 안 풀린다. 200K, 1M, 10M 컨텍스트가 와도 — 상대적 attention은 그대로다.

사람이 계속 해야 할 일

세션을 짧게 유지하라. 큰 작업은 여러 세션으로 쪼개라. 각 세션은 명확한 input과 output을 갖게.
AGENTS.md를 매번 다시 주입. 시스템 프롬프트가 있더라도, 중요한 컨벤션은 명시적으로 다시 말해라.
drift를 보면 새 세션을 시작하라. 미련 갖지 마라. 컨텍스트 리셋이 가장 강력한 도구다.

drift를 측정하는 한 줄 휴리스틱

세션 시작 시점에 "이 작업의 3가지 핵심 규칙을 한 줄로 말하라"고 묻고 답을 메모해 둔다. 4시간 뒤 같은 질문을 다시 던진다. 답이 달라지면 drift다. 답이 같아도 코드 스타일이 달라지면 부분 drift다. 어떤 경우든 새 세션을 시작.

11장 · 그래서 AI는 무엇을 잘 하는가 (대조군)

이 글은 AI 부정이 아니다. AI가 분명히 multiplier인 영역을 명시해야 공정하다.

AI가 빛나는 작업

부트스트랩 — 새 프로젝트 scaffold, 보일러플레이트, 익숙한 패턴의 첫 100줄.
잘 정의된 단일 작업 — "이 함수를 TypeScript로 변환", "이 SQL을 ORM 코드로", "이 테스트의 빠진 케이스 추가".
읽기·요약·설명 — 1000줄 파일을 200줄로 요약, 빌드 로그에서 진짜 에러 찾기, PR 변경의 의도 설명.
반복적인 마이그레이션 (잘 정의된) — 일관된 codemod로 표현 가능한 변경.
테스트 작성 — 특히 input/output 형태가 명확한 unit test.
문서 생성 — 코드에서 docstring, README 초안, API 명세.
익숙한 디버깅 — null pointer, off-by-one, 흔한 트랜잭션 누락 같은 패턴 인지.

이 영역들에서 AI는 3x~10x multiplier다. 부정할 이유 없다.

AI가 인간보다 빠른 영역

타이핑 자체 — 키보드 입력은 무조건 더 빠르다.
API surface 외우기 — 인간이 매뉴얼을 뒤지는 시간을 0으로.
포맷 변환 — JSON ↔ YAML, schema ↔ TypeScript, REST ↔ GraphQL.
번역 — 자연어, 코드 언어 간 모두.

12장 · "AI가 빛나는 영역" vs "인간이 곱하는 영역" 매트릭스

작업 유형	AI 단독	AI + 인간 (인간 multiplier)
보일러플레이트, scaffold	강력	인간 검토 마지막 5분
잘 정의된 함수 작성	강력	인간이 함수 시그니처 정의
단위 테스트	강력	인간이 edge case 후보 제시
코드 요약·설명	강력	인간 없어도 OK
익숙한 디버깅	양호	인간이 가설 다양화 강제
깊은 버그 디버깅	약함	인간이 가설 터널링 깨야 함
작은 리팩터링	양호	인간이 범위 한정
대규모 아키텍처	약함	인간이 큰 그림 소유
성능 회귀 잡기	약함	인간이 프로파일링 결과 해석
동시성 코드	위험	인간 + 형식 검증
보안 코드	위험	인간 보안 전문가 + 자동 도구
레거시 변경	위험	인간이 암묵 가정 알려줘야
새 라이브러리	회귀 위험	인간이 패턴 직접 짜본 뒤 위임
멀티 레포 변경	일관성 깨짐	인간이 codemod 작성 + 검토
모호한 트레이드오프	추측	인간이 팀 선호 알려줘야
긴 세션 작업	drift	세션 쪼개기 (인간 결정)

이 매트릭스가 메시지다. **"AI를 쓸까 말까"**가 아니라 **"이 작업의 어디서 인간이 multiplier인가"**를 묻는 것.

13장 · 안티패턴 — 자주 보는 잘못

안티패턴 1: "에이전트가 8시간 째 한 가설을 쫓고 있는데 그냥 둔다"

→ 멈추고 가설을 의심하라. 인간 디버거의 첫 룰.

안티패턴 2: "AI가 짠 보안 코드를 사람이 보지 않는다"

→ 보안은 negative space. 자동 도구 + 전문가 리뷰가 비협상.

안티패턴 3: "큰 코드베이스에서 'feature X를 추가하라'고 한 줄 지시한다"

→ 에이전트는 단면을 본다. 비슷한 게 있는지 먼저 검색하라고 강제.

안티패턴 4: "에이전트 성능 회귀를 CI 없이 믿는다"

→ 회귀는 결정적 측정으로 잡아야 한다. AI 측정은 보조.

안티패턴 5: "긴 세션을 그냥 둔다"

→ 4시간 넘으면 drift를 가정. 세션을 쪼개라.

안티패턴 6: "새 라이브러리에 대해 RAG만 믿는다"

→ RAG는 weight prior를 덮지 못한다. 본인이 한 번 짜본 뒤 위임.

안티패턴 7: "레거시 코드 리팩터링을 에이전트에게 통째로 위임"

→ 레거시는 암묵 지식. 인간 리뷰가 필수.

안티패턴 8: "에이전트가 짠 동시성 코드를 단위 테스트로만 검증"

→ Race는 단위 테스트로 안 잡힌다. ThreadSanitizer, loom.

안티패턴 9: "팀 컨벤션을 AGENTS.md에 안 적는다"

→ 에이전트는 모르면 추측한다. 명시화가 multiplier.

안티패턴 10: "AI 결과를 PR 자동 merge로 푼다"

→ 인간 리뷰가 마지막 안전망. 자동 merge는 신뢰가 너무 비싸다.

14장 · 개발자 체크리스트 (calibration 도구)

작업을 시작하기 전:

이 작업의 실패 모드가 위 10가지 중 몇에 해당하는가?
인간 검토가 필요한 단계는 어디인가? (PR? 디자인? 머지 전?)
자동화 안전망(CI, lint, SAST)이 박혀 있는가?
AGENTS.md에 팀 컨벤션이 명시되어 있는가?
새 라이브러리/API가 포함된다면 — 본인이 패턴을 한 번 짜봤는가?
멀티 레포 변경이라면 — 첫 1~2개는 직접 하는가?
보안/동시성/성능 크리티컬 부분이 포함되는가? 전문가 리뷰 일정이 있는가?

작업 중:

같은 가설을 4시간 이상 쫓고 있지 않은가?
세션 길이가 4시간을 넘었는가? drift 점검.
에이전트가 "옛 API 패턴"으로 회귀하지 않는가?

작업 후:

결과를 인간이 한 번 읽었는가?
성능 회귀 테스트, 보안 스캔이 돌았는가?
멀티 레포라면 — 각 레포의 빌드가 다 통과했는가?

에필로그 — multiplier로서의 개발자

다시 말하지만, 이 글은 AI 회의론이 아니다.

AI 코딩 어시스턴트는 진짜 multiplier다. 그러나 multiplier는 0이 아닌 값에 곱해질 때만 의미가 있다. 그 "0이 아닌 값"이 당신의 판단이다.

판단이 0이면 — 즉, 작업을 100% 에이전트에 넘기고 결과를 검토하지 않으면 — 어떤 multiplier도 0을 곱하면 0이다. 가끔은 음수다. (에이전트가 어렵게 잘못한 코드를 정리하는 게 처음부터 짜는 것보다 오래 걸린다.)

판단이 1이면 — 즉, 위 10가지 실패 모드를 알고, 적절히 안전망을 설치하고, 가설 터널링을 깨고, 세션을 쪼개고, 보안/동시성/성능을 적절히 사람에게 넘기면 — 그때 비로소 AI는 진짜 multiplier가 된다.

2026년 mid의 개발자에게 가장 중요한 스킬은 AI를 의심하는 능력이 아니라, AI의 신뢰 구간을 정확히 추정하는 능력이다.

AI 코딩 어시스턴트의 능력은 폭발적으로 좋아진다. 그러나 위 10가지 실패 모드 중 일부는 모델 크기로 해결되지 않을 것이다. 가설 터널링, 암묵 지식, 팀 선호, 보안의 negative space — 이것들은 사람의 컨텍스트가 필요하다. 그 컨텍스트는 한동안 사람의 머리에만 있을 것이다.

당신의 일은 줄어들지 않는다. 모양이 바뀐다. 코드를 타이핑하는 시간이 줄고, 무엇을 신뢰할지 결정하는 시간이 늘어난다. 그게 multiplier로서의 개발자의 모양이다.

다음 글 예고

"AI 코드의 PR 리뷰 — 인간 리뷰어가 봐야 할 7가지 시그널" — AI 코드는 어디서 의심스러운가, 어떤 패턴이 "통과 후 사고"로 이어지는가, PR 리뷰 체크리스트.
"AGENTS.md 작성법 — 팀 컨벤션을 에이전트에게 가르치기" — 실제 working AGENTS.md 예시, 어떤 정보가 효과 있고 어떤 게 노이즈인가.

참고 / References

벤치마크와 평가

SWE-bench Verified leaderboard — https://www.swebench.com/ (에이전트 성능 ceiling 추적)
OSWorld benchmark — https://os-world.github.io/ (데스크탑 환경 작업의 한계)
MLE-bench — https://github.com/openai/mle-bench (ML 엔지니어링 작업의 에이전트 평가)
HumanEval과 LiveCodeBench — 단순 함수 작성 능력과 실제 엔지니어링 능력의 갭

Context rot, hypothesis tunneling, attention dilution

"Lost in the Middle" — Liu et al., 2023, 긴 컨텍스트의 attention dilution을 처음으로 정량화한 논문
"Context Rot" — Anthropic, Claude 4 시리즈의 long-context degradation 분석
"Confirmation Bias in LLM-based Agents" — 최근 평가 연구들에서 반복 보고되는 패턴

동시성과 보안의 LLM 한계

"LLMs and Concurrency: A Survey" — 2025년 연구, formal verification이 여전히 필요한 이유
"Security Implications of Code Generated by LLMs" — Pearce et al., 2025년 업데이트, common CWE에서의 빈도

실용 가이드와 의견

Simon Willison's blog — https://simonwillison.net/ — calibrated take의 모범
"When AI Coding Assistants Fail" — 다양한 엔지니어링 블로그의 실패 사례 모음
Anthropic's "Best practices for agentic coding" — 공식 가이드, 실패 모드를 명시

도구

Claude Code, Cursor, Codex CLI — 본문 언급
semgrep, gosec, bandit — 보안 자동화 도구 (AI 보조용)
ThreadSanitizer, loom — 동시성 검증 도구
TLA+ — 형식 검증

TL;DR — AI 코딩 어시스턴트는 강력하지만 균일하지 않다. 깊은 디버깅·아키텍처·성능 회귀·동시성·보안·레거시·모호함·새 기술·멀티 레포·긴 세션 — 이 10가지 영역에서 인간 판단은 여전히 multiplier다. AI를 회의하지 마라. AI의 신뢰 구간을 정확히 추정하라. 그것이 2026년 mid의 개발자의 가장 가치 있는 스킬이다.

AI Coding Assistant Limitations 2026 — An Honest but Fair Take on What Doesn't Work

Prologue — This is the moment for calibration, not cheerleading

Most AI-coding content over the past 18 months has been in one of three modes:

Utopia — "Developers are obsolete now."
Dystopia — "All AI code is trash."
Marketing — "Our tool automates X% of your work."

This post is none of those. This post is about calibration.

AI coding assistants are powerful. But the shape of that power is not uniform. On some tasks the multiplier is 3x to 10x. On some it is 0x. On some it is negative — slower and worse than a human doing it alone.

As of mid-May 2026, the top agents on SWE-bench Verified sit around 75%. A great number. But if you do not know which 25% the failures live in, you set your trust at 1 in the 25% zone and lose.

This post covers the ten things AI does not do well. For each:

Real failure pattern — code, not abstraction.
Why it happens — reduced to model architecture and harness limits.
What the developer should keep doing — where human judgment is a multiplier.

At the end you get a matrix, a checklist, and an anti-pattern list. Not AI skepticism — calibration so you can use AI tools better.

Chapter 1 · Deep bug debugging — "Agents write code well; they do not narrow bugs well"

Pattern

A production service had a P99 latency spike every five minutes. Root cause was not GC — it was TLS session renegotiation storms. The agent burned eight hours on GC tuning. At hour eight it asked for permissions to "look at GC logs in more detail."

Why it happens

LLM agents are vulnerable to hypothesis tunneling. Once "this looks like a GC issue" enters the context, every subsequent tool call and reasoning step is biased toward confirming that hypothesis. A human debugger pauses around the one-hour mark and asks, "what if it is not this?" An agent keeps reinforcing the initial hypothesis as the context thickens — the most insidious form of context rot.

Also, the agent's tool calls see local information. Loading a debugger, inspecting a core dump, tailing packet captures, reading kernel traces — only some of this is well-automated. And bugs that do not reproduce are an agent weak point: when reproduction is hard, the agent multiplies its guessing rather than narrowing.

What humans should keep doing

Do not close hypotheses. Force the agent to "list five alternative hypotheses; this might not be GC."
Design falsifiable experiments. Agents are good at finding "evidence I am right" and bad at finding "evidence I am wrong."
The eight-hour rule. If you have been chasing the same hypothesis for eight hours, stop. The agent will not stop — you have to.

Sample dialog

Human: "P99 is spiking, looks like GC. Take a look."
Agent: (1 hour) "Let us tune the G1GC parameters..."
Human: "What other hypotheses could fit? Enumerate five."
Agent: "1) GC, 2) disk I/O storm, 3) network retransmits, 4) TLS renegotiation, 5) compaction work."
Human: "Spend five minutes on each of 2 through 5. Is there evidence?"

Without this forced branching, the agent never escapes hypothesis 1. The meta-instruction — "look at the other hypotheses" — is the multiplier.

Chapter 2 · Architecture in a large codebase — "Agents see slices, not the whole"

Pattern

In a one-million-line monorepo you ask the agent to add a feature. The agent cleanly builds it under new-feature/. Code works. A senior on the same team reads it and goes, "uh, there is already something similar under platform/shared/orchestration." The agent could not see it — it never read that directory.

Why it happens

LLMs reason inside their context window. A 1M-token model still cannot fit a million-line monorepo. Agents navigate by name-based search and mentioned paths — these miss implicit knowledge about which code is canonical.

The most dangerous variant is abstraction conflict. The agent makes a "clean abstraction." But if the new abstraction overlaps in intent with an existing one and differs only in shape, the codebase ends up with two similar abstraction layers. Six months later a new hire does not know which to use, uses both, and ultimately creates a third.

What humans should keep doing

Force a "search for existing similar" step. Put it in AGENTS.md or in the prompt.
An "abstraction review" stage in PR — a senior asks, "how is this different from existing X?" Agents do not ask this of themselves.
Store architecture docs in agent-readable form. ADRs, module responsibility matrices. Agents do not see what they cannot read.

One step further — make it read existing code twice

A trick that has worked well in big monorepos: before giving the agent the actual task, force two explicit steps.

1) "Spend five minutes searching for existing modules that could be related to this task. Enumerate them."
2) "For each module, describe how its responsibility overlaps with the task. Propose three candidate locations for the new code."
3) "Wait for human confirmation before writing code."

This three-step gate reduces abstraction conflicts by 30 to 50% in our team's experience. The agent's default is "start writing code"; you have to explicitly tell it to stop and think first.

Chapter 3 · Subtle performance regressions — "Looks the same, ten times slower"

Pattern

The agent "refactors" an N+1 query. The code looks cleaner. All tests pass. You ship. P95 latency goes up 10x. Turns out the clean-looking code now uses a query pattern that misses the index, or lazy-load became eager-load.

Why it happens

LLMs see the shape of code. The runtime cost is something they infer, not observe. And that inference often depends on the data distribution — an N+1 invisible at 100 rows kills P95 at 1M rows. The agent does not know what your distribution looks like.

Agents also write poor micro-benchmarks. JIT warm-up, cache effects, measurement noise — all need human intuition.

What humans should keep doing

Bake performance regression tests into CI. The agent will not check regressions — CI has to.
Feed profiler results directly to the agent. "Look at this PR" is not the same as "look at this flamegraph and find the regression." Tool changes capability.
Owning the data distribution is a human job. The agent knows "works at 1k rows," not "works at 100M rows."

Small case — algorithmic complexity's hidden upgrade

A common PR: the agent "simplifies" a Set-based dedup into Array.filter + includes. Code is shorter. At 100 items it is even faster (smaller constant). At 100k items it is O(n squared) and 30x slower. The unit test has 10 items and does not catch it. Production catches it.

This pattern is dangerous because it looks like a refactor but is a regression. PR reviewers go, "ooh cleaner," and merge. Only automated performance regression tests give you an objective safety net.

Chapter 4 · Concurrency correctness — "Pattern-matches race-free idioms but misses the subtle data race"

Pattern

The agent writes a multi-threaded queue. The code uses Mutex and Condvar canonically. Reviewing it you go, "looks fine." Six months later, under load, you get a deadlock. Two locks were acquired in opposite orders in two places — on a path with rare contention.

Why it happens

Concurrency does not yield to case-based learning. The model imitates "common race-free patterns" very well — producer/consumer, reader/writer lock, channel-based message passing. But lock-order consistency, memory model semantics (Acquire/Release/SeqCst), reentrancy conditions — these require global reasoning. The model sees slices.

And tests do not catch them. Race conditions are non-deterministic; unit tests usually cover the deterministic cases. You need ThreadSanitizer, loom, jepsen — and agents are not fluent operators of those tools.

What humans should keep doing

Concurrency code always gets a second pair of human eyes. Label PRs "concurrency: review needed."
Use model checkers. TLA+, loom, Coq — the only real guarantees for concurrency correctness are still from formal verification.
Tell the agent explicitly to "enumerate interleavings." It will not do it unless asked.

A one-line comment to enforce lock order

// LOCK_ORDER: always (config -> session) — see SECURITY.md L42
let _c = config.lock();
let _s = session.lock();

That explicit comment is the tool that forces the same order in both places. Agents respect "context-marker comments" well. The humans, however, have to write the comment.

Chapter 5 · Judgment under ambiguity — "Which trade-off does the team prefer here"

Pattern

"Add error handling to this API." The agent adds clean Result-based error handling. The team's convention was panic-on-invariant-violation. Another team's convention was to emit errors via OpenTelemetry. The agent does not know the "right" answer — the answer depends on team choice.

Why it happens

Most of software engineering is taste and trade-offs. "Which paradigm here," "what is the right level of abstraction," "is 50ms of latency worth 30% better readability." These need information outside the context — team values, company priorities, user pain points.

The agent guesses. It makes good guesses. But a guess is a guess.

What humans should keep doing

Write a "preferences" doc. Put "errors are Results not panics," "logging is structured," "async is channel-based" in AGENTS.md.
Make the agent ask about trade-offs. When the answer is ambiguous, the agent should request clarification. Otherwise it guesses.
Make "matches team convention" an explicit PR review step. Do not automate away taste.

Chapter 6 · Genuinely new tech — "Training cutoff bias; even with new docs, it reverts to old patterns"

Pattern

A library released in early 2026 (say, Effect-TS 3.x, new concurrent features in React 19.5, PPR stabilization in Next.js 16). You dump the full new docs into context. The agent uses the new API correctly for the first few hundred tokens. Then it reverts to 2024 patterns — layering useEffect patterns on top of the new concurrency hooks, calling new APIs with old signatures because it cannot remember the new ones.

Why it happens

LLMs are strongly influenced by frequency in training data. The new library's docs are tens of thousands of tokens; the old library is deeply baked into the weights. Context cannot fully overwrite the weight prior.

RAG does not fix this. Even with the new docs retrieved, the model produces a mixed hallucination — new signature plus old pattern.

What humans should keep doing

Write code in the new library by hand once. Learn the pattern yourself before delegating. Then you can recognize regressions.
Bake "version enforcement" into the toolchain. Lint rules, type checks. "Old API call is a compile error."
Correct the agent every time it reverts. Repeated correction within a session works.

Verified regression signals

While working with a new library, suspect regression if you see any of:

import paths quietly returning to the old path (react instead of react/jsx-runtime).
new hooks used with the mental model of old hooks.
error messages that do not match the new version's signatures (meaning the try/catch assumes the old API).

If you see any one of these, start a new session and reinject the new docs with stronger emphasis.

Chapter 7 · Multi-repo cross-cutting changes — "Still hard"

Pattern

Forty microservices need the same change (e.g. deprecated SDK 1.x to 2.x). The agent does repo one well. Repo two well. Repo five — it misses the subtle differences each repo has (test runner differences, build system differences, dependency versions). It opens 40 PRs; 12 of them are broken at build time.

Why it happens

Each repo carries hidden context. A multi-repo change is fundamentally about N contexts, but the LLM context window is one. The agent generalizes a decision from repo A to repo B — and fails wherever that generalization breaks.

Multi-repo work is also a dependency graph. To change A you have to deploy B first, which depends on C being compatible. Graph reasoning across a single context is brittle.

What humans should keep doing

Settle the pattern in one repo first, then scale. The first five repos are human-driven; the rest are agent plus human review.
Encode shared changes as codemods. AST transforms plus AI review beat pure AI.
Own the rollout order. Agents are weak at dependency-graph reasoning.

Chapter 8 · Security-critical code — "Looks right; has a subtle flaw"

Pattern

The agent writes a password hashing function. It uses bcrypt, salts from random. Looks canonical. But the comparison is timing-attack vulnerable (uses == instead of constant_time_eq); the salt rounds cost factor is 8 (current guidance in 2026 is 10 to 12); the error message echoes the username (timing leak). All subtle. All easy to miss in code review.

Why it happens

Security is the discipline of negative space — "if something that should be there is missing, that is a flaw." LLMs reason about "what is there." Catching "what is not there" requires a security mental model. The agent's model is average, not expert.

Security vulnerabilities are also slow discoveries — they surface six months later in an incident. By then the agent is doing something else.

What humans should keep doing

Security-critical code requires a security expert review. AI review is supplementary.
Enforce automation — semgrep, bandit, gosec, cargo-audit. More deterministic than AI.
Mark "security risk" sections in AGENTS.md. "Code in this directory always requires human security review."

Commonly missed negative-space checklist

What we check in security reviews of agent-written code:

Is a timing-safe comparison used?
Could a secret end up in logs?
Does the error response create a timing oracle?
Is rate limiting applied to every path?
Is input validation immediately before sanitization (not after)?
Are CSRF/CORS explicitly configured (not relying on defaults)?

Bake these six items into the PR template and you catch about 80% of security regressions. The remaining 20% is the domain of expert review.

Chapter 9 · Legacy code with implicit assumptions — "The code knows something the code does not say"

Pattern

A function in a 15-year-old monolith looks weird. A variable is named magic_offset_42. The agent decides "this needs cleanup" and refactors it. You ship. A month later an old data migration pipeline dies. Turns out magic_offset_42 was an off-by-one correction dating from a 2012 database migration accident. The code did not say so. But the code knew.

Why it happens

Legacy code is compressed implicit knowledge. The reason code looks the way it does is often not in the code — it is in a Slack channel somewhere, in the head of an engineer who left, in an incident report from five years ago. LLMs see the code only.

RAG does not solve this. Even if the incident report is retrieved, the model has to make the connection between "this is the line that needs to keep its weirdness" and the report — that connection often does not fire.

What humans should keep doing

Put "do not refactor" markers on legacy code. Comments, separate files.
Force git blame + history into context. "If you want to know why this line is this way, read the commit history."
Legacy changes always get human review. Automated refactor on legacy is dangerous.

Chapter 10 · Prompt drift in long sessions — "The agent slowly loses the thread"

Pattern

An eight-hour agent session. The first hour is flawless. Around hour four the code style gets weird — variable naming convention drifts slightly, error-handling patterns change a bit, comments thin out. By hour seven the agent acts as if it forgot the initial instructions.

Why it happens

The longer the context, the more the model weights recent tool results over the original instructions. This is essentially attention dilution. Automatic context compression (summarization) adds another step where subtle information disappears. The model slowly forgets who it is.

This is not fully solved by model size. 200K, 1M, 10M context windows — relative attention patterns remain.

What humans should keep doing

Keep sessions short. Break large work into multiple sessions. Each session has clear input and output.
Reinject AGENTS.md repeatedly. Even with a system prompt, restate the important conventions explicitly.
When you see drift, start a new session. No sunk-cost regret. Context reset is the most powerful tool you have.

A one-line heuristic for measuring drift

At the start of the session, ask "summarize the three core rules of this task in one line each," and save the answer. Four hours later, ask the same question. If the answer changes, that is drift. If the answer stays the same but the code style has changed, that is partial drift. Either way: start a new session.

Chapter 11 · So what does AI do well (the control group)

This post is not anti-AI. To be fair we have to name where AI clearly multiplies.

Tasks where AI shines

Bootstrap — new project scaffolds, boilerplate, the first 100 lines of familiar patterns.
Well-defined single tasks — "convert this function to TypeScript," "this SQL into ORM code," "add the missing test cases for this function."
Read, summarize, explain — summarize a 1000-line file in 200, find the real error in a build log, explain the intent of a PR change.
Repetitive well-defined migrations — anything expressible as a consistent codemod.
Test writing — especially unit tests where input/output shape is clear.
Doc generation — docstrings from code, README drafts, API specs.
Familiar debugging — null pointer, off-by-one, common missing-transaction patterns.

In these areas AI is a 3x to 10x multiplier. Nothing to argue against.

Areas where AI is strictly faster than humans

Typing itself — keyboard input is always slower for humans.
API surface memorization — human time spent flipping through docs goes to zero.
Format conversion — JSON to YAML, schema to TypeScript, REST to GraphQL.
Translation — both natural language and code language.

Chapter 12 · "Where AI shines" vs "Where humans multiply" matrix

Task type	AI alone	AI + human (human multiplier)
Boilerplate, scaffolds	Strong	Final 5-min human review
Well-defined function writing	Strong	Human defines the signature
Unit tests	Strong	Human suggests edge case candidates
Code summary, explanation	Strong	Fine without humans
Familiar debugging	OK	Human forces hypothesis diversity
Deep bug debugging	Weak	Human breaks hypothesis tunneling
Small refactors	OK	Human bounds scope
Large architecture	Weak	Human owns the big picture
Catching performance regressions	Weak	Human interprets profiler output
Concurrency code	Risky	Human + formal verification
Security code	Risky	Human security expert + automated tools
Legacy changes	Risky	Human teaches the implicit assumptions
New libraries	Regression risk	Human writes a pattern first, then delegates
Multi-repo changes	Inconsistent	Human writes codemod + reviews
Ambiguous trade-offs	Guesses	Human teaches team preference
Long-session work	Drifts	Split sessions (human decision)

The matrix is the message. The question is not "do I use AI or not" but "where in this task am I the multiplier".

Chapter 13 · Anti-patterns — common mistakes

Anti-pattern 1: "The agent has been chasing one hypothesis for eight hours; let it keep going"

Stop and doubt the hypothesis. First rule of human debugging.

Anti-pattern 2: "The agent wrote security code; humans did not look at it"

Security is negative space. Automated tools + expert review are non-negotiable.

Anti-pattern 3: "Give a one-line 'add feature X' to a big codebase"

The agent sees slices. Force a "look for existing similar" step.

Anti-pattern 4: "Trust the agent's performance regression eyeballing without CI"

Regressions need deterministic measurement. AI measurement is supplementary.

Anti-pattern 5: "Let long sessions run"

Past four hours, assume drift. Split sessions.

Anti-pattern 6: "RAG alone for new libraries"

RAG does not overwrite weight priors. Write the pattern yourself first, then delegate.

Anti-pattern 7: "Hand legacy refactor entirely to the agent"

Legacy = implicit knowledge. Human review mandatory.

Anti-pattern 8: "Verify agent-written concurrency code with unit tests only"

Races do not fall out of unit tests. ThreadSanitizer, loom.

Anti-pattern 9: "Do not write team conventions in AGENTS.md"

The agent guesses if it does not know. Explicit conventions are a multiplier.

Anti-pattern 10: "Auto-merge AI results"

Human review is the last safety net. Auto-merge buys trust too cheaply.

Chapter 14 · Developer checklist (calibration tool)

Before starting a task:

Which of the ten failure modes above does this task touch?
Where in the workflow does human review need to happen (PR, design, before merge)?
Are the safety nets (CI, lint, SAST) actually in place?
Are team conventions written in AGENTS.md?
If new libraries or APIs are involved — have you written a sample pattern yourself?
If it is a multi-repo change — are the first one or two repos being done by hand?
Does this touch security, concurrency, performance-critical code? Is expert review scheduled?

During the task:

Have you been chasing the same hypothesis for more than four hours?
Has the session exceeded four hours? Check for drift.
Is the agent reverting to "old API patterns"?

After the task:

Did a human actually read the result once?
Did performance regression tests and security scans run?
In multi-repo work — did every repo's build pass?

Epilogue — The developer as multiplier

Again: this is not AI skepticism.

AI coding assistants really are multipliers. But a multiplier only matters when applied to a non-zero value. That non-zero value is your judgment.

If judgment is zero — you hand 100% of the task to the agent and never review the result — any multiplier times zero is zero. Sometimes it is negative. (Cleaning up the code the agent got hard-to-find wrong takes longer than writing it from scratch.)

If judgment is one — you know the ten failure modes, install safety nets, break hypothesis tunneling, split sessions, hand security/concurrency/performance to humans where appropriate — that is when AI becomes a real multiplier.

The most important skill for a mid-2026 developer is not the ability to doubt AI but the ability to estimate AI's confidence interval accurately.

AI coding assistant capabilities improve explosively. But some of the ten failure modes above will not be solved by model size. Hypothesis tunneling, implicit knowledge, team preference, security negative space — these need human context. That context lives in human heads for the foreseeable future.

Your job does not shrink. It changes shape. Less time typing code, more time deciding what to trust. That is the developer-as-multiplier shape.

"Reviewing AI Code — 7 Signals Every Human Reviewer Should Watch" — where AI code is most suspect, which patterns lead to "passed review then incident," PR review checklists.
"How to Write AGENTS.md — Teaching Your Team Conventions to the Agent" — real working examples, what kinds of information move the needle and what is just noise.

References

Benchmarks and evaluations

SWE-bench Verified leaderboard — https://www.swebench.com/ (tracks agent performance ceilings)
OSWorld benchmark — https://os-world.github.io/ (desktop-environment task limits)
MLE-bench — https://github.com/openai/mle-bench (agent evaluation on ML engineering)
HumanEval and LiveCodeBench — the gap between single-function ability and real engineering ability

Context rot, hypothesis tunneling, attention dilution

"Lost in the Middle" — Liu et al., 2023, first to quantify attention dilution in long context
"Context Rot" — Anthropic, long-context degradation analysis in the Claude 4 series
"Confirmation Bias in LLM-based Agents" — repeatedly reported in recent evaluation studies

Concurrency and security limits of LLMs

"LLMs and Concurrency: A Survey" — 2025, why formal verification remains necessary
"Security Implications of Code Generated by LLMs" — Pearce et al., 2025 update, common CWE frequencies

Practical guides and opinion

Simon Willison's blog — https://simonwillison.net/ — a model of calibrated takes
"When AI Coding Assistants Fail" — collections of failure cases across engineering blogs
Anthropic "Best practices for agentic coding" — official guide, names failure modes

Tools

Claude Code, Cursor, Codex CLI — mentioned in the body
semgrep, gosec, bandit — security automation (for AI-assisted use)
ThreadSanitizer, loom — concurrency verification
TLA+ — formal verification

TL;DR — AI coding assistants are powerful but uneven. Across deep debugging, architecture, performance regressions, concurrency, security, legacy, ambiguity, new tech, multi-repo, and long sessions, human judgment is still the multiplier. Do not doubt AI. Estimate AI's confidence interval accurately. That is the most valuable developer skill in mid-2026.