Split View: AI가 쓴 코드를 리뷰하는 법: 에이전트 출력물을 위한 검증 규율과 'AI 슬롭' 걸러내기

AI가 쓴 코드를 리뷰하는 법: 에이전트 출력물을 위한 검증 규율과 'AI 슬롭' 걸러내기

"코드를 만드는 비용이 0에 수렴하면, 그 코드가 맞는지 확인하는 비용이 전부가 된다."

프롤로그 — 이제 병목은 리뷰다

몇 년 전까지 소프트웨어 팀의 병목은 코드를 쓰는 일이었습니다. 기능 하나를 구현하려면 사람이 키보드 앞에 앉아 한 줄씩 타이핑해야 했고, 그 시간이 일정의 대부분을 차지했습니다.

2026년, 그 병목은 사라졌습니다. 코딩 에이전트는 몇 분 만에 수백 줄을 생성합니다. 기능 하나가 PR로 올라오는 데 걸리는 시간은 10분의 1로 줄었습니다. 하지만 팀의 처리량은 그만큼 늘지 않았습니다. 왜냐하면 병목이 사라진 게 아니라 옮겨갔기 때문입니다. 이제 병목은 리뷰입니다.

생성은 빨라졌는데 검증은 빨라지지 않았습니다. 그래서 리뷰 큐에 PR이 쌓이고, 리뷰어는 자기가 쓰지 않은 코드를 매일 수백 줄씩 읽어야 하며, "일단 통과시키자"는 유혹과 "전부 다시 봐야 한다"는 부담 사이에서 흔들립니다.

사람 코드 리뷰와 다른 기술이다

여기서 핵심 주장을 먼저 던집니다. AI가 쓴 코드를 리뷰하는 것은 사람 PR을 리뷰하는 것과 다른 기술입니다.

사람 코드 리뷰는 대체로 사회적 행위입니다. PR 에티켓, 머지 큐, "이건 취향 차이일 수도 있는데" 같은 어조 조절, 동료의 성장을 돕는 코멘트 — 이런 것들이 절반입니다. 사람 리뷰는 신뢰를 전제로 합니다. "이 사람은 이 도메인을 알고, 모르면 물어봤을 것이다"라는 가정이 깔려 있습니다.

AI 코드 리뷰에는 그 가정이 없습니다. 에이전트는 모르면 묻지 않습니다. 가장 그럴듯한 것으로 채웁니다. 자신감과 정확도가 분리되어 있습니다. 그래서 AI 코드 리뷰는 사회적 행위가 아니라 검증 규율입니다. 어조를 조절할 필요도, 성장을 도울 필요도 없습니다. 대신 "이게 정말 맞는가"를 체계적으로, 의심을 기본값으로 깔고 확인해야 합니다.

이 글은 그 검증 규율에 대한 실전 가이드입니다. 일반적인 사람 코드 리뷰 글이 아닙니다 — 그건 다른 글의 주제입니다. 이 글은 오직 에이전트가 쓴 코드를 어떻게 검증하는가만 다룹니다.

이 글에서 다룰 내용:

장	주제
1장	왜 AI 코드는 다른 리뷰 렌즈가 필요한가
2장	AI 생성 코드의 특유한 실패 패턴
3장	검증 루프 — 타입 > 테스트 > 사람
4장	AI 디프를 효율적으로 읽기
5장	'AI 슬롭' — 정체와 필터링
6장	AI가 AI 코드를 리뷰할 때
7장	코드베이스를 검증 가능하게 만들기
8장	사람만 할 수 있는, 줄일 수 없는 일
에필로그	체크리스트 + 안티패턴 + 다음 글 예고

1장 · 왜 AI 코드는 다른 리뷰 렌즈가 필요한가

사람이 쓴 코드의 버그와 AI가 쓴 코드의 버그는 분포가 다릅니다. 같은 렌즈로 보면 놓칩니다.

1.1 그럴듯하지만 틀린 코드

사람의 실수는 보통 "어설퍼" 보입니다. 변수명이 이상하거나, 들여쓰기가 어긋나거나, 명백히 미완성인 티가 납니다. 리뷰어의 눈이 자연스럽게 거기 멈춥니다.

AI의 실수는 다릅니다. 표면이 매끄럽습니다. 변수명은 적절하고, 구조는 관례를 따르고, 주석까지 달려 있습니다. 그런데 로직이 미묘하게 틀렸습니다. 이것이 가장 위험합니다 — 코드가 "잘 쓴 것처럼" 보이기 때문에 리뷰어의 의심 센서가 켜지지 않습니다.

사람 코드 리뷰: "이상해 보이는 곳"을 찾는다. AI 코드 리뷰: "맞아 보이는데 틀린 곳"을 찾는다. 훨씬 어렵다.

1.2 자신감 있는 환각

에이전트는 "잘 모르겠습니다"라고 말하지 않습니다. 존재하지 않는 API를 자신 있게 호출하고, 없는 패키지를 import하고, 동작하지 않는 설정 키를 그럴듯한 이름으로 만들어냅니다. 출력물의 어조에는 불확실성의 흔적이 없습니다.

사람이라면 모르는 영역에서 머뭇거리는 신호 — 주석의 물음표, "이거 맞나?" 코멘트, 드래프트 PR — 가 남습니다. 에이전트의 출력에는 그게 없습니다. 확신에 찬 코드와 환각에 찬 코드가 똑같이 생겼습니다.

1.3 빠진 엣지 케이스

AI는 "행복한 경로(happy path)"를 잘 씁니다. 입력이 정상이고, 네트워크가 멀쩡하고, 배열이 비어 있지 않을 때의 코드는 거의 항상 맞습니다.

문제는 그 바깥입니다. 빈 배열, null 입력, 타임아웃, 동시성, 부분 실패, 정수 오버플로 — 사람 시니어라면 반사적으로 떠올리는 케이스들을 AI는 명시적으로 요구하지 않으면 자주 빠뜨립니다. 코드는 데모에서 완벽하게 돌아가고, 프로덕션의 세 번째 주에 무너집니다.

1.4 과잉 설계

반대 방향의 실패도 있습니다. "유저 이름을 대문자로 바꿔줘"라고 했더니 추상 팩토리, 전략 패턴 인터페이스, 설정 가능한 변환 파이프라인이 딸려 옵니다. AI는 학습 데이터에서 본 "엔터프라이즈급" 패턴을 작은 문제에 과하게 적용하는 경향이 있습니다.

과잉 설계는 버그는 아니지만 부채입니다. 읽을 코드가 늘고, 유지보수 표면이 커지고, 다음 사람(또는 다음 에이전트)이 헤맵니다.

1.5 미묘한 API 오용

AI는 API를 "대략" 맞게 씁니다. 함수 이름은 맞는데 인자 순서가 틀리거나, 옵션 객체의 키 하나가 옛 버전 것이거나, 비동기 함수를 await 없이 호출하거나, 반환값의 의미를 미묘하게 오해합니다.

타입 시스템이 강하면 이 중 상당수가 컴파일 단계에서 잡힙니다. 그래서 1장의 결론은 다음 장이 아니라 3장으로 이어집니다 — 기계가 잡을 수 있는 것은 기계에게 맡겨야 합니다.

2장 · AI 생성 코드의 특유한 실패 패턴

1장이 "왜 다른가"라면, 2장은 "구체적으로 무엇을 찾을 것인가"입니다. 리뷰할 때 머릿속에 띄워둘 체크 항목입니다.

2.1 패턴 카탈로그

패턴	증상	어떻게 잡나
환각 API/패키지	존재하지 않는 함수·옵션·라이브러리	빌드/타입체크, 의존성 락파일 확인
복붙 불일치	같은 로직이 파일마다 미묘하게 다름	디프 전체를 한 번에 훑기
에러 처리 부재	`try`/`catch` 없음, 실패 무시	"실패하면?"을 모든 외부 호출에 질문
보안 사각지대	입력 미검증, 시크릿 하드코딩, 인젝션	신뢰 경계를 따라 데이터 흐름 추적
의미 없는 테스트	통과하지만 아무것도 검증 안 함	테스트를 의도적으로 깨뜨려 보기
과잉 설계	작은 문제에 큰 추상화	"이게 정말 필요한가" 질문

2.2 환각 API와 패키지

가장 흔하고 가장 잡기 쉬운 패턴입니다. 에이전트는 "있었으면 좋겠는" API를 만들어냅니다.

// AI가 생성한 코드 — 그럴듯하지만 틀림
import { retryWithBackoff } from 'lodash' // lodash에 이런 함수 없음
import dayjs from 'dayjs'

const result = await fetchUser(userId, {
  retry: 3,           // fetchUser 옵션에 retry 없음 — 환각
  timeout: '5s',      // timeout은 number(ms)를 받음 — 타입 오용
})

const formatted = dayjs(result.createdAt).format('YYYY-MM-DD')

위 코드는 읽기에 자연스럽습니다. 하지만 lodash에 retryWithBackoff는 없고, fetchUser에 retry 옵션도 없습니다. 좋은 소식은, 이 부류는 타입체크와 빌드가 거의 다 잡는다는 것입니다. 사람이 눈으로 찾을 필요가 없습니다.

2.3 복붙 불일치

에이전트는 한 PR 안에서도 같은 일을 여러 방식으로 합니다. 한 파일에서는 async/await, 다른 파일에서는 .then(). 한 곳에서는 에러를 던지고, 다른 곳에서는 null을 반환합니다. 각 조각만 보면 다 괜찮지만, 전체로 보면 일관성이 없습니다.

이걸 잡으려면 디프를 파일 단위로 보지 말고 PR 전체를 한 번에 훑어야 합니다. "이 PR이 같은 종류의 일을 몇 가지 방식으로 하는가?"가 질문입니다.

2.4 에러 처리 부재

AI 코드의 행복한 경로는 깔끔합니다. 그리고 행복하지 않은 경로는 존재하지 않습니다. 외부 API 호출에 try/catch가 없고, 파일 읽기 실패가 무시되고, JSON.parse가 방어 없이 호출됩니다.

리뷰 규칙: 모든 외부 경계(네트워크, 디스크, 파싱, 서드파티 호출)에 "여기서 실패하면 무슨 일이 일어나는가?"를 물어보세요. AI 디프에서 이 질문의 답이 "앱이 죽는다"인 경우가 놀랍도록 많습니다.

2.5 보안 사각지대

에이전트는 보안을 "기능이 아닌 것"으로 취급합니다. 요청대로 동작하는 코드를 만들 뿐, 적대적 입력은 가정하지 않습니다. 흔한 사각지대:

사용자 입력을 검증 없이 쿼리·명령·경로에 연결
시크릿을 코드에 하드코딩 (데모에서 "그냥 되게" 하려고)
권한 체크 누락 — "이 유저가 이걸 할 수 있는가"를 묻지 않음
로그에 민감 정보 출력

신뢰 경계를 따라 데이터가 어디서 들어와 어디로 가는지 손가락으로 짚어가며 추적하세요. AI는 이걸 스스로 하지 않습니다.

2.6 아무것도 검증하지 않는 테스트

가장 교활한 패턴입니다. 에이전트에게 "테스트도 써줘"라고 하면 테스트를 씁니다. 통과하는 테스트를. 그런데 그 테스트가 무엇을 검증하는지 보면 종종 아무것도 아닙니다.

// AI가 생성한 "테스트" — 통과하지만 무의미
test('calculateDiscount works', () => {
  const result = calculateDiscount(100, 0.1)
  expect(result).toBeDefined()        // 무엇이든 통과
  expect(typeof result).toBe('number') // 90이든 9999든 통과
})

이 테스트는 calculateDiscount가 0을 반환하든 -50을 반환하든 통과합니다. 실제 값(90)을 검증하지 않기 때문입니다. 더 나쁜 경우, 구현을 그대로 베껴 "구현이 구현과 같다"를 확인하는 테스트도 있습니다.

검증법은 단순합니다. 테스트를 의도적으로 깨뜨려 보세요. 구현에 버그를 심었는데도 테스트가 통과하면, 그 테스트는 가짜입니다.

3장 · 검증 루프 — 타입 > 테스트 > 사람

2장의 패턴들을 사람의 눈으로 하나하나 찾으면 지칩니다. 핵심 원칙은 이것입니다: 기계가 잡을 수 있는 것은 기계에게 맡기고, 사람의 주의력은 기계가 못 잡는 것에 쓴다.

3.1 세 겹의 필터

검증을 세 단계의 필터로 생각하세요. 값싼 필터를 먼저, 비싼 필터를 나중에.

단계	도구	잡는 것	비용
1. 타입	타입체커, 린터, 컴파일러	환각 API, 타입 오용, 미사용 변수	거의 0 (초 단위)
2. 테스트	단위·통합 테스트, 정적 분석	로직 오류, 회귀, 엣지 케이스	낮음 (분 단위)
3. 사람	리뷰어의 판단	아키텍처, 의도, "이게 말이 되는가"	비쌈 (사람 시간)

원칙: 1단계를 통과하지 못한 코드는 사람이 볼 가치가 없습니다. 타입 에러가 있는 PR을 사람이 리뷰하는 것은 사람 시간 낭비입니다. CI가 먼저 거르게 하세요.

3.2 타입을 1차 방어선으로

강한 타입 시스템은 AI 코드 리뷰에서 가장 큰 레버리지입니다. 2장의 환각 API, API 오용, 잘못된 인자 — 이 중 큰 비중이 타입체크에서 컴파일 에러로 떨어집니다. 사람이 한 줄도 읽기 전에.

그래서 느슨한 타입은 AI 시대에 더 비싸집니다. any로 도배된 코드베이스에서는 1차 필터가 작동하지 않고, 그 부담이 전부 3단계(사람)로 넘어옵니다.

3.3 테스트를 2차 방어선으로

타입이 "이 코드가 말이 되는 형태인가"를 본다면, 테스트는 "이 코드가 맞게 동작하는가"를 봅니다. 단, 2.6의 함정을 기억하세요 — AI가 쓴 테스트 자체도 검증 대상입니다.

권장 순서: 테스트를 사람이 먼저 쓰거나(또는 사람이 명세를 잡고), 그 다음 구현을 에이전트에게 맡기세요. 테스트가 먼저 있으면 그 테스트는 "구현을 베낀 가짜"가 될 수 없습니다.

3.4 사람은 마지막에, 그러나 가장 중요하게

세 겹의 필터를 통과한 코드만 사람 앞에 옵니다. 이때 사람의 주의력은 기계가 원리적으로 못 잡는 것에 집중됩니다 — 아키텍처가 맞는지, 이 변경이 티켓의 의도와 일치하는지, 애초에 이게 말이 되는 접근인지. 이것이 8장의 주제입니다.

검증 루프의 한 줄 요약: 값싼 기계 검사를 모두 통과시킨 뒤, 사람은 기계가 못 보는 것만 본다.

4장 · AI 디프를 효율적으로 읽기

검증 루프를 통과한 PR이라도 사람이 디프를 읽어야 합니다. 사람이 쓴 디프와 AI 디프는 읽는 방법이 다릅니다.

4.1 첫 질문: 티켓과 일치하는가

AI 디프를 열면 코드 품질부터 보고 싶은 충동이 듭니다. 참으세요. 첫 질문은 항상 이것입니다: 이 디프가 티켓이 요구한 것을 하는가?

에이전트는 티켓을 "대략" 해석합니다. 요구한 것의 80%를 하고 20%를 다르게 하거나, 요구하지 않은 것을 추가로 합니다. 코드가 아무리 깔끔해도 잘못된 것을 구현했다면 의미가 없습니다. 티켓의 인수 조건을 한 줄씩 디프와 대조하세요.

4.2 범위 크리프 탐지

두 번째 질문: 이 디프가 건드리지 말아야 할 것을 건드렸는가?

에이전트는 "하는 김에" 다른 것도 고치는 경향이 있습니다. 포매팅을 바꾸고, 무관한 파일을 리팩터링하고, 의존성 버전을 올립니다. 각각은 선의일 수 있지만, 합치면 리뷰가 불가능한 디프가 됩니다.

신호	의심할 것
변경 파일 수가 티켓 규모에 비해 많음	범위 크리프
무관한 디렉터리에 변경이 흩어짐	"하는 김에" 리팩터링
락파일·설정 파일이 까닭 없이 바뀜	의도치 않은 의존성 변경
포매팅만 바뀐 줄이 대량	노이즈 — 실질 변경을 가림

이런 디프는 되돌려 보내세요. "티켓 범위만 남기고 다시"가 정당한 요청입니다.

4.3 디프를 읽는 순서

효율적인 순서가 있습니다:

PR 설명 — 무엇을·왜 바꿨다고 주장하는가
테스트 — 이 변경이 무엇을 보장한다고 주장하는가
핵심 로직 — 주장이 실제 코드와 일치하는가
경계와 에러 처리 — 행복하지 않은 경로가 있는가
나머지 — 설정, 포매팅, 부수 변경

테스트를 로직보다 먼저 읽는 이유: 테스트는 "이 코드가 무엇을 하기로 되어 있는지"의 명세입니다. 명세를 먼저 읽으면 로직을 읽을 때 어긋난 곳이 눈에 들어옵니다.

4.4 디프 크기와 신뢰의 반비례

작은 디프는 꼼꼼히 읽을 수 있습니다. 800줄짜리 디프는 사람이 끝까지 집중하지 못합니다 — AI든 사람이든. AI는 큰 디프를 쉽게 만들어내므로, **"이 PR을 더 작게 쪼갤 수 있는가"**를 항상 물어야 합니다. 리뷰 가능성은 디프 크기에 반비례합니다.

5장 · 'AI 슬롭' — 정체와 필터링

5.1 AI 슬롭이란

**AI 슬롭(AI slop)**은 그럴듯해 보이지만 실질 가치가 낮은 AI 생성 산출물을 가리키는 말입니다. 코드에서의 AI 슬롭은 "컴파일도 되고 테스트도 통과하지만, 코드베이스를 더 나쁘게 만드는 코드"입니다.

슬롭은 명백한 버그가 아닙니다. 버그는 차라리 잡기 쉽습니다. 슬롭은 미묘하게 부풀고, 미묘하게 일관성 없고, 미묘하게 불필요한 코드입니다. 한 PR로는 티가 안 나지만, 100개의 PR이 쌓이면 코드베이스가 늪이 됩니다.

5.2 슬롭의 징후

징후	설명
장황함	5줄로 될 일을 30줄로. 불필요한 헬퍼, 래퍼, 추상화
빈 주석	`// 유저를 가져온다` 위에 `getUser()` — 코드를 반복하는 주석
방어적 노이즈	일어날 수 없는 조건에 대한 체크가 사방에
일관성 없음	같은 코드베이스에서 다섯 가지 스타일
죽은 추상화	한 번만 쓰이는 인터페이스, 구현이 하나뿐인 팩토리
그럴듯한 더미	의미 없는 테스트, TODO만 있는 함수, placeholder 로직

5.3 슬롭을 거르는 질문

리뷰할 때 던질 질문은 단순합니다:

"이 줄을 지우면 무엇이 깨지는가?" — 답이 "아무것도"라면 슬롭이다.
"이 추상화는 몇 곳에서 쓰이는가?" — 한 곳이라면 인라인해라.
"이 주석은 코드가 말하지 않는 것을 말하는가?" — 아니라면 지워라.
"이 PR을 절반 크기로 줄일 수 있는가?" — 대개 줄일 수 있다.

5.4 슬롭은 생성이 아니라 수용의 문제다

중요한 관점: 슬롭을 만드는 것은 AI지만, 슬롭을 코드베이스에 들이는 것은 사람입니다. 에이전트는 슬롭을 제안할 뿐이고, 머지 버튼을 누르는 것은 리뷰어입니다.

그래서 슬롭 문제의 해법은 "AI를 쓰지 마라"가 아니라 "리뷰 기준을 낮추지 마라"입니다. AI 코드라고 해서 사람 코드보다 관대하게 통과시키면 안 됩니다. 오히려 자신감 있는 표면 때문에 더 엄격하게 봐야 합니다.

6장 · AI가 AI 코드를 리뷰할 때

검증 루프에 "AI 리뷰어"를 한 단계 더 넣는 것은 합리적입니다. 하지만 함정이 있습니다.

6.1 생성자와 검증자는 분리되어야 한다

핵심 원칙: 코드를 생성한 에이전트가 같은 코드를 리뷰하면 안 됩니다.

같은 모델, 같은 컨텍스트, 같은 가정을 가진 에이전트는 자기가 만든 환각을 자기가 못 봅니다. 잘못된 API를 만들어낸 추론 과정이 그대로 "이 API는 맞다"고 판단합니다. 사람으로 치면, 자기 답안을 자기가 채점하는 것과 같습니다.

검증자는 분리되어야 합니다 — 다른 모델이거나, 최소한 다른 컨텍스트와 다른 프롬프트를 가진 별도 세션이어야 합니다.

6.2 AI 리뷰어가 잘하는 것과 못하는 것

AI 리뷰어가 잘하는 것	AI 리뷰어가 못하는 것
패턴 일치 — 알려진 안티패턴, 흔한 버그	아키텍처 판단 — "이 접근이 맞는가"
일관성 검사 — 스타일, 네이밍	의도 일치 — "티켓이 정말 원한 게 이건가"
체크리스트 적용 — 빠진 에러 처리 등	트레이드오프 — "이 복잡도가 값어치 하는가"
표면적 보안 — 하드코딩 시크릿 등	도메인 정합성 — 비즈니스 규칙이 맞는가

요약: AI 리뷰어는 검증 루프의 1.5단계입니다 — 타입체크보다 똑똑하지만 사람을 대체하지 못합니다. 기계가 잡을 수 있는 것의 범위를 넓혀줄 뿐, 마지막 판단은 여전히 사람의 몫입니다.

6.3 실용적 배치

권장 구성:

에이전트 A가 코드를 생성
타입체크·테스트·린트 (기계 1단계)
에이전트 B(다른 컨텍스트)가 리뷰 — 패턴·일관성·체크리스트
사람이 최종 리뷰 — 아키텍처·의도·판단

AI 리뷰어의 코멘트는 입력이지 결론이 아닙니다. 사람은 AI 리뷰어가 놓친 것과 과민하게 반응한 것을 둘 다 걸러야 합니다.

7장 · 코드베이스를 검증 가능하게 만들기

검증 루프의 효율은 코드를 검증하는 기술보다 코드베이스가 얼마나 검증 가능한가에 더 크게 좌우됩니다. 검증하기 좋은 코드베이스는 AI 시대에 복리로 보상받습니다.

7.1 강한 타입

이미 3장에서 말했지만 다시 강조할 가치가 있습니다. 강한 타입은 1차 필터의 성능을 결정합니다. any를 줄이고, 도메인을 타입으로 표현하고(원시값 대신 UserId, Email 같은 타입), 함수 시그니처를 정직하게 만드세요. 타입이 강할수록 사람이 읽어야 할 양이 줄어듭니다.

7.2 빠른 테스트

테스트가 느리면 검증 루프가 끊깁니다. 30분짜리 테스트 스위트는 "일단 머지하고 나중에 보자"를 부릅니다. 테스트는 빠르고, 결정적이고, 신뢰할 수 있어야 합니다. 플레이키 테스트는 슬롭만큼 해롭습니다 — 신호를 노이즈로 만들기 때문입니다.

7.3 명확한 관례

에이전트는 코드베이스의 패턴을 모방합니다. 코드베이스에 일관된 관례가 있으면 에이전트의 출력도 일관됩니다. 관례가 없으면 에이전트가 매번 다른 스타일을 가져옵니다(2.3의 복붙 불일치는 사실 관례 부재의 증상이기도 합니다).

관례를 문서로, 그리고 가능하면 린터 규칙으로 박제하세요. CONTRIBUTING 문서, 명확한 디렉터리 구조, 에이전트용 가이드 파일 — 이것들이 에이전트의 출력 품질을 직접 끌어올립니다.

7.4 검증 가능성 체크리스트

항목	질문
타입	`any`의 비중이 낮은가? 도메인이 타입으로 표현되는가?
테스트	전체 스위트가 몇 분 안에 끝나는가? 플레이키가 없는가?
린트	스타일·흔한 실수가 자동으로 잡히는가?
관례	새 코드가 따라야 할 패턴이 문서화돼 있는가?
경계	모듈 경계가 명확해 디프의 영향 범위가 좁은가?

이 다섯 가지가 갖춰진 코드베이스에서는 AI 코드 리뷰가 빠르고 신뢰할 만합니다. 갖춰지지 않은 코드베이스에서는 모든 PR이 사람의 풀 리뷰를 요구합니다 — 그리고 그게 바로 병목입니다.

8장 · 사람만 할 수 있는, 줄일 수 없는 일

기계가 1단계와 2단계를 가져가고, AI 리뷰어가 1.5단계를 가져가면, 사람에게 남는 것은 무엇일까요. 줄어들었지만 사라지지 않았고, 오히려 더 중요해진 일입니다.

8.1 판단

"이 코드는 맞다." 이 문장은 두 가지를 의미합니다. (1) 코드가 명세대로 동작한다 — 이건 기계가 검증합니다. (2) 명세 자체가 옳다 — 이건 사람만 할 수 있습니다.

에이전트는 티켓이 시킨 것을 합니다. 티켓이 틀렸으면 틀린 것을 완벽하게 구현합니다. "이 티켓이 애초에 풀려는 문제가 맞는가"는 기계가 묻지 않습니다.

8.2 아키텍처

개별 함수가 맞아도 시스템 구조가 틀릴 수 있습니다. 이 추상화가 6개월 뒤에도 버틸지, 이 경계가 올바른 곳에 그어졌는지, 이 의존성 방향이 건강한지 — 이것은 패턴 일치로 답할 수 없는 질문입니다. 코드베이스 전체의 미래를 상상하는 일이고, 그건 사람의 일입니다.

8.3 "이게 말이 되는가"

가장 줄일 수 없는 일은 가장 단순한 질문입니다: 이게 말이 되는가.

에이전트가 완벽하게 동작하는, 타입이 맞는, 테스트를 통과하는 코드를 만들어 왔는데 — 그 기능을 애초에 만들지 말았어야 할 수도 있습니다. 더 단순한 해법이 있었을 수도 있고, 문제를 다르게 정의했어야 할 수도 있습니다. 기계는 "어떻게"를 검증하지만 "왜"와 "정말?"은 사람의 몫입니다.

8.4 책임

마지막으로, 머지 버튼을 누르는 사람이 그 코드의 책임자입니다. "AI가 썼어요"는 프로덕션 장애 앞에서 변명이 되지 않습니다. 검증 규율의 본질이 여기 있습니다 — 도구가 무엇을 생성했든, 그것을 코드베이스에 들이기로 한 결정은 사람의 것이고, 그 결정에는 책임이 따릅니다.

사람의 일은 줄어든 게 아니라 농축됐다. 타이핑은 사라지고, 판단만 남았다.

에필로그 — 검증은 새로운 핵심 역량이다

코드를 만드는 비용이 0에 수렴하는 시대에, 경쟁력은 "얼마나 빨리 생성하는가"가 아니라 "얼마나 신뢰할 수 있게 검증하는가"입니다. 생성은 상품이 됐고, 검증은 희소해졌습니다.

이 글의 한 줄 결론: AI 코드 리뷰는 사회적 행위가 아니라 검증 규율입니다. 의심을 기본값으로 깔고, 기계가 잡을 수 있는 것은 기계에게 맡기고, 사람의 주의력은 기계가 원리적으로 못 보는 것 — 판단, 아키텍처, "이게 말이 되는가" — 에 집중하세요.

AI 코드 리뷰 체크리스트

타입 먼저 — 타입체크·빌드를 통과하지 못한 PR은 사람이 볼 가치가 없다
테스트 검증 — AI가 쓴 테스트를 의도적으로 깨뜨려 가짜인지 확인한다
티켓 대조 — 디프가 인수 조건을 한 줄씩 충족하는지 본다
범위 크리프 — "하는 김에" 한 변경이 섞여 있는지 본다
경계 점검 — 모든 외부 호출에 "실패하면?"을 묻는다
보안 추적 — 신뢰 경계를 따라 데이터 흐름을 손가락으로 짚는다
슬롭 필터 — "이 줄을 지우면 뭐가 깨지나"를 묻는다
생성자 != 검증자 — AI 리뷰어는 코드를 생성한 에이전트와 분리한다
디프 크기 — 큰 PR은 쪼개라. 리뷰 가능성은 크기에 반비례한다
최종 판단은 사람 — 머지 버튼을 누르는 사람이 책임자다

피해야 할 안티패턴

표면 신뢰 — 코드가 매끄러워 보인다고 의심을 끈다
타입 건너뛰기 — 타입 에러가 있는 PR을 사람이 먼저 읽는다
테스트 맹신 — "테스트 통과"를 "검증됐다"로 착각한다
자기 채점 — 생성한 에이전트가 자기 코드를 리뷰한다
AI 관대주의 — AI 코드를 사람 코드보다 느슨하게 통과시킨다
거대 디프 수용 — 800줄 PR을 끝까지 안 읽고 승인한다
책임 회피 — "AI가 썼다"를 장애의 변명으로 쓴다

다음 글 예고

다음 글은 **"AI 시대의 테스트 전략 — 에이전트가 쓴 테스트를 믿을 수 있게 만드는 법"**입니다. 이 글에서 "AI가 쓴 테스트 자체가 검증 대상"이라고 했는데, 그렇다면 신뢰할 수 있는 테스트는 어떻게 설계해야 할까요. 명세를 먼저 잡는 법, 변이 테스트로 테스트를 검증하는 법, 에이전트에게 테스트를 맡길 때와 맡기지 말아야 할 때를 다룹니다.

생성이 공짜가 된 세상에서, 가장 비싼 기술은 "아니오"라고 말할 줄 아는 검증의 눈입니다. 그 눈이 곧 엔지니어링입니다.

How to Review AI-Generated Code: The Verification Discipline for Agent Output and Filtering 'AI Slop'

"When the cost of producing code approaches zero, the cost of confirming the code is correct becomes everything."

Prologue — The Bottleneck Is Now Review

A few years ago, the bottleneck in a software team was writing the code. Implementing one feature meant a human sat at a keyboard and typed it out line by line, and that time dominated the schedule.

In 2026, that bottleneck is gone. A coding agent generates hundreds of lines in minutes. The time it takes for one feature to land as a PR has dropped tenfold. But team throughput has not risen tenfold — because the bottleneck did not disappear, it moved. The bottleneck is now review.

Generation got faster; verification did not. So PRs pile up in the review queue, reviewers read hundreds of lines a day of code they did not write, and they oscillate between the temptation to "just approve it" and the dread of "I have to re-check everything."

It Is a Different Skill From Human Code Review

Here is the core claim up front. Reviewing AI-written code is a different skill from reviewing a human PR.

Human code review is largely a social act. PR etiquette, merge queues, tone calibration like "this might just be a matter of taste," comments that help a colleague grow — half of it is that. Human review assumes trust. The assumption underneath is: "this person knows the domain, and if they didn't, they would have asked."

AI code review has no such assumption. The agent does not ask when it does not know. It fills in with the most plausible thing. Confidence and accuracy are decoupled. So AI code review is not a social act — it is a verification discipline. There is no tone to calibrate, no growth to support. Instead you confirm "is this actually correct" systematically, with suspicion as the default.

This article is a practical guide to that verification discipline. It is not a generic human-code-review article — that is a different topic. This article covers only one thing: how to verify code an agent wrote.

What this article covers:

Chapter	Topic
1	Why AI code needs a different review lens
2	The characteristic failure patterns of AI-generated code
3	The verification loop — types before tests before humans
4	Reading AI diffs efficiently
5	'AI slop' — what it is and how to filter it
6	When AI reviews AI code
7	Making your codebase verifiable
8	The human's irreducible job
Epilogue	Checklist + anti-patterns + next-post teaser

Chapter 1 · Why AI Code Needs a Different Review Lens

Bugs in human-written code and bugs in AI-written code have a different distribution. Looking through the same lens, you miss them.

1.1 Plausible-but-Wrong Code

A human's mistakes usually look "rough." The variable name is odd, the indentation is off, it visibly looks unfinished. The reviewer's eye naturally stops there.

AI's mistakes are different. The surface is smooth. The variable names are appropriate, the structure follows convention, there are even comments. And yet the logic is subtly wrong. This is the most dangerous case — because the code looks "well written," the reviewer's suspicion sensor never switches on.

Human code review: you look for "the parts that look wrong." AI code review: you look for "the parts that look right but are wrong." Much harder.

1.2 Confident Hallucination

An agent does not say "I'm not sure." It confidently calls APIs that do not exist, imports packages that are not there, and invents config keys with plausible names that do nothing. The tone of the output carries no trace of uncertainty.

A human leaves signals when working in unfamiliar territory — a question mark in a comment, an "is this right?" note, a draft PR. The agent's output has none of that. Confident code and hallucinated code look identical.

1.3 Missing Edge Cases

AI writes the "happy path" well. The code for when input is valid, the network is fine, and the array is not empty is almost always correct.

The problem is everything outside that. Empty arrays, null input, timeouts, concurrency, partial failure, integer overflow — the cases a senior human reflexively recalls, AI frequently omits unless you explicitly demand them. The code runs perfectly in the demo and falls apart in the third week of production.

1.4 Over-Engineering

There is a failure in the opposite direction too. You ask "uppercase the user's name" and back comes an abstract factory, a strategy-pattern interface, and a configurable transformation pipeline. AI tends to over-apply the "enterprise-grade" patterns it saw in training data to small problems.

Over-engineering is not a bug, but it is debt. There is more code to read, the maintenance surface grows, and the next person (or the next agent) gets lost.

1.5 Subtle API Misuse

AI uses APIs "approximately" correctly. The function name is right but the argument order is wrong, one key of the options object is from an old version, an async function is called without await, or the meaning of a return value is subtly misunderstood.

When the type system is strong, a good portion of this is caught at the compile stage. So the conclusion of Chapter 1 leads not to the next chapter but to Chapter 3 — what machines can catch should be left to machines.

Chapter 2 · The Characteristic Failure Patterns of AI-Generated Code

If Chapter 1 was "why it is different," Chapter 2 is "concretely, what to look for." These are the check items to keep loaded in your head while reviewing.

2.1 The Pattern Catalog

Pattern	Symptom	How to catch it
Hallucinated API/package	Functions, options, libraries that do not exist	Build/typecheck, check the dependency lockfile
Copy-paste inconsistency	The same logic, subtly different per file	Scan the whole diff at once
No error handling	No `try`/`catch`, failures ignored	Ask "what if this fails?" of every external call
Security blind spots	Unvalidated input, hardcoded secrets, injection	Trace data flow along the trust boundary
Tests that test nothing	Pass, but verify nothing	Deliberately break the test and see
Over-engineering	Big abstraction for a small problem	Ask "is this really needed"

2.2 Hallucinated APIs and Packages

The most common and the easiest to catch. The agent invents the API it "wishes existed."

// AI-generated code — plausible but wrong
import { retryWithBackoff } from 'lodash' // lodash has no such function
import dayjs from 'dayjs'

const result = await fetchUser(userId, {
  retry: 3,           // fetchUser options have no retry — hallucination
  timeout: '5s',      // timeout takes a number (ms) — type misuse
})

const formatted = dayjs(result.createdAt).format('YYYY-MM-DD')

The code above reads naturally. But lodash has no retryWithBackoff, and fetchUser has no retry option. The good news: this class is almost entirely caught by typecheck and build. A human does not need to find it by eye.

2.3 Copy-Paste Inconsistency

An agent does the same thing several ways even within a single PR. async/await in one file, .then() in another. Throws an error in one place, returns null in another. Each piece looks fine on its own, but as a whole there is no consistency.

To catch this you cannot look at the diff file by file — you have to scan the whole PR at once. The question is: "how many ways does this PR do the same kind of thing?"

2.4 No Error Handling

The happy path of AI code is clean. And the unhappy path does not exist. There is no try/catch around the external API call, a file-read failure is ignored, JSON.parse is called without a guard.

Review rule: ask "what happens if this fails here?" of every external boundary (network, disk, parsing, third-party calls). In AI diffs, the answer to that question is "the app crashes" surprisingly often.

2.5 Security Blind Spots

An agent treats security as "not a feature." It builds code that does what was asked, but it does not assume adversarial input. Common blind spots:

Wiring user input into queries, commands, or paths without validation
Hardcoding secrets in the code (to "just make it work" in the demo)
Missing authorization checks — never asking "can this user do this"
Printing sensitive information to logs

Trace, finger on the screen, where data enters and where it goes along the trust boundary. AI does not do this on its own.

2.6 Tests That Verify Nothing

The most insidious pattern. Ask an agent to "write tests too" and it writes tests. Tests that pass. But look at what those tests verify and it is often nothing.

// AI-generated "test" — passes but is meaningless
test('calculateDiscount works', () => {
  const result = calculateDiscount(100, 0.1)
  expect(result).toBeDefined()        // passes for anything
  expect(typeof result).toBe('number') // passes for 90 or 9999
})

This test passes whether calculateDiscount returns 0 or -50, because it never verifies the actual value (90). Worse, some tests copy the implementation verbatim and confirm "the implementation equals the implementation."

The check is simple. Deliberately break the test. If you plant a bug in the implementation and the test still passes, that test is fake.

Chapter 3 · The Verification Loop — Types Before Tests Before Humans

Finding the patterns of Chapter 2 one by one with human eyes is exhausting. The core principle is this: what machines can catch, leave to machines; spend human attention on what machines cannot catch.

3.1 Three Layers of Filter

Think of verification as a three-stage filter. Cheap filters first, expensive filters later.

Stage	Tool	What it catches	Cost
1. Types	Typechecker, linter, compiler	Hallucinated APIs, type misuse, unused variables	Near zero (seconds)
2. Tests	Unit/integration tests, static analysis	Logic errors, regressions, edge cases	Low (minutes)
3. Humans	The reviewer's judgment	Architecture, intent, "does this make sense"	Expensive (human time)

The principle: code that does not pass stage 1 is not worth a human's eyes. A human reviewing a PR with type errors is a waste of human time. Let CI filter it first.

3.2 Types as the First Line of Defense

A strong type system is the biggest leverage in AI code review. The hallucinated APIs, API misuse, and wrong arguments of Chapter 2 — a large share of these fall out as compile errors at typecheck, before a human reads a single line.

So loose typing gets more expensive in the AI era. In a codebase plastered with any, the first filter does not work, and that whole load shifts to stage 3 (humans).

3.3 Tests as the Second Line of Defense

If types check "is this code in a shape that makes sense," tests check "does this code behave correctly." But remember the trap of 2.6 — the tests the AI wrote are themselves subject to verification.

Recommended order: have the human write the tests first (or have the human pin down the spec), then leave the implementation to the agent. If the tests exist first, those tests cannot be "fakes that copied the implementation."

3.4 Humans Last, but Most Important

Only code that passes the three-layer filter comes before a human. At that point, human attention concentrates on what machines cannot catch in principle — is the architecture right, does this change match the intent of the ticket, is this even a sensible approach in the first place. That is the subject of Chapter 8.

The verification loop in one line: after passing every cheap machine check, the human looks only at what machines cannot see.

Chapter 4 · Reading AI Diffs Efficiently

Even a PR that passed the verification loop, a human must read the diff. A human-written diff and an AI diff are read in different ways.

4.1 The First Question: Does It Match the Ticket?

When you open an AI diff, you feel the urge to look at code quality first. Resist it. The first question is always this: does this diff do what the ticket asked for?

An agent interprets the ticket "approximately." It does 80% of what was asked and 20% differently, or it adds things that were not asked for. No matter how clean the code is, if it implemented the wrong thing it is meaningless. Cross-check the ticket's acceptance criteria against the diff, line by line.

4.2 Detecting Scope Creep

The second question: did this diff touch something it should not have touched?

An agent tends to fix other things "while it is at it." It changes formatting, refactors unrelated files, bumps dependency versions. Each may be well-intentioned, but together they make a diff that cannot be reviewed.

Signal	What to suspect
Number of changed files is large for the ticket's size	Scope creep
Changes scattered across unrelated directories	"while I'm at it" refactoring
Lockfile or config files changed for no reason	Unintended dependency change
Mass of lines where only formatting changed	Noise — hides the substantive change

Send these diffs back. "Leave only the ticket scope and redo" is a legitimate request.

4.3 The Order to Read a Diff

There is an efficient order:

PR description — what and why does it claim to have changed
Tests — what does this change claim to guarantee
Core logic — does the claim match the actual code
Boundaries and error handling — is there an unhappy path
The rest — config, formatting, incidental changes

Why read tests before logic: tests are the spec of "what this code is supposed to do." Read the spec first, and the divergences jump out when you read the logic.

4.4 Diff Size Inversely Proportional to Trust

A small diff you can read carefully. An 800-line diff, a human cannot stay focused to the end — AI or human. AI easily produces large diffs, so you must always ask "can this PR be split smaller." Reviewability is inversely proportional to diff size.

Chapter 5 · 'AI Slop' — What It Is and How to Filter It

5.1 What AI Slop Is

AI slop is the term for AI-generated output that looks plausible but has low real value. AI slop in code is "code that compiles and passes tests but makes the codebase worse."

Slop is not an obvious bug. Bugs, frankly, are easier to catch. Slop is code that is subtly bloated, subtly inconsistent, subtly unnecessary. It does not show in one PR, but stack up 100 PRs and the codebase becomes a swamp.

5.2 The Signs of Slop

Sign	Description
Verbosity	30 lines for what takes 5. Unnecessary helpers, wrappers, abstractions
Empty comments	`getUser()` under `// gets the user` — comments that repeat the code
Defensive noise	Checks for conditions that cannot happen, everywhere
Inconsistency	Five styles in the same codebase
Dead abstractions	An interface used once, a factory with a single implementation
Plausible dummies	Meaningless tests, functions with only a TODO, placeholder logic

5.3 The Questions That Filter Slop

The questions to ask while reviewing are simple:

"If I delete this line, what breaks?" — if the answer is "nothing," it is slop.
"How many places use this abstraction?" — if it is one place, inline it.
"Does this comment say something the code does not?" — if not, delete it.
"Can this PR be cut to half its size?" — usually it can.

5.4 Slop Is a Problem of Acceptance, Not Generation

An important perspective: it is AI that produces slop, but it is the human who lets slop into the codebase. The agent only proposes slop; the one who presses the merge button is the reviewer.

So the solution to the slop problem is not "do not use AI" but "do not lower the review bar." You must not pass AI code more leniently than human code. If anything, because of the confident surface, you must look more strictly.

Chapter 6 · When AI Reviews AI Code

Adding one more "AI reviewer" stage to the verification loop is reasonable. But there is a trap.

6.1 Generator and Verifier Must Be Separated

The core principle: the agent that generated the code must not review the same code.

An agent with the same model, the same context, the same assumptions cannot see the hallucination it made itself. The reasoning process that invented the wrong API will, unchanged, judge that "this API is correct." In human terms, it is grading your own exam.

The verifier must be separated — a different model, or at minimum a separate session with a different context and a different prompt.

6.2 What an AI Reviewer Is Good and Bad At

What an AI reviewer is good at	What an AI reviewer is bad at
Pattern matching — known anti-patterns, common bugs	Architectural judgment — "is this approach right"
Consistency checks — style, naming	Intent matching — "is this really what the ticket wanted"
Checklist application — missing error handling, etc.	Trade-offs — "is this complexity worth it"
Surface-level security — hardcoded secrets, etc.	Domain correctness — are the business rules right

In summary: an AI reviewer is stage 1.5 of the verification loop — smarter than a typecheck but not a replacement for a human. It only widens the range of what machines can catch; the final judgment is still the human's.

6.3 A Practical Arrangement

The recommended configuration:

Agent A generates the code
Typecheck, tests, lint (machine stage 1)
Agent B (different context) reviews — patterns, consistency, checklist
A human does the final review — architecture, intent, judgment

The AI reviewer's comments are input, not conclusion. The human must filter both what the AI reviewer missed and what it overreacted to.

Chapter 7 · Making Your Codebase Verifiable

The efficiency of the verification loop depends less on the skill of verifying code and more on how verifiable the codebase is. A codebase that is good to verify is rewarded compoundingly in the AI era.

7.1 Strong Types

Already said in Chapter 3, but worth stressing again. Strong types determine the performance of the first filter. Reduce any, express the domain as types (types like UserId, Email instead of primitives), and make function signatures honest. The stronger the types, the less a human has to read.

7.2 Fast Tests

If tests are slow, the verification loop breaks. A 30-minute test suite invites "merge it now and look later." Tests must be fast, deterministic, and trustworthy. A flaky test is as harmful as slop — because it turns signal into noise.

7.3 Clear Conventions

An agent imitates the patterns of the codebase. If the codebase has consistent conventions, the agent's output is consistent. If there are no conventions, the agent brings a different style every time (the copy-paste inconsistency of 2.3 is, in fact, also a symptom of absent conventions).

Pin conventions in documents and, where possible, in linter rules. A CONTRIBUTING document, a clear directory structure, a guide file for agents — these directly raise the quality of the agent's output.

7.4 Verifiability Checklist

Item	Question
Types	Is the share of `any` low? Is the domain expressed as types?
Tests	Does the full suite finish within minutes? Are there no flaky ones?
Lint	Are style and common mistakes caught automatically?
Conventions	Are the patterns new code should follow documented?
Boundaries	Are module boundaries clear so a diff's blast radius is narrow?

In a codebase with these five in place, AI code review is fast and trustworthy. In a codebase without them, every PR demands a full human review — and that is exactly the bottleneck.

Chapter 8 · The Human's Irreducible Job

Once machines take stages 1 and 2 and an AI reviewer takes stage 1.5, what is left for the human? A job that has shrunk but has not disappeared, and has in fact become more important.

8.1 Judgment

"This code is correct." That sentence means two things. (1) The code behaves per the spec — this, machines verify. (2) The spec itself is right — this, only a human can do.

An agent does what the ticket told it to. If the ticket is wrong, it implements the wrong thing perfectly. "Is this ticket even the right problem to solve" — machines do not ask that.

8.2 Architecture

Even when individual functions are correct, the system structure can be wrong. Whether this abstraction will hold up six months from now, whether this boundary is drawn in the right place, whether this dependency direction is healthy — these are questions pattern matching cannot answer. It is the work of imagining the future of the whole codebase, and that is a human's work.

8.3 "Does This Make Sense"

The most irreducible job is the simplest question: does this make sense.

The agent brought back perfectly functioning, type-correct, test-passing code — and the feature should perhaps never have been built in the first place. There may have been a simpler solution, or the problem should have been defined differently. Machines verify the "how" but the "why" and the "really?" are the human's.

8.4 Accountability

Finally, the person who presses the merge button is the owner of that code. "An AI wrote it" is not an excuse in front of a production outage. The essence of the verification discipline is here — whatever the tool generated, the decision to let it into the codebase is the human's, and that decision carries accountability.

The human's job did not shrink — it concentrated. The typing is gone; only the judgment remains.

Epilogue — Verification Is the New Core Competency

In an era when the cost of producing code approaches zero, the competitive edge is not "how fast you generate" but "how reliably you verify." Generation became a commodity; verification became scarce.

The one-line conclusion of this article: AI code review is not a social act but a verification discipline. Set suspicion as the default, leave what machines can catch to machines, and concentrate human attention on what machines cannot see in principle — judgment, architecture, "does this make sense."

AI Code Review Checklist

Types first — a PR that does not pass typecheck and build is not worth a human's eyes
Verify the tests — deliberately break the tests the AI wrote to confirm whether they are fake
Cross-check the ticket — see whether the diff satisfies the acceptance criteria line by line
Scope creep — see whether "while I'm at it" changes are mixed in
Check boundaries — ask "what if this fails?" of every external call
Trace security — finger on the screen, trace data flow along the trust boundary
Filter slop — ask "what breaks if I delete this line"
Generator != verifier — separate the AI reviewer from the agent that generated the code
Diff size — split large PRs. Reviewability is inversely proportional to size
Final judgment is the human's — the person who presses merge is the owner

Anti-Patterns to Avoid

Trusting the surface — switching off suspicion because the code looks smooth
Skipping types — having a human read a PR with type errors first
Blind faith in tests — mistaking "tests pass" for "verified"
Self-grading — having the generating agent review its own code
AI leniency — passing AI code more loosely than human code
Accepting giant diffs — approving an 800-line PR without reading it to the end
Dodging accountability — using "an AI wrote it" as an excuse for an outage

Next-Post Teaser

The next article is "A Testing Strategy for the AI Era — How to Make Agent-Written Tests Trustworthy." This article said "the tests the AI wrote are themselves subject to verification" — so then, how should trustworthy tests be designed? It covers how to pin down the spec first, how to verify tests with mutation testing, and when to leave tests to an agent and when not to.

In a world where generation became free, the most expensive skill is the verifying eye that knows how to say "no." That eye is engineering itself.