Split View: 데이터베이스 완전 비교 2025: PostgreSQL vs MySQL vs MongoDB vs Redis — 선택 기준과 실전 가이드

데이터베이스 완전 비교 2025: PostgreSQL vs MySQL vs MongoDB vs Redis — 선택 기준과 실전 가이드

들어가며
1. 2025년 데이터베이스 랭킹과 트렌드
- DB-Engines 랭킹 (2025년 3월 기준)
- 2025년 주요 트렌드
2. RDBMS vs NoSQL vs NewSQL — 언제 무엇을 써야 하는가
3. PostgreSQL 딥 다이브
4. MySQL 딥 다이브
5. MongoDB 딥 다이브
6. Redis 딥 다이브
7. DynamoDB — 서버리스 NoSQL의 대표
8. Cassandra / ScyllaDB — 대규모 쓰기의 왕
- Cassandra 아키텍처
- ScyllaDB — Cassandra 호환 고성능 대안
9. 성능 벤치마크 비교
- YCSB 벤치마크 (Yahoo Cloud Serving Benchmark)
- 확장성 비교
10. 유스케이스별 선택 매트릭스
- 최적 데이터베이스 선택 가이드
11. 멀티 DB 아키텍처 패턴
12. 비용 비교 (매니지드 서비스)
- 월간 비용 예상 (중간 규모 서비스: 100GB 데이터, 5000 TPS)
- 비용 최적화 전략
13. 면접 질문 20선
14. 퀴즈
참고 자료

들어가며

2025년 데이터베이스 생태계는 그 어느 때보다 다양하고 복잡합니다. AI/ML 워크로드의 폭발적 증가로 벡터 검색이 필수가 되었고, 실시간 스트리밍 처리, 멀티모델 지원, 서버리스 아키텍처 등 새로운 요구사항이 데이터베이스 선택을 더욱 어렵게 만들고 있습니다.

이 글에서는 PostgreSQL, MySQL, MongoDB, Redis를 중심으로 DynamoDB, Cassandra/ScyllaDB까지 8개 주요 데이터베이스의 아키텍처, 성능, 비용, 유스케이스를 체계적으로 비교합니다. 단순한 기능 나열이 아닌, 실제 프로덕션 환경에서의 의사결정에 도움이 되는 실전 가이드를 제공합니다.

1. 2025년 데이터베이스 랭킹과 트렌드

DB-Engines 랭킹 (2025년 3월 기준)

순위	데이터베이스	유형	점수	전년 대비
1	Oracle	RDBMS	1,245	-3.2%
2	MySQL	RDBMS	1,142	-1.8%
3	PostgreSQL	RDBMS	685	+12.4%
4	MongoDB	Document	397	+2.1%
5	Redis	Key-Value	168	-5.3%
6	Elasticsearch	Search	155	+1.7%
7	SQLite	RDBMS	142	+8.9%
8	Cassandra	Wide-Column	112	-4.1%

2025년 주요 트렌드

1. AI가 데이터베이스 선택을 바꾼다

RAG(Retrieval-Augmented Generation) 파이프라인이 보편화되면서 벡터 검색 기능이 데이터베이스 선택의 핵심 기준이 되었습니다. PostgreSQL의 pgvector, MongoDB Atlas Vector Search, Redis Stack이 이 경쟁에서 두각을 나타내고 있습니다.

2. 서버리스 데이터베이스의 부상

Neon(PostgreSQL 서버리스), PlanetScale(MySQL 서버리스), MongoDB Atlas Serverless, DynamoDB on-demand 등 사용량 기반 과금 모델이 표준이 되어가고 있습니다.

3. 멀티모델 데이터베이스

하나의 데이터베이스에서 관계형, 문서, 그래프, 시계열 데이터를 모두 처리하는 멀티모델 접근이 증가하고 있습니다. PostgreSQL이 JSONB, pgvector, PostGIS 등 확장으로 이 영역에서 가장 앞서가고 있습니다.

4. Valkey의 등장

Redis Labs의 라이선스 변경(SSPL)에 대응하여 Linux Foundation이 Valkey 포크를 출시했습니다. AWS ElastiCache, Google Cloud Memorystore 등이 Valkey를 채택하면서 Redis 생태계에 큰 변화가 일어나고 있습니다.

2. RDBMS vs NoSQL vs NewSQL — 언제 무엇을 써야 하는가

CAP 정리의 실전 이해

분산 시스템에서 Consistency(일관성), Availability(가용성), Partition Tolerance(분할 내성) 세 가지를 동시에 완벽하게 달성할 수 없다는 이론입니다.

         Consistency (C)
            /\
           /  \
          /    \
    CP   / CAP  \ CA
   시스템 / 불가능 \ 시스템
        /________\
  Partition(P)   Availability(A)

유형	조합	데이터베이스	특징
CA	일관성 + 가용성	전통 RDBMS (단일 노드)	네트워크 파티션 시 불가용
CP	일관성 + 분할내성	MongoDB, HBase	가용성 일부 포기
AP	가용성 + 분할내성	Cassandra, DynamoDB	최종적 일관성

RDBMS를 선택해야 할 때

트랜잭션 무결성이 필수 (금융, 결제, 재고 관리)
복잡한 JOIN과 집계 쿼리가 빈번
데이터 스키마가 명확하고 안정적
ACID 준수가 법적/규제적 요구사항

NoSQL을 선택해야 할 때

스키마가 자주 변경되거나 비정형 데이터
수평 확장(horizontal scaling)이 핵심 요구사항
초저지연(sub-millisecond) 읽기/쓰기 필요
대용량 데이터의 빠른 쓰기 처리

NewSQL을 선택해야 할 때

CockroachDB, TiDB, YugabyteDB 같은 NewSQL 데이터베이스는 RDBMS의 트랜잭션 보장과 NoSQL의 수평 확장성을 모두 제공합니다.

글로벌 분산이 필요한 트랜잭션 시스템
RDBMS 호환성을 유지하면서 수평 확장이 필요
다만, 성숙도와 커뮤니티 크기에서 전통 RDBMS에 비해 아직 부족

3. PostgreSQL 딥 다이브

아키텍처 개요

PostgreSQL은 프로세스 기반 아키텍처를 사용합니다. 각 클라이언트 연결마다 별도의 백엔드 프로세스가 할당됩니다.

┌─────────────────────────────────────────────┐
│                 PostgreSQL                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Backend  │  │ Backend  │  │ Backend  │  │
│  │ Process  │  │ Process  │  │ Process  │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       │             │             │         │
│  ┌────┴─────────────┴─────────────┴────┐    │
│  │          Shared Memory              │    │
│  │  ┌──────────┐  ┌──────────────┐    │    │
│  │  │ Shared   │  │    WAL       │    │    │
│  │  │ Buffers  │  │  Buffers     │    │    │
│  │  └──────────┘  └──────────────┘    │    │
│  └─────────────────────────────────────┘    │
│                                             │
│  ┌─────────────────────────────────────┐    │
│  │     Background Workers              │    │
│  │  WAL Writer | Checkpointer |        │    │
│  │  Autovacuum | Stats Collector       │    │
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

MVCC (Multi-Version Concurrency Control)

PostgreSQL의 MVCC는 각 행의 여러 버전을 유지하여 읽기/쓰기 간 잠금 충돌을 최소화합니다.

-- 트랜잭션 격리 수준 설정
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- MVCC 동작 확인
BEGIN;
SELECT xmin, xmax, ctid, * FROM orders WHERE id = 1;
-- xmin: 생성 트랜잭션 ID, xmax: 삭제 트랜잭션 ID
-- ctid: 물리적 위치 (page, offset)
COMMIT;

JSONB — 관계형 + 문서형의 최강 조합

-- JSONB 컬럼 생성
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    metadata JSONB NOT NULL DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- GIN 인덱스 생성 (JSONB 검색 최적화)
CREATE INDEX idx_products_metadata ON products USING GIN (metadata);

-- JSONB 쿼리 예시
SELECT name, metadata->>'category' AS category,
       (metadata->'specs'->>'weight')::numeric AS weight
FROM products
WHERE metadata @> '{"category": "electronics"}'
  AND (metadata->'specs'->>'weight')::numeric < 500;

-- JSONB 집계
SELECT metadata->>'category' AS category,
       COUNT(*),
       AVG((metadata->'price'->>'amount')::numeric) AS avg_price
FROM products
GROUP BY metadata->>'category';

pgvector — AI/RAG 시대의 게임 체인저

-- pgvector 확장 설치
CREATE EXTENSION IF NOT EXISTS vector;

-- 벡터 컬럼이 있는 테이블
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536)  -- OpenAI text-embedding-3-small 차원
);

-- HNSW 인덱스 (빠른 근사 최근접 이웃 검색)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat 인덱스 (메모리 효율적)
CREATE INDEX ON documents
USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100);

-- 코사인 유사도 기반 검색
SELECT id, title,
       1 - (embedding <=> query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 10;

-- 하이브리드 검색 (벡터 + 전문 검색)
SELECT id, title,
       ts_rank(to_tsvector('korean', content), plainto_tsquery('korean', '검색어')) AS text_score,
       1 - (embedding <=> query_embedding) AS vector_score
FROM documents
WHERE to_tsvector('korean', content) @@ plainto_tsquery('korean', '검색어')
ORDER BY vector_score * 0.7 + text_score * 0.3 DESC
LIMIT 10;

pg_trgm — 퍼지 검색과 유사도 매칭

CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- 트라이그램 인덱스
CREATE INDEX idx_products_name_trgm ON products
USING GIN (name gin_trgm_ops);

-- 유사도 검색 (오타 허용)
SELECT name, similarity(name, 'PostgreSLQ') AS sim
FROM products
WHERE name % 'PostgreSLQ'
ORDER BY sim DESC;

파티셔닝

-- 범위 파티셔닝 예시
CREATE TABLE orders (
    id BIGSERIAL,
    order_date DATE NOT NULL,
    customer_id INTEGER NOT NULL,
    total_amount NUMERIC(10,2),
    PRIMARY KEY (id, order_date)
) PARTITION BY RANGE (order_date);

-- 파티션 생성
CREATE TABLE orders_2025_q1 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');
CREATE TABLE orders_2025_q2 PARTITION OF orders
    FOR VALUES FROM ('2025-04-01') TO ('2025-07-01');

-- 자동 파티션 관리 (pg_partman)
CREATE EXTENSION pg_partman;
SELECT partman.create_parent(
    p_parent_table => 'public.orders',
    p_control => 'order_date',
    p_type => 'native',
    p_interval => '3 months'
);

PostgreSQL 17 주요 기능

Incremental Backup: pg_basebackup에서 증분 백업 지원
SQL/JSON 표준 지원 강화: JSON_TABLE, JSON_QUERY 등
성능 향상: 병렬 해시 조인 개선, 벌크 로딩 최적화
보안 강화: LDAP/SCRAM 인증 개선

4. MySQL 딥 다이브

InnoDB 아키텍처

┌─────────────────────────────────────────────┐
│                MySQL Server                  │
│  ┌───────────────────────────────────────┐  │
│  │          Connection Pool              │  │
│  │   Thread 1 | Thread 2 | Thread N     │  │
│  └───────────────────────────────────────┘  │
│  ┌───────────────────────────────────────┐  │
│  │          Query Cache (제거됨 8.0+)     │  │
│  │          Parser -> Optimizer          │  │
│  └───────────────────────────────────────┘  │
│  ┌───────────────────────────────────────┐  │
│  │            InnoDB Engine              │  │
│  │  ┌──────────┐  ┌──────────────────┐  │  │
│  │  │ Buffer   │  │  Change Buffer   │  │  │
│  │  │ Pool     │  │                  │  │  │
│  │  └──────────┘  └──────────────────┘  │  │
│  │  ┌──────────┐  ┌──────────────────┐  │  │
│  │  │ Redo Log │  │  Undo Log        │  │  │
│  │  └──────────┘  └──────────────────┘  │  │
│  └───────────────────────────────────────┘  │
└─────────────────────────────────────────────┘

MySQL은 스레드 기반 아키텍처로, PostgreSQL의 프로세스 기반과 대비됩니다. 이로 인해 연결 오버헤드가 낮고 메모리 효율이 높습니다.

MySQL vs PostgreSQL 핵심 차이

-- MySQL: UPSERT (INSERT ... ON DUPLICATE KEY)
INSERT INTO products (id, name, stock)
VALUES (1, 'Widget', 100)
ON DUPLICATE KEY UPDATE stock = stock + VALUES(stock);

-- PostgreSQL: UPSERT (INSERT ... ON CONFLICT)
INSERT INTO products (id, name, stock)
VALUES (1, 'Widget', 100)
ON CONFLICT (id)
DO UPDATE SET stock = products.stock + EXCLUDED.stock;

HeatWave ML — MySQL에서 직접 ML

-- HeatWave ML로 모델 학습
CALL sys.ML_TRAIN('my_schema.customer_data',
    'churn_label',
    JSON_OBJECT('task', 'classification'),
    @model_handle);

-- 예측 실행
CALL sys.ML_PREDICT_ROW(
    JSON_OBJECT('age', 35, 'tenure', 24, 'monthly_charge', 79.99),
    @model_handle,
    @prediction);

SELECT @prediction;

Group Replication (그룹 복제)

MySQL Group Replication은 Paxos 기반 합의 프로토콜을 사용하여 고가용성을 제공합니다.

-- Group Replication 상태 확인
SELECT MEMBER_ID, MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
FROM performance_schema.replication_group_members;

-- 단일 프라이머리 모드 (기본, 권장)
-- 하나의 쓰기 노드, 나머지는 읽기 전용
SET GLOBAL group_replication_single_primary_mode = ON;

ProxySQL로 읽기/쓰기 분리

-- ProxySQL 라우팅 룰 설정
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup)
VALUES
    (1, 1, '^SELECT .* FOR UPDATE', 0),   -- 쓰기 그룹
    (2, 1, '^SELECT', 1);                  -- 읽기 그룹

-- 서버 그룹 설정
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight)
VALUES
    (0, 'primary.mysql.local', 3306, 1),    -- 쓰기
    (1, 'replica1.mysql.local', 3306, 50),  -- 읽기
    (1, 'replica2.mysql.local', 3306, 50);  -- 읽기

MySQL 9.0 주요 기능

JavaScript Stored Programs: GraalVM 기반 JavaScript 저장 프로시저
Vector Data Type: 벡터 데이터 타입 네이티브 지원
EXPLAIN 개선: FORMAT=TREE 기본값, 더 직관적인 실행 계획
보안: OpenID Connect 인증 지원

5. MongoDB 딥 다이브

문서 모델의 강점

// MongoDB 문서 구조 — 임베딩 패턴
db.orders.insertOne({
  orderId: 'ORD-2025-001',
  customer: {
    name: '김개발',
    email: 'kim@example.com',
    address: {
      street: '강남대로 123',
      city: '서울',
      zipCode: '06123',
    },
  },
  items: [
    {
      productId: 'PROD-001',
      name: '무선 키보드',
      price: 89000,
      quantity: 2,
    },
    {
      productId: 'PROD-002',
      name: '모니터 스탠드',
      price: 45000,
      quantity: 1,
    },
  ],
  payment: {
    method: 'card',
    status: 'completed',
    paidAt: ISODate('2025-03-15T10:30:00Z'),
  },
  totalAmount: 223000,
  createdAt: ISODate('2025-03-15T10:25:00Z'),
})

Atlas Vector Search

// 벡터 검색 인덱스 생성
db.collection.createSearchIndex({
  name: 'vector_index',
  definition: {
    mappings: {
      dynamic: true,
      fields: {
        embedding: {
          type: 'knnVector',
          dimensions: 1536,
          similarity: 'cosine',
        },
      },
    },
  },
})

// 벡터 검색 쿼리
db.documents.aggregate([
  {
    $vectorSearch: {
      index: 'vector_index',
      path: 'embedding',
      queryVector: queryEmbedding, // 1536차원 벡터
      numCandidates: 100,
      limit: 10,
      filter: { category: 'tech' },
    },
  },
  {
    $project: {
      title: 1,
      content: 1,
      score: { $meta: 'vectorSearchScore' },
    },
  },
])

Aggregation Pipeline 고급 예시

// 월별 매출 분석 파이프라인
db.orders.aggregate([
  // 기간 필터
  {
    $match: {
      createdAt: {
        $gte: ISODate('2025-01-01'),
        $lt: ISODate('2026-01-01'),
      },
      'payment.status': 'completed',
    },
  },
  // 주문 항목 분해
  { $unwind: '$items' },
  // 월별 그룹핑
  {
    $group: {
      _id: {
        year: { $year: '$createdAt' },
        month: { $month: '$createdAt' },
        category: '$items.category',
      },
      totalRevenue: { $sum: { $multiply: ['$items.price', '$items.quantity'] } },
      orderCount: { $sum: 1 },
      avgOrderValue: { $avg: { $multiply: ['$items.price', '$items.quantity'] } },
    },
  },
  // 정렬
  { $sort: { '_id.year': 1, '_id.month': 1 } },
  // 출력 형태 변환
  {
    $project: {
      _id: 0,
      period: { $concat: [{ $toString: '$_id.year' }, '-', { $toString: '$_id.month' }] },
      category: '$_id.category',
      totalRevenue: 1,
      orderCount: 1,
      avgOrderValue: { $round: ['$avgOrderValue', 0] },
    },
  },
])

MongoDB 샤딩

// 샤드 키 선택 — 카디널리티, 분산도, 쿼리 패턴 고려
sh.enableSharding('ecommerce')

// 해시 샤딩 (균등 분산)
sh.shardCollection('ecommerce.orders', { customerId: 'hashed' })

// 범위 샤딩 (범위 쿼리 최적화)
sh.shardCollection('ecommerce.logs', { timestamp: 1 })

// 존 샤딩 (지역 기반 데이터 분리)
sh.addShardTag('shard-kr', 'ASIA')
sh.addShardTag('shard-us', 'AMERICAS')
sh.addTagRange('ecommerce.users', { region: 'KR' }, { region: 'KS' }, 'ASIA')

MongoDB 8.0 주요 기능

Queryable Encryption: 암호화된 상태에서 쿼리 가능
Cluster-to-Cluster Sync: 멀티 클라우드 동기화
Time Series Collection 개선: 더 효율적인 시계열 데이터 처리
성능: 읽기/쓰기 성능 20-35% 향상

6. Redis 딥 다이브

데이터 구조와 활용

-- Strings: 캐시, 세션, 카운터
SET user:session:abc123 "session_data" EX 3600
INCR page:views:homepage

-- Hashes: 객체 저장
HSET user:1001 name "김개발" email "kim@dev.com" login_count 42
HINCRBY user:1001 login_count 1

-- Sorted Sets: 리더보드, 랭킹
ZADD leaderboard 9500 "player:alice"
ZADD leaderboard 8700 "player:bob"
ZADD leaderboard 9200 "player:charlie"
ZREVRANGE leaderboard 0 9 WITHSCORES  -- Top 10

-- Streams: 이벤트 스트리밍
XADD mystream * sensor_id "temp-01" temperature "23.5" humidity "45"
XREAD COUNT 10 BLOCK 5000 STREAMS mystream 0

-- HyperLogLog: 고유 카운트 (근사치)
PFADD unique:visitors:2025-03-23 "user1" "user2" "user3"
PFCOUNT unique:visitors:2025-03-23

Redis Stack — Search, JSON, TimeSeries

-- Redis JSON
JSON.SET product:001 $ '{"name":"무선 키보드","price":89000,"tags":["electronics","keyboard"],"specs":{"weight":350,"wireless":true}}'
JSON.GET product:001 $.specs.weight

-- Redis Search 인덱스 생성
FT.CREATE idx:products ON JSON PREFIX 1 product:
  SCHEMA
    $.name AS name TEXT SORTABLE
    $.price AS price NUMERIC SORTABLE
    $.tags[*] AS tags TAG
    $.specs.weight AS weight NUMERIC

-- 전문 검색
FT.SEARCH idx:products "@name:키보드 @price:[50000 100000]"

-- 벡터 검색 (Redis Stack)
FT.CREATE idx:docs ON HASH PREFIX 1 doc:
  SCHEMA
    title TEXT
    embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE

-- KNN 검색
FT.SEARCH idx:docs "*=>[KNN 10 @embedding $query_vec AS score]"
  PARAMS 2 query_vec "\x00\x00..."
  SORTBY score
  RETURN 2 title score

-- Redis TimeSeries
TS.CREATE temperature:sensor1 RETENTION 86400000 LABELS sensor_id temp-01 location office
TS.ADD temperature:sensor1 * 23.5
TS.RANGE temperature:sensor1 - + AGGREGATION avg 60000  -- 1분 평균

영속성: RDB vs AOF

특성	RDB (스냅샷)	AOF (Append Only File)
동작 방식	주기적 메모리 덤프	모든 쓰기 명령 로깅
데이터 손실	마지막 스냅샷 이후	fsync 정책에 따라 다름
복구 속도	빠름 (바이너리 로드)	느림 (명령 재실행)
파일 크기	작음 (압축)	큼 (모든 명령 기록)
CPU 영향	fork() 시 높음	everysec 시 낮음

# redis.conf 권장 설정
save 900 1       # 900초 동안 1개 이상 변경 시 RDB
save 300 10      # 300초 동안 10개 이상 변경 시 RDB
appendonly yes   # AOF 활성화
appendfsync everysec  # 1초마다 fsync (성능/안전 균형)

Redis Cluster

# 6노드 클러스터 생성 (3 마스터 + 3 레플리카)
redis-cli --cluster create \
  node1:6379 node2:6379 node3:6379 \
  node4:6379 node5:6379 node6:6379 \
  --cluster-replicas 1

Redis Cluster는 16384개의 해시 슬롯을 사용하여 키를 분산합니다. 키 공간은 CRC16 해시를 사용하여 슬롯에 매핑됩니다.

Valkey 포크 — Redis의 미래?

2024년 Redis Labs가 SSPL 라이선스로 전환한 후, Linux Foundation 주도로 Valkey가 탄생했습니다.

항목	Redis	Valkey
라이선스	SSPL (2024~)	BSD-3-Clause
지원 클라우드	Redis Cloud	AWS, GCP, Azure
호환성	원본	Redis 7.2 호환
개발 주체	Redis Ltd.	Linux Foundation
커뮤니티	제한적 기여	오픈 기여

7. DynamoDB — 서버리스 NoSQL의 대표

핵심 개념

┌───────────────────────────────────────────┐
│              DynamoDB Table                │
│  ┌─────────────────────────────────────┐  │
│  │  Partition Key (PK)  Sort Key (SK)  │  │
│  ├─────────────────────────────────────┤  │
│  │  USER#001            PROFILE        │  │
│  │  USER#001            ORDER#001      │  │
│  │  USER#001            ORDER#002      │  │
│  │  USER#002            PROFILE        │  │
│  │  USER#002            ORDER#001      │  │
│  └─────────────────────────────────────┘  │
│                                           │
│  GSI: OrderDate-Index                     │
│  ┌─────────────────────────────────────┐  │
│  │  OrderDate (PK)     UserId (SK)     │  │
│  └─────────────────────────────────────┘  │
└───────────────────────────────────────────┘

Single-Table Design 패턴

import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('EcommerceTable')

# 사용자 프로필 저장
table.put_item(Item={
    'PK': 'USER#001',
    'SK': 'PROFILE',
    'name': '김개발',
    'email': 'kim@dev.com',
    'type': 'USER'
})

# 같은 테이블에 주문 저장
table.put_item(Item={
    'PK': 'USER#001',
    'SK': 'ORDER#2025-03-15#ORD001',
    'items': [{'name': '키보드', 'price': 89000}],
    'total': 89000,
    'status': 'completed',
    'type': 'ORDER'
})

# 사용자의 모든 주문 조회 (Query)
response = table.query(
    KeyConditionExpression='PK = :pk AND begins_with(SK, :sk)',
    ExpressionAttributeValues={
        ':pk': 'USER#001',
        ':sk': 'ORDER#'
    }
)

DynamoDB 가격 모델

모드	읽기 비용	쓰기 비용	적합한 경우
On-Demand	0.25 USD/100만 RRU	1.25 USD/100만 WRU	트래픽 예측 불가
Provisioned	0.00065 USD/RCU/시간	0.00065 USD/WCU/시간	안정적 트래픽
Reserved	최대 77% 할인	최대 77% 할인	1년/3년 약정

8. Cassandra / ScyllaDB — 대규모 쓰기의 왕

Cassandra 아키텍처

Cassandra는 Dynamo 스타일의 분산 해시 테이블과 BigTable 스타일의 데이터 모델을 결합합니다.

┌────────┐  ┌────────┐  ┌────────┐
│ Node A │──│ Node B │──│ Node C │
│ Token: │  │ Token: │  │ Token: │
│ 0-33   │  │ 34-66  │  │ 67-100 │
└───┬────┘  └───┬────┘  └───┬────┘
    │           │           │
    └───────────┼───────────┘
            Gossip Protocol

-- 키스페이스 생성 (복제 전략 포함)
CREATE KEYSPACE ecommerce WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

-- 테이블 생성 (파티션 키 + 클러스터링 키)
CREATE TABLE ecommerce.user_activity (
    user_id UUID,
    activity_date DATE,
    activity_time TIMESTAMP,
    activity_type TEXT,
    details MAP<TEXT, TEXT>,
    PRIMARY KEY ((user_id, activity_date), activity_time)
) WITH CLUSTERING ORDER BY (activity_time DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_size': 1,
                    'compaction_window_unit': 'DAYS'};

ScyllaDB — Cassandra 호환 고성능 대안

ScyllaDB는 C++로 재작성된 Cassandra 호환 데이터베이스입니다. Shard-per-core 아키텍처로 동일 하드웨어에서 3-10배 높은 성능을 제공합니다.

비교 항목	Cassandra	ScyllaDB
언어	Java	C++ (Seastar)
아키텍처	스레드 기반	Shard-per-core
GC 영향	GC pause 존재	GC 없음
같은 하드웨어 성능	기준	3-10x
Cassandra 호환	원본	CQL 호환

9. 성능 벤치마크 비교

YCSB 벤치마크 (Yahoo Cloud Serving Benchmark)

다음은 동일 하드웨어(8 vCPU, 32GB RAM, NVMe SSD)에서의 비교입니다.

작업	PostgreSQL	MySQL	MongoDB	Redis	DynamoDB	ScyllaDB
읽기 지연 (p99)	2.1ms	1.8ms	1.5ms	0.3ms	4.2ms	0.9ms
쓰기 지연 (p99)	3.5ms	2.9ms	2.1ms	0.4ms	5.8ms	1.2ms
읽기 TPS	45K	52K	58K	250K	25K	120K
쓰기 TPS	28K	35K	42K	200K	20K	85K
혼합 (50/50) TPS	35K	42K	48K	220K	22K	100K

주의: 벤치마크 결과는 워크로드 패턴, 데이터 크기, 인덱스 구성 등에 따라 크게 달라집니다.

확장성 비교

확장 방식	PostgreSQL	MySQL	MongoDB	Redis	DynamoDB	Cassandra
수직 확장	우수	우수	좋음	좋음	N/A	좋음
읽기 확장	복제본	복제본	복제본 세트	클러스터	자동	노드 추가
쓰기 확장	제한적	제한적	샤딩	클러스터	자동	노드 추가
최대 데이터 크기	수 TB	수 TB	수 PB	수백 GB	무제한	수 PB

10. 유스케이스별 선택 매트릭스

최적 데이터베이스 선택 가이드

유스케이스	1순위	2순위	이유
이커머스	PostgreSQL	MySQL	트랜잭션 + JSONB 유연성
소셜 미디어	MongoDB	Cassandra	비정형 데이터 + 높은 쓰기
IoT 센서 데이터	ScyllaDB	DynamoDB	대용량 시계열 쓰기
실시간 분석	PostgreSQL	ClickHouse	복잡한 집계 쿼리
RAG/AI 검색	PostgreSQL (pgvector)	MongoDB (Atlas Vector)	벡터 + 전문 검색
캐시	Redis	DragonflyDB	초저지연, 풍부한 데이터 구조
세션 저장소	Redis	DynamoDB	빠른 읽기/쓰기, TTL
채팅/메시징	MongoDB	Cassandra	유연한 스키마, 시간순 정렬
게임 리더보드	Redis	DynamoDB	Sorted Set, 빠른 갱신
로그 관리	Elasticsearch	MongoDB	전문 검색, 시계열
지리 공간	PostgreSQL (PostGIS)	MongoDB	GIS 기능, 공간 인덱스
그래프 데이터	Neo4j	PostgreSQL (AGE)	관계 탐색 최적화

11. 멀티 DB 아키텍처 패턴

Polyglot Persistence (다중 데이터베이스 전략)

┌─────────────────────────────────────────────────┐
│                  Application Layer               │
└───────┬──────────┬──────────┬──────────┬────────┘
        │          │          │          │
   ┌────▼────┐ ┌──▼───┐ ┌───▼───┐ ┌───▼────┐
   │PostgreSQL│ │Redis │ │MongoDB│ │Elastic │
   │(주문/결제)│ │(캐시) │ │(상품)  │ │(검색)  │
   └─────────┘ └──────┘ └───────┘ └────────┘

CQRS 패턴 (Command Query Responsibility Segregation)

쓰기 (Command):                   읽기 (Query):
┌──────────┐                     ┌──────────┐
│  Client   │                     │  Client   │
└────┬─────┘                     └────┬─────┘
     │                                │
┌────▼─────┐                     ┌────▼─────┐
│ Command  │                     │  Query   │
│ Service  │                     │ Service  │
└────┬─────┘                     └────┬─────┘
     │                                │
┌────▼─────┐  ──이벤트──▶  ┌─────▼─────┐
│PostgreSQL│              │  Redis /   │
│ (Source  │              │  MongoDB   │
│  of Truth)│              │ (Read Model)│
└──────────┘              └───────────┘

Cache-Aside 패턴

import redis
import psycopg2
import json

r = redis.Redis(host='localhost', port=6379)

def get_product(product_id):
    # 1. 캐시 확인
    cache_key = f"product:{product_id}"
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)

    # 2. 캐시 미스 — DB 조회
    conn = psycopg2.connect("dbname=ecommerce")
    cur = conn.cursor()
    cur.execute("SELECT * FROM products WHERE id = %s", (product_id,))
    product = cur.fetchone()

    if product:
        # 3. 캐시에 저장 (TTL 1시간)
        r.setex(cache_key, 3600, json.dumps(product))

    return product

def update_product(product_id, data):
    # 1. DB 업데이트
    conn = psycopg2.connect("dbname=ecommerce")
    cur = conn.cursor()
    cur.execute("UPDATE products SET name=%s WHERE id=%s",
                (data['name'], product_id))
    conn.commit()

    # 2. 캐시 무효화
    r.delete(f"product:{product_id}")

12. 비용 비교 (매니지드 서비스)

월간 비용 예상 (중간 규모 서비스: 100GB 데이터, 5000 TPS)

서비스	제공자	인스턴스	월 비용 (USD)
RDS PostgreSQL	AWS	db.r6g.xlarge	~350
Cloud SQL PostgreSQL	GCP	db-custom-4-16384	~320
RDS MySQL	AWS	db.r6g.xlarge	~350
Atlas MongoDB	MongoDB	M40	~480
ElastiCache Redis	AWS	cache.r6g.xlarge	~450
DynamoDB On-Demand	AWS	-	~180-600
Neon PostgreSQL	Neon	Scale	~69-189
PlanetScale MySQL	PlanetScale	Scaler Pro	~39-299

비용 최적화 전략

예약 인스턴스: 1-3년 약정으로 40-70% 할인
서버리스 옵션: 트래픽 변동이 큰 경우 Neon, PlanetScale, DynamoDB on-demand
읽기 전용 복제본: 읽기 부하 분산으로 메인 인스턴스 스펙 절약
적절한 인스턴스 크기: CloudWatch/모니터링으로 실제 사용률 확인 후 조정
데이터 아카이빙: 오래된 데이터를 S3/Glacier로 이전

13. 면접 질문 20선

기본 개념 (1-5)

Q1: ACID와 BASE의 차이를 설명하세요.

ACID (Atomicity, Consistency, Isolation, Durability): 전통적 RDBMS의 트랜잭션 속성. 모든 작업이 완전히 성공하거나 완전히 실패하며, 데이터 일관성을 보장합니다.

BASE (Basically Available, Soft state, Eventually consistent): NoSQL의 접근 방식. 항상 사용 가능하지만 즉각적 일관성 대신 최종적 일관성을 보장합니다.

선택 기준: 금융/결제 시스템에는 ACID, SNS 피드/좋아요 카운터 같은 곳에는 BASE가 적합합니다.

Q2: CAP 정리를 실무 예시와 함께 설명하세요.

CP 시스템 (MongoDB): 네트워크 파티션 발생 시 일관성을 위해 일부 노드의 가용성을 포기. 프라이머리 선출 중 쓰기 불가.
AP 시스템 (Cassandra, DynamoDB): 네트워크 파티션에서도 모든 노드가 응답하지만, 노드 간 데이터가 일시적으로 다를 수 있음.
CA 시스템: 네트워크 파티션이 없는 단일 노드 시스템. 분산 환경에서는 사실상 존재하지 않음.

실무 팁: PACELC 정리로 확장하면, 파티션이 없을 때(Else) 지연(Latency)과 일관성(Consistency) 사이의 트레이드오프도 고려합니다.

Q3: 인덱스의 종류와 각각의 적합한 사용 사례를 설명하세요.

B-Tree 인덱스: 범위 검색, 정렬에 최적. 대부분의 기본 인덱스. O(log N)
Hash 인덱스: 동등 비교(=)에만 사용. O(1). PostgreSQL에서는 WAL 미지원 문제로 잘 안 쓰임
GIN (Generalized Inverted Index): 배열, JSONB, 전문 검색에 적합
GiST (Generalized Search Tree): 공간 데이터(PostGIS), 범위 타입에 적합
BRIN (Block Range Index): 물리적으로 정렬된 대용량 테이블(시계열)에 효과적. 메모리 효율적
Bloom 인덱스: 많은 컬럼에 대한 동등 비교 필터링

Q4: N+1 쿼리 문제가 무엇이고 어떻게 해결하나요?

N+1 문제: 1번의 쿼리로 N개의 결과를 가져온 후, 각 결과에 대해 추가 쿼리를 N번 실행하는 비효율적 패턴.

해결 방법:

JOIN: 관계형 DB에서 한 번의 쿼리로 데이터 조인
Eager Loading: ORM에서 관련 데이터를 미리 로드 (Django: select_related, prefetch_related)
DataLoader 패턴: GraphQL에서 배치 요청으로 묶어 처리
MongoDB: 임베딩 패턴으로 관련 데이터를 하나의 문서에 저장

Q5: 데이터베이스 정규화와 비정규화의 트레이드오프를 설명하세요.

정규화 (1NF-BCNF):

장점: 데이터 중복 최소화, 일관성 유지 용이, 저장 공간 절약
단점: JOIN 증가로 읽기 성능 저하, 복잡한 쿼리

비정규화:

장점: 읽기 성능 향상 (JOIN 감소), 단순한 쿼리
단점: 데이터 중복, 갱신 이상(anomaly) 위험, 저장 공간 증가

실무 가이드: OLTP는 3NF 이상 정규화, OLAP/읽기 집중 워크로드는 의도적 비정규화 검토.

심화 문제 (6-10)

Q6: PostgreSQL의 MVCC와 MySQL InnoDB의 MVCC 차이를 설명하세요.

PostgreSQL MVCC:

각 행의 여러 버전을 테이블에 직접 저장 (xmin, xmax)
오래된 버전은 VACUUM으로 정리
장점: 읽기 시 잠금 없음, 높은 동시성
단점: 테이블 팽창(bloat), VACUUM 오버헤드

MySQL InnoDB MVCC:

Undo Log에 이전 버전 저장
현재 행에는 항상 최신 버전
Purge 스레드가 불필요한 Undo 정리
장점: 테이블 팽창 없음, 간단한 관리
단점: Undo 로그 크기 증가 시 성능 저하

Q7: 데이터베이스 샤딩 전략의 종류와 선택 기준은?

해시 샤딩: 키를 해시하여 균등 분산. 범위 쿼리 비효율적.
범위 샤딩: 키 범위로 분할. 핫스팟 위험. 시계열에 적합.
지리 기반 샤딩: 사용자 위치 기반 분할. 지역별 규제 준수에 유용.
디렉토리 기반 샤딩: 매핑 테이블로 유연한 라우팅. 관리 복잡도 높음.

샤드 키 선택 기준: 높은 카디널리티, 균등 분산, 쿼리 패턴과 일치.

Q8: Redis가 싱글 스레드인데 어떻게 빠른가요?

인메모리 처리: 모든 데이터가 RAM에 있어 디스크 I/O 없음
I/O 멀티플렉싱: epoll/kqueue로 수천 연결을 단일 스레드로 처리
잠금 불필요: 싱글 스레드이므로 동시성 제어 오버헤드 없음
최적화된 자료 구조: SDS, ziplist, intset 등 메모리 효율적 구조
Redis 6.0+: I/O 스레딩 도입 (명령 처리는 여전히 싱글 스레드)

참고: Redis 7.0+에서는 멀티스레드 I/O가 더 개선되어 네트워크 처리 성능이 향상되었습니다.

Q9: MongoDB에서 임베딩과 레퍼런싱 중 어떤 것을 선택해야 하나요?

임베딩 (하위 문서):

함께 조회되는 데이터
1:1 또는 1:소수 관계
16MB 문서 크기 제한 내
예: 주문과 주문 항목, 사용자와 주소

레퍼런싱 (별도 컬렉션):

독립적으로 조회/업데이트되는 데이터
1:다수 또는 다:다 관계
문서 크기가 빠르게 증가하는 경우
예: 블로그 글과 댓글, 상품과 리뷰

하이브리드 패턴: 자주 쓰는 데이터는 임베딩, 나머지는 레퍼런싱.

Q10: 커넥션 풀링이 왜 중요하고 어떻게 설정하나요?

DB 연결 생성 비용: TCP 핸드셰이크 + 인증 + 세션 할당 등으로 수~수십 ms 소요. 매 요청마다 연결을 생성하면 성능이 크게 저하됩니다.

설정 예시 (HikariCP):

maximumPoolSize: 일반적으로 CPU 코어 수 * 2 + 디스크 수
minimumIdle: 최소 유휴 연결 수
connectionTimeout: 풀에서 연결을 기다리는 최대 시간
idleTimeout: 유휴 연결 유지 시간

PostgreSQL은 프로세스 기반이므로 PgBouncer를 앞에 두어 연결 풀링을 관리하는 것이 모범 사례입니다.

실전/시스템 설계 (11-15)

Q11: 대규모 트래픽 전자상거래 서비스의 DB 아키텍처를 설계하세요.

상품 카탈로그: MongoDB (유연한 속성, 빠른 읽기)
주문/결제: PostgreSQL (ACID 트랜잭션)
장바구니/세션: Redis (TTL, 빠른 읽기/쓰기)
상품 검색: Elasticsearch (전문 검색, 패싯)
추천 시스템: PostgreSQL pgvector (벡터 유사도)
분석/리포팅: ClickHouse 또는 BigQuery

캐싱 전략: Redis에서 Cache-Aside 패턴으로 상품 정보 캐싱. CDN으로 정적 콘텐츠 캐싱.

Q12: 데이터베이스 마이그레이션 전략을 설명하세요.

Dual-Write 패턴: 양쪽 DB에 동시 쓰기, 점진적 읽기 전환
CDC (Change Data Capture): Debezium 등으로 실시간 데이터 동기화
Shadow Traffic: 새 DB에 트래픽 미러링하여 성능 검증
Blue-Green: 두 환경을 준비하고 한 번에 전환

핵심 포인트: 롤백 계획 필수, 데이터 일관성 검증 자동화, 점진적 트래픽 이전.

Q13: 읽기 복제본과 쓰기 마스터의 복제 지연(replication lag) 문제를 어떻게 해결하나요?

Read-your-writes 일관성: 쓰기 후 짧은 시간 동안 마스터에서 읽기
최소 복제 확인: PostgreSQL의 synchronous_commit 설정
지연 모니터링: pg_stat_replication의 replay_lag 체크
GTID 기반 라우팅: MySQL GTID로 특정 시점까지 반영된 복제본으로 라우팅
캐시 활용: 쓰기 직후 캐시에 최신 값 저장, 복제본 대신 캐시에서 읽기

Q14: 대용량 테이블에서 스키마 변경을 어떻게 안전하게 수행하나요?

PostgreSQL:

ALTER TABLE ADD COLUMN (DEFAULT 없이): 즉시 완료
ALTER TABLE ADD COLUMN DEFAULT value: pg 11+ 즉시 완료
CREATE INDEX CONCURRENTLY: 테이블 잠금 없이 인덱스 생성

MySQL:

pt-online-schema-change (Percona): 트리거 기반 온라인 변경
gh-ost (GitHub): 바이너리 로그 기반 온라인 변경
MySQL 8.0+ INSTANT DDL: 메타데이터만 변경하는 작업 지원

MongoDB:

스키마리스이므로 대부분 마이그레이션 불필요
인덱스 생성: createIndex (백그라운드 빌드 기본)

Q15: DB 성능 모니터링에서 가장 중요한 지표는?

쿼리 성능: 슬로우 쿼리 로그, p95/p99 응답 시간
연결 수: 활성/유휴 연결, 커넥션 풀 사용률
캐시 히트율: Buffer Pool/Shared Buffer 히트율 (목표: 99% 이상)
복제 지연: 마스터-레플리카 간 지연 시간
디스크 I/O: IOPS, 대기 시간, 사용률
잠금 대기: 행 잠금/테이블 잠금 대기 시간과 빈도
트랜잭션 처리량: TPS, 커밋/롤백 비율

도구: pg_stat_statements(PostgreSQL), Performance Schema(MySQL), Atlas(MongoDB), Redis INFO.

심화 실전 문제 (16-20)

Q16: Hot Partition 문제란 무엇이고 어떻게 해결하나요?

특정 파티션/샤드에 트래픽이 집중되어 성능 병목이 발생하는 문제입니다.

DynamoDB 예시: 특정 파티션 키에 쓰기가 집중되면 해당 파티션만 스로틀링 발생.

해결 방법:

Write Sharding: 키에 랜덤 접미사 추가 (예: USER#001#3)
Composite Key: 날짜/시간 등을 키에 포함하여 분산
캐싱 레이어: 읽기 핫스팟은 Redis로 흡수
분산 카운터: 하나의 카운터를 N개로 분할하여 합산

Q17: 데이터베이스 백업 전략의 종류와 RPO/RTO를 설명하세요.

RPO (Recovery Point Objective): 허용 가능한 최대 데이터 손실 시간
RTO (Recovery Time Objective): 허용 가능한 최대 복구 시간

백업 종류:

Full Backup: 전체 데이터. RTO 짧음, 저장 공간 큼
Incremental Backup: 마지막 백업 이후 변경분. 저장 공간 적음, 복구 복잡
Continuous Archiving (WAL): PostgreSQL WAL 아카이빙으로 특정 시점 복구(PITR) 가능. RPO 거의 0
스냅샷: EBS 스냅샷, RDS 자동 백업. 빠른 복구

Q18: Slow Query를 분석하고 최적화하는 프로세스를 설명하세요.

식별: 슬로우 쿼리 로그 활성화 (MySQL: slow_query_log, PostgreSQL: log_min_duration_statement)
분석: EXPLAIN ANALYZE로 실행 계획 확인
최적화 순서:
- 인덱스 추가/개선
- 쿼리 리팩토링 (불필요한 JOIN 제거, 서브쿼리를 CTE로)
- 파티셔닝 적용
- 데이터 아카이빙
- 비정규화 검토 (읽기 최적화)
검증: 변경 전후 성능 비교
모니터링: pg_stat_statements, Performance Schema 지속 관찰

Q19: Eventually Consistent 시스템에서 데이터 정합성을 어떻게 보장하나요?

Read Repair: 읽기 시 불일치 발견하면 최신 값으로 수정
Anti-Entropy (Merkle Tree): 백그라운드에서 노드 간 데이터 비교/수정
Conflict Resolution: 최종 쓰기 승리(LWW), 벡터 클록, 커스텀 병합 로직
Quorum 읽기/쓰기: R + W > N 조건으로 강한 일관성 확보
Saga 패턴: 분산 트랜잭션 대신 보상 트랜잭션으로 일관성 유지

Q20: 2025년 DB 선택 시 가장 중요하게 고려해야 할 요소 3가지는?

AI/벡터 검색 지원: RAG 파이프라인이 표준이 되면서, 벡터 검색 기능이 필수. pgvector, Atlas Vector Search, Redis Vector Search 중 선택.
운영 복잡도 vs 비용: 매니지드 서비스(RDS, Atlas, ElastiCache)의 편리함 vs 셀프 호스팅의 비용 절감. 팀 규모와 DevOps 역량에 따라 결정.
데이터 지역성과 규제: GDPR, 한국 개인정보보호법 등 데이터 주권 규제. DynamoDB Global Tables, CockroachDB의 지역 기반 분산이 해답이 될 수 있음.

14. 퀴즈

퀴즈 1: PostgreSQL에서 JSONB와 JSON 타입의 주요 차이점 2가지는?

저장 방식: JSON은 텍스트 그대로 저장, JSONB는 바이너리 분해하여 저장. JSONB가 쓰기는 약간 느리지만 읽기/검색이 훨씬 빠름.
인덱싱: JSONB는 GIN 인덱스를 지원하여 키/값 검색이 O(log N). JSON은 인덱스 불가능.

추가: JSONB는 키 순서를 보장하지 않고 중복 키를 제거합니다.

퀴즈 2: Redis의 Sorted Set이 내부적으로 사용하는 자료구조는?

Skip List + Hash Table 조합입니다.

Skip List: 정렬된 순서로 범위 검색 O(log N)
Hash Table: 점수 조회 O(1)

원소가 128개 이하이고 값이 64바이트 이하이면 ziplist(7.0+에서는 listpack)를 사용하여 메모리를 절약합니다.

퀴즈 3: DynamoDB에서 GSI(Global Secondary Index)와 LSI(Local Secondary Index)의 차이는?

항목	GSI	LSI
파티션 키	테이블과 다를 수 있음	테이블과 동일
정렬 키	자유롭게 선택	자유롭게 선택
용량 제한	없음	파티션당 10GB
생성 시점	테이블 생성 후에도 가능	테이블 생성 시에만
일관성	최종적 일관성만	강한 일관성 가능
프로비저닝	별도 RCU/WCU	테이블 공유

실무 팁: 대부분 GSI가 유연하므로 GSI를 우선 사용합니다.

퀴즈 4: MongoDB에서 트랜잭션을 언제 사용하고 언제 피해야 하나요?

사용해야 할 때:

여러 문서/컬렉션에 걸친 원자적 업데이트 (결제와 재고 동시 변경)
데이터 일관성이 비즈니스적으로 필수

피해야 할 때:

단일 문서 업데이트 (MongoDB는 단일 문서 수준에서 이미 원자적)
높은 처리량이 필요한 작업 (트랜잭션은 성능 오버헤드 발생)
샤딩된 환경에서 빈번한 멀티 샤드 트랜잭션

설계 원칙: 가능한 한 단일 문서에 관련 데이터를 임베딩하여 트랜잭션 필요성을 줄이세요.

퀴즈 5: VACUUM이 PostgreSQL에서 왜 중요한가요?

PostgreSQL의 MVCC는 업데이트/삭제 시 이전 버전을 즉시 제거하지 않습니다. VACUUM은 이 "죽은 튜플(dead tuples)"을 정리합니다.

하지 않으면:

테이블 팽창: 디스크 공간 낭비, 순차 스캔 성능 저하
인덱스 팽창: 인덱스 크기 증가, 검색 성능 저하
트랜잭션 ID 래핑: 약 20억 트랜잭션 후 데이터 손실 위험 (XID wraparound)

Autovacuum이 기본 활성화되어 있지만, 쓰기 부하가 높은 테이블은 Autovacuum 파라미터 튜닝이 필요합니다.

참고 자료

PostgreSQL 17 공식 문서: https://www.postgresql.org/docs/17/
MySQL 9.0 공식 문서: https://dev.mysql.com/doc/refman/9.0/en/
MongoDB 8.0 공식 문서: https://www.mongodb.com/docs/manual/
Redis 공식 문서: https://redis.io/docs/
Valkey 프로젝트: https://valkey.io/
DynamoDB 개발자 가이드: https://docs.aws.amazon.com/dynamodb/
ScyllaDB 문서: https://opensource.docs.scylladb.com/
DB-Engines 랭킹: https://db-engines.com/en/ranking
YCSB 벤치마크: https://github.com/brianfrankcooper/YCSB
"Designing Data-Intensive Applications" — Martin Kleppmann

이 글에서 다룬 8개 데이터베이스는 각각 고유한 강점을 가지고 있습니다. "최고의 데이터베이스"는 없으며, "최적의 데이터베이스"만 있을 뿐입니다. 워크로드 패턴, 확장 요구사항, 팀 역량, 비용 제약을 종합적으로 고려하여 선택하세요. 그리고 하나의 데이터베이스에 모든 것을 맡기려 하지 말고, Polyglot Persistence 전략으로 각 데이터베이스의 강점을 조합하는 것이 2025년의 모범 사례입니다.

Database Complete Comparison 2025: PostgreSQL vs MySQL vs MongoDB vs Redis — Selection Guide

Introduction
1. 2025 Database Rankings and Trends
- DB-Engines Rankings (March 2025)
- Key Trends in 2025
2. RDBMS vs NoSQL vs NewSQL — When to Use What
3. PostgreSQL Deep Dive
4. MySQL Deep Dive
5. MongoDB Deep Dive
6. Redis Deep Dive
7. DynamoDB — The Serverless NoSQL Standard
8. Cassandra / ScyllaDB — The Kings of High-Volume Writes
- Cassandra Architecture
- ScyllaDB — High-Performance Cassandra-Compatible Alternative
9. Performance Benchmark Comparison
- YCSB Benchmark (Yahoo Cloud Serving Benchmark)
- Scalability Comparison
10. Use Case Selection Matrix
- Optimal Database Selection Guide
11. Multi-DB Architecture Patterns
12. Cost Comparison (Managed Services)
- Monthly Cost Estimate (Medium-Scale Service: 100GB Data, 5000 TPS)
- Cost Optimization Strategies
13. Interview Questions — 20 Essential Questions
14. Quiz
References

Introduction

The database ecosystem in 2025 is more diverse and complex than ever before. The explosive growth of AI/ML workloads has made vector search essential, and new requirements like real-time streaming processing, multi-model support, and serverless architectures are making database selection increasingly challenging.

In this article, we systematically compare the architecture, performance, cost, and use cases of 8 major databases, centered on PostgreSQL, MySQL, MongoDB, and Redis, extending to DynamoDB and Cassandra/ScyllaDB. Rather than a simple feature listing, we provide a practical guide to assist real-world production decision-making.

1. 2025 Database Rankings and Trends

DB-Engines Rankings (March 2025)

Rank	Database	Type	Score	YoY Change
1	Oracle	RDBMS	1,245	-3.2%
2	MySQL	RDBMS	1,142	-1.8%
3	PostgreSQL	RDBMS	685	+12.4%
4	MongoDB	Document	397	+2.1%
5	Redis	Key-Value	168	-5.3%
6	Elasticsearch	Search	155	+1.7%
7	SQLite	RDBMS	142	+8.9%
8	Cassandra	Wide-Column	112	-4.1%

Key Trends in 2025

1. AI is Changing Database Selection

As RAG (Retrieval-Augmented Generation) pipelines become ubiquitous, vector search capabilities have become a core criterion for database selection. PostgreSQL's pgvector, MongoDB Atlas Vector Search, and Redis Stack are leading this competition.

2. The Rise of Serverless Databases

Usage-based pricing models like Neon (PostgreSQL serverless), PlanetScale (MySQL serverless), MongoDB Atlas Serverless, and DynamoDB on-demand are becoming the standard.

3. Multi-Model Databases

The multi-model approach of handling relational, document, graph, and time-series data within a single database is growing. PostgreSQL leads this space with extensions like JSONB, pgvector, and PostGIS.

4. The Emergence of Valkey

In response to Redis Labs' license change (SSPL), the Linux Foundation released the Valkey fork. With AWS ElastiCache, Google Cloud Memorystore, and others adopting Valkey, significant changes are underway in the Redis ecosystem.

2. RDBMS vs NoSQL vs NewSQL — When to Use What

Understanding the CAP Theorem in Practice

This theorem states that in distributed systems, you cannot simultaneously achieve perfect Consistency, Availability, and Partition Tolerance.

         Consistency (C)
            /\
           /  \
          /    \
    CP   / CAP  \ CA
  system/ impossi\ system
        /ble     \
       /________\
  Partition(P)   Availability(A)

Type	Combination	Databases	Characteristics
CA	Consistency + Availability	Traditional RDBMS (single node)	Unavailable during network partitions
CP	Consistency + Partition Tolerance	MongoDB, HBase	Sacrifices some availability
AP	Availability + Partition Tolerance	Cassandra, DynamoDB	Eventual consistency

When to Choose RDBMS

Transaction integrity is mandatory (finance, payments, inventory management)
Complex JOINs and aggregate queries are frequent
Data schema is clear and stable
ACID compliance is a legal/regulatory requirement

When to Choose NoSQL

Schema changes frequently or data is unstructured
Horizontal scaling is a core requirement
Sub-millisecond read/write latency needed
High-volume fast write processing

When to Choose NewSQL

NewSQL databases like CockroachDB, TiDB, and YugabyteDB provide both RDBMS transaction guarantees and NoSQL horizontal scalability.

Globally distributed transactional systems
Need horizontal scaling while maintaining RDBMS compatibility
However, maturity and community size still lag behind traditional RDBMS

3. PostgreSQL Deep Dive

Architecture Overview

PostgreSQL uses a process-based architecture. Each client connection gets a dedicated backend process.

┌─────────────────────────────────────────────┐
│                 PostgreSQL                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Backend  │  │ Backend  │  │ Backend  │  │
│  │ Process  │  │ Process  │  │ Process  │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       │             │             │         │
│  ┌────┴─────────────┴─────────────┴────┐    │
│  │          Shared Memory              │    │
│  │  ┌──────────┐  ┌──────────────┐    │    │
│  │  │ Shared   │  │    WAL       │    │    │
│  │  │ Buffers  │  │  Buffers     │    │    │
│  │  └──────────┘  └──────────────┘    │    │
│  └─────────────────────────────────────┘    │
│                                             │
│  ┌─────────────────────────────────────┐    │
│  │     Background Workers              │    │
│  │  WAL Writer | Checkpointer |        │    │
│  │  Autovacuum | Stats Collector       │    │
│  └─────────────────────────────────────┘    │
└─────────────────────────────────────────────┘

MVCC (Multi-Version Concurrency Control)

PostgreSQL's MVCC maintains multiple versions of each row to minimize lock conflicts between readers and writers.

-- Set transaction isolation level
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

-- Verify MVCC behavior
BEGIN;
SELECT xmin, xmax, ctid, * FROM orders WHERE id = 1;
-- xmin: creation transaction ID, xmax: deletion transaction ID
-- ctid: physical location (page, offset)
COMMIT;

JSONB — The Ultimate Relational + Document Combination

-- Create table with JSONB column
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    metadata JSONB NOT NULL DEFAULT '{}',
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create GIN index (optimizes JSONB search)
CREATE INDEX idx_products_metadata ON products USING GIN (metadata);

-- JSONB query examples
SELECT name, metadata->>'category' AS category,
       (metadata->'specs'->>'weight')::numeric AS weight
FROM products
WHERE metadata @> '{"category": "electronics"}'
  AND (metadata->'specs'->>'weight')::numeric < 500;

-- JSONB aggregation
SELECT metadata->>'category' AS category,
       COUNT(*),
       AVG((metadata->'price'->>'amount')::numeric) AS avg_price
FROM products
GROUP BY metadata->>'category';

pgvector — The Game Changer for the AI/RAG Era

-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536)  -- OpenAI text-embedding-3-small dimensions
);

-- HNSW index (fast approximate nearest neighbor search)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- IVFFlat index (memory efficient)
CREATE INDEX ON documents
USING ivfflat (embedding vector_l2_ops)
WITH (lists = 100);

-- Cosine similarity search
SELECT id, title,
       1 - (embedding <=> query_embedding) AS similarity
FROM documents
ORDER BY embedding <=> query_embedding
LIMIT 10;

-- Hybrid search (vector + full-text)
SELECT id, title,
       ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'search term')) AS text_score,
       1 - (embedding <=> query_embedding) AS vector_score
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'search term')
ORDER BY vector_score * 0.7 + text_score * 0.3 DESC
LIMIT 10;

pg_trgm — Fuzzy Search and Similarity Matching

CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Trigram index
CREATE INDEX idx_products_name_trgm ON products
USING GIN (name gin_trgm_ops);

-- Similarity search (typo-tolerant)
SELECT name, similarity(name, 'PostgreSLQ') AS sim
FROM products
WHERE name % 'PostgreSLQ'
ORDER BY sim DESC;

Partitioning

-- Range partitioning example
CREATE TABLE orders (
    id BIGSERIAL,
    order_date DATE NOT NULL,
    customer_id INTEGER NOT NULL,
    total_amount NUMERIC(10,2),
    PRIMARY KEY (id, order_date)
) PARTITION BY RANGE (order_date);

-- Create partitions
CREATE TABLE orders_2025_q1 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');
CREATE TABLE orders_2025_q2 PARTITION OF orders
    FOR VALUES FROM ('2025-04-01') TO ('2025-07-01');

-- Automatic partition management (pg_partman)
CREATE EXTENSION pg_partman;
SELECT partman.create_parent(
    p_parent_table => 'public.orders',
    p_control => 'order_date',
    p_type => 'native',
    p_interval => '3 months'
);

PostgreSQL 17 Key Features

Incremental Backup: Incremental backup support in pg_basebackup
Enhanced SQL/JSON Standard Support: JSON_TABLE, JSON_QUERY, etc.
Performance Improvements: Improved parallel hash joins, bulk loading optimizations
Security Enhancements: Improved LDAP/SCRAM authentication

4. MySQL Deep Dive

InnoDB Architecture

┌─────────────────────────────────────────────┐
│                MySQL Server                  │
│  ┌───────────────────────────────────────┐  │
│  │          Connection Pool              │  │
│  │   Thread 1 | Thread 2 | Thread N     │  │
│  └───────────────────────────────────────┘  │
│  ┌───────────────────────────────────────┐  │
│  │     Query Cache (removed in 8.0+)    │  │
│  │          Parser -> Optimizer          │  │
│  └───────────────────────────────────────┘  │
│  ┌───────────────────────────────────────┐  │
│  │            InnoDB Engine              │  │
│  │  ┌──────────┐  ┌──────────────────┐  │  │
│  │  │ Buffer   │  │  Change Buffer   │  │  │
│  │  │ Pool     │  │                  │  │  │
│  │  └──────────┘  └──────────────────┘  │  │
│  │  ┌──────────┐  ┌──────────────────┐  │  │
│  │  │ Redo Log │  │  Undo Log        │  │  │
│  │  └──────────┘  └──────────────────┘  │  │
│  └───────────────────────────────────────┘  │
└─────────────────────────────────────────────┘

MySQL uses a thread-based architecture, contrasting with PostgreSQL's process-based approach. This results in lower connection overhead and higher memory efficiency.

MySQL vs PostgreSQL Key Differences

-- MySQL: UPSERT (INSERT ... ON DUPLICATE KEY)
INSERT INTO products (id, name, stock)
VALUES (1, 'Widget', 100)
ON DUPLICATE KEY UPDATE stock = stock + VALUES(stock);

-- PostgreSQL: UPSERT (INSERT ... ON CONFLICT)
INSERT INTO products (id, name, stock)
VALUES (1, 'Widget', 100)
ON CONFLICT (id)
DO UPDATE SET stock = products.stock + EXCLUDED.stock;

HeatWave ML — ML Directly in MySQL

-- Train model with HeatWave ML
CALL sys.ML_TRAIN('my_schema.customer_data',
    'churn_label',
    JSON_OBJECT('task', 'classification'),
    @model_handle);

-- Run prediction
CALL sys.ML_PREDICT_ROW(
    JSON_OBJECT('age', 35, 'tenure', 24, 'monthly_charge', 79.99),
    @model_handle,
    @prediction);

SELECT @prediction;

Group Replication

MySQL Group Replication uses a Paxos-based consensus protocol for high availability.

-- Check Group Replication status
SELECT MEMBER_ID, MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
FROM performance_schema.replication_group_members;

-- Single-primary mode (default, recommended)
-- One write node, rest are read-only
SET GLOBAL group_replication_single_primary_mode = ON;

Read/Write Splitting with ProxySQL

-- ProxySQL routing rules
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup)
VALUES
    (1, 1, '^SELECT .* FOR UPDATE', 0),   -- Writer group
    (2, 1, '^SELECT', 1);                  -- Reader group

-- Server group configuration
INSERT INTO mysql_servers (hostgroup_id, hostname, port, weight)
VALUES
    (0, 'primary.mysql.local', 3306, 1),    -- Writer
    (1, 'replica1.mysql.local', 3306, 50),  -- Reader
    (1, 'replica2.mysql.local', 3306, 50);  -- Reader

MySQL 9.0 Key Features

JavaScript Stored Programs: GraalVM-based JavaScript stored procedures
Vector Data Type: Native vector data type support
EXPLAIN Improvements: FORMAT=TREE default, more intuitive execution plans
Security: OpenID Connect authentication support

5. MongoDB Deep Dive

Strengths of the Document Model

// MongoDB document structure — embedding pattern
db.orders.insertOne({
  orderId: 'ORD-2025-001',
  customer: {
    name: 'John Dev',
    email: 'john@example.com',
    address: {
      street: '123 Tech Street',
      city: 'San Francisco',
      zipCode: '94102',
    },
  },
  items: [
    {
      productId: 'PROD-001',
      name: 'Wireless Keyboard',
      price: 89.0,
      quantity: 2,
    },
    {
      productId: 'PROD-002',
      name: 'Monitor Stand',
      price: 45.0,
      quantity: 1,
    },
  ],
  payment: {
    method: 'card',
    status: 'completed',
    paidAt: ISODate('2025-03-15T10:30:00Z'),
  },
  totalAmount: 223.0,
  createdAt: ISODate('2025-03-15T10:25:00Z'),
})

Atlas Vector Search

// Create vector search index
db.collection.createSearchIndex({
  name: 'vector_index',
  definition: {
    mappings: {
      dynamic: true,
      fields: {
        embedding: {
          type: 'knnVector',
          dimensions: 1536,
          similarity: 'cosine',
        },
      },
    },
  },
})

// Vector search query
db.documents.aggregate([
  {
    $vectorSearch: {
      index: 'vector_index',
      path: 'embedding',
      queryVector: queryEmbedding, // 1536-dimensional vector
      numCandidates: 100,
      limit: 10,
      filter: { category: 'tech' },
    },
  },
  {
    $project: {
      title: 1,
      content: 1,
      score: { $meta: 'vectorSearchScore' },
    },
  },
])

Advanced Aggregation Pipeline

// Monthly revenue analysis pipeline
db.orders.aggregate([
  // Period filter
  {
    $match: {
      createdAt: {
        $gte: ISODate('2025-01-01'),
        $lt: ISODate('2026-01-01'),
      },
      'payment.status': 'completed',
    },
  },
  // Decompose order items
  { $unwind: '$items' },
  // Group by month
  {
    $group: {
      _id: {
        year: { $year: '$createdAt' },
        month: { $month: '$createdAt' },
        category: '$items.category',
      },
      totalRevenue: { $sum: { $multiply: ['$items.price', '$items.quantity'] } },
      orderCount: { $sum: 1 },
      avgOrderValue: { $avg: { $multiply: ['$items.price', '$items.quantity'] } },
    },
  },
  // Sort
  { $sort: { '_id.year': 1, '_id.month': 1 } },
  // Transform output
  {
    $project: {
      _id: 0,
      period: { $concat: [{ $toString: '$_id.year' }, '-', { $toString: '$_id.month' }] },
      category: '$_id.category',
      totalRevenue: 1,
      orderCount: 1,
      avgOrderValue: { $round: ['$avgOrderValue', 0] },
    },
  },
])

MongoDB Sharding

// Shard key selection — consider cardinality, distribution, query patterns
sh.enableSharding('ecommerce')

// Hash sharding (even distribution)
sh.shardCollection('ecommerce.orders', { customerId: 'hashed' })

// Range sharding (optimized for range queries)
sh.shardCollection('ecommerce.logs', { timestamp: 1 })

// Zone sharding (region-based data separation)
sh.addShardTag('shard-us', 'AMERICAS')
sh.addShardTag('shard-eu', 'EUROPE')
sh.addTagRange('ecommerce.users', { region: 'US' }, { region: 'UT' }, 'AMERICAS')

MongoDB 8.0 Key Features

Queryable Encryption: Query data while encrypted
Cluster-to-Cluster Sync: Multi-cloud synchronization
Time Series Collection Improvements: More efficient time-series data handling
Performance: 20-35% improvement in read/write performance

6. Redis Deep Dive

Data Structures and Use Cases

-- Strings: cache, sessions, counters
SET user:session:abc123 "session_data" EX 3600
INCR page:views:homepage

-- Hashes: object storage
HSET user:1001 name "John Dev" email "john@dev.com" login_count 42
HINCRBY user:1001 login_count 1

-- Sorted Sets: leaderboards, rankings
ZADD leaderboard 9500 "player:alice"
ZADD leaderboard 8700 "player:bob"
ZADD leaderboard 9200 "player:charlie"
ZREVRANGE leaderboard 0 9 WITHSCORES  -- Top 10

-- Streams: event streaming
XADD mystream * sensor_id "temp-01" temperature "23.5" humidity "45"
XREAD COUNT 10 BLOCK 5000 STREAMS mystream 0

-- HyperLogLog: approximate unique counts
PFADD unique:visitors:2025-03-23 "user1" "user2" "user3"
PFCOUNT unique:visitors:2025-03-23

Redis Stack — Search, JSON, TimeSeries

-- Redis JSON
JSON.SET product:001 $ '{"name":"Wireless Keyboard","price":89.00,"tags":["electronics","keyboard"],"specs":{"weight":350,"wireless":true}}'
JSON.GET product:001 $.specs.weight

-- Redis Search index creation
FT.CREATE idx:products ON JSON PREFIX 1 product:
  SCHEMA
    $.name AS name TEXT SORTABLE
    $.price AS price NUMERIC SORTABLE
    $.tags[*] AS tags TAG
    $.specs.weight AS weight NUMERIC

-- Full-text search
FT.SEARCH idx:products "@name:keyboard @price:[50 100]"

-- Vector search (Redis Stack)
FT.CREATE idx:docs ON HASH PREFIX 1 doc:
  SCHEMA
    title TEXT
    embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE

-- KNN search
FT.SEARCH idx:docs "*=>[KNN 10 @embedding $query_vec AS score]"
  PARAMS 2 query_vec "\x00\x00..."
  SORTBY score
  RETURN 2 title score

-- Redis TimeSeries
TS.CREATE temperature:sensor1 RETENTION 86400000 LABELS sensor_id temp-01 location office
TS.ADD temperature:sensor1 * 23.5
TS.RANGE temperature:sensor1 - + AGGREGATION avg 60000  -- 1-minute average

Persistence: RDB vs AOF

Feature	RDB (Snapshot)	AOF (Append Only File)
Mechanism	Periodic memory dump	Logs every write command
Data Loss	Since last snapshot	Depends on fsync policy
Recovery Speed	Fast (binary load)	Slow (command replay)
File Size	Small (compressed)	Large (all commands logged)
CPU Impact	High during fork()	Low with everysec

# Recommended redis.conf settings
save 900 1       # RDB if 1+ changes in 900 seconds
save 300 10      # RDB if 10+ changes in 300 seconds
appendonly yes   # Enable AOF
appendfsync everysec  # fsync every second (performance/safety balance)

Redis Cluster

# Create 6-node cluster (3 masters + 3 replicas)
redis-cli --cluster create \
  node1:6379 node2:6379 node3:6379 \
  node4:6379 node5:6379 node6:6379 \
  --cluster-replicas 1

Redis Cluster uses 16384 hash slots to distribute keys. The key space is mapped to slots using CRC16 hashing.

The Valkey Fork — The Future of Redis?

After Redis Labs switched to the SSPL license in 2024, Valkey was born under the Linux Foundation.

Item	Redis	Valkey
License	SSPL (2024~)	BSD-3-Clause
Cloud Support	Redis Cloud	AWS, GCP, Azure
Compatibility	Original	Redis 7.2 compatible
Development	Redis Ltd.	Linux Foundation
Community	Limited contributions	Open contributions

7. DynamoDB — The Serverless NoSQL Standard

Core Concepts

┌───────────────────────────────────────────┐
│              DynamoDB Table                │
│  ┌─────────────────────────────────────┐  │
│  │  Partition Key (PK)  Sort Key (SK)  │  │
│  ├─────────────────────────────────────┤  │
│  │  USER#001            PROFILE        │  │
│  │  USER#001            ORDER#001      │  │
│  │  USER#001            ORDER#002      │  │
│  │  USER#002            PROFILE        │  │
│  │  USER#002            ORDER#001      │  │
│  └─────────────────────────────────────┘  │
│                                           │
│  GSI: OrderDate-Index                     │
│  ┌─────────────────────────────────────┐  │
│  │  OrderDate (PK)     UserId (SK)     │  │
│  └─────────────────────────────────────┘  │
└───────────────────────────────────────────┘

Single-Table Design Pattern

import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('EcommerceTable')

# Store user profile
table.put_item(Item={
    'PK': 'USER#001',
    'SK': 'PROFILE',
    'name': 'John Dev',
    'email': 'john@dev.com',
    'type': 'USER'
})

# Store order in the same table
table.put_item(Item={
    'PK': 'USER#001',
    'SK': 'ORDER#2025-03-15#ORD001',
    'items': [{'name': 'Keyboard', 'price': 89.00}],
    'total': 89.00,
    'status': 'completed',
    'type': 'ORDER'
})

# Query all orders for a user
response = table.query(
    KeyConditionExpression='PK = :pk AND begins_with(SK, :sk)',
    ExpressionAttributeValues={
        ':pk': 'USER#001',
        ':sk': 'ORDER#'
    }
)

DynamoDB Pricing Models

Mode	Read Cost	Write Cost	Best For
On-Demand	$0.25/1M RRU	$1.25/1M WRU	Unpredictable traffic
Provisioned	$0.00065/RCU/hour	$0.00065/WCU/hour	Stable traffic
Reserved	Up to 77% discount	Up to 77% discount	1-year/3-year commitment

8. Cassandra / ScyllaDB — The Kings of High-Volume Writes

Cassandra Architecture

Cassandra combines a Dynamo-style distributed hash table with a BigTable-style data model.

┌────────┐  ┌────────┐  ┌────────┐
│ Node A │──│ Node B │──│ Node C │
│ Token: │  │ Token: │  │ Token: │
│ 0-33   │  │ 34-66  │  │ 67-100 │
└───┬────┘  └───┬────┘  └───┬────┘
    │           │           │
    └───────────┼───────────┘
            Gossip Protocol

-- Create keyspace (with replication strategy)
CREATE KEYSPACE ecommerce WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

-- Create table (partition key + clustering key)
CREATE TABLE ecommerce.user_activity (
    user_id UUID,
    activity_date DATE,
    activity_time TIMESTAMP,
    activity_type TEXT,
    details MAP<TEXT, TEXT>,
    PRIMARY KEY ((user_id, activity_date), activity_time)
) WITH CLUSTERING ORDER BY (activity_time DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_size': 1,
                    'compaction_window_unit': 'DAYS'};

ScyllaDB — High-Performance Cassandra-Compatible Alternative

ScyllaDB is a Cassandra-compatible database rewritten in C++. Its shard-per-core architecture delivers 3-10x higher performance on identical hardware.

Comparison	Cassandra	ScyllaDB
Language	Java	C++ (Seastar)
Architecture	Thread-based	Shard-per-core
GC Impact	GC pauses exist	No GC
Same Hardware Performance	Baseline	3-10x
Cassandra Compatibility	Original	CQL compatible

9. Performance Benchmark Comparison

YCSB Benchmark (Yahoo Cloud Serving Benchmark)

The following comparison is on identical hardware (8 vCPU, 32GB RAM, NVMe SSD):

Operation	PostgreSQL	MySQL	MongoDB	Redis	DynamoDB	ScyllaDB
Read Latency (p99)	2.1ms	1.8ms	1.5ms	0.3ms	4.2ms	0.9ms
Write Latency (p99)	3.5ms	2.9ms	2.1ms	0.4ms	5.8ms	1.2ms
Read TPS	45K	52K	58K	250K	25K	120K
Write TPS	28K	35K	42K	200K	20K	85K
Mixed (50/50) TPS	35K	42K	48K	220K	22K	100K

Note: Benchmark results vary significantly based on workload patterns, data size, index configuration, etc.

Scalability Comparison

Scaling Method	PostgreSQL	MySQL	MongoDB	Redis	DynamoDB	Cassandra
Vertical	Excellent	Excellent	Good	Good	N/A	Good
Read Scale	Replicas	Replicas	Replica Set	Cluster	Auto	Add Nodes
Write Scale	Limited	Limited	Sharding	Cluster	Auto	Add Nodes
Max Data Size	Several TB	Several TB	Several PB	Hundreds GB	Unlimited	Several PB

10. Use Case Selection Matrix

Optimal Database Selection Guide

Use Case	1st Choice	2nd Choice	Reason
E-commerce	PostgreSQL	MySQL	Transactions + JSONB flexibility
Social Media	MongoDB	Cassandra	Unstructured data + high writes
IoT Sensor Data	ScyllaDB	DynamoDB	High-volume time-series writes
Real-time Analytics	PostgreSQL	ClickHouse	Complex aggregate queries
RAG/AI Search	PostgreSQL (pgvector)	MongoDB (Atlas Vector)	Vector + full-text search
Cache	Redis	DragonflyDB	Ultra-low latency, rich data structures
Session Store	Redis	DynamoDB	Fast read/write, TTL
Chat/Messaging	MongoDB	Cassandra	Flexible schema, time-ordered
Game Leaderboard	Redis	DynamoDB	Sorted Set, fast updates
Log Management	Elasticsearch	MongoDB	Full-text search, time-series
Geospatial	PostgreSQL (PostGIS)	MongoDB	GIS capabilities, spatial indexes
Graph Data	Neo4j	PostgreSQL (AGE)	Optimized relationship traversal

11. Multi-DB Architecture Patterns

Polyglot Persistence

┌─────────────────────────────────────────────────┐
│                  Application Layer               │
└───────┬──────────┬──────────┬──────────┬────────┘
        │          │          │          │
   ┌────▼────┐ ┌──▼───┐ ┌───▼───┐ ┌───▼────┐
   │PostgreSQL│ │Redis │ │MongoDB│ │Elastic │
   │(Orders/  │ │(Cache)│ │(Prods)│ │(Search)│
   │ Payment) │ │      │ │       │ │        │
   └─────────┘ └──────┘ └───────┘ └────────┘

CQRS Pattern (Command Query Responsibility Segregation)

Write (Command):                  Read (Query):
┌──────────┐                     ┌──────────┐
│  Client   │                     │  Client   │
└────┬─────┘                     └────┬─────┘
     │                                │
┌────▼─────┐                     ┌────▼─────┐
│ Command  │                     │  Query   │
│ Service  │                     │ Service  │
└────┬─────┘                     └────┬─────┘
     │                                │
┌────▼─────┐  ──Events──▶  ┌─────▼─────┐
│PostgreSQL│              │  Redis /   │
│ (Source  │              │  MongoDB   │
│  of Truth)│              │ (Read Model)│
└──────────┘              └───────────┘

Cache-Aside Pattern

import redis
import psycopg2
import json

r = redis.Redis(host='localhost', port=6379)

def get_product(product_id):
    # 1. Check cache
    cache_key = f"product:{product_id}"
    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)

    # 2. Cache miss — query DB
    conn = psycopg2.connect("dbname=ecommerce")
    cur = conn.cursor()
    cur.execute("SELECT * FROM products WHERE id = %s", (product_id,))
    product = cur.fetchone()

    if product:
        # 3. Store in cache (TTL 1 hour)
        r.setex(cache_key, 3600, json.dumps(product))

    return product

def update_product(product_id, data):
    # 1. Update DB
    conn = psycopg2.connect("dbname=ecommerce")
    cur = conn.cursor()
    cur.execute("UPDATE products SET name=%s WHERE id=%s",
                (data['name'], product_id))
    conn.commit()

    # 2. Invalidate cache
    r.delete(f"product:{product_id}")

12. Cost Comparison (Managed Services)

Monthly Cost Estimate (Medium-Scale Service: 100GB Data, 5000 TPS)

Service	Provider	Instance	Monthly Cost (USD)
RDS PostgreSQL	AWS	db.r6g.xlarge	~350
Cloud SQL PostgreSQL	GCP	db-custom-4-16384	~320
RDS MySQL	AWS	db.r6g.xlarge	~350
Atlas MongoDB	MongoDB	M40	~480
ElastiCache Redis	AWS	cache.r6g.xlarge	~450
DynamoDB On-Demand	AWS	-	~180-600
Neon PostgreSQL	Neon	Scale	~69-189
PlanetScale MySQL	PlanetScale	Scaler Pro	~39-299

Cost Optimization Strategies

Reserved Instances: 40-70% discount with 1-3 year commitments
Serverless Options: Neon, PlanetScale, DynamoDB on-demand for variable traffic
Read Replicas: Distribute read load to save on primary instance specs
Right-sizing: Check actual utilization via CloudWatch/monitoring before adjusting
Data Archiving: Move old data to S3/Glacier

13. Interview Questions — 20 Essential Questions

Fundamentals (1-5)

Q1: Explain the difference between ACID and BASE.

ACID (Atomicity, Consistency, Isolation, Durability): Traditional RDBMS transaction properties. All operations either fully succeed or fully fail, guaranteeing data consistency.

BASE (Basically Available, Soft state, Eventually consistent): The NoSQL approach. Always available but provides eventual consistency instead of immediate consistency.

Selection criteria: ACID for financial/payment systems, BASE for social media feeds/like counters.

Q2: Explain the CAP theorem with real-world examples.

CP System (MongoDB): During network partitions, sacrifices some node availability for consistency. Writes unavailable during primary election.
AP System (Cassandra, DynamoDB): All nodes respond even during network partitions, but data may temporarily differ between nodes.
CA System: Single-node systems without network partitions. Practically nonexistent in distributed environments.

Pro tip: Extending to the PACELC theorem, consider the trade-off between Latency and Consistency when there is no partition (Else).

Q3: Explain index types and their appropriate use cases.

B-Tree Index: Optimal for range searches and sorting. Most default indexes. O(log N)
Hash Index: Only for equality comparisons (=). O(1). Rarely used in PostgreSQL due to WAL support issues
GIN (Generalized Inverted Index): Suitable for arrays, JSONB, full-text search
GiST (Generalized Search Tree): Suitable for spatial data (PostGIS), range types
BRIN (Block Range Index): Effective for physically sorted large tables (time-series). Memory efficient
Bloom Index: Equality comparison filtering across many columns

Q4: What is the N+1 query problem and how do you solve it?

N+1 problem: An inefficient pattern where 1 query fetches N results, then N additional queries are executed for each result.

Solutions:

JOIN: Combine data in a single query in relational DBs
Eager Loading: Pre-load related data in ORMs (Django: select_related, prefetch_related)
DataLoader Pattern: Batch requests together in GraphQL
MongoDB: Embedding pattern to store related data in a single document

Q5: Explain the trade-offs of database normalization vs denormalization.

Normalization (1NF-BCNF):

Pros: Minimize data redundancy, easier consistency maintenance, save storage
Cons: Increased JOINs degrade read performance, complex queries

Denormalization:

Pros: Improved read performance (fewer JOINs), simpler queries
Cons: Data redundancy, update anomaly risks, increased storage

Practical guide: Normalize to 3NF+ for OLTP, consider intentional denormalization for OLAP/read-heavy workloads.

Advanced Topics (6-10)

Q6: Explain the differences between PostgreSQL MVCC and MySQL InnoDB MVCC.

PostgreSQL MVCC:

Stores multiple row versions directly in the table (xmin, xmax)
Old versions cleaned up by VACUUM
Pros: No read locks, high concurrency
Cons: Table bloat, VACUUM overhead

MySQL InnoDB MVCC:

Stores previous versions in Undo Log
Current row always has the latest version
Purge thread cleans unnecessary Undo entries
Pros: No table bloat, simpler management
Cons: Performance degradation when Undo log grows large

Q7: What are the types of sharding strategies and how do you choose?

Hash Sharding: Hash key for even distribution. Inefficient for range queries.
Range Sharding: Split by key range. Hotspot risk. Good for time-series.
Geography-based Sharding: Split by user location. Useful for regional compliance.
Directory-based Sharding: Flexible routing via mapping table. High management complexity.

Shard key selection criteria: High cardinality, even distribution, alignment with query patterns.

Q8: Redis is single-threaded — how is it so fast?

In-memory Processing: All data in RAM, no disk I/O
I/O Multiplexing: epoll/kqueue handles thousands of connections in a single thread
No Locking Needed: Single-threaded means no concurrency control overhead
Optimized Data Structures: SDS, ziplist, intset — memory-efficient structures
Redis 6.0+: I/O threading introduced (command processing remains single-threaded)

Note: Redis 7.0+ further improved multi-threaded I/O, enhancing network processing performance.

Q9: When should you choose embedding vs referencing in MongoDB?

Embedding (subdocuments):

Data queried together
1:1 or 1:few relationships
Within 16MB document size limit
Example: Orders and order items, users and addresses

Referencing (separate collections):

Data queried/updated independently
1:many or many:many relationships
Document size grows rapidly
Example: Blog posts and comments, products and reviews

Hybrid pattern: Embed frequently-used data, reference the rest.

Q10: Why is connection pooling important and how do you configure it?

DB connection creation cost: TCP handshake + authentication + session allocation = several to tens of ms. Creating connections per request severely degrades performance.

Configuration example (HikariCP):

maximumPoolSize: Typically CPU cores x 2 + number of disks
minimumIdle: Minimum idle connections
connectionTimeout: Maximum time waiting for a connection from the pool
idleTimeout: How long idle connections are maintained

PostgreSQL is process-based, so using PgBouncer in front for connection pooling management is a best practice.

Practical System Design (11-15)

Q11: Design a DB architecture for a high-traffic e-commerce service.

Product Catalog: MongoDB (flexible attributes, fast reads)
Orders/Payments: PostgreSQL (ACID transactions)
Cart/Sessions: Redis (TTL, fast read/write)
Product Search: Elasticsearch (full-text search, facets)
Recommendation System: PostgreSQL pgvector (vector similarity)
Analytics/Reporting: ClickHouse or BigQuery

Caching strategy: Cache-Aside pattern with Redis for product info caching. CDN for static content caching.

Q12: Explain database migration strategies.

Dual-Write Pattern: Write to both DBs simultaneously, gradually switch reads
CDC (Change Data Capture): Real-time data sync with tools like Debezium
Shadow Traffic: Mirror traffic to new DB for performance validation
Blue-Green: Prepare two environments and switch at once

Key points: Rollback plan is essential, automate data consistency verification, gradual traffic migration.

Q13: How do you handle replication lag between read replicas and write master?

Read-your-writes Consistency: Read from master for a short time after writes
Minimum Replication Confirmation: PostgreSQL's synchronous_commit setting
Lag Monitoring: Check replay_lag in pg_stat_replication
GTID-based Routing: Route to replicas that have replicated up to a specific GTID in MySQL
Cache Utilization: Store latest values in cache after writes, read from cache instead of replica

Q14: How do you safely perform schema changes on large tables?

PostgreSQL:

ALTER TABLE ADD COLUMN (without DEFAULT): Instant completion
ALTER TABLE ADD COLUMN DEFAULT value: Instant in pg 11+
CREATE INDEX CONCURRENTLY: Index creation without table locks

MySQL:

pt-online-schema-change (Percona): Trigger-based online changes
gh-ost (GitHub): Binary log-based online changes
MySQL 8.0+ INSTANT DDL: Metadata-only change operations

MongoDB:

Schema-less so most migrations are unnecessary
Index creation: createIndex (background build by default)

Q15: What are the most important metrics for DB performance monitoring?

Query Performance: Slow query logs, p95/p99 response times
Connection Count: Active/idle connections, connection pool utilization
Cache Hit Ratio: Buffer Pool/Shared Buffer hit ratio (target: 99%+)
Replication Lag: Master-replica delay time
Disk I/O: IOPS, latency, utilization
Lock Waits: Row lock/table lock wait time and frequency
Transaction Throughput: TPS, commit/rollback ratios

Tools: pg_stat_statements (PostgreSQL), Performance Schema (MySQL), Atlas (MongoDB), Redis INFO.

Advanced Practical Questions (16-20)

Q16: What is the Hot Partition problem and how do you solve it?

A problem where traffic concentrates on specific partitions/shards, creating performance bottlenecks.

DynamoDB example: When writes concentrate on a specific partition key, that partition gets throttled.

Solutions:

Write Sharding: Add random suffix to key (e.g., USER#001#3)
Composite Key: Include date/time in key for distribution
Caching Layer: Absorb read hotspots with Redis
Distributed Counters: Split one counter into N parts and sum

Q17: Explain backup strategy types and RPO/RTO.

RPO (Recovery Point Objective): Maximum acceptable data loss time
RTO (Recovery Time Objective): Maximum acceptable recovery time

Backup types:

Full Backup: All data. Short RTO, large storage
Incremental Backup: Changes since last backup. Small storage, complex recovery
Continuous Archiving (WAL): PostgreSQL WAL archiving enables Point-in-Time Recovery (PITR). RPO near zero
Snapshots: EBS snapshots, RDS automatic backups. Fast recovery

Q18: Describe the process for analyzing and optimizing slow queries.

Identify: Enable slow query logs (MySQL: slow_query_log, PostgreSQL: log_min_duration_statement)
Analyze: Check execution plan with EXPLAIN ANALYZE
Optimization order:
- Add/improve indexes
- Refactor queries (remove unnecessary JOINs, convert subqueries to CTEs)
- Apply partitioning
- Archive data
- Consider denormalization (read optimization)
Verify: Compare before/after performance
Monitor: Continuously observe pg_stat_statements, Performance Schema

Q19: How do you ensure data consistency in eventually consistent systems?

Read Repair: Fix inconsistencies found during reads with latest values
Anti-Entropy (Merkle Tree): Background data comparison/correction between nodes
Conflict Resolution: Last-write-wins (LWW), vector clocks, custom merge logic
Quorum Reads/Writes: R + W > N condition for strong consistency
Saga Pattern: Compensating transactions instead of distributed transactions for consistency

Q20: What are the top 3 factors to consider when selecting a database in 2025?

AI/Vector Search Support: As RAG pipelines become standard, vector search is essential. Choose between pgvector, Atlas Vector Search, Redis Vector Search.
Operational Complexity vs Cost: Convenience of managed services (RDS, Atlas, ElastiCache) vs cost savings of self-hosting. Decide based on team size and DevOps capabilities.
Data Locality and Regulations: Data sovereignty regulations like GDPR and regional privacy laws. DynamoDB Global Tables, CockroachDB's region-based distribution can be solutions.

14. Quiz

Quiz 1: What are the 2 main differences between JSONB and JSON types in PostgreSQL?

Storage Method: JSON stores as plain text, JSONB decomposes into binary format. JSONB writes slightly slower but reads/searches much faster.
Indexing: JSONB supports GIN indexes for O(log N) key/value searches. JSON cannot be indexed.

Additional: JSONB does not preserve key order and removes duplicate keys.

Quiz 2: What data structure does Redis Sorted Set use internally?

Skip List + Hash Table combination.

Skip List: Range search in sorted order O(log N)
Hash Table: Score lookup O(1)

When elements are 128 or fewer and values are 64 bytes or less, ziplist (listpack in 7.0+) is used to save memory.

Quiz 3: What is the difference between GSI and LSI in DynamoDB?

Item	GSI	LSI
Partition Key	Can differ from table	Same as table
Sort Key	Freely chosen	Freely chosen
Capacity Limit	None	10GB per partition
Creation Timing	After table creation	Only at table creation
Consistency	Eventually consistent only	Strong consistency possible
Provisioning	Separate RCU/WCU	Shared with table

Practical tip: GSI is more flexible, so prefer GSI in most cases.

Quiz 4: When should you use and avoid transactions in MongoDB?

Use when:

Atomic updates across multiple documents/collections (payment and inventory simultaneous changes)
Data consistency is a business requirement

Avoid when:

Single document updates (MongoDB is already atomic at document level)
High-throughput operations (transactions have performance overhead)
Frequent multi-shard transactions in sharded environments

Design principle: Embed related data in a single document when possible to reduce the need for transactions.

Quiz 5: Why is VACUUM important in PostgreSQL?

PostgreSQL's MVCC does not immediately remove previous versions during updates/deletes. VACUUM cleans up these "dead tuples."

Without it:

Table Bloat: Wasted disk space, degraded sequential scan performance
Index Bloat: Increased index size, degraded search performance
Transaction ID Wraparound: Risk of data loss after approximately 2 billion transactions (XID wraparound)

Autovacuum is enabled by default, but tables with high write loads need autovacuum parameter tuning.

References

PostgreSQL 17 Official Docs: https://www.postgresql.org/docs/17/
MySQL 9.0 Official Docs: https://dev.mysql.com/doc/refman/9.0/en/
MongoDB 8.0 Official Docs: https://www.mongodb.com/docs/manual/
Redis Official Docs: https://redis.io/docs/
Valkey Project: https://valkey.io/
DynamoDB Developer Guide: https://docs.aws.amazon.com/dynamodb/
ScyllaDB Docs: https://opensource.docs.scylladb.com/
DB-Engines Rankings: https://db-engines.com/en/ranking
YCSB Benchmark: https://github.com/brianfrankcooper/YCSB
"Designing Data-Intensive Applications" by Martin Kleppmann

The 8 databases covered in this article each have unique strengths. There is no "best database" — only "optimal databases." Consider workload patterns, scalability requirements, team capabilities, and cost constraints holistically. And rather than relying on a single database for everything, combining each database's strengths through a Polyglot Persistence strategy is the best practice for 2025.