Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

1. What Is a Database
- RDBMS vs NoSQL
- Major Database Comparison
2. SQL Basics
3. Normalization
4. Indexes
5. Transactions
6. Query Optimization
7. Advanced PostgreSQL
8. NoSQL Types and Selection Criteria
9. Redis in Practice
10. Data Modeling
11. Migrations
Conclusion

1. What Is a Database

A database is a system that stores and manages structured data. In modern software, data storage and retrieval are core capabilities, and choosing the right database and designing it properly determines application performance and scalability.

RDBMS vs NoSQL

Relational databases (RDBMS) and NoSQL have fundamentally different philosophies.

Aspect	RDBMS	NoSQL
Data Model	Tables (rows and columns)	Document, Key-Value, Column, Graph
Schema	Fixed schema	Flexible schema
Scaling	Vertical (Scale-Up)	Horizontal (Scale-Out)
Transactions	ACID guaranteed	BASE (some ACID support)
Query Language	SQL	DB-specific APIs
Best For	Complex relationships, consistency	High volume, flexible schemas

Major Database Comparison

DB	Type	Key Features	Primary Use Cases
PostgreSQL	RDBMS	JSONB, extensibility, rich types	General purpose
MySQL	RDBMS	High compatibility, InnoDB engine	Web services
MongoDB	Document	Flexible schema, aggregation pipeline	Content management
Redis	Key-Value	In-memory, ultra-fast	Cache, sessions
Cassandra	Wide Column	High write throughput	IoT, time series
Neo4j	Graph	Optimized relationship traversal	Social networks

2. SQL Basics

SQL (Structured Query Language) is the standard language for working with relational databases.

SELECT and Basic Syntax

-- Basic query
SELECT name, email, created_at
FROM users
WHERE status = 'active'
ORDER BY created_at DESC
LIMIT 10;

-- Using aliases
SELECT
    u.name AS user_name,
    COUNT(o.id) AS order_count,
    SUM(o.total_amount) AS total_spent
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.name;

Types of JOINs

JOINs are the core operation for combining data from multiple tables.

-- INNER JOIN: only matching rows from both sides
SELECT u.name, o.order_date
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- LEFT JOIN: all rows from left table + matching right
SELECT u.name, o.order_date
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- RIGHT JOIN: all rows from right table + matching left
SELECT u.name, o.order_date
FROM users u
RIGHT JOIN orders o ON u.id = o.user_id;

-- FULL OUTER JOIN: all rows from both sides
SELECT u.name, o.order_date
FROM users u
FULL OUTER JOIN orders o ON u.id = o.user_id;

JOIN Selection Guide:

INNER JOIN -- when you only need data that exists in both tables
LEFT JOIN -- when you need to preserve all rows from the base table
RIGHT JOIN -- rarely used in practice since LEFT JOIN can substitute
FULL OUTER JOIN -- when you need to merge all data from both sides

GROUP BY and HAVING

-- Departments where average salary exceeds 50,000
SELECT
    department,
    COUNT(*) AS employee_count,
    AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) >= 50000
ORDER BY avg_salary DESC;

WHERE filters individual rows, while HAVING filters grouped results.

Subqueries

-- Employees with salary above average
SELECT name, salary
FROM employees
WHERE salary > (
    SELECT AVG(salary) FROM employees
);

-- Subquery with EXISTS
SELECT u.name
FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.user_id = u.id
    AND o.created_at >= '2026-01-01'
);

-- Subquery with IN
SELECT name, department
FROM employees
WHERE department IN (
    SELECT department
    FROM employees
    GROUP BY department
    HAVING COUNT(*) > 10
);

3. Normalization

Normalization is the process of decomposing tables to reduce data redundancy and improve integrity.

First Normal Form (1NF)

Every column value must be atomic. A single cell cannot contain multiple values.

-- 1NF violation: multiple values in phone_numbers
-- name: 'John Doe', phone_numbers: '555-1234, 555-5678'

-- 1NF compliant: separate table
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL
);

CREATE TABLE phone_numbers (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    phone VARCHAR(20) NOT NULL
);

Second Normal Form (2NF)

Satisfies 1NF and has no partial dependencies. Columns that depend on only part of a composite key must be separated.

-- 2NF violation: student_name depends only on student_id
-- (student_id, course_id) -> grade  (OK)
-- student_id -> student_name  (partial dependency!)

-- 2NF compliant
CREATE TABLE students (
    student_id SERIAL PRIMARY KEY,
    student_name VARCHAR(100) NOT NULL
);

CREATE TABLE enrollments (
    student_id INTEGER REFERENCES students(student_id),
    course_id INTEGER REFERENCES courses(course_id),
    grade CHAR(2),
    PRIMARY KEY (student_id, course_id)
);

Third Normal Form (3NF)

Satisfies 2NF and has no transitive dependencies.

-- 3NF violation: department_name is transitively dependent via department_id
-- employee_id -> department_id -> department_name

-- 3NF compliant
CREATE TABLE departments (
    department_id SERIAL PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL
);

CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    department_id INTEGER REFERENCES departments(department_id)
);

BCNF (Boyce-Codd Normal Form)

Satisfies 3NF and every determinant is a candidate key. It applies stricter conditions than 3NF.

When to Denormalize

Denormalization intentionally allows redundancy to improve read performance.

Consider denormalization when:

Reads vastly outnumber writes
Complex JOINs occur repeatedly
Real-time aggregation is needed
Caching alone cannot meet performance requirements

-- Denormalization example: storing user name in orders table
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    user_name VARCHAR(100),  -- Denormalized: eliminates JOIN on read
    total_amount DECIMAL(10, 2),
    created_at TIMESTAMP DEFAULT NOW()
);

4. Indexes

An index is a data structure that accelerates data retrieval. It works like a book index.

B-Tree Index

The most versatile index type. Effective for range queries, sorting, and equality comparisons.

-- Single column index
CREATE INDEX idx_users_email ON users(email);

-- Query examples with index usage
SELECT * FROM users WHERE email = 'test@example.com';  -- Uses index
SELECT * FROM users WHERE email LIKE 'test%';           -- Uses index
SELECT * FROM users WHERE email LIKE '%test';           -- No index (leading wildcard)

Hash Index

Only usable for equality comparisons (=). Cannot perform range queries but can be faster than B-Tree for equality lookups.

CREATE INDEX idx_users_email_hash ON users USING HASH (email);

GIN Index (Generalized Inverted Index)

Suitable for arrays, JSONB, and full-text search.

-- GIN index on JSONB column
CREATE INDEX idx_products_tags ON products USING GIN (tags);

-- GIN index for full-text search
CREATE INDEX idx_articles_search ON articles USING GIN (
    to_tsvector('english', title || ' ' || content)
);

GiST Index (Generalized Search Tree)

Used for spatial data, range types, and full-text search.

-- Spatial data index
CREATE INDEX idx_locations_coords ON locations USING GIST (coordinates);

Composite Index

An index combining multiple columns. Column order matters significantly.

-- Composite index: matches from leading column
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);

-- Queries where this index is effective:
SELECT * FROM orders WHERE user_id = 1;                          -- Used
SELECT * FROM orders WHERE user_id = 1 AND created_at > '2026-01-01'; -- Used
SELECT * FROM orders WHERE created_at > '2026-01-01';            -- Not used (missing leading column)

Covering Index

Includes all columns needed by a query in the index, returning results without accessing the table.

-- Covering index
CREATE INDEX idx_orders_covering ON orders(user_id, created_at, total_amount);

-- Can be processed using index only (Index Only Scan)
SELECT user_id, created_at, total_amount
FROM orders
WHERE user_id = 1 AND created_at > '2026-01-01';

Index Design Principles

Index columns with high cardinality (number of unique values)
Prioritize columns frequently used in WHERE, JOIN, and ORDER BY
Consider the trade-off between write performance and storage space
In composite indexes, place higher selectivity columns first

5. Transactions

A transaction is a single logical unit of work that changes the state of a database.

ACID Properties

Atomicity -- all operations in a transaction either succeed entirely or fail entirely
Consistency -- the database remains in a consistent state before and after a transaction
Isolation -- concurrent transactions do not interfere with each other
Durability -- results of a completed transaction are permanently preserved

-- Bank transfer transaction example
BEGIN;

UPDATE accounts SET balance = balance - 1000
WHERE account_id = 'A001';

UPDATE accounts SET balance = balance + 1000
WHERE account_id = 'B001';

-- Verify balance
SELECT balance FROM accounts WHERE account_id = 'A001';

COMMIT;  -- or ROLLBACK if there is an issue;

Isolation Levels

Higher isolation levels increase data consistency but decrease concurrency.

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Performance
Read Uncommitted	Possible	Possible	Possible	Highest
Read Committed	Prevented	Possible	Possible	High
Repeatable Read	Prevented	Prevented	Possible	Moderate
Serializable	Prevented	Prevented	Prevented	Lowest

-- Setting isolation level
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;

BEGIN;
SELECT balance FROM accounts WHERE account_id = 'A001';
-- Changes from other transactions are not visible during this transaction
COMMIT;

Problems at each isolation level:

Dirty Read -- reading data that has not been committed
Non-Repeatable Read -- getting different results when running the same query twice
Phantom Read -- new rows appearing when querying with the same conditions

Deadlock

A state where two transactions wait forever for each other to release their locks.

-- Deadlock scenario
-- Transaction 1: locks A then B
-- Transaction 2: locks B then A

-- Solutions:
-- 1. Unify lock ordering (always lock the lower ID first)
-- 2. Set timeouts
-- 3. Minimize transaction scope

SET lock_timeout = '5s';

6. Query Optimization

Slow queries degrade overall service performance. Systematic optimization is essential.

EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at >= '2026-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 10;

Key items to check in the execution plan:

Seq Scan -- full table scan. If it occurs on large tables, an index is needed
Index Scan -- query using an index. Efficient
Index Only Scan -- returns data from index alone. Most efficient
Nested Loop -- suitable for small data JOINs
Hash Join -- suitable for medium to large data JOINs
Merge Join -- suitable for large sorted data JOINs
Actual Time -- real execution time
Rows -- check estimated vs actual row count discrepancy

The N+1 Problem

The N+1 problem is an inefficient pattern where 1 query triggers N additional queries.

-- N+1 problem occurring
-- 1 query: SELECT * FROM users LIMIT 100;
-- 100 queries: SELECT * FROM orders WHERE user_id = ?;  (for each user)

-- Solution: retrieve in one query with JOIN
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.id IN (SELECT id FROM users LIMIT 100);

-- Or use a subquery
SELECT u.*, (
    SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id
) AS order_count
FROM users u
LIMIT 100;

Query Optimization Checklist

Specify only needed columns instead of SELECT *
Functions in WHERE clauses can invalidate indexes
Avoid leading wildcards in LIKE searches
Process large INSERTs in batches
Create appropriate indexes

-- Bad example: index invalidated
SELECT * FROM users WHERE YEAR(created_at) = 2026;

-- Good example: index can be utilized
SELECT * FROM users
WHERE created_at >= '2026-01-01'
  AND created_at < '2027-01-01';

7. Advanced PostgreSQL

PostgreSQL is the most powerful open-source RDBMS in terms of extensibility and standards compliance.

JSONB

-- Creating and using JSONB columns
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(200),
    attributes JSONB
);

-- Inserting data
INSERT INTO products (name, attributes)
VALUES ('Laptop', '{"brand": "Samsung", "ram": 16, "storage": "512GB", "tags": ["electronics", "portable"]}');

-- JSONB queries
SELECT name, attributes->>'brand' AS brand
FROM products
WHERE attributes->>'ram' = '16';

-- JSONB containment operator
SELECT * FROM products
WHERE attributes @> '{"brand": "Samsung"}';

-- JSONB array query
SELECT * FROM products
WHERE attributes->'tags' ? 'electronics';

CTE (Common Table Expression)

-- Recursive CTE: traversing an org chart
WITH RECURSIVE org_tree AS (
    -- Base case: top-level manager
    SELECT id, name, manager_id, 1 AS depth
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: traverse subordinates
    SELECT e.id, e.name, e.manager_id, t.depth + 1
    FROM employees e
    INNER JOIN org_tree t ON e.manager_id = t.id
)
SELECT * FROM org_tree ORDER BY depth, name;

Window Functions

-- Salary ranking by department
SELECT
    name,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_in_dept,
    RANK() OVER (ORDER BY salary DESC) AS overall_rank,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg_salary,
    salary - LAG(salary) OVER (PARTITION BY department ORDER BY salary DESC) AS diff_from_prev
FROM employees;

-- Running total
SELECT
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date ROWS UNBOUNDED PRECEDING) AS running_total
FROM daily_sales;

Partitioning

-- Range partitioning
CREATE TABLE events (
    id BIGSERIAL,
    event_type VARCHAR(50),
    payload JSONB,
    created_at TIMESTAMP NOT NULL
) PARTITION BY RANGE (created_at);

-- Creating monthly partitions
CREATE TABLE events_2026_01 PARTITION OF events
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE events_2026_02 PARTITION OF events
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

CREATE TABLE events_2026_03 PARTITION OF events
    FOR VALUES FROM ('2026-03-01') TO ('2026-04-01');

-- Indexes are automatically created per partition
CREATE INDEX idx_events_type ON events(event_type);

8. NoSQL Types and Selection Criteria

Document Store -- MongoDB

A document-oriented database that supports flexible JSON-like schemas.

// MongoDB query examples
db.users.insertOne({
  name: "John Doe",
  email: "john@example.com",
  address: {
    city: "New York",
    state: "NY"
  },
  tags: ["premium", "active"]
});

// Aggregation pipeline
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: {
      _id: "$user_id",
      totalSpent: { $sum: "$amount" },
      orderCount: { $sum: 1 }
  }},
  { $sort: { totalSpent: -1 } },
  { $limit: 10 }
]);

Best for: services with frequently changing schemas, content management systems, catalogs

Key-Value Store -- Redis

An in-memory data store providing ultra-fast reads and writes. (Redis details are covered in section 9.)

Wide Column Store -- Cassandra

Provides high write throughput in large-scale distributed environments.

-- Cassandra CQL
CREATE TABLE sensor_data (
    sensor_id UUID,
    event_time TIMESTAMP,
    temperature DOUBLE,
    humidity DOUBLE,
    PRIMARY KEY (sensor_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

Best for: IoT data, time series data, large-scale log storage

Graph Database -- Neo4j

Stores and traverses data based on nodes and relationships (edges).

// Neo4j Cypher query
// Finding friends of friends (2-degree relationship)
MATCH (me:Person)-[:FRIEND]->(friend)-[:FRIEND]->(fof)
WHERE me.name = 'John Doe'
  AND NOT (me)-[:FRIEND]->(fof)
  AND fof <> me
RETURN DISTINCT fof.name AS recommendation
LIMIT 10;

Best for: social networks, recommendation systems, fraud detection, knowledge graphs

NoSQL Selection Guide

Requirement	Recommended DB
Flexible schema + complex queries	MongoDB
Ultra-fast cache/sessions	Redis
Massive writes + high availability	Cassandra
Relationship traversal + graph analysis	Neo4j
Full-text search	Elasticsearch
Time series data	TimescaleDB, InfluxDB

9. Redis in Practice

Redis is an in-memory data structure store that supports a variety of data structures.

Caching

# Set cache (TTL 3600 seconds)
SET user:1001:profile '{"name":"John Doe","email":"john@example.com"}' EX 3600

# Read cache
GET user:1001:profile

# Invalidate cache
DEL user:1001:profile

Caching strategies:

Cache-Aside (Lazy Loading) -- check cache on request; if missing, query DB and store in cache
Write-Through -- write to both DB and cache simultaneously
Write-Behind -- write to cache first, then asynchronously persist to DB

Session Management

# Store session (30-minute expiry)
HSET session:abc123 user_id 1001 role admin login_time "2026-04-12T10:00:00"
EXPIRE session:abc123 1800

# Retrieve session
HGETALL session:abc123

Leaderboards

# Add scores
ZADD game:leaderboard 1500 "player:001"
ZADD game:leaderboard 2300 "player:002"
ZADD game:leaderboard 1800 "player:003"

# Get top 10 (highest score first)
ZREVRANGE game:leaderboard 0 9 WITHSCORES

# Get a specific player rank
ZREVRANK game:leaderboard "player:001"

Pub/Sub

# Subscribe to channel
SUBSCRIBE notifications:user:1001

# Publish message
PUBLISH notifications:user:1001 '{"type":"order","message":"Your order has been completed"}'

Distributed Locks

# Acquire lock (NX: only if not exists, EX: expiry time)
SET lock:order:process "worker-1" NX EX 30

# Release lock (Lua script for atomic operation)
# Only release the lock if the value matches (to ensure we own it)
EVAL "if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end" 1 lock:order:process "worker-1"

10. Data Modeling

ER Diagrams

Entity-Relationship diagrams are a visual representation of database design.

Key components:

Entity -- a real-world object that becomes a table
Attribute -- a characteristic of an entity (column)
Relationship -- an association between entities

Relationship Types

One-to-One (1:1)

-- User and profile: 1:1
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL
);

CREATE TABLE user_profiles (
    user_id INTEGER PRIMARY KEY REFERENCES users(id),
    bio TEXT,
    avatar_url VARCHAR(500)
);

One-to-Many (1:N)

-- User and posts: 1:N
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    title VARCHAR(200) NOT NULL,
    content TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

Many-to-Many (M:N)

-- Posts and tags: M:N (junction table required)
CREATE TABLE tags (
    id SERIAL PRIMARY KEY,
    name VARCHAR(50) UNIQUE NOT NULL
);

CREATE TABLE post_tags (
    post_id INTEGER REFERENCES posts(id),
    tag_id INTEGER REFERENCES tags(id),
    PRIMARY KEY (post_id, tag_id)
);

Modeling Checklist

Analyze business requirements first
Identify entities and relationships
Apply normalization (minimum 3NF)
Review denormalization based on performance requirements
Establish an index strategy
Forecast data growth volume

11. Migrations

Why Migrations Are Necessary

Database schemas evolve alongside code. Migration tools version-control schema changes and ensure the entire team maintains the same schema.

Flyway (Java/JVM Ecosystem)

-- V1__create_users_table.sql
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    name VARCHAR(100) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- V2__add_status_to_users.sql
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
CREATE INDEX idx_users_status ON users(status);

-- V3__create_orders_table.sql
CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT REFERENCES users(id),
    total_amount DECIMAL(10, 2) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

Flyway rules:

File names follow the V-number-description.sql format
Never modify a migration that has already been applied
Add new changes as new files

Alembic (Python/SQLAlchemy Ecosystem)

# alembic/versions/001_create_users.py
from alembic import op
import sqlalchemy as sa

def upgrade():
    op.create_table(
        'users',
        sa.Column('id', sa.Integer(), primary_key=True),
        sa.Column('email', sa.String(255), nullable=False, unique=True),
        sa.Column('name', sa.String(100), nullable=False),
        sa.Column('created_at', sa.DateTime(), server_default=sa.func.now()),
    )

def downgrade():
    op.drop_table('users')

Zero-Downtime Migrations

Methods for changing schemas without service interruption in production environments.

Adding a Column (Safe)

-- Add without NOT NULL (no impact on existing rows)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);

Dropping a Column (Staged Approach)

Step 1 -- remove all code usage of the column and deploy.

Step 2 -- after sufficient time has passed, drop the column.

ALTER TABLE users DROP COLUMN phone;

Renaming a Table (Safe Method)

Step 1 -- create the new table.

Step 2 -- modify code to write to both tables.

Step 3 -- migrate the data.

Step 4 -- switch reads to the new table only.

Step 5 -- drop the old table.

Large-Scale Index Creation

-- CONCURRENTLY option: create index without table lock
CREATE INDEX CONCURRENTLY idx_orders_user ON orders(user_id);

Migration Tool Comparison

Tool	Ecosystem	Features
Flyway	Java/JVM	SQL-based, simple and intuitive
Liquibase	Java/JVM	XML/YAML/JSON/SQL support
Alembic	Python	SQLAlchemy integration, auto-detection
Prisma Migrate	Node.js	Prisma ORM integration
golang-migrate	Go	Lightweight, CLI-based
Knex.js	Node.js	JavaScript migrations

Conclusion

A database is the heart of software. You need a solid foundation in SQL basics, an understanding of normalization and index internals, and hands-on mastery of ACID transaction properties. Add query optimization skills and informed NoSQL selection criteria, and you can design a robust data layer for services of any scale.

Key takeaways:

Deeply understand SQL -- JOINs, subqueries, and Window Functions are essential
Follow normalization principles, but consider denormalization for performance
Indexes are the key to read performance, but they come with write costs
Transaction isolation levels are a trade-off between consistency and concurrency
Measure and improve query performance with EXPLAIN ANALYZE
Understand the strengths and weaknesses of RDBMS and NoSQL, and choose accordingly
Always version-control migrations and establish zero-downtime strategies