Skip to content
Published on

Database Engineering Complete Guide: From SQL to Vector DBs and AI RAG Systems

Authors

Database Engineering Complete Guide: From SQL to Vector DBs and AI RAG Systems

Even in the AI era, database engineering remains the backbone of every production system. No matter how powerful an LLM is, you cannot build reliable software without efficiently storing and retrieving data. The rise of vector search and RAG architectures has actually made database engineering more important, not less.

This guide takes you from advanced SQL through PostgreSQL internals, NoSQL systems, vector databases, distributed DB theory, and finally into LLM + DB integration patterns — all with production-ready code examples.


1. Relational DB Core: Advanced SQL

1.1 Window Functions

Window functions compute results across a set of rows related to the current row without collapsing them into a single group — indispensable for analytical queries.

-- Salary rank and rolling average per department
SELECT
    employee_id,
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
    SUM(salary) OVER (PARTITION BY department) AS dept_total,
    AVG(salary) OVER (
        PARTITION BY department
        ORDER BY hire_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_3
FROM employees;

Key window functions:

  • ROW_NUMBER(): Unique sequential number with no ties
  • RANK(): Tied rows get the same rank; next rank skips numbers
  • DENSE_RANK(): Tied rows get the same rank; next rank is consecutive
  • LAG() / LEAD(): Access values from previous / next rows
  • FIRST_VALUE() / LAST_VALUE(): First / last value within the window frame
-- Month-over-month revenue growth rate
SELECT
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
    ROUND(
        (revenue - LAG(revenue, 1) OVER (ORDER BY month)) * 100.0
        / NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0),
        2
    ) AS growth_rate_pct
FROM monthly_sales;

1.2 CTEs (Common Table Expressions)

CTEs let you break complex queries into named, readable steps. Recursive CTEs are especially powerful for hierarchical traversal.

-- Recursive CTE for org chart traversal
WITH RECURSIVE org_tree AS (
    -- Base case: top-level employees
    SELECT
        employee_id,
        name,
        manager_id,
        0 AS depth,
        name::TEXT AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case
    SELECT
        e.employee_id,
        e.name,
        e.manager_id,
        ot.depth + 1,
        ot.path || ' > ' || e.name
    FROM employees e
    INNER JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT employee_id, name, depth, path
FROM org_tree
ORDER BY path;
-- Multi-step analysis broken into readable CTEs
WITH
top_customers AS (
    SELECT customer_id, SUM(amount) AS total_spent
    FROM orders
    WHERE created_at >= NOW() - INTERVAL '90 days'
    GROUP BY customer_id
    HAVING SUM(amount) > 1000
),
customer_details AS (
    SELECT c.*, tc.total_spent
    FROM customers c
    JOIN top_customers tc ON c.id = tc.customer_id
),
ranked AS (
    SELECT *,
           NTILE(4) OVER (ORDER BY total_spent DESC) AS quartile
    FROM customer_details
)
SELECT * FROM ranked WHERE quartile = 1;

1.3 Reading EXPLAIN ANALYZE

Understanding the query execution plan is the single most powerful skill for diagnosing performance problems.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2025-01-01'
GROUP BY u.id, u.name
ORDER BY order_count DESC
LIMIT 10;

Sample plan output:

Limit  (cost=1234.56..1234.57 rows=10 width=72) (actual time=45.123..45.125 rows=10 loops=1)
  ->  Sort  (cost=1234.56..1259.56 rows=10000 width=72) (actual time=45.120..45.121 rows=10 loops=1)
        Sort Key: (count(o.id)) DESC
        Sort Method: top-N heapsort  Memory: 25kB
        ->  HashAggregate  (cost=876.00..976.00 rows=10000 width=72) (actual time=38.456..42.234 rows=8523 loops=1)
              Group Key: u.id
              ->  Hash Left Join  (cost=345.00..801.00 rows=15000 width=40)
                    Hash Cond: (o.user_id = u.id)
                    ->  Seq Scan on orders o  (cost=0.00..312.00 rows=15000 width=16)
                    ->  Hash  (cost=270.00..270.00 rows=6000 width=32)
                          ->  Index Scan using idx_users_created_at on users u
Planning Time: 1.234 ms
Execution Time: 45.456 ms

Key signals to watch:

  • Seq Scan on a large table: investigate whether an index would help
  • Large gap between estimated and actual rows: run ANALYZE to refresh statistics
  • Buffers: shared read vs hit: low cache hit ratio means consider shared_buffers tuning
  • High loops on nested loop joins: consider a hash join or better index

1.4 Index Design

-- Composite index: put high-selectivity columns first
CREATE INDEX idx_orders_user_status_date
ON orders (user_id, status, created_at DESC);

-- Partial index: only index rows meeting a condition
CREATE INDEX idx_active_users
ON users (email)
WHERE deleted_at IS NULL AND status = 'active';

-- Expression index: index the result of a function
CREATE INDEX idx_users_lower_email
ON users (LOWER(email));

-- BRIN index: very small, ideal for naturally ordered time-series
CREATE INDEX idx_logs_timestamp_brin
ON application_logs USING BRIN (created_at);

-- GIN index: arrays, JSONB, full-text search
CREATE INDEX idx_products_tags_gin
ON products USING GIN (tags);

1.5 Transactions and ACID

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

BEGIN;
  -- Atomic bank transfer
  UPDATE accounts SET balance = balance - 500 WHERE id = 1;
  UPDATE accounts SET balance = balance + 500 WHERE id = 2;

  -- Guard clause
  DO $$
  DECLARE
    bal NUMERIC;
  BEGIN
    SELECT balance INTO bal FROM accounts WHERE id = 1;
    IF bal < 0 THEN
      RAISE EXCEPTION 'Insufficient funds';
    END IF;
  END $$;

COMMIT;

Isolation level anomalies:

Isolation LevelDirty ReadNon-repeatable ReadPhantom Read
READ UNCOMMITTEDPossiblePossiblePossible
READ COMMITTEDPreventedPossiblePossible
REPEATABLE READPreventedPreventedPossible
SERIALIZABLEPreventedPreventedPrevented

2. PostgreSQL in Practice

2.1 JSONB and Semi-Structured Data

PostgreSQL's JSONB stores JSON in a parsed binary format enabling indexing and fast operators.

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    metadata JSONB NOT NULL DEFAULT '{}'
);

CREATE INDEX idx_products_metadata ON products USING GIN (metadata);

-- Containment check (@>), key existence (?), path extraction (->>)
SELECT * FROM products
WHERE metadata @> '{"category": "electronics", "in_stock": true}';

SELECT
    name,
    metadata->>'brand' AS brand,
    (metadata->>'price')::NUMERIC AS price,
    metadata->'specs'->>'cpu' AS cpu
FROM products
WHERE metadata ? 'discount_pct'
  AND (metadata->>'discount_pct')::NUMERIC > 10;

-- Update a single key without replacing the whole document
UPDATE products
SET metadata = jsonb_set(metadata, '{price}', '29900'::jsonb)
WHERE id = 42;

2.2 Table Partitioning

-- Range partitioning for time-series data
CREATE TABLE events (
    id BIGSERIAL,
    user_id INT,
    event_type TEXT,
    payload JSONB,
    created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE OR REPLACE FUNCTION create_monthly_partition(target_date DATE)
RETURNS VOID AS $$
DECLARE
    partition_name TEXT;
    start_date DATE;
    end_date DATE;
BEGIN
    start_date := DATE_TRUNC('month', target_date);
    end_date := start_date + INTERVAL '1 month';
    partition_name := 'events_' || TO_CHAR(start_date, 'YYYY_MM');

    EXECUTE FORMAT(
        'CREATE TABLE IF NOT EXISTS %I PARTITION OF events
         FOR VALUES FROM (%L) TO (%L)',
        partition_name, start_date, end_date
    );
END;
$$ LANGUAGE plpgsql;

SELECT create_monthly_partition('2026-03-01');

2.3 Logical Replication

-- On the primary: create a publication
ALTER SYSTEM SET wal_level = logical;

CREATE PUBLICATION app_publication
FOR TABLE users, orders, products
WITH (publish = 'insert, update, delete');

-- On the replica: subscribe
CREATE SUBSCRIPTION app_subscription
CONNECTION 'host=primary-db port=5432 dbname=myapp user=replicator'
PUBLICATION app_publication;

-- Monitor replication lag
SELECT subname, received_lsn, latest_end_lsn, latest_end_time
FROM pg_stat_subscription;

3. NoSQL Databases

3.1 Redis: Caching Patterns

import redis
import json
import hashlib
import time
from typing import Optional

r = redis.Redis(host='localhost', port=6379, db=0, decode_responses=True)

# Cache-Aside (Lazy Loading)
def get_user_profile(user_id: int) -> dict:
    cache_key = f"user:profile:{user_id}"

    cached = r.get(cache_key)
    if cached:
        return json.loads(cached)

    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    r.setex(cache_key, 3600, json.dumps(user))
    return user

# Write-Through: update cache synchronously on every write
def update_user_profile(user_id: int, data: dict) -> None:
    cache_key = f"user:profile:{user_id}"
    db.execute("UPDATE users SET ... WHERE id = %s", user_id)
    updated = get_user_from_db(user_id)
    r.setex(cache_key, 3600, json.dumps(updated))

# Write-Behind (Write-Back): write to cache immediately, flush DB asynchronously
class WriteBehindCache:
    def __init__(self):
        self.dirty_keys_set = "cache:dirty_keys"

    def write(self, key: str, value: dict, ttl: int = 3600):
        r.setex(key, ttl, json.dumps(value))
        r.sadd(self.dirty_keys_set, key)

    def flush_to_db(self):
        dirty_keys = r.smembers(self.dirty_keys_set)
        for key in dirty_keys:
            data = r.get(key)
            if data:
                db.upsert(json.loads(data))
                r.srem(self.dirty_keys_set, key)

# Distributed lock (Redlock pattern)
def acquire_lock(lock_name: str, timeout: int = 10) -> Optional[str]:
    identifier = str(time.time())
    acquired = r.set(f"lock:{lock_name}", identifier, nx=True, ex=timeout)
    return identifier if acquired else None

def release_lock(lock_name: str, identifier: str) -> bool:
    lua = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else
        return 0
    end
    """
    return bool(r.eval(lua, 1, f"lock:{lock_name}", identifier))

Additional Redis data structures:

# Sorted Set: real-time leaderboard
def update_score(player: str, score: int):
    r.zadd("leaderboard", {player: score})

def get_top_players(n: int = 10):
    return r.zrevrange("leaderboard", 0, n - 1, withscores=True)

# HyperLogLog: approximate unique visitor count (very memory-efficient)
def track_visitor(page: str, user_id: str):
    r.pfadd(f"visitors:{page}", user_id)

def get_unique_visitors(page: str) -> int:
    return r.pfcount(f"visitors:{page}")

3.2 MongoDB: Document Modeling

from pymongo import MongoClient
from datetime import datetime

client = MongoClient('mongodb://localhost:27017/')
db = client['ecommerce']

# Embedded documents vs references
# Embed when data is frequently read together and has limited growth
sample_order = {
    "_id": "order_12345",
    "user_id": "user_67890",
    "status": "shipped",
    "created_at": datetime.utcnow(),
    "shipping_address": {           # Embedded: snapshot at time of order
        "street": "123 Main St",
        "city": "Seoul",
        "zip": "06234"
    },
    "items": [                      # Embedded: price snapshot preserved
        {"product_id": "p001", "name": "Laptop", "price": 1200, "qty": 1},
        {"product_id": "p002", "name": "Mouse",  "price":   35, "qty": 2}
    ],
    "total": 1270
}

# Aggregation Pipeline
pipeline = [
    {"$match": {
        "created_at": {"$gte": datetime(2026, 2, 17)},
        "status": {"$in": ["delivered", "shipped"]}
    }},
    {"$unwind": "$items"},
    {"$group": {
        "_id": "$items.product_id",
        "product_name": {"$first": "$items.name"},
        "total_qty":     {"$sum": "$items.qty"},
        "total_revenue": {"$sum": {"$multiply": ["$items.price", "$items.qty"]}}
    }},
    {"$sort": {"total_revenue": -1}},
    {"$limit": 10},
    {"$project": {
        "product_id":    "$_id",
        "product_name":  1,
        "total_qty":     1,
        "total_revenue": 1,
        "_id": 0
    }}
]

top_products = list(db['orders'].aggregate(pipeline))

3.3 Cassandra: Wide-Column Design

Cassandra forces you to design tables around query patterns, not entity relationships.

-- Query-first table design
CREATE TABLE user_timeline (
    user_id  UUID,
    created_at TIMEUUID,
    post_id  UUID,
    content  TEXT,
    PRIMARY KEY (user_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC)
  AND compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'DAYS',
    'compaction_window_size': 7
  };

-- Denormalized table for tag-based search (separate query pattern)
CREATE TABLE posts_by_tag (
    tag        TEXT,
    created_at TIMEUUID,
    post_id    UUID,
    user_id    UUID,
    title      TEXT,
    PRIMARY KEY (tag, created_at, post_id)
) WITH CLUSTERING ORDER BY (created_at DESC);

4. Vector Databases: The AI-Era Core

4.1 pgvector: Vector Search Inside PostgreSQL

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_embeddings (
    id         BIGSERIAL PRIMARY KEY,
    content    TEXT NOT NULL,
    metadata   JSONB DEFAULT '{}',
    embedding  vector(1536),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- HNSW index for fast approximate nearest-neighbor search
CREATE INDEX idx_doc_hnsw
ON document_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Cosine similarity search (lower distance = more similar)
SELECT
    id,
    content,
    metadata,
    1 - (embedding <=> '[0.1, 0.2, ...]'::vector) AS similarity
FROM document_embeddings
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 5;

-- Hybrid: metadata filter + vector search
SELECT
    id,
    content,
    metadata->>'source' AS source,
    1 - (embedding <=> query_vec) AS similarity
FROM document_embeddings
WHERE metadata->>'language' = 'en'
  AND metadata->>'category' = 'technical'
ORDER BY embedding <=> query_vec
LIMIT 10;

Distance operators in pgvector:

  • <=>: Cosine distance — best for text embeddings
  • <->: L2 (Euclidean) distance — image feature vectors
  • <#>: Negative inner product — normalized vectors (equivalent to cosine)

4.2 IVFFlat vs HNSW

-- IVFFlat: faster build, lower memory, good for large initial datasets
CREATE INDEX idx_ivfflat
ON document_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);         -- rule of thumb: sqrt(row count)

SET ivfflat.probes = 10;   -- more probes = higher recall, slower query

-- HNSW: higher recall, faster queries, more memory / slower build
CREATE INDEX idx_hnsw
ON document_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

SET hnsw.ef_search = 40;   -- higher = more accurate, slower

4.3 Choosing a Vector Database

PropertypgvectorPineconeWeaviateMilvus
DeploymentPG extensionSaaSSelf-hosted / cloudSelf-hosted / cloud
FilteringFull SQLMetadata filtersGraphQL + filtersRich filters
ScaleMillionsBillions+Hundreds of millionsBillions+
CostPG infra onlyUsage-basedOpen-source freeOpen-source free
Hybrid searchBM25 + vectorBuilt-inBM25 + vector nativeRich support
Best forExisting PG stackFast prototypesKnowledge graphsMassive scale

5. Distributed Databases: Theory and Practice

5.1 CAP Theorem

The CAP theorem states that a distributed system can satisfy at most two of:

  • C (Consistency): Every node returns the latest data
  • A (Availability): Every request gets a response
  • P (Partition Tolerance): The system keeps working despite network splits

Because network partitions are inevitable in distributed systems, the real choice is between CP and AP.

CP systems (consistency over availability):

  • Refuse requests during a partition to preserve correctness
  • Examples: ZooKeeper, HBase, MongoDB with w: majority
  • Use cases: financial transactions, inventory management

AP systems (availability over consistency):

  • Return possibly stale data during a partition
  • Examples: DynamoDB, Cassandra, CouchDB
  • Use cases: social feeds, DNS, notification systems

5.2 Consistency Models

Strong Consistency        (high latency, low throughput)
Sequential Consistency
Causal Consistency
Eventual Consistency      (low latency, high throughput)
Monotonic Read

5.3 Sharding Strategies

# Range sharding: consecutive ID ranges per shard
def range_shard(user_id: int, num_shards: int = 4) -> int:
    shard_size = 250_000_000
    return min(user_id // shard_size, num_shards - 1)

# Hash sharding: uniform distribution, prevents hotspots
import hashlib

def hash_shard(key: str, num_shards: int = 8) -> int:
    h = int(hashlib.md5(key.encode()).hexdigest(), 16)
    return h % num_shards

# Consistent hashing: minimal rehashing when nodes are added/removed
import bisect

class ConsistentHashRing:
    def __init__(self, nodes: list, replicas: int = 150):
        self.replicas = replicas
        self.ring: dict = {}
        self.sorted_keys: list = []
        for node in nodes:
            self.add_node(node)

    def add_node(self, node: str):
        for i in range(self.replicas):
            key = self._hash(f"{node}:{i}")
            self.ring[key] = node
            bisect.insort(self.sorted_keys, key)

    def get_node(self, key: str) -> str:
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

6. Data Modeling

6.1 Normalization vs Denormalization

Normalization (through 3NF):

  • Eliminates redundancy and update anomalies
  • Excellent write performance
  • Requires JOINs for reads

Denormalization:

  • Optimizes read performance at the cost of redundancy
  • Well-suited for OLAP and data warehouses
-- Normalized schema
CREATE TABLE categories (id SERIAL PRIMARY KEY, name TEXT);
CREATE TABLE products (
    id          SERIAL PRIMARY KEY,
    name        TEXT,
    category_id INT REFERENCES categories(id)
);

-- Denormalized schema (eliminates the JOIN)
CREATE TABLE products_denormalized (
    id            SERIAL PRIMARY KEY,
    name          TEXT,
    category_id   INT,
    category_name TEXT   -- duplicated to avoid JOIN
);

6.2 Star Schema for Data Warehouses

-- Fact table (measurements / events)
CREATE TABLE fact_sales (
    sale_id      BIGINT,
    date_key     INT,
    product_key  INT,
    customer_key INT,
    store_key    INT,
    quantity     INT,
    unit_price   DECIMAL(10,2),
    total_amount DECIMAL(10,2)
);

-- Dimension tables (descriptive context)
CREATE TABLE dim_date (
    date_key    INT PRIMARY KEY,
    full_date   DATE,
    year        INT, quarter INT, month INT, week INT, day_of_week INT
);

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_id  TEXT,
    name        TEXT, category TEXT, brand TEXT, unit_cost DECIMAL(10,2)
);

7. AI Integration: LLM + DB Patterns

7.1 Building a RAG System with LangChain + pgvector

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_postgres import PGVector
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.schema import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

CONNECTION_STRING = "postgresql+psycopg://user:password@localhost:5432/vectordb"

vector_store = PGVector(
    embeddings=embeddings,
    collection_name="documents",
    connection=CONNECTION_STRING,
    use_jsonb=True,
)

# Ingest: chunk, embed, and store
def ingest_documents(file_path: str, metadata: dict):
    with open(file_path, 'r', encoding='utf-8') as f:
        raw_text = f.read()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=50,
        separators=["\n\n", "\n", ".", " "]
    )
    chunks = splitter.split_text(raw_text)
    documents = [
        Document(page_content=chunk, metadata={**metadata, "chunk_index": i})
        for i, chunk in enumerate(chunks)
    ]
    return vector_store.add_documents(documents)

# Build the RAG chain
def build_rag_chain(k: int = 5, score_threshold: float = 0.7):
    retriever = vector_store.as_retriever(
        search_type="similarity_score_threshold",
        search_kwargs={"k": k, "score_threshold": score_threshold}
    )
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    return RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
    )

rag_chain = build_rag_chain()
result = rag_chain.invoke({"query": "How does PostgreSQL MVCC handle concurrency?"})
print(result['result'])

7.2 Hybrid Search: BM25 + Vector via RRF

from sqlalchemy import text

def hybrid_search(
    query: str,
    query_embedding: list,
    k: int = 10,
    alpha: float = 0.5,   # 0 = BM25 only, 1 = vector only
) -> list:
    sql = text("""
    WITH
    vector_results AS (
        SELECT id, content, metadata,
               ROW_NUMBER() OVER (ORDER BY embedding <=> :embedding) AS rank
        FROM document_embeddings
        ORDER BY embedding <=> :embedding
        LIMIT :k
    ),
    bm25_results AS (
        SELECT id, content, metadata,
               ROW_NUMBER() OVER (ORDER BY ts_rank(
                   to_tsvector('english', content),
                   plainto_tsquery('english', :query)
               ) DESC) AS rank
        FROM document_embeddings
        WHERE to_tsvector('english', content) @@ plainto_tsquery('english', :query)
        LIMIT :k
    ),
    rrf AS (
        SELECT
            COALESCE(v.id, b.id)           AS id,
            COALESCE(v.content, b.content) AS content,
            COALESCE(v.metadata, b.metadata) AS metadata,
            COALESCE(1.0 / (60 + v.rank), 0) * :alpha +
            COALESCE(1.0 / (60 + b.rank), 0) * (1 - :alpha) AS score
        FROM vector_results v
        FULL OUTER JOIN bm25_results b ON v.id = b.id
    )
    SELECT id, content, metadata, score
    FROM rrf
    ORDER BY score DESC
    LIMIT :k
    """)

    with engine.connect() as conn:
        rows = conn.execute(sql, {
            "embedding": str(query_embedding),
            "query": query,
            "k": k,
            "alpha": alpha,
        })
        return [dict(row) for row in rows]

7.3 Embedding Cache with Redis

import hashlib, json
from typing import Optional

class EmbeddingCache:
    def __init__(self, redis_client, ttl: int = 86400 * 7):
        self.redis = redis_client
        self.ttl = ttl

    def _key(self, text: str, model: str) -> str:
        h = hashlib.sha256(f"{model}:{text}".encode()).hexdigest()
        return f"embedding:{h}"

    def get(self, text: str, model: str) -> Optional[list]:
        raw = self.redis.get(self._key(text, model))
        return json.loads(raw) if raw else None

    def set(self, text: str, model: str, embedding: list):
        self.redis.setex(self._key(text, model), self.ttl, json.dumps(embedding))

    def get_or_compute(self, text: str, model: str, compute_fn) -> list:
        cached = self.get(text, model)
        if cached:
            return cached
        embedding = compute_fn(text)
        self.set(text, model, embedding)
        return embedding

Quiz: Test Your Knowledge

Q1. When should you use a B-Tree index versus a Hash index?

Answer: B-Tree is best for range queries, sorting, and prefix LIKE patterns. Hash indexes only support equality comparisons (=).

Explanation: B-Tree stores keys in a sorted tree, making it ideal for WHERE age > 30, ORDER BY name, and BETWEEN operations. Hash indexes compute a hash of the key for O(1) point lookups (WHERE id = 42) but are completely useless for range queries or sorting. In PostgreSQL, Hash indexes gained full WAL logging support in v10, making them production-safe.

Q2. How does PostgreSQL MVCC handle concurrent reads and writes?

Answer: Each transaction takes a snapshot at its start time and only sees row versions that were committed before that snapshot. Readers never block writers and writers never block readers.

Explanation: When a row is updated, PostgreSQL writes a new version of the row tagged with xmin and xmax transaction IDs rather than overwriting the old version. Each transaction uses its own xid snapshot to determine which row version is visible. Old versions (dead tuples) are cleaned up by the VACUUM background process. This is why PostgreSQL can achieve very high read concurrency with minimal locking.

Q3. What are the trade-offs between CP and AP systems in the CAP theorem?

Answer: CP systems sacrifice availability during a network partition to preserve data consistency. AP systems stay available during a partition but may return stale data, offering only eventual consistency.

Explanation: A CP system like ZooKeeper will refuse requests or return errors during a partition, ensuring every response reflects the latest committed state — critical for distributed coordination, financial ledgers, and inventory. An AP system like Cassandra will always return a response even if some nodes are unreachable, but different clients might temporarily see different versions of data. The right choice depends on whether inconsistency or downtime is the bigger risk for your use case.

Q4. Why does HNSW outperform brute-force KNN for approximate nearest-neighbor search in vector databases?

Answer: HNSW builds a hierarchical small-world graph where each node connects to a small number of neighbors. Search navigates from coarse upper layers to fine-grained lower layers in O(log N) time, achieving high recall with far fewer distance computations than brute-force O(N).

Explanation: Exact KNN over millions of vectors requires computing the distance to every vector — prohibitively slow at scale. HNSW (Hierarchical Navigable Small World) constructs multiple graph layers: upper layers enable fast long-range navigation while the bottom layer provides precise local search. The build parameters m (edges per node) and ef_construction (beam width during construction) determine index quality. At query time, ef_search controls the recall/speed trade-off. For most production workloads, HNSW achieves 95%+ recall with millisecond latency.

Q5. What is the difference between write-through and write-behind caching in Redis?

Answer: Write-through updates the cache and the database synchronously on every write, guaranteeing consistency at the cost of write latency. Write-behind updates the cache immediately and flushes dirty entries to the database asynchronously, improving throughput at the risk of data loss on failure.

Explanation: In write-through, every write goes to both the cache and the DB before acknowledging the client, so the cache is never stale but writes are as slow as the DB. In write-behind (write-back), the application writes only to the cache and a background process batches DB writes later. This gives much lower write latency and higher throughput, but if the cache node crashes before flushing, those writes are lost. Write-behind is appropriate for high-frequency, loss-tolerant workloads like shopping carts, game state, and analytics counters.


Conclusion

Database engineering extends far beyond writing SQL. It encompasses data modeling, index strategy, distributed systems theory, and now — increasingly — vector search and AI pipeline integration. In the AI era, the engineer who can design a schema for embedding storage, tune HNSW parameters, implement a Redis caching layer, and wire everything into a LangChain RAG pipeline holds a rare and valuable skill set.

Start with pgvector for vector search inside your existing PostgreSQL infrastructure, add Redis for caching hot data, and apply the principles of proper index design and query plan analysis to keep everything fast. The foundations in this guide will serve you well regardless of which specific tools the ecosystem moves to next.