Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

Introduction: The "Most Similar" Problem

A New Essential Skill in the LLM Era

Since ChatGPT (2022), RAG (Retrieval-Augmented Generation) has become core to LLM apps:

Embed documents (convert to vectors).
Embed the user's question.
Search for the most similar documents.
Feed results to the LLM as context.

Step 3 is similarity search: finding the top-K nearest vectors out of millions or billions.

The Trap of the Naive Approach

The simplest approach: compute all distances, then sort.

def naive_knn(query, vectors, k):
    distances = [cosine_distance(query, v) for v in vectors]
    return sorted(enumerate(distances), key=lambda x: x[1])[:k]

100K vectors: 0.1s (fine).
1M vectors: 1s (slow).
100M vectors: tens of seconds (impossible).
1,000 dims x 100M: minutes (catastrophic).

Real search engines, recommenders, and RAG need tens of milliseconds, not seconds. Fundamentally different approach needed.

Trading Accuracy

The key insight:

"Close enough K neighbors" is usually sufficient. Exact K-NN is not required.

Allowing 10% error yields 1000x faster search. That is the core of ANN (Approximate Nearest Neighbor).

What We Cover

Vector similarity: cosine, L2, inner product.
LSH: the oldest ANN algorithm.
IVF: clustering-based search.
Product Quantization: massive memory savings.
HNSW: current graph-based standard.
DiskANN, ScaNN: the latest.
Practical selection and tuning.

1. Vector Similarity Basics

Distance Functions

L2 (Euclidean):

d(x, y) = sqrt(sum((x_i - y_i)^2))

Cosine distance:

cos(x, y) = (x . y) / (|x| * |y|)
distance = 1 - cos(x, y)

Inner product:

d(x, y) = -x . y   (negative: larger = more similar)

Normalized Vectors

If vectors are normalized to unit length:

x . y = cos(x, y) (inner product equals cosine)
|x - y|^2 = 2 - 2 * cos(x, y) (L2 equivalent to cosine)

So normalization makes L2, cosine, and inner product equivalent. Most embedding models (OpenAI, Sentence-BERT) return normalized vectors for this reason.

The Curse of Dimensionality

In high-dim space, odd things happen:

All points become nearly equidistant.
Mean-to-variance ratio shrinks.
Intuition from 2D/3D fails.

Real data lives on a low-dim manifold, which ANN exploits.

Recall Definition

Quality metric: Recall@k.

"How many of the true top-k did we find?"

recall@10 = |found top-10 that are truly top-10| / 10.
1.0 = perfect, 0.9 = most users happy, 0.5 = bad.

Target: recall 0.95+ with optimized speed and memory.

2. LSH: Locality-Sensitive Hashing

The Idea

LSH (1998) is one of the earliest ANN methods:

"Put similar vectors in the same bucket, and search only that bucket at query time."

Regular hashes give different outputs for similar inputs. LSH is the opposite: similar inputs are more likely to hash the same.

Random Projection LSH (for cosine)

import numpy as np

class CosineLSH:
    def __init__(self, dim, num_hashes=32):
        self.dim = dim
        self.num_hashes = num_hashes
        self.projections = np.random.randn(num_hashes, dim)

    def hash(self, vector):
        signs = np.sign(self.projections @ vector)
        return tuple((signs > 0).astype(int))

Collision probability:

P(hash(x) == hash(y)) = 1 - theta(x,y)/pi

Building an LSH Index

from collections import defaultdict

class LSHIndex:
    def __init__(self, dim, num_tables=10, num_hashes_per_table=16):
        self.tables = []
        for _ in range(num_tables):
            lsh = CosineLSH(dim, num_hashes_per_table)
            self.tables.append((lsh, defaultdict(list)))

    def add(self, vector, id):
        for lsh, buckets in self.tables:
            h = lsh.hash(vector)
            buckets[h].append((id, vector))

    def query(self, vector, k):
        candidates = set()
        for lsh, buckets in self.tables:
            h = lsh.hash(vector)
            for id, v in buckets.get(h, []):
                candidates.add((id, tuple(v)))
        scored = [(id, cosine_distance(vector, v)) for id, v in candidates]
        return sorted(scored, key=lambda x: x[1])[:k]

Pros: simple, fast insert, distributable, streaming-friendly. Cons: low recall (0.7 to 0.85), many tables needed, tuning is tricky.

LSH has been largely superseded by HNSW/IVF for most use cases, but still shines for massive distributed and streaming workloads.

3. IVF: Inverted File Index

The Idea

IVF is widely used in FAISS:

Use k-means to split space into clusters.
Each vector assigned to the nearest centroid.
At query time, search only the nprobe closest clusters.

Query Flow

Compare query q to all centroids.
Pick top nprobe clusters.
Compare q precisely against vectors in those clusters.
Return top-k.

class IVFIndex:
    def __init__(self, dim, nlist=100):
        self.nlist = nlist
        self.centroids = None
        self.lists = [[] for _ in range(nlist)]

    def train(self, vectors):
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=self.nlist)
        kmeans.fit(vectors)
        self.centroids = kmeans.cluster_centers_
        assignments = kmeans.predict(vectors)
        for i, c in enumerate(assignments):
            self.lists[c].append((i, vectors[i]))

    def query(self, q, k=10, nprobe=10):
        centroid_dists = np.linalg.norm(self.centroids - q, axis=1)
        top_clusters = np.argsort(centroid_dists)[:nprobe]
        candidates = []
        for c in top_clusters:
            for id, v in self.lists[c]:
                candidates.append((id, np.linalg.norm(q - v)))
        return sorted(candidates, key=lambda x: x[1])[:k]

Parameters

nlist: cluster count, typically sqrt(N).
nprobe: clusters checked per query; larger = more accurate but slower.

Pros: dozens of times faster than linear scan, simple, well-optimized in FAISS. Cons: needs training, rebalancing on updates, boundary vector errors.

4. Product Quantization: King of Memory Savings

The Problem

100M x 1024-dim float32 = 400 GB. Too big for memory, too slow from disk.

PQ Idea (Jegou et al., 2011)

Split each vector into M subvectors (e.g. 1024 -> 8 x 128).
Run k-means per subspace (e.g. 256 centroids).
Replace each subvector with its nearest centroid index (1 byte).
Each vector compressed to M bytes.

Original: 1024 * 4 = 4096 bytes
PQ:       8 bytes
Compression: 512x

Encoding

class ProductQuantizer:
    def __init__(self, dim, M=8, nbits=8):
        self.dim = dim
        self.M = M
        self.nbits = nbits
        self.dsub = dim // M
        self.K = 2 ** nbits
        self.codebooks = None

    def train(self, vectors):
        from sklearn.cluster import KMeans
        self.codebooks = np.zeros((self.M, self.K, self.dsub))
        for m in range(self.M):
            sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
            kmeans = KMeans(n_clusters=self.K)
            kmeans.fit(sub)
            self.codebooks[m] = kmeans.cluster_centers_

    def encode(self, vectors):
        codes = np.zeros((len(vectors), self.M), dtype=np.uint8)
        for m in range(self.M):
            sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
            dists = np.linalg.norm(
                sub[:, None, :] - self.codebooks[m][None, :, :],
                axis=2
            )
            codes[:, m] = np.argmin(dists, axis=1)
        return codes

Asymmetric Distance Computation (ADC)

Keep the query uncompressed.
Precompute distance from q to every centroid per subspace: lookup table LUT[M, K].
Distance to any compressed vector = sum of M lookups.

def query_adc(self, q, codes):
    lut = np.zeros((self.M, self.K))
    for m in range(self.M):
        q_sub = q[m*self.dsub:(m+1)*self.dsub]
        for k in range(self.K):
            lut[m, k] = np.sum((q_sub - self.codebooks[m, k])**2)
    distances = np.zeros(len(codes))
    for i, code in enumerate(codes):
        for m in range(self.M):
            distances[i] += lut[m, code[m]]
    return distances

Distance becomes M lookups + additions, not dim multiplications.

IVF-PQ

FAISS's famous combination:

IVF to narrow to a cluster.
PQ to compress inside each cluster.
Query: search nprobe clusters via ADC.

Result: 10B vectors in a few GB, ms-level queries. The Meta/Facebook image search scale.

import faiss
d = 128; nlist = 1024; m = 8; bits = 8
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)
index.train(training_vectors)
index.add(vectors)
index.nprobe = 32
distances, indices = index.search(queries, k=10)

5. HNSW: The Graph-Based Revolution

The Idea

HNSW (Hierarchical Navigable Small World) was published in 2016 (Malkov and Yashunin). It is the current best ANN algorithm for most benchmarks and powers Pinecone, Weaviate, Qdrant, Milvus, and pgvector.

Connect vectors in a graph.
Each node links to nearby neighbors.
Query by walking the graph.

Greedy Search and Local Minima

Greedy walk: move to the nearest neighbor until no closer neighbor exists. Prone to local minima — neighborhood looks far but a globally closer node exists elsewhere.

Multi-Layer Structure

HNSW stacks multiple graph layers:

Layer 2 (sparse): [A --- E --- I]
                   |     |     |
Layer 1:          [A B D E G I J]
                   | | | | | | |
Layer 0 (dense):  [A B C D E F G H I J K L]

Layer 0: all vectors, dense.
Upper layers: random subsets, sparse — act as "highways".

Search Flow

Start at top layer.
Greedy-search that layer.
Descend one layer.
Repeat down to layer 0.
Return top-k from layer 0.

Skip-list-like structure: upper = global moves, lower = fine-grained.

Insert

Pick random layer: L = floor(-log(uniform(0,1)) * mL).
Greedy-search from the top.
Connect to M nearest neighbors on each layer.
Prune if neighbor count exceeds M_max (heuristic, keep diversity).

Key Parameters

M parameter: max neighbors per node (default 16).
ef_construction: search width during insert (default 200).
ef_search: search width during query.

Pseudo-code

class HNSW:
    def __init__(self, M=16, ef_construction=200):
        self.M = M
        self.M_max0 = M * 2
        self.ef_construction = ef_construction
        self.mL = 1.0 / np.log(M)
        self.nodes = {}
        self.layers = defaultdict(dict)
        self.entry_point = None
        self.max_layer = -1

    def search_layer(self, q, entry, ef, layer):
        visited = {entry}
        candidates = [(distance(q, self.nodes[entry]), entry)]
        results = [(-distance(q, self.nodes[entry]), entry)]
        while candidates:
            dist, curr = heapq.heappop(candidates)
            if -results[0][0] < dist:
                break
            for neighbor in self.layers[layer].get(curr, []):
                if neighbor in visited:
                    continue
                visited.add(neighbor)
                d = distance(q, self.nodes[neighbor])
                if d < -results[0][0] or len(results) < ef:
                    heapq.heappush(candidates, (d, neighbor))
                    heapq.heappush(results, (-d, neighbor))
                    if len(results) > ef:
                        heapq.heappop(results)
        return sorted([(-d, n) for d, n in results])

Performance

Recall 0.95 with 100x to 1000x speedup over brute force.
Memory: ~1.5x to 2x original vectors.
Insert slower than search.
Deletion is hard (tombstone + rebuild).

Downsides

Memory-heavy (hard to run from disk).
Modification-hostile.
Parameter-sensitive.

6. DiskANN: Disk-Based ANN

Motivation

10B x 768-dim = 3TB. Beyond RAM.

Microsoft Research's DiskANN (2019) works from SSD and handles billions on one server.

Core Ideas

Vamana graph: single-layer, diverse edges (alpha > 1).
SSD minimized: tens of random reads per query.
Prefetching of next nodes.

Setup

Vectors on SSD.
Compressed graph in memory (node IDs only).
Query: walk graph in memory, read actual vectors from SSD in batches.

Performance

100M vectors: 8GB RAM + SSD, recall 0.95 @ 10ms.
1B vectors: 64GB RAM + SSD.

Memory drops to ~1/50 of original. Used in Microsoft Bing, likely OpenAI Embeddings, and Milvus's DiskANN integration.

7. ScaNN: Google's Answer

Google's ScaNN (2020) improves PQ with anisotropic quantization, optimizing quantization error with respect to inner product rather than plain L2. Combined with asymmetric hashing, it achieves top-tier recall and speed. Used in Google Search, YouTube recommendations, Google Photos. Open source via TensorFlow.

8. Vector DB Comparison

DB	Engine	Notes	Best For
Pinecone	HNSW + custom	Managed, serverless	Production, low ops
Weaviate	HNSW	Schema, GraphQL	Semantic search
Qdrant	HNSW	Rust, strong filters	Filter + vector
Milvus	IVF/HNSW/DiskANN	GPU support	Large scale
pgvector	IVF-Flat, HNSW	PostgreSQL ext	SQL integration
Elasticsearch	HNSW	Full-text + vector	Hybrid search
FAISS	All	Library	Research, custom
Chroma	HNSW	Dev-friendly	Prototypes, RAG
Redis Stack	HNSW, Flat	In-memory	Low latency

pgvector

CREATE EXTENSION vector;

CREATE TABLE items (
  id bigserial PRIMARY KEY,
  content text,
  embedding vector(1536)
);

CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

SELECT id, content, embedding <=> '[0.1,0.2,...]' AS distance
FROM items
ORDER BY embedding <=> '[0.1,0.2,...]'
LIMIT 10;

Good up to millions; falls behind dedicated DBs at hundreds of millions.

Qdrant — Hybrid Filter + Vector

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue

client.search(
    collection_name="my_collection",
    query_vector=embedding,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="tech"))]
    ),
    limit=10,
)

9. Parameter Tuning

HNSW

M parameter:

Low (8-16): memory-saving, low-dim (<=128).
Mid (16-32): common.
High (32-64): high-dim, high recall.

ef_construction:

100-200: fast build.
400-800: better recall.
1000+: best quality, slow build.

ef_search:

50-100: fast, lower recall.
100-200: balanced.
500-1000: best recall.

Recipes

High recall (>=0.99):

m = 32
ef_construction = 400
ef_search = 200

Balanced (recall 0.95):

m = 16
ef_construction = 200
ef_search = 100

Large scale, tight memory:

m = 8
ef_construction = 100
ef_search = 50
# + Product Quantization

Dimensionality Reduction

1536-dim (OpenAI ada-002) is heavy. Options: PCA to ~784, Matryoshka Embeddings, smaller models (384-dim Sentence-BERT).

10. Real Scenarios & Pitfalls

RAG Pipeline

1M document chunks, <=100ms response, high recall.
Choice: Qdrant or Pinecone, HNSW, M=32, ef_search=100, metadata filters.

Image Similarity

100M images, top-100 for re-ranking, limited memory.
Choice: Milvus + IVF-PQ, GPU, recall 0.9 then exact re-rank.

Real-Time Recommendation

User + item embeddings, <=10ms, frequent user updates.
Choice: Redis Stack + HNSW, or FAISS in-memory.

Pitfall 1: Cold Start

HNSW build is slow — batch inserts, background build, or swap.

Pitfall 2: Delete/Update

Tombstone + periodic rebuild. Consider Annoy or swap indexes.

Pitfall 3: Filtering

Post-filter: may return too few hits.
Pre-filter: HNSW graph may navigate poorly on subset.
Integrated (Qdrant): best but complex.

Pitfall 4: Curse of Dimensionality

Over 2000 dims hurts. Use 256-768 when possible.

Pitfall 5: Wrong Recall Measurement

Measure on your own data and queries:

def measure_recall(index, queries, exact_answers, k=10):
    correct = 0
    total = 0
    for q, exact in zip(queries, exact_answers):
        approx = index.search(q, k)
        correct += len(set(approx) & set(exact))
        total += k
    return correct / total

Quiz

Q1. Why use Approximate K-NN instead of Exact K-NN?

A. Speed-accuracy trade-off is extreme. Exact K-NN is O(N*d). For 100M x 1024-dim, that is minutes per query. ANN sacrifices 1-5% recall for 1000x speedup. In practice, similarity ground truth is subjective, recall 0.95 is indistinguishable to users, speed dominates UX, and re-ranking on top-100 candidates recovers accuracy. Good-enough-and-fast wins.

Q2. How does PQ yield 500x memory savings?

A. Split the vector space into M subspaces and quantize each independently. 1024-dim float32 (4096 bytes) becomes 8 subvectors with 256 centroids each, stored as 1 byte per subvector = 8 bytes total (512x). Recall survives because 256^8 = 2^64 combinations cover the data distribution well, and ADC (asymmetric distance computation) keeps queries uncompressed while reducing distance computation to M lookup-table additions. IVF-PQ combines this with cluster filtering — the de facto standard for huge-scale search.

Q3. Why does HNSW use multiple layers?

A. To avoid local minima in greedy search. Upper layers act as highways enabling fast global moves; lower layers are dense for fine-grained search. Node layer is sampled from an exponential distribution so most nodes sit at layer 0 and few reach the top. Search starts at the top layer (quick coarse positioning) and descends layer by layer until layer 0 returns the final top-k. Result: O(log N)-like complexity and state-of-the-art recall-latency.

Q4. How does IVF's nprobe control trade-offs?

A. nprobe is how many clusters to actually search per query. nprobe=1 is fastest but misses boundary cases (recall ~0.6); nprobe=10 reaches ~0.85; nprobe=32 hits ~0.95; nprobe=nlist equals brute force. Tune by measuring recall on your own data and picking the smallest nprobe that hits your target. For nlist=1000, nprobe=32 is a typical sweet spot.

Q5. When should you NOT pick HNSW?

A. Pick something else when: (1) memory is extremely tight - use IVF-PQ; (2) disk-based is required - DiskANN; (3) high insert/delete churn - Annoy or periodic rebuild; (4) GPU environment - FAISS IVF-PQ; (5) small scale under 100K - brute force is simpler and often faster; (6) heavy streaming inserts - LSH; (7) filter-dominant - Qdrant with pre-filter; (8) existing SQL stack - pgvector. HNSW is the strong default, but knowing its limits matters at scale.

Closing: The Geometry of Vectors

Summary

Exact -> Approximate: inevitable trade-off.
LSH: hash-based, streaming-friendly.
IVF: clustering; tune nprobe.
Product Quantization: 500x memory savings.
IVF-PQ: the large-scale standard.
HNSW: graph-based, top recall-latency.
DiskANN: disk-based huge scale.
ScaNN: Google's anisotropic quantization.

Algorithm Choice Summary

Scale	Memory	Pick
`<100K`	ample	Brute force
100K-10M	ample	HNSW
10M-100M	ample	HNSW or IVF-PQ
100M-10B	limited	IVF-PQ or DiskANN
>10B	limited	DiskANN, ScaNN, distributed

Why It Matters in the RAG Era

Almost every LLM app uses RAG, and its performance and cost hinge on vector search: slow search = bad UX, low recall = hallucinations, huge memory = exploding cost. Understanding ANN algorithms tells you which DB to pick, how to tune parameters, how to debug, and where the next optimization lies.

Introduction: The "Most Similar" Problem

A New Essential Skill in the LLM Era

The Trap of the Naive Approach

Trading Accuracy

What We Cover

1. Vector Similarity Basics

Distance Functions

Normalized Vectors

The Curse of Dimensionality

Recall Definition

2. LSH: Locality-Sensitive Hashing

The Idea

Random Projection LSH (for cosine)

Building an LSH Index

3. IVF: Inverted File Index

The Idea

Query Flow

Parameters

4. Product Quantization: King of Memory Savings

The Problem

PQ Idea (Jegou et al., 2011)

Encoding

Asymmetric Distance Computation (ADC)

IVF-PQ

5. HNSW: The Graph-Based Revolution

The Idea

Greedy Search and Local Minima

Multi-Layer Structure

Search Flow

Insert

Key Parameters

Pseudo-code

Performance

Downsides

6. DiskANN: Disk-Based ANN

Motivation

Core Ideas

Setup

Performance

7. ScaNN: Google's Answer

8. Vector DB Comparison

pgvector

Qdrant — Hybrid Filter + Vector

9. Parameter Tuning

HNSW

Recipes

Dimensionality Reduction

10. Real Scenarios & Pitfalls

RAG Pipeline

Image Similarity

Real-Time Recommendation

Pitfall 1: Cold Start

Pitfall 2: Delete/Update

Pitfall 3: Filtering

Pitfall 4: Curse of Dimensionality

Pitfall 5: Wrong Recall Measurement

Quiz

Closing: The Geometry of Vectors

Summary

Algorithm Choice Summary

Why It Matters in the RAG Era

References