✍️ 필사 모드: ANN Algorithms Complete Guide 2025: HNSW, IVF, Product Quantization, LSH — How Vector DBs Really Work
EnglishIntroduction: The "Most Similar" Problem
A New Essential Skill in the LLM Era
Since ChatGPT (2022), RAG (Retrieval-Augmented Generation) has become core to LLM apps:
- Embed documents (convert to vectors).
- Embed the user's question.
- Search for the most similar documents.
- Feed results to the LLM as context.
Step 3 is similarity search: finding the top-K nearest vectors out of millions or billions.
The Trap of the Naive Approach
The simplest approach: compute all distances, then sort.
def naive_knn(query, vectors, k):
distances = [cosine_distance(query, v) for v in vectors]
return sorted(enumerate(distances), key=lambda x: x[1])[:k]
- 100K vectors: 0.1s (fine).
- 1M vectors: 1s (slow).
- 100M vectors: tens of seconds (impossible).
- 1,000 dims x 100M: minutes (catastrophic).
Real search engines, recommenders, and RAG need tens of milliseconds, not seconds. Fundamentally different approach needed.
Trading Accuracy
The key insight:
"Close enough K neighbors" is usually sufficient. Exact K-NN is not required.
Allowing 10% error yields 1000x faster search. That is the core of ANN (Approximate Nearest Neighbor).
What We Cover
- Vector similarity: cosine, L2, inner product.
- LSH: the oldest ANN algorithm.
- IVF: clustering-based search.
- Product Quantization: massive memory savings.
- HNSW: current graph-based standard.
- DiskANN, ScaNN: the latest.
- Practical selection and tuning.
1. Vector Similarity Basics
Distance Functions
L2 (Euclidean):
d(x, y) = sqrt(sum((x_i - y_i)^2))
Cosine distance:
cos(x, y) = (x . y) / (|x| * |y|)
distance = 1 - cos(x, y)
Inner product:
d(x, y) = -x . y (negative: larger = more similar)
Normalized Vectors
If vectors are normalized to unit length:
x . y = cos(x, y)(inner product equals cosine)|x - y|^2 = 2 - 2 * cos(x, y)(L2 equivalent to cosine)
So normalization makes L2, cosine, and inner product equivalent. Most embedding models (OpenAI, Sentence-BERT) return normalized vectors for this reason.
The Curse of Dimensionality
In high-dim space, odd things happen:
- All points become nearly equidistant.
- Mean-to-variance ratio shrinks.
- Intuition from 2D/3D fails.
Real data lives on a low-dim manifold, which ANN exploits.
Recall Definition
Quality metric: Recall@k.
"How many of the true top-k did we find?"
recall@10 = |found top-10 that are truly top-10| / 10.- 1.0 = perfect, 0.9 = most users happy, 0.5 = bad.
Target: recall 0.95+ with optimized speed and memory.
2. LSH: Locality-Sensitive Hashing
The Idea
LSH (1998) is one of the earliest ANN methods:
"Put similar vectors in the same bucket, and search only that bucket at query time."
Regular hashes give different outputs for similar inputs. LSH is the opposite: similar inputs are more likely to hash the same.
Random Projection LSH (for cosine)
import numpy as np
class CosineLSH:
def __init__(self, dim, num_hashes=32):
self.dim = dim
self.num_hashes = num_hashes
self.projections = np.random.randn(num_hashes, dim)
def hash(self, vector):
signs = np.sign(self.projections @ vector)
return tuple((signs > 0).astype(int))
Collision probability:
P(hash(x) == hash(y)) = 1 - theta(x,y)/pi
Building an LSH Index
from collections import defaultdict
class LSHIndex:
def __init__(self, dim, num_tables=10, num_hashes_per_table=16):
self.tables = []
for _ in range(num_tables):
lsh = CosineLSH(dim, num_hashes_per_table)
self.tables.append((lsh, defaultdict(list)))
def add(self, vector, id):
for lsh, buckets in self.tables:
h = lsh.hash(vector)
buckets[h].append((id, vector))
def query(self, vector, k):
candidates = set()
for lsh, buckets in self.tables:
h = lsh.hash(vector)
for id, v in buckets.get(h, []):
candidates.add((id, tuple(v)))
scored = [(id, cosine_distance(vector, v)) for id, v in candidates]
return sorted(scored, key=lambda x: x[1])[:k]
Pros: simple, fast insert, distributable, streaming-friendly. Cons: low recall (0.7 to 0.85), many tables needed, tuning is tricky.
LSH has been largely superseded by HNSW/IVF for most use cases, but still shines for massive distributed and streaming workloads.
3. IVF: Inverted File Index
The Idea
IVF is widely used in FAISS:
- Use k-means to split space into clusters.
- Each vector assigned to the nearest centroid.
- At query time, search only the nprobe closest clusters.
Query Flow
- Compare query q to all centroids.
- Pick top
nprobeclusters. - Compare q precisely against vectors in those clusters.
- Return top-k.
class IVFIndex:
def __init__(self, dim, nlist=100):
self.nlist = nlist
self.centroids = None
self.lists = [[] for _ in range(nlist)]
def train(self, vectors):
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=self.nlist)
kmeans.fit(vectors)
self.centroids = kmeans.cluster_centers_
assignments = kmeans.predict(vectors)
for i, c in enumerate(assignments):
self.lists[c].append((i, vectors[i]))
def query(self, q, k=10, nprobe=10):
centroid_dists = np.linalg.norm(self.centroids - q, axis=1)
top_clusters = np.argsort(centroid_dists)[:nprobe]
candidates = []
for c in top_clusters:
for id, v in self.lists[c]:
candidates.append((id, np.linalg.norm(q - v)))
return sorted(candidates, key=lambda x: x[1])[:k]
Parameters
- nlist: cluster count, typically
sqrt(N). - nprobe: clusters checked per query; larger = more accurate but slower.
Pros: dozens of times faster than linear scan, simple, well-optimized in FAISS. Cons: needs training, rebalancing on updates, boundary vector errors.
4. Product Quantization: King of Memory Savings
The Problem
100M x 1024-dim float32 = 400 GB. Too big for memory, too slow from disk.
PQ Idea (Jegou et al., 2011)
- Split each vector into M subvectors (e.g. 1024 -> 8 x 128).
- Run k-means per subspace (e.g. 256 centroids).
- Replace each subvector with its nearest centroid index (1 byte).
- Each vector compressed to M bytes.
Original: 1024 * 4 = 4096 bytes
PQ: 8 bytes
Compression: 512x
Encoding
class ProductQuantizer:
def __init__(self, dim, M=8, nbits=8):
self.dim = dim
self.M = M
self.nbits = nbits
self.dsub = dim // M
self.K = 2 ** nbits
self.codebooks = None
def train(self, vectors):
from sklearn.cluster import KMeans
self.codebooks = np.zeros((self.M, self.K, self.dsub))
for m in range(self.M):
sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
kmeans = KMeans(n_clusters=self.K)
kmeans.fit(sub)
self.codebooks[m] = kmeans.cluster_centers_
def encode(self, vectors):
codes = np.zeros((len(vectors), self.M), dtype=np.uint8)
for m in range(self.M):
sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
dists = np.linalg.norm(
sub[:, None, :] - self.codebooks[m][None, :, :],
axis=2
)
codes[:, m] = np.argmin(dists, axis=1)
return codes
Asymmetric Distance Computation (ADC)
- Keep the query uncompressed.
- Precompute distance from q to every centroid per subspace: lookup table
LUT[M, K]. - Distance to any compressed vector = sum of M lookups.
def query_adc(self, q, codes):
lut = np.zeros((self.M, self.K))
for m in range(self.M):
q_sub = q[m*self.dsub:(m+1)*self.dsub]
for k in range(self.K):
lut[m, k] = np.sum((q_sub - self.codebooks[m, k])**2)
distances = np.zeros(len(codes))
for i, code in enumerate(codes):
for m in range(self.M):
distances[i] += lut[m, code[m]]
return distances
Distance becomes M lookups + additions, not dim multiplications.
IVF-PQ
FAISS's famous combination:
- IVF to narrow to a cluster.
- PQ to compress inside each cluster.
- Query: search nprobe clusters via ADC.
Result: 10B vectors in a few GB, ms-level queries. The Meta/Facebook image search scale.
import faiss
d = 128; nlist = 1024; m = 8; bits = 8
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)
index.train(training_vectors)
index.add(vectors)
index.nprobe = 32
distances, indices = index.search(queries, k=10)
5. HNSW: The Graph-Based Revolution
The Idea
HNSW (Hierarchical Navigable Small World) was published in 2016 (Malkov and Yashunin). It is the current best ANN algorithm for most benchmarks and powers Pinecone, Weaviate, Qdrant, Milvus, and pgvector.
- Connect vectors in a graph.
- Each node links to nearby neighbors.
- Query by walking the graph.
Greedy Search and Local Minima
Greedy walk: move to the nearest neighbor until no closer neighbor exists. Prone to local minima — neighborhood looks far but a globally closer node exists elsewhere.
Multi-Layer Structure
HNSW stacks multiple graph layers:
Layer 2 (sparse): [A --- E --- I]
| | |
Layer 1: [A B D E G I J]
| | | | | | |
Layer 0 (dense): [A B C D E F G H I J K L]
- Layer 0: all vectors, dense.
- Upper layers: random subsets, sparse — act as "highways".
Search Flow
- Start at top layer.
- Greedy-search that layer.
- Descend one layer.
- Repeat down to layer 0.
- Return top-k from layer 0.
Skip-list-like structure: upper = global moves, lower = fine-grained.
Insert
- Pick random layer:
L = floor(-log(uniform(0,1)) * mL). - Greedy-search from the top.
- Connect to M nearest neighbors on each layer.
- Prune if neighbor count exceeds M_max (heuristic, keep diversity).
Key Parameters
- M parameter: max neighbors per node (default 16).
- ef_construction: search width during insert (default 200).
- ef_search: search width during query.
Pseudo-code
class HNSW:
def __init__(self, M=16, ef_construction=200):
self.M = M
self.M_max0 = M * 2
self.ef_construction = ef_construction
self.mL = 1.0 / np.log(M)
self.nodes = {}
self.layers = defaultdict(dict)
self.entry_point = None
self.max_layer = -1
def search_layer(self, q, entry, ef, layer):
visited = {entry}
candidates = [(distance(q, self.nodes[entry]), entry)]
results = [(-distance(q, self.nodes[entry]), entry)]
while candidates:
dist, curr = heapq.heappop(candidates)
if -results[0][0] < dist:
break
for neighbor in self.layers[layer].get(curr, []):
if neighbor in visited:
continue
visited.add(neighbor)
d = distance(q, self.nodes[neighbor])
if d < -results[0][0] or len(results) < ef:
heapq.heappush(candidates, (d, neighbor))
heapq.heappush(results, (-d, neighbor))
if len(results) > ef:
heapq.heappop(results)
return sorted([(-d, n) for d, n in results])
Performance
- Recall 0.95 with 100x to 1000x speedup over brute force.
- Memory: ~1.5x to 2x original vectors.
- Insert slower than search.
- Deletion is hard (tombstone + rebuild).
Downsides
- Memory-heavy (hard to run from disk).
- Modification-hostile.
- Parameter-sensitive.
6. DiskANN: Disk-Based ANN
Motivation
10B x 768-dim = 3TB. Beyond RAM.
Microsoft Research's DiskANN (2019) works from SSD and handles billions on one server.
Core Ideas
- Vamana graph: single-layer, diverse edges (
alpha > 1). - SSD minimized: tens of random reads per query.
- Prefetching of next nodes.
Setup
- Vectors on SSD.
- Compressed graph in memory (node IDs only).
- Query: walk graph in memory, read actual vectors from SSD in batches.
Performance
- 100M vectors: 8GB RAM + SSD, recall 0.95 @ 10ms.
- 1B vectors: 64GB RAM + SSD.
Memory drops to ~1/50 of original. Used in Microsoft Bing, likely OpenAI Embeddings, and Milvus's DiskANN integration.
7. ScaNN: Google's Answer
Google's ScaNN (2020) improves PQ with anisotropic quantization, optimizing quantization error with respect to inner product rather than plain L2. Combined with asymmetric hashing, it achieves top-tier recall and speed. Used in Google Search, YouTube recommendations, Google Photos. Open source via TensorFlow.
8. Vector DB Comparison
| DB | Engine | Notes | Best For |
|---|---|---|---|
| Pinecone | HNSW + custom | Managed, serverless | Production, low ops |
| Weaviate | HNSW | Schema, GraphQL | Semantic search |
| Qdrant | HNSW | Rust, strong filters | Filter + vector |
| Milvus | IVF/HNSW/DiskANN | GPU support | Large scale |
| pgvector | IVF-Flat, HNSW | PostgreSQL ext | SQL integration |
| Elasticsearch | HNSW | Full-text + vector | Hybrid search |
| FAISS | All | Library | Research, custom |
| Chroma | HNSW | Dev-friendly | Prototypes, RAG |
| Redis Stack | HNSW, Flat | In-memory | Low latency |
pgvector
CREATE EXTENSION vector;
CREATE TABLE items (
id bigserial PRIMARY KEY,
content text,
embedding vector(1536)
);
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
SELECT id, content, embedding <=> '[0.1,0.2,...]' AS distance
FROM items
ORDER BY embedding <=> '[0.1,0.2,...]'
LIMIT 10;
Good up to millions; falls behind dedicated DBs at hundreds of millions.
Qdrant — Hybrid Filter + Vector
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
client.search(
collection_name="my_collection",
query_vector=embedding,
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="tech"))]
),
limit=10,
)
9. Parameter Tuning
HNSW
M parameter:
- Low (8-16): memory-saving, low-dim (
<=128). - Mid (16-32): common.
- High (32-64): high-dim, high recall.
ef_construction:
- 100-200: fast build.
- 400-800: better recall.
- 1000+: best quality, slow build.
ef_search:
- 50-100: fast, lower recall.
- 100-200: balanced.
- 500-1000: best recall.
Recipes
High recall (>=0.99):
m = 32
ef_construction = 400
ef_search = 200
Balanced (recall 0.95):
m = 16
ef_construction = 200
ef_search = 100
Large scale, tight memory:
m = 8
ef_construction = 100
ef_search = 50
# + Product Quantization
Dimensionality Reduction
1536-dim (OpenAI ada-002) is heavy. Options: PCA to ~784, Matryoshka Embeddings, smaller models (384-dim Sentence-BERT).
10. Real Scenarios & Pitfalls
RAG Pipeline
- 1M document chunks,
<=100msresponse, high recall. - Choice: Qdrant or Pinecone, HNSW, M=32, ef_search=100, metadata filters.
Image Similarity
- 100M images, top-100 for re-ranking, limited memory.
- Choice: Milvus + IVF-PQ, GPU, recall 0.9 then exact re-rank.
Real-Time Recommendation
- User + item embeddings,
<=10ms, frequent user updates. - Choice: Redis Stack + HNSW, or FAISS in-memory.
Pitfall 1: Cold Start
HNSW build is slow — batch inserts, background build, or swap.
Pitfall 2: Delete/Update
Tombstone + periodic rebuild. Consider Annoy or swap indexes.
Pitfall 3: Filtering
- Post-filter: may return too few hits.
- Pre-filter: HNSW graph may navigate poorly on subset.
- Integrated (Qdrant): best but complex.
Pitfall 4: Curse of Dimensionality
Over 2000 dims hurts. Use 256-768 when possible.
Pitfall 5: Wrong Recall Measurement
Measure on your own data and queries:
def measure_recall(index, queries, exact_answers, k=10):
correct = 0
total = 0
for q, exact in zip(queries, exact_answers):
approx = index.search(q, k)
correct += len(set(approx) & set(exact))
total += k
return correct / total
Quiz
Q1. Why use Approximate K-NN instead of Exact K-NN?
A. Speed-accuracy trade-off is extreme. Exact K-NN is O(N*d). For 100M x 1024-dim, that is minutes per query. ANN sacrifices 1-5% recall for 1000x speedup. In practice, similarity ground truth is subjective, recall 0.95 is indistinguishable to users, speed dominates UX, and re-ranking on top-100 candidates recovers accuracy. Good-enough-and-fast wins.
Q2. How does PQ yield 500x memory savings?
A. Split the vector space into M subspaces and quantize each independently. 1024-dim float32 (4096 bytes) becomes 8 subvectors with 256 centroids each, stored as 1 byte per subvector = 8 bytes total (512x). Recall survives because 256^8 = 2^64 combinations cover the data distribution well, and ADC (asymmetric distance computation) keeps queries uncompressed while reducing distance computation to M lookup-table additions. IVF-PQ combines this with cluster filtering — the de facto standard for huge-scale search.
Q3. Why does HNSW use multiple layers?
A. To avoid local minima in greedy search. Upper layers act as highways enabling fast global moves; lower layers are dense for fine-grained search. Node layer is sampled from an exponential distribution so most nodes sit at layer 0 and few reach the top. Search starts at the top layer (quick coarse positioning) and descends layer by layer until layer 0 returns the final top-k. Result: O(log N)-like complexity and state-of-the-art recall-latency.
Q4. How does IVF's nprobe control trade-offs?
A. nprobe is how many clusters to actually search per query. nprobe=1 is fastest but misses boundary cases (recall ~0.6); nprobe=10 reaches ~0.85; nprobe=32 hits ~0.95; nprobe=nlist equals brute force. Tune by measuring recall on your own data and picking the smallest nprobe that hits your target. For nlist=1000, nprobe=32 is a typical sweet spot.
Q5. When should you NOT pick HNSW?
A. Pick something else when: (1) memory is extremely tight - use IVF-PQ; (2) disk-based is required - DiskANN; (3) high insert/delete churn - Annoy or periodic rebuild; (4) GPU environment - FAISS IVF-PQ; (5) small scale under 100K - brute force is simpler and often faster; (6) heavy streaming inserts - LSH; (7) filter-dominant - Qdrant with pre-filter; (8) existing SQL stack - pgvector. HNSW is the strong default, but knowing its limits matters at scale.
Closing: The Geometry of Vectors
Summary
- Exact -> Approximate: inevitable trade-off.
- LSH: hash-based, streaming-friendly.
- IVF: clustering; tune nprobe.
- Product Quantization: 500x memory savings.
- IVF-PQ: the large-scale standard.
- HNSW: graph-based, top recall-latency.
- DiskANN: disk-based huge scale.
- ScaNN: Google's anisotropic quantization.
Algorithm Choice Summary
| Scale | Memory | Pick |
|---|---|---|
<100K | ample | Brute force |
| 100K-10M | ample | HNSW |
| 10M-100M | ample | HNSW or IVF-PQ |
| 100M-10B | limited | IVF-PQ or DiskANN |
| >10B | limited | DiskANN, ScaNN, distributed |
Why It Matters in the RAG Era
Almost every LLM app uses RAG, and its performance and cost hinge on vector search: slow search = bad UX, low recall = hallucinations, huge memory = exploding cost. Understanding ANN algorithms tells you which DB to pick, how to tune parameters, how to debug, and where the next optimization lies.
References
현재 단락 (1/370)
Since ChatGPT (2022), **RAG (Retrieval-Augmented Generation)** has become core to LLM apps: