Split View: ANN 알고리즘 완전 가이드 2025: HNSW, IVF, Product Quantization, LSH — 벡터 DB의 내부는 어떻게 작동하는가

ANN 알고리즘 완전 가이드 2025: HNSW, IVF, Product Quantization, LSH — 벡터 DB의 내부는 어떻게 작동하는가

들어가며: "가장 비슷한 것"을 찾는 문제

LLM 시대의 새로운 필수 기술

2022년 ChatGPT 이후 RAG (Retrieval-Augmented Generation) 가 LLM의 핵심 기법이 되었다. RAG의 구조는 단순하다:

문서를 임베딩 (벡터로 변환).
사용자 질문도 임베딩.
가장 유사한 문서 검색.
LLM에게 검색 결과를 컨텍스트로 제공.

이 "가장 유사한 문서 검색"이 유사도 검색(Similarity Search) 이다. 수백만~수십억 개의 벡터 중에서 특정 벡터에 가장 가까운 K개를 찾아야 한다.

순진한 방법의 함정

가장 단순한 접근: 모든 벡터와 거리 계산 후 정렬.

def naive_knn(query, vectors, k):
    distances = [cosine_distance(query, v) for v in vectors]
    return sorted(enumerate(distances), key=lambda x: x[1])[:k]

10만 벡터: 0.1초 (괜찮음).
1백만 벡터: 1초 (느림).
1억 벡터: 수십 초 (불가).
1,000차원 × 1억 벡터: 수 분 (재앙).

실제 검색엔진, 추천 시스템, RAG는 모두 수초가 아닌 수십 밀리초를 요구한다. 이는 근본적으로 다른 접근이 필요하다는 뜻이다.

정확도를 양보하다

핵심 통찰:

"정확히 k개의 최근접 이웃"보다 "충분히 가까운 k개의 이웃"이면 충분하다.

10% 오차를 허용하면 1000배 빠른 검색이 가능하다. 이것이 ANN (Approximate Nearest Neighbor) 알고리즘의 핵심이다.

이 글에서 다룰 것

벡터 유사도: 코사인, L2, 내적의 차이.
LSH: 가장 오래된 ANN 알고리즘.
IVF: 클러스터링 기반 검색.
Product Quantization: 메모리 대폭 절약.
HNSW: 현재 표준, 그래프 기반.
DiskANN, ScaNN: 최신 알고리즘.
실전 선택과 튜닝.

왜 중요한가: RAG를 구현하는 모든 개발자는 이 지식이 필요하다. Pinecone/Weaviate/Qdrant/pgvector 중에서 선택하는 것, ef_construction과 M 파라미터를 튜닝하는 것, 메모리 사용을 이해하는 것 — 모두 ANN 알고리즘 이해 위에 있다.

1. 벡터 유사도의 기초

거리 함수들

L2 (Euclidean) 거리:

d(x, y) = √(Σ(x_i - y_i)²)

코사인 거리:

cos(x, y) = (x·y) / (|x| × |y|)
distance = 1 - cos(x, y)

내적 (Inner Product):

d(x, y) = -x·y   (음수: 클수록 유사)

정규화된 벡터

벡터를 단위 길이(|v|=1) 로 정규화하면:

x·y = cos(x, y) (내적 = 코사인)
|x - y|² = 2 - 2·cos(x, y) (L2 ≈ 코사인)

즉, 정규화 후 L2, 코사인, 내적이 동등하다. 대부분의 임베딩 모델(OpenAI, Sentence-BERT)이 정규화된 벡터를 반환하는 이유다.

차원의 저주

고차원 공간에서 이상한 일이 벌어진다:

모든 점이 거의 같은 거리에 있게 됨.
평균과 분산의 비율이 작아짐.
직관(2D, 3D)이 통하지 않음.

1000차원 공간에서 두 랜덤 점의 거리는 거의 상수다. 이것이 ANN 알고리즘이 어려운 이유 중 하나다. 그러나 실제 데이터는 저차원 manifold(구조)를 가지므로 활용 가능.

Recall의 정의

ANN의 품질 측정: Recall@k.

"진짜 top-k 중 몇 개를 찾았는가?"

recall@10 = 찾은 진짜 top-10 개수 / 10.
1.0이면 완벽 (exact search와 동일).
0.9면 진짜 10개 중 9개 찾음 (대부분 만족).
0.5면 절반만 (나쁨).

ANN 알고리즘들의 목표는 recall 0.95 이상을 유지하면서 속도와 메모리를 최적화하는 것이다.

2. LSH: Locality-Sensitive Hashing

아이디어

LSH는 가장 오래된 ANN 기법 중 하나다 (1998). 핵심 아이디어:

"유사한 벡터는 같은 해시 버킷에 넣고, 쿼리 시 같은 버킷만 검색하자."

일반 해시 함수는 유사한 입력에 다른 해시를 준다. LSH는 반대다: 유사한 입력이 같은 해시를 가질 확률이 높다.

코사인용 LSH: Random Projection

import numpy as np

class CosineLSH:
    def __init__(self, dim, num_hashes=32):
        self.dim = dim
        self.num_hashes = num_hashes
        # 랜덤 초평면들
        self.projections = np.random.randn(num_hashes, dim)

    def hash(self, vector):
        # 각 초평면의 어느 쪽에 있는가? (±1)
        signs = np.sign(self.projections @ vector)
        # 비트 벡터로 변환
        return tuple((signs > 0).astype(int))

# 사용 예시
lsh = CosineLSH(dim=128, num_hashes=32)
h1 = lsh.hash(vec1)  # (1, 0, 1, 1, 0, ...)
h2 = lsh.hash(vec2)

두 벡터가 코사인으로 가까우면 같은 해시 버킷에 들어갈 확률이 높다. 수학적으로:

P(hash(x) == hash(y)) = 1 - θ(x,y)/π

여기서 θ는 두 벡터 사이의 각도.

LSH 테이블 구축

from collections import defaultdict

class LSHIndex:
    def __init__(self, dim, num_tables=10, num_hashes_per_table=16):
        self.tables = []
        for _ in range(num_tables):
            lsh = CosineLSH(dim, num_hashes_per_table)
            self.tables.append((lsh, defaultdict(list)))

    def add(self, vector, id):
        for lsh, buckets in self.tables:
            h = lsh.hash(vector)
            buckets[h].append((id, vector))

    def query(self, vector, k):
        candidates = set()
        for lsh, buckets in self.tables:
            h = lsh.hash(vector)
            for id, v in buckets.get(h, []):
                candidates.add((id, tuple(v)))

        # 정확한 거리 재계산
        scored = [(id, cosine_distance(vector, v)) for id, v in candidates]
        return sorted(scored, key=lambda x: x[1])[:k]

여러 테이블 사용 → recall 향상.
쿼리 시 여러 버킷의 후보를 합침.
후보들만 정확히 거리 계산 → 정답 반환.

LSH의 장단점

장점:

구현 단순.
매우 빠른 삽입.
분산 환경 친화적.
새 데이터 추가 쉬움.

단점:

Recall이 낮음 (보통 0.7~0.85).
많은 테이블 필요 → 메모리 부담.
튜닝 복잡 (테이블 수, 해시 수 조정).

현대에는 HNSW, IVF 같은 그래프/트리 기반 방법이 더 우수해서 LSH 사용이 줄었다. 그러나 극대규모 분산 환경이나 스트리밍 삽입에선 여전히 유용하다.

3. IVF: Inverted File Index

아이디어

IVF (Inverted File) 는 FAISS에서 널리 쓰이는 방법. 아이디어:

k-means로 벡터 공간을 클러스터로 나눔.
각 벡터는 가장 가까운 centroid에 할당됨.
쿼리 시 가장 가까운 몇 개 클러스터만 검색.

이는 "정확성 대신 속도"를 선택하는 전형적 ANN 방식.

구조

전체 벡터를 K개 클러스터로 분할
┌──────┐
│ C_0  │ → [v1, v5, v9, v12, ...]
├──────┤
│ C_1  │ → [v2, v7, v13, ...]
├──────┤
│ C_2  │ → [v3, v4, v8, v11, ...]
│ ...  │
│ C_K-1│ → [v6, v10, v14, ...]
└──────┘

쿼리 과정

쿼리 벡터 q와 모든 centroid 비교.
가장 가까운 nprobe개 클러스터 선택.
그 클러스터들 내의 벡터만 정확히 비교.
Top-k 반환.

class IVFIndex:
    def __init__(self, dim, nlist=100):
        self.nlist = nlist
        self.centroids = None
        self.lists = [[] for _ in range(nlist)]

    def train(self, vectors):
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=self.nlist)
        kmeans.fit(vectors)
        self.centroids = kmeans.cluster_centers_

        # 각 벡터를 해당 클러스터에 할당
        assignments = kmeans.predict(vectors)
        for i, c in enumerate(assignments):
            self.lists[c].append((i, vectors[i]))

    def query(self, q, k=10, nprobe=10):
        # 가장 가까운 nprobe 클러스터 찾기
        centroid_dists = np.linalg.norm(self.centroids - q, axis=1)
        top_clusters = np.argsort(centroid_dists)[:nprobe]

        # 그 클러스터들만 검색
        candidates = []
        for c in top_clusters:
            for id, v in self.lists[c]:
                candidates.append((id, np.linalg.norm(q - v)))

        return sorted(candidates, key=lambda x: x[1])[:k]

파라미터

nlist: 클러스터 수. 일반적으로 √N (N은 벡터 수).
nprobe: 쿼리 시 확인할 클러스터 수. 1~nlist 사이.

Trade-off:

nprobe 증가 → 더 많은 벡터 비교 → 느리지만 정확.
nprobe 감소 → 적은 벡터 비교 → 빠르지만 recall 낮음.

장단점

장점:

선형 탐색보다 수십 배 빠름.
구현 단순.
FAISS에서 잘 최적화.

단점:

Training 필요 (k-means).
새 데이터 추가 시 재균형 필요.
클러스터 경계 근처 벡터 오류.

4. Product Quantization: 메모리 절약의 왕

문제

1억 개의 1024차원 float32 벡터 = 400 GB. 메모리에 못 올린다. 디스크에서 읽으면 느리다.

해결: 압축. 하지만 일반 압축은 안 된다 (랜덤 접근 필요).

Product Quantization의 아이디어

2011년 Hervé Jégou 등이 발표한 PQ의 핵심:

각 벡터를 M개의 subvector로 분할 (예: 1024차원 → 8개의 128차원).
각 subvector 공간에서 별도로 k-means 클러스터링 (예: 256 centroids).
각 subvector를 가장 가까운 centroid 인덱스로 대체 (1바이트).
벡터 하나가 M바이트로 압축.

원본: 1024 × 4 = 4096 바이트
PQ:   8 바이트 (8 subvectors × 1 byte each)
압축률: 512배!

구체 예시

class ProductQuantizer:
    def __init__(self, dim, M=8, nbits=8):
        self.dim = dim
        self.M = M  # subvector 수
        self.nbits = nbits  # subvector당 비트 (256 centroids = 8비트)
        self.dsub = dim // M  # 각 subvector 차원
        self.K = 2 ** nbits  # centroid 수
        self.codebooks = None  # [M, K, dsub]

    def train(self, vectors):
        from sklearn.cluster import KMeans
        self.codebooks = np.zeros((self.M, self.K, self.dsub))
        for m in range(self.M):
            sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
            kmeans = KMeans(n_clusters=self.K)
            kmeans.fit(sub)
            self.codebooks[m] = kmeans.cluster_centers_

    def encode(self, vectors):
        """각 벡터를 M바이트 코드로 변환."""
        codes = np.zeros((len(vectors), self.M), dtype=np.uint8)
        for m in range(self.M):
            sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
            dists = np.linalg.norm(
                sub[:, None, :] - self.codebooks[m][None, :, :],
                axis=2
            )
            codes[:, m] = np.argmin(dists, axis=1)
        return codes

거리 계산: Asymmetric Distance

쿼리 q와 압축된 데이터베이스 벡터 c의 거리를 어떻게 계산할까?

Asymmetric Distance Computation (ADC):

쿼리를 압축하지 않고 원본으로 유지.
쿼리와 각 subspace의 모든 centroid 간 거리를 미리 계산. → Lookup table LUT[M, K].
압축된 벡터의 거리 = M개 LUT 조회의 합.

def query_adc(self, q, codes):
    # 1. Lookup table 구축
    lut = np.zeros((self.M, self.K))
    for m in range(self.M):
        q_sub = q[m*self.dsub:(m+1)*self.dsub]
        for k in range(self.K):
            lut[m, k] = np.sum((q_sub - self.codebooks[m, k])**2)

    # 2. 각 코드에 대해 거리 합산 (매우 빠름)
    distances = np.zeros(len(codes))
    for i, code in enumerate(codes):
        for m in range(self.M):
            distances[i] += lut[m, code[m]]
    return distances

놀라운 점: 거리 계산이 M번의 테이블 조회 + 합으로 끝난다. 원본 벡터로 계산하면 dim번의 곱셈 + dim번의 덧셈인데, PQ는 dim 대신 M이다 (보통 1/64 ~ 1/128).

정확도

PQ는 근사 압축이므로 원본 대비 오차가 있다. 하지만:

8 subvectors × 8bit = 2^64개 코드 조합 → 대부분 벡터를 잘 구분.
실전 recall: 0.9+ 가능.

IVF-PQ 조합

FAISS의 가장 유명한 조합: IVFPQ.

IVF로 클러스터링 (후보 축소).
각 클러스터 내의 벡터를 PQ로 압축.
쿼리 시 nprobe 클러스터만 검색 + PQ ADC로 빠른 거리 계산.

결과: 100억 벡터를 수 GB에 저장하고 수 밀리초에 쿼리. Facebook/Meta 스케일의 이미지 검색이 이 방식.

구현

import faiss

d = 128
nlist = 1024  # IVF 클러스터 수
m = 8  # PQ subvectors
bits = 8  # PQ bits

quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)

index.train(training_vectors)
index.add(vectors)

index.nprobe = 32
distances, indices = index.search(queries, k=10)

5. HNSW: 그래프 기반의 혁명

아이디어

HNSW (Hierarchical Navigable Small World) 는 2016년 Yu. Malkov와 D. Yashunin이 발표. 현재 최고 성능의 ANN 알고리즘으로 간주되며, Pinecone, Weaviate, Qdrant, Milvus, pgvector의 기본 엔진.

핵심 아이디어:

벡터들을 그래프로 연결.
각 벡터(노드)가 가까운 몇 개 이웃과 edge.
쿼리 시 그래프 탐색으로 가장 가까운 노드 찾기.

Small World와 Greedy Search

Small World 네트워크의 특성: 임의의 두 노드 사이 경로가 짧다 (Six Degrees of Kevin Bacon).

그래프 위에서 greedy search로 최근접 이웃을 찾는다:

시작 노드에서 출발.
현재 노드의 이웃들 중 쿼리와 가장 가까운 것으로 이동.
더 가까운 이웃이 없을 때까지 반복.

문제: Local Minimum

단순 greedy는 local minimum에 빠질 수 있다:

근처 이웃은 다 멀지만, 그래프 전체에 더 가까운 노드가 있는 경우.

HNSW의 해결책: Layer

HNSW는 여러 층(layer) 의 그래프를 쌓는다:

Layer 2 (sparse): [A ─── E ─── I]
                   │     │     │
Layer 1:          [A B D E G I J]
                   │ │ │ │ │ │ │
Layer 0 (dense):  [A B C D E F G H I J K L]

Layer 0: 모든 벡터 포함, 조밀한 연결.
상위 layer: 무작위 하위집합만 포함, 스파스 연결. "고속도로" 역할.

검색 과정

최상위 layer에서 시작.
Greedy search로 그 층의 최근접 노드 찾기.
그 노드에서 시작해 한 층 아래로 내려감.
반복하며 layer 0까지 내려옴.
Layer 0에서 최종 k개 이웃 반환.

상위 층이 전역적인 이동을, 하위 층이 세밀한 탐색을 담당한다. Skip list와 유사한 구조.

삽입

새 벡터를 추가할 때:

랜덤 layer 선택: L = floor(-log(uniform(0,1)) * mL). 지수 분포로 대부분 하위 층에 들어감.
최상위 layer부터 시작해 greedy search로 위치 찾기.
각 layer에서 가장 가까운 M개 이웃과 연결.
이웃 수가 M_max를 넘으면 heuristic pruning (다양성 유지).

주요 파라미터

M: 각 노드의 최대 이웃 수 (기본 16). 클수록 정확하지만 메모리 ↑.
ef_construction: 삽입 시 탐색 폭 (기본 200). 클수록 품질 ↑, 빌드 시간 ↑.
ef_search (또는 ef): 쿼리 시 탐색 폭. 클수록 정확하지만 느림.

의사 코드

class HNSW:
    def __init__(self, M=16, ef_construction=200):
        self.M = M
        self.M_max0 = M * 2  # layer 0 max connections
        self.ef_construction = ef_construction
        self.mL = 1.0 / np.log(M)

        self.nodes = {}  # id → vector
        self.layers = defaultdict(dict)  # layer → {id → neighbors}
        self.entry_point = None
        self.max_layer = -1

    def search_layer(self, q, entry, ef, layer):
        """한 layer에서 greedy + beam search."""
        visited = {entry}
        candidates = [(distance(q, self.nodes[entry]), entry)]
        results = [(-distance(q, self.nodes[entry]), entry)]

        while candidates:
            dist, curr = heapq.heappop(candidates)
            if -results[0][0] < dist:
                break

            for neighbor in self.layers[layer].get(curr, []):
                if neighbor in visited:
                    continue
                visited.add(neighbor)
                d = distance(q, self.nodes[neighbor])
                if d < -results[0][0] or len(results) < ef:
                    heapq.heappush(candidates, (d, neighbor))
                    heapq.heappush(results, (-d, neighbor))
                    if len(results) > ef:
                        heapq.heappop(results)

        return sorted([(-d, n) for d, n in results])

    def insert(self, id, vector):
        self.nodes[id] = vector
        layer = int(-np.log(np.random.random()) * self.mL)

        if self.entry_point is None:
            self.entry_point = id
            self.max_layer = layer
            for l in range(layer + 1):
                self.layers[l][id] = []
            return

        # 상위 layer부터 내려가며 삽입 위치 찾기
        curr = self.entry_point
        for l in range(self.max_layer, layer, -1):
            curr = self.search_layer(vector, curr, ef=1, layer=l)[0][1]

        # layer부터 0까지 이웃 연결
        for l in range(min(layer, self.max_layer), -1, -1):
            candidates = self.search_layer(vector, curr, self.ef_construction, l)
            M = self.M_max0 if l == 0 else self.M
            neighbors = [n for _, n in candidates[:M]]
            self.layers[l][id] = neighbors
            for n in neighbors:
                self.layers[l][n].append(id)
                # pruning if too many
                if len(self.layers[l][n]) > M:
                    # heuristic: remove least diverse
                    pass

        if layer > self.max_layer:
            self.max_layer = layer
            self.entry_point = id

성능 특성

HNSW는 recall-latency 트레이드오프에서 대부분의 벤치마크 우승:

Recall 0.95에서 선형 탐색 대비 100~1000배 빠름.
메모리 오버헤드: 원본 벡터 + 그래프 연결 (대략 원본의 1.5~2배).
삽입이 검색보다 훨씬 느림 (ef_construction에 따라).
삭제는 복잡. 단순 노드 제거가 어려움.

HNSW의 단점

메모리 집중: 디스크 기반이 어려움 (그래프 순회가 랜덤 접근).
수정 어려움: 삭제, 수정 시 그래프 재구성 필요.
파라미터 민감: M, ef_construction, ef_search 튜닝 필요.

6. DiskANN: 디스크 기반 ANN

동기

HNSW는 전부 메모리에서 동작한다. 10억 벡터 × 768차원 = 3TB. 메모리에 올릴 수 없다.

Microsoft Research의 DiskANN (2019)은 SSD에서 잘 작동하는 ANN을 만들었다. 한 대 서버로 수십억 벡터 검색 가능.

핵심 아이디어

Vamana 그래프 알고리즘: HNSW와 유사하지만 단일 layer, 다양한 edge.
SSD 접근 최소화: 검색 시 수십 번의 random read만.
Prefetching: 다음 노드들을 미리 예측해 읽기.

Vamana 그래프

HNSW의 다층 구조 대신 하나의 그래프에 모든 이웃을 담는다. 대신:

α > 1 파라미터로 다양성 강제 (가까운 것만 연결하지 않음).
먼 이웃도 포함해 shortcut 역할.
Greedy search가 local minimum에 덜 빠짐.

구성

SSD에 벡터 저장: 정렬되지 않은 그대로.
압축된 그래프를 메모리에 유지: 노드 ID만.
쿼리 시:
- 메모리에서 그래프 탐색 → SSD에서 실제 벡터 읽기.
- 배치로 여러 벡터 동시 prefetch.

성능

1억 벡터: 8GB 메모리 + SSD로 recall 0.95 @ 10ms.
10억 벡터: 64GB 메모리 + SSD.

메모리 크기가 원본 벡터 크기의 1/50 수준이다. 거대 규모 검색의 경제성을 완전히 바꿨다.

사용처

Microsoft Bing 검색.
OpenAI Embeddings API 내부 추정.
Milvus의 DiskANN 통합.

7. ScaNN: Google의 답

동기

Google의 ScaNN (Scalable Nearest Neighbors) 는 2020년 발표. Product Quantization을 개선한 방식으로, recall과 속도 모두에서 최고 수준.

핵심: Anisotropic Quantization

PQ는 각 subspace를 독립적으로 양자화한다. ScaNN은 이를 개선:

양자화 오차를 내적(inner product) 기반으로 최적화.
유사도 보존에 중요한 방향을 더 정밀하게 양자화.

결과: 같은 비트 수로 더 나은 recall.

AH (Asymmetric Hashing)

추가 최적화로 asymmetric hashing을 사용. 쿼리 측은 원본, 데이터베이스 측은 양자화된 상태로 거리 계산. PQ의 ADC와 유사하지만 다른 목적 함수로 학습.

사용처

Google Search 내부.
YouTube 추천.
Google Photos 유사 검색.

TensorFlow와 오픈소스로 공개되어 있다.

8. 실전 벡터 DB 비교

주요 벡터 DB

DB	엔진	특징	적합 워크로드
Pinecone	HNSW + 자체 최적화	관리형, 서버리스	프로덕션, 관리 부담 낮춤
Weaviate	HNSW	Schema, GraphQL	의미론적 검색
Qdrant	HNSW	Rust, 필터 강함	필터링 + 벡터 검색
Milvus	IVF/HNSW/DiskANN	GPU 지원	대규모, 유연함
pgvector	IVF-Flat, HNSW	PostgreSQL 확장	SQL과 통합
Elasticsearch	HNSW	전문 검색과 통합	하이브리드 검색
FAISS	모든 알고리즘	라이브러리	연구, 커스텀 구축
Chroma	HNSW	개발자 친화	프로토타입, RAG
Redis Stack	HNSW, Flat	기존 Redis + 벡터	저지연 요구

pgvector: SQL과의 통합

pgvector는 PostgreSQL 확장으로 기존 DB에 벡터 검색을 추가한다:

CREATE EXTENSION vector;

CREATE TABLE items (
  id bigserial PRIMARY KEY,
  content text,
  embedding vector(1536)
);

-- HNSW 인덱스
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- 유사도 검색
SELECT id, content, embedding <=> '[0.1,0.2,...]' AS distance
FROM items
ORDER BY embedding <=> '[0.1,0.2,...]'
LIMIT 10;

장점:

기존 SQL 데이터와 함께 사용.
트랜잭션.
PostgreSQL의 모든 기능 (replication, backup 등).

단점:

전용 벡터 DB보다 성능 낮음 (수억 벡터에서는).
수백만 규모까지는 적합.

Qdrant: Rust의 속도

Qdrant는 Rust로 작성, 필터 + 벡터 하이브리드 쿼리에 강하다:

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue

client.search(
    collection_name="my_collection",
    query_vector=embedding,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="tech"))]
    ),
    limit=10,
)

필터 조건을 만족하는 벡터만 검색. 메타데이터 활용이 중요한 애플리케이션에 적합.

Milvus: 확장성

Milvus는 쿠버네티스 기반으로 수십억 벡터 처리 가능. 여러 알고리즘 지원 (HNSW, IVF, DiskANN). 대규모 프로덕션에 적합.

9. 파라미터 튜닝

HNSW 튜닝

가장 중요한 세 파라미터:

M (연결 수):

낮음 (8~16): 메모리 절약, 낮은 차원(≤128)에 적합.
중간 (16~32): 일반적 선택.
높음 (32~64): 고차원(≥512), 높은 recall 요구.

ef_construction (빌드 품질):

100~200: 빠른 빌드.
400~800: 더 나은 recall.
1000+: 최고 품질, 느린 빌드.

ef_search (쿼리 품질):

50~100: 빠른 검색, 낮은 recall.
100~200: 균형.
500~1000: 최고 recall, 느림.

실전 레시피

1. 높은 recall (≥0.99):

m = 32
ef_construction = 400
ef_search = 200

2. 균형 (recall 0.95, 빠른 속도):

m = 16
ef_construction = 200
ef_search = 100

3. 대규모, 메모리 제한:

m = 8
ef_construction = 100
ef_search = 50
# + Product Quantization

차원 축소 고려

1536차원 (OpenAI ada-002) 벡터는 메모리를 많이 쓴다. 옵션:

PCA: 784차원으로 축소, recall 약간 감소.
Matryoshka Embeddings: 256/512/1024차원 지원 모델 사용.
텍스트 임베딩 모델 교체: 384차원 모델 (Sentence-BERT).

10. 실전 시나리오와 함정

시나리오 1: RAG 파이프라인

요구사항:

100만 문서 청크.
사용자 질문에 100ms 이내 응답.
높은 recall (중요 문서 놓치면 안 됨).

시나리오 2: 이미지 유사도 검색

요구사항:

1억 이미지 임베딩.
재랭킹 전 단계로 top-100.
메모리 예산 제한.

시나리오 3: 실시간 추천

요구사항:

사용자별 임베딩 + 아이템 임베딩.
10ms 이내 응답.
사용자 임베딩이 자주 업데이트.

함정 1: Cold Start

HNSW는 삽입 시간이 길다. 1백만 벡터 빌드가 수십 분 걸릴 수 있다.

해결:

배치 삽입.
백그라운드 빌드 후 swap.
Pinecone 같은 관리형은 자동 처리.

함정 2: 삭제와 업데이트

HNSW는 삭제가 까다롭다:

"tombstone" 마킹만 (실제 삭제 X).
주기적 재구성 필요.

해결:

삭제가 많으면 다른 알고리즘 (예: Annoy).
또는 주기적으로 새 인덱스 빌드 후 swap.

함정 3: 필터링의 복잡성

"카테고리 X + 벡터 유사 Y"를 동시에 할 때:

옵션 1: Post-filter (먼저 벡터 검색, 그 후 필터링):

문제: 필터가 엄격하면 top-k가 너무 적을 수 있음.

옵션 2: Pre-filter (먼저 필터링, 그 후 벡터 검색):

문제: HNSW 그래프가 필터된 부분집합에서 잘 탐색 안 될 수 있음.

옵션 3: Integrated (Qdrant 방식):

벡터 검색과 필터를 동시에 적용.
가장 효율적이지만 구현 복잡.

함정 4: 차원의 저주

너무 높은 차원(> 2000)은 ANN 효과가 떨어진다. 가능하면:

적절한 임베딩 모델 선택 (256~768 권장).
차원 축소 (PCA, autoencoder).

함정 5: Recall 측정 오류

"Recall 0.95"가 실제 품질을 반영하는가? 다음 경우 주의:

사용자 쿼리 분포와 벤치마크가 다르면 recall이 다를 수 있음.
자기 데이터로 측정해라. 일반 벤치마크 믿지 말고.

def measure_recall(index, queries, exact_answers, k=10):
    correct = 0
    total = 0
    for q, exact in zip(queries, exact_answers):
        approx = index.search(q, k)
        correct += len(set(approx) & set(exact))
        total += k
    return correct / total

퀴즈로 복습하기

Q1. 왜 "정확한 K-NN" 대신 "근사 K-NN"이 실전에서 쓰이는가?

A. 속도와 정확도의 트레이드오프가 극단적이기 때문이다. 정확한 K-NN은 모든 벡터와의 거리를 계산해야 해서 O(N·d)다. 1억 벡터 × 1024차원이면 단일 쿼리에 수 분이 걸린다.

근사 K-NN은 약간의 정확도(보통 95~99% recall)를 희생해 1000배 이상의 속도를 얻는다. 실전에서:

유사도 검색의 "정답"은 애초에 주관적이다. 진짜 가장 가까운 이웃이 반드시 사용자가 원하는 답은 아니다.
Recall 0.95는 충분하다. 10개 결과 중 9.5개가 실제 top-10이면 사용자는 차이를 느끼지 않는다.
속도가 사용자 경험을 좌우한다. RAG 파이프라인에서 100ms vs 10초는 완전히 다른 서비스다.
재랭킹이 가능하다. ANN으로 top-100을 빠르게 찾고, 그 위에 정확한 재랭킹(cross-encoder, BM25 혼합 등)을 적용하는 것이 일반적 패턴.

이론적으로 완벽할 필요 없이, "충분히 좋은" 답을 빠르게 — 이것이 ANN의 철학이다.

Q2. Product Quantization이 어떻게 500배 메모리 절약을 가능하게 하는가?

A. 핵심은 "벡터 공간을 subspace로 분할 후 각각 독립적으로 양자화" 다.

수학:

원본: 1024차원 × float32 = 4096 바이트.
PQ: 1024차원을 8개의 128차원 subvector로 분할.
각 subspace에서 k-means (256 centroids).
각 subvector는 가장 가까운 centroid의 인덱스(8비트) 로 대체.
총 저장: 8 subvectors × 1 byte = 8 바이트.
압축률: 512배.

왜 recall이 유지되는가:

각 subspace의 centroid 수가 256개 → 2^64 가능한 조합 전체.
k-means가 실제 데이터 분포에 맞춰 학습되므로 손실이 제한적.
Asymmetric Distance Computation (ADC): 쿼리는 원본, 데이터만 압축. 거리 계산이 M번의 lookup table 조회 + 합으로 끝남.

실전 효과:

1억 벡터 × 1024차원: 400GB → 800MB (PQ8). 한 대 서버 메모리에 들어감.
거리 계산 속도: 원본 대비 8~16배 빠름 (lookup이 곱셈보다 빠름).
Recall: 일반적으로 0.9~0.95 유지 가능.

IVF-PQ 조합이 실전 대규모 검색의 사실상 표준이다. FAISS의 flagship 인덱스 타입이며, Meta, Google, Microsoft 등 모든 대기업이 어떤 형태로든 사용한다.

Q3. HNSW가 다층 구조를 쓰는 이유와 각 층의 역할은?

A. Greedy search의 local minimum 문제를 해결하기 위해서다.

문제: 단일 layer 그래프에서 greedy search (현재 노드의 이웃 중 가장 가까운 것으로 이동)는 local minimum에 빠질 수 있다. 근처 이웃은 다 멀지만 그래프 전체에서 더 가까운 노드가 존재하는 경우.

해결 — Skip List에서 영감: HNSW는 여러 layer를 쌓는다. 각 벡터는 지수 분포로 랜덤 layer가 결정된다:

대부분 벡터 (layer 0): 모든 벡터 포함, 조밀한 연결 (M_max0 = 32).
소수 (layer 1+): 무작위 하위집합, 스파스 연결 (M = 16).
극소수 (layer 최상위): 매우 희소.

검색 흐름:

최상위 layer에서 시작. 소수의 노드만 있으므로 빠른 전역 이동.
그 층의 최근접 노드를 찾으면 한 층 아래로 내려감.
하위 층에서 greedy search 재개 — 상위 층에서 이미 정확한 구역으로 이동했으므로 local minimum 가능성 낮음.
Layer 0까지 내려와 최종 k개 이웃 반환.

비유: 도시에서 "가장 가까운 커피숍" 찾기.

상위 layer: 고속도로로 올바른 동네까지 빠르게.
중간 layer: 동네 안에서 올바른 블록까지.
Layer 0: 블록 안에서 정확한 가게까지.

이 구조 덕분에 HNSW는 O(log N) 검색 복잡도에 가까운 성능을 보이며, 100만~10억 벡터에서 현재 벤치마크 최고 수준이다.

Q4. IVF의 nprobe 파라미터가 trade-off를 어떻게 결정하는가?

A. nprobe는 "쿼리 시 몇 개의 클러스터를 실제로 검색할 것인가" 다.

구조 복습:

IVF는 벡터 공간을 nlist개 클러스터로 분할 (k-means).
각 쿼리마다 모든 centroid와 거리 계산 후, 가장 가까운 nprobe개 클러스터만 검색.
나머지 클러스터는 완전히 건너뜀.

Trade-off:

nprobe	검색 속도	Recall	이유
1	극속	낮음 (~0.6)	가장 가까운 클러스터만. 경계 근처 벡터 놓침.
10	빠름	중간 (~0.85)	대부분 커버.
32	중간	높음 (~0.95)	경계 문제 거의 해결.
100	느림	매우 높음 (~0.99)	거의 brute force 수준.
nlist	가장 느림	1.0	모든 클러스터 검색 = 선형 탐색.

실전 튜닝 방법:

자기 데이터로 recall 측정:

for nprobe in [1, 5, 10, 20, 50, 100]:
    index.nprobe = nprobe
    recall = measure_recall(index, queries, exact_answers)
    print(f"nprobe={nprobe}: recall={recall}")

목표 recall (예: 0.95)을 달성하는 가장 작은 nprobe 선택.
그 값으로 속도 측정. 만족스러우면 채택.

경계 문제: 클러스터 경계 근처의 쿼리는 여러 클러스터를 검색해야 진짜 최근접을 찾는다. nprobe=1이면 놓칠 수 있다. nlist=1000일 때 nprobe=32 정도가 일반적 스위트스팟이다.

이런 트레이드오프를 이해하면 recall이 낮다는 불만에 "nprobe를 늘려봐라"라고 즉시 답할 수 있다.

Q5. 언제 HNSW가 아닌 다른 알고리즘을 선택해야 하는가?

A. HNSW는 현재 최고 성능이지만 모든 상황에 최적은 아니다:

1. 메모리가 극도로 제한될 때 → PQ 기반 (IVF-PQ) HNSW는 원본 벡터 + 그래프를 모두 메모리에 유지. 10억 벡터가 1TB를 넘으면 불가능. IVF-PQ는 8~16배 압축 가능.

2. 디스크 기반이 필요할 때 → DiskANN HNSW는 random access가 많아 디스크에서 극도로 느림. DiskANN은 SSD에 최적화되어 수십억 벡터를 한 서버에서 처리.

3. 삽입/삭제가 매우 빈번할 때 → Annoy, 또는 주기적 재빌드 HNSW는 삽입이 느리고 삭제가 복잡. Annoy는 다른 트레이드오프 (정적 빌드지만 로드 빠름).

4. GPU 환경 → FAISS with GPU 지원 HNSW는 GPU 가속이 어렵다 (그래프 순회가 분기 많음). IVF-PQ는 GPU에서 극도로 빠름.

5. 작은 규모 (10만 미만) + 정확도 요구 → Brute Force 진짜다. 10만 이하는 선형 탐색이 ANN보다 더 빠를 수 있다. 복잡도 없이 100% 정확.

6. 스트리밍 삽입 대량 → LSH LSH는 구조 재조정 없이 새 벡터 추가 가능. 수백만/초 삽입 시 유용.

7. 필터링이 주 조건 → Qdrant (pre-filter) "카테고리 X 내에서 벡터 검색"이면 HNSW는 post-filter라 비효율. Qdrant는 integrated filter 지원.

8. 기존 SQL DB 활용 → pgvector (IVFFlat 또는 HNSW) 별도 인프라 없이 PostgreSQL에 통합. 수백만 규모까지는 충분.

결론: HNSW는 강력한 기본값이지만 데이터 규모, 메모리, 삽입/삭제 패턴, 필터링 요구 등을 종합 고려해 선택해야 한다. "그냥 HNSW 쓰자"는 대부분 맞지만, 한계를 알아야 확장 시 다른 선택을 할 수 있다.

마치며: 벡터의 기하학

핵심 정리

Exact → Approximate: 속도를 위한 필연적 선택.
LSH: 해시 기반, 스트리밍 친화.
IVF: 클러스터링, nprobe로 튜닝.
Product Quantization: 메모리 500배 절약.
IVF-PQ: 대규모의 표준.
HNSW: 그래프 기반, 현재 최고 속도-recall.
DiskANN: 디스크 기반 거대 규모.
ScaNN: Google의 anisotropic quantization.

알고리즘 선택 요약

규모	메모리 여유	추천
< 100K	충분	Brute force
100K ~ 10M	충분	HNSW
10M ~ 100M	충분	HNSW or IVF-PQ
100M ~ 10B	제한	IVF-PQ or DiskANN
> 10B	제한	DiskANN, ScaNN 분산

RAG 시대의 중요성

ChatGPT 이후 거의 모든 LLM 애플리케이션이 RAG를 쓴다. 그리고 RAG의 성능과 비용은 벡터 검색이 크게 좌우한다:

느린 검색 → 느린 응답 → 나쁜 UX.
낮은 recall → 잘못된 컨텍스트 → 환각.
큰 메모리 → 비싼 인프라 → 비용 폭발.

ANN 알고리즘을 이해하면:

어떤 벡터 DB를 고를지 안다.
파라미터를 적절히 튜닝할 수 있다.
문제 발생 시 원인을 추측할 수 있다.
미래의 최적화 기회를 본다.

마지막 교훈

ANN은 정확도와 효율의 극단적 트레이드오프를 보여주는 분야다. 100% 정답을 원하면 O(N) 복잡도를 받아들여야 하고, O(log N)을 원하면 약간의 오차를 받아들여야 한다. 수많은 엔지니어가 이 곡선을 조금씩 밀어올려 왔다. LSH에서 IVF, PQ, HNSW, DiskANN까지 — 모두 같은 목표를 다른 각도에서 공략한 결과다.

당신이 다음에 벡터 DB를 쓸 때, 그 아래에 이런 수학과 구조가 있음을 기억하자. 단지 "Pinecone에 넣고 search"가 아니라, 왜 그렇게 빠른지를 이해할 수 있을 것이다. 그 이해가 더 좋은 시스템을 만든다.

참고 자료

ANN Algorithms Complete Guide 2025: HNSW, IVF, Product Quantization, LSH — How Vector DBs Really Work

Introduction: The "Most Similar" Problem

A New Essential Skill in the LLM Era

Since ChatGPT (2022), RAG (Retrieval-Augmented Generation) has become core to LLM apps:

Embed documents (convert to vectors).
Embed the user's question.
Search for the most similar documents.
Feed results to the LLM as context.

Step 3 is similarity search: finding the top-K nearest vectors out of millions or billions.

The Trap of the Naive Approach

The simplest approach: compute all distances, then sort.

def naive_knn(query, vectors, k):
    distances = [cosine_distance(query, v) for v in vectors]
    return sorted(enumerate(distances), key=lambda x: x[1])[:k]

100K vectors: 0.1s (fine).
1M vectors: 1s (slow).
100M vectors: tens of seconds (impossible).
1,000 dims x 100M: minutes (catastrophic).

Real search engines, recommenders, and RAG need tens of milliseconds, not seconds. Fundamentally different approach needed.

Trading Accuracy

The key insight:

"Close enough K neighbors" is usually sufficient. Exact K-NN is not required.

Allowing 10% error yields 1000x faster search. That is the core of ANN (Approximate Nearest Neighbor).

What We Cover

Vector similarity: cosine, L2, inner product.
LSH: the oldest ANN algorithm.
IVF: clustering-based search.
Product Quantization: massive memory savings.
HNSW: current graph-based standard.
DiskANN, ScaNN: the latest.
Practical selection and tuning.

1. Vector Similarity Basics

Distance Functions

L2 (Euclidean):

d(x, y) = sqrt(sum((x_i - y_i)^2))

Cosine distance:

cos(x, y) = (x . y) / (|x| * |y|)
distance = 1 - cos(x, y)

Inner product:

d(x, y) = -x . y   (negative: larger = more similar)

Normalized Vectors

If vectors are normalized to unit length:

x . y = cos(x, y) (inner product equals cosine)
|x - y|^2 = 2 - 2 * cos(x, y) (L2 equivalent to cosine)

So normalization makes L2, cosine, and inner product equivalent. Most embedding models (OpenAI, Sentence-BERT) return normalized vectors for this reason.

The Curse of Dimensionality

In high-dim space, odd things happen:

All points become nearly equidistant.
Mean-to-variance ratio shrinks.
Intuition from 2D/3D fails.

Real data lives on a low-dim manifold, which ANN exploits.

Recall Definition

Quality metric: Recall@k.

"How many of the true top-k did we find?"

recall@10 = |found top-10 that are truly top-10| / 10.
1.0 = perfect, 0.9 = most users happy, 0.5 = bad.

Target: recall 0.95+ with optimized speed and memory.

2. LSH: Locality-Sensitive Hashing

The Idea

LSH (1998) is one of the earliest ANN methods:

"Put similar vectors in the same bucket, and search only that bucket at query time."

Regular hashes give different outputs for similar inputs. LSH is the opposite: similar inputs are more likely to hash the same.

Random Projection LSH (for cosine)

import numpy as np

class CosineLSH:
    def __init__(self, dim, num_hashes=32):
        self.dim = dim
        self.num_hashes = num_hashes
        self.projections = np.random.randn(num_hashes, dim)

    def hash(self, vector):
        signs = np.sign(self.projections @ vector)
        return tuple((signs > 0).astype(int))

Collision probability:

P(hash(x) == hash(y)) = 1 - theta(x,y)/pi

Building an LSH Index

from collections import defaultdict

class LSHIndex:
    def __init__(self, dim, num_tables=10, num_hashes_per_table=16):
        self.tables = []
        for _ in range(num_tables):
            lsh = CosineLSH(dim, num_hashes_per_table)
            self.tables.append((lsh, defaultdict(list)))

    def add(self, vector, id):
        for lsh, buckets in self.tables:
            h = lsh.hash(vector)
            buckets[h].append((id, vector))

    def query(self, vector, k):
        candidates = set()
        for lsh, buckets in self.tables:
            h = lsh.hash(vector)
            for id, v in buckets.get(h, []):
                candidates.add((id, tuple(v)))
        scored = [(id, cosine_distance(vector, v)) for id, v in candidates]
        return sorted(scored, key=lambda x: x[1])[:k]

Pros: simple, fast insert, distributable, streaming-friendly. Cons: low recall (0.7 to 0.85), many tables needed, tuning is tricky.

LSH has been largely superseded by HNSW/IVF for most use cases, but still shines for massive distributed and streaming workloads.

3. IVF: Inverted File Index

The Idea

IVF is widely used in FAISS:

Use k-means to split space into clusters.
Each vector assigned to the nearest centroid.
At query time, search only the nprobe closest clusters.

Query Flow

Compare query q to all centroids.
Pick top nprobe clusters.
Compare q precisely against vectors in those clusters.
Return top-k.

class IVFIndex:
    def __init__(self, dim, nlist=100):
        self.nlist = nlist
        self.centroids = None
        self.lists = [[] for _ in range(nlist)]

    def train(self, vectors):
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=self.nlist)
        kmeans.fit(vectors)
        self.centroids = kmeans.cluster_centers_
        assignments = kmeans.predict(vectors)
        for i, c in enumerate(assignments):
            self.lists[c].append((i, vectors[i]))

    def query(self, q, k=10, nprobe=10):
        centroid_dists = np.linalg.norm(self.centroids - q, axis=1)
        top_clusters = np.argsort(centroid_dists)[:nprobe]
        candidates = []
        for c in top_clusters:
            for id, v in self.lists[c]:
                candidates.append((id, np.linalg.norm(q - v)))
        return sorted(candidates, key=lambda x: x[1])[:k]

Parameters

nlist: cluster count, typically sqrt(N).
nprobe: clusters checked per query; larger = more accurate but slower.

Pros: dozens of times faster than linear scan, simple, well-optimized in FAISS. Cons: needs training, rebalancing on updates, boundary vector errors.

4. Product Quantization: King of Memory Savings

The Problem

100M x 1024-dim float32 = 400 GB. Too big for memory, too slow from disk.

PQ Idea (Jegou et al., 2011)

Split each vector into M subvectors (e.g. 1024 -> 8 x 128).
Run k-means per subspace (e.g. 256 centroids).
Replace each subvector with its nearest centroid index (1 byte).
Each vector compressed to M bytes.

Original: 1024 * 4 = 4096 bytes
PQ:       8 bytes
Compression: 512x

Encoding

class ProductQuantizer:
    def __init__(self, dim, M=8, nbits=8):
        self.dim = dim
        self.M = M
        self.nbits = nbits
        self.dsub = dim // M
        self.K = 2 ** nbits
        self.codebooks = None

    def train(self, vectors):
        from sklearn.cluster import KMeans
        self.codebooks = np.zeros((self.M, self.K, self.dsub))
        for m in range(self.M):
            sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
            kmeans = KMeans(n_clusters=self.K)
            kmeans.fit(sub)
            self.codebooks[m] = kmeans.cluster_centers_

    def encode(self, vectors):
        codes = np.zeros((len(vectors), self.M), dtype=np.uint8)
        for m in range(self.M):
            sub = vectors[:, m*self.dsub:(m+1)*self.dsub]
            dists = np.linalg.norm(
                sub[:, None, :] - self.codebooks[m][None, :, :],
                axis=2
            )
            codes[:, m] = np.argmin(dists, axis=1)
        return codes

Asymmetric Distance Computation (ADC)

Keep the query uncompressed.
Precompute distance from q to every centroid per subspace: lookup table LUT[M, K].
Distance to any compressed vector = sum of M lookups.

def query_adc(self, q, codes):
    lut = np.zeros((self.M, self.K))
    for m in range(self.M):
        q_sub = q[m*self.dsub:(m+1)*self.dsub]
        for k in range(self.K):
            lut[m, k] = np.sum((q_sub - self.codebooks[m, k])**2)
    distances = np.zeros(len(codes))
    for i, code in enumerate(codes):
        for m in range(self.M):
            distances[i] += lut[m, code[m]]
    return distances

Distance becomes M lookups + additions, not dim multiplications.

IVF-PQ

FAISS's famous combination:

IVF to narrow to a cluster.
PQ to compress inside each cluster.
Query: search nprobe clusters via ADC.

Result: 10B vectors in a few GB, ms-level queries. The Meta/Facebook image search scale.

import faiss
d = 128; nlist = 1024; m = 8; bits = 8
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, bits)
index.train(training_vectors)
index.add(vectors)
index.nprobe = 32
distances, indices = index.search(queries, k=10)

5. HNSW: The Graph-Based Revolution

The Idea

HNSW (Hierarchical Navigable Small World) was published in 2016 (Malkov and Yashunin). It is the current best ANN algorithm for most benchmarks and powers Pinecone, Weaviate, Qdrant, Milvus, and pgvector.

Connect vectors in a graph.
Each node links to nearby neighbors.
Query by walking the graph.

Greedy Search and Local Minima

Greedy walk: move to the nearest neighbor until no closer neighbor exists. Prone to local minima — neighborhood looks far but a globally closer node exists elsewhere.

Multi-Layer Structure

HNSW stacks multiple graph layers:

Layer 2 (sparse): [A --- E --- I]
                   |     |     |
Layer 1:          [A B D E G I J]
                   | | | | | | |
Layer 0 (dense):  [A B C D E F G H I J K L]

Layer 0: all vectors, dense.
Upper layers: random subsets, sparse — act as "highways".

Search Flow

Start at top layer.
Greedy-search that layer.
Descend one layer.
Repeat down to layer 0.
Return top-k from layer 0.

Skip-list-like structure: upper = global moves, lower = fine-grained.

Insert

Pick random layer: L = floor(-log(uniform(0,1)) * mL).
Greedy-search from the top.
Connect to M nearest neighbors on each layer.
Prune if neighbor count exceeds M_max (heuristic, keep diversity).

Key Parameters

M parameter: max neighbors per node (default 16).
ef_construction: search width during insert (default 200).
ef_search: search width during query.

Pseudo-code

class HNSW:
    def __init__(self, M=16, ef_construction=200):
        self.M = M
        self.M_max0 = M * 2
        self.ef_construction = ef_construction
        self.mL = 1.0 / np.log(M)
        self.nodes = {}
        self.layers = defaultdict(dict)
        self.entry_point = None
        self.max_layer = -1

    def search_layer(self, q, entry, ef, layer):
        visited = {entry}
        candidates = [(distance(q, self.nodes[entry]), entry)]
        results = [(-distance(q, self.nodes[entry]), entry)]
        while candidates:
            dist, curr = heapq.heappop(candidates)
            if -results[0][0] < dist:
                break
            for neighbor in self.layers[layer].get(curr, []):
                if neighbor in visited:
                    continue
                visited.add(neighbor)
                d = distance(q, self.nodes[neighbor])
                if d < -results[0][0] or len(results) < ef:
                    heapq.heappush(candidates, (d, neighbor))
                    heapq.heappush(results, (-d, neighbor))
                    if len(results) > ef:
                        heapq.heappop(results)
        return sorted([(-d, n) for d, n in results])

Performance

Recall 0.95 with 100x to 1000x speedup over brute force.
Memory: ~1.5x to 2x original vectors.
Insert slower than search.
Deletion is hard (tombstone + rebuild).

Downsides

Memory-heavy (hard to run from disk).
Modification-hostile.
Parameter-sensitive.

6. DiskANN: Disk-Based ANN

Motivation

10B x 768-dim = 3TB. Beyond RAM.

Microsoft Research's DiskANN (2019) works from SSD and handles billions on one server.

Core Ideas

Vamana graph: single-layer, diverse edges (alpha > 1).
SSD minimized: tens of random reads per query.
Prefetching of next nodes.

Setup

Vectors on SSD.
Compressed graph in memory (node IDs only).
Query: walk graph in memory, read actual vectors from SSD in batches.

Performance

100M vectors: 8GB RAM + SSD, recall 0.95 @ 10ms.
1B vectors: 64GB RAM + SSD.

Memory drops to ~1/50 of original. Used in Microsoft Bing, likely OpenAI Embeddings, and Milvus's DiskANN integration.

7. ScaNN: Google's Answer

Google's ScaNN (2020) improves PQ with anisotropic quantization, optimizing quantization error with respect to inner product rather than plain L2. Combined with asymmetric hashing, it achieves top-tier recall and speed. Used in Google Search, YouTube recommendations, Google Photos. Open source via TensorFlow.

8. Vector DB Comparison

DB	Engine	Notes	Best For
Pinecone	HNSW + custom	Managed, serverless	Production, low ops
Weaviate	HNSW	Schema, GraphQL	Semantic search
Qdrant	HNSW	Rust, strong filters	Filter + vector
Milvus	IVF/HNSW/DiskANN	GPU support	Large scale
pgvector	IVF-Flat, HNSW	PostgreSQL ext	SQL integration
Elasticsearch	HNSW	Full-text + vector	Hybrid search
FAISS	All	Library	Research, custom
Chroma	HNSW	Dev-friendly	Prototypes, RAG
Redis Stack	HNSW, Flat	In-memory	Low latency

pgvector

CREATE EXTENSION vector;

CREATE TABLE items (
  id bigserial PRIMARY KEY,
  content text,
  embedding vector(1536)
);

CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

SELECT id, content, embedding <=> '[0.1,0.2,...]' AS distance
FROM items
ORDER BY embedding <=> '[0.1,0.2,...]'
LIMIT 10;

Good up to millions; falls behind dedicated DBs at hundreds of millions.

Qdrant — Hybrid Filter + Vector

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue

client.search(
    collection_name="my_collection",
    query_vector=embedding,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="tech"))]
    ),
    limit=10,
)

9. Parameter Tuning

HNSW

M parameter:

Low (8-16): memory-saving, low-dim (<=128).
Mid (16-32): common.
High (32-64): high-dim, high recall.

ef_construction:

100-200: fast build.
400-800: better recall.
1000+: best quality, slow build.

ef_search:

50-100: fast, lower recall.
100-200: balanced.
500-1000: best recall.

Recipes

High recall (>=0.99):

m = 32
ef_construction = 400
ef_search = 200

Balanced (recall 0.95):

m = 16
ef_construction = 200
ef_search = 100

Large scale, tight memory:

m = 8
ef_construction = 100
ef_search = 50
# + Product Quantization

Dimensionality Reduction

1536-dim (OpenAI ada-002) is heavy. Options: PCA to ~784, Matryoshka Embeddings, smaller models (384-dim Sentence-BERT).

10. Real Scenarios & Pitfalls

RAG Pipeline

1M document chunks, <=100ms response, high recall.
Choice: Qdrant or Pinecone, HNSW, M=32, ef_search=100, metadata filters.

Image Similarity

100M images, top-100 for re-ranking, limited memory.
Choice: Milvus + IVF-PQ, GPU, recall 0.9 then exact re-rank.

Real-Time Recommendation

User + item embeddings, <=10ms, frequent user updates.
Choice: Redis Stack + HNSW, or FAISS in-memory.

Pitfall 1: Cold Start

HNSW build is slow — batch inserts, background build, or swap.

Pitfall 2: Delete/Update

Tombstone + periodic rebuild. Consider Annoy or swap indexes.

Pitfall 3: Filtering

Post-filter: may return too few hits.
Pre-filter: HNSW graph may navigate poorly on subset.
Integrated (Qdrant): best but complex.

Pitfall 4: Curse of Dimensionality

Over 2000 dims hurts. Use 256-768 when possible.

Pitfall 5: Wrong Recall Measurement

Measure on your own data and queries:

def measure_recall(index, queries, exact_answers, k=10):
    correct = 0
    total = 0
    for q, exact in zip(queries, exact_answers):
        approx = index.search(q, k)
        correct += len(set(approx) & set(exact))
        total += k
    return correct / total

Quiz

Q1. Why use Approximate K-NN instead of Exact K-NN?

A. Speed-accuracy trade-off is extreme. Exact K-NN is O(N*d). For 100M x 1024-dim, that is minutes per query. ANN sacrifices 1-5% recall for 1000x speedup. In practice, similarity ground truth is subjective, recall 0.95 is indistinguishable to users, speed dominates UX, and re-ranking on top-100 candidates recovers accuracy. Good-enough-and-fast wins.

Q2. How does PQ yield 500x memory savings?

A. Split the vector space into M subspaces and quantize each independently. 1024-dim float32 (4096 bytes) becomes 8 subvectors with 256 centroids each, stored as 1 byte per subvector = 8 bytes total (512x). Recall survives because 256^8 = 2^64 combinations cover the data distribution well, and ADC (asymmetric distance computation) keeps queries uncompressed while reducing distance computation to M lookup-table additions. IVF-PQ combines this with cluster filtering — the de facto standard for huge-scale search.

Q3. Why does HNSW use multiple layers?

A. To avoid local minima in greedy search. Upper layers act as highways enabling fast global moves; lower layers are dense for fine-grained search. Node layer is sampled from an exponential distribution so most nodes sit at layer 0 and few reach the top. Search starts at the top layer (quick coarse positioning) and descends layer by layer until layer 0 returns the final top-k. Result: O(log N)-like complexity and state-of-the-art recall-latency.

Q4. How does IVF's nprobe control trade-offs?

A. nprobe is how many clusters to actually search per query. nprobe=1 is fastest but misses boundary cases (recall ~0.6); nprobe=10 reaches ~0.85; nprobe=32 hits ~0.95; nprobe=nlist equals brute force. Tune by measuring recall on your own data and picking the smallest nprobe that hits your target. For nlist=1000, nprobe=32 is a typical sweet spot.

Q5. When should you NOT pick HNSW?

A. Pick something else when: (1) memory is extremely tight - use IVF-PQ; (2) disk-based is required - DiskANN; (3) high insert/delete churn - Annoy or periodic rebuild; (4) GPU environment - FAISS IVF-PQ; (5) small scale under 100K - brute force is simpler and often faster; (6) heavy streaming inserts - LSH; (7) filter-dominant - Qdrant with pre-filter; (8) existing SQL stack - pgvector. HNSW is the strong default, but knowing its limits matters at scale.

Closing: The Geometry of Vectors

Summary

Exact -> Approximate: inevitable trade-off.
LSH: hash-based, streaming-friendly.
IVF: clustering; tune nprobe.
Product Quantization: 500x memory savings.
IVF-PQ: the large-scale standard.
HNSW: graph-based, top recall-latency.
DiskANN: disk-based huge scale.
ScaNN: Google's anisotropic quantization.

Algorithm Choice Summary

Scale	Memory	Pick
`<100K`	ample	Brute force
100K-10M	ample	HNSW
10M-100M	ample	HNSW or IVF-PQ
100M-10B	limited	IVF-PQ or DiskANN
>10B	limited	DiskANN, ScaNN, distributed

Why It Matters in the RAG Era

Almost every LLM app uses RAG, and its performance and cost hinge on vector search: slow search = bad UX, low recall = hallucinations, huge memory = exploding cost. Understanding ANN algorithms tells you which DB to pick, how to tune parameters, how to debug, and where the next optimization lies.