Split View: Vector Database 완전 가이드 2025: 임베딩, 유사도 검색, Pinecone/Weaviate/Qdrant/pgvector

Vector Database 완전 가이드 2025: 임베딩, 유사도 검색, Pinecone/Weaviate/Qdrant/pgvector

1. 왜 Vector Database인가
2. 벡터 임베딩 기초
3. 거리 메트릭 (유사도 측정)
4. 인덱싱 알고리즘
5. Vector Database 비교
6. pgvector 딥다이브
7. 하이브리드 검색
8. 메타데이터 필터링
9. 네임스페이스와 컬렉션 관리
- 9.1 멀티테넌시 전략
- 9.2 컬렉션 라이프사이클 관리
10. 프로덕션 운영
11. 성능 벤치마크
12. 비용 분석
- 12.1 관리형 서비스 비용 비교
- 12.2 셀프호스팅 비용
13. 실전 구현: RAG 파이프라인에서의 Vector DB
14. 퀴즈
15. 참고 자료

1. 왜 Vector Database인가

1.1 전통적 검색의 한계

전통적인 데이터베이스는 정확한 키워드 매칭에 기반합니다. "강아지가 공원에서 뛰놀고 있다"를 검색하면 "개가 잔디밭에서 놀고 있다"는 찾지 못합니다. 두 문장은 의미적으로 거의 동일하지만 키워드가 다르기 때문입니다.

전통 검색: "강아지 공원" → 키워드 매칭 → "강아지"와 "공원" 포함 문서만 반환
벡터 검색: "강아지 공원" → 의미 벡터화 → 유사한 의미의 모든 문서 반환
         → "개가 잔디밭에서 놀고 있다" ✅ 발견
         → "반려견 산책 장소 추천" ✅ 발견

1.2 Vector Database가 해결하는 문제

Vector Database는 데이터를 고차원 벡터(숫자 배열)로 저장하고, 벡터 간 유사도를 기반으로 검색합니다.

핵심 사용 사례:

영역	설명	예시
RAG (검색 증강 생성)	LLM에 관련 문서 컨텍스트 제공	ChatGPT + 사내 문서
시맨틱 검색	의미 기반 검색	자연어 질문 검색
추천 시스템	유사 아이템 발견	상품/콘텐츠 추천
이미지 검색	시각적 유사성	"이 옷과 비슷한 상품"
이상 탐지	정상 패턴에서 벗어난 데이터	사기 거래 탐지
중복 검출	유사 콘텐츠 식별	표절 탐지, 중복 문서

1.3 시장 성장

Vector Database 시장은 2024년 15억 달러에서 2028년 약 60억 달러로 성장 전망입니다. RAG 파이프라인의 폭발적 채택이 핵심 동력입니다.

2. 벡터 임베딩 기초

2.1 임베딩이란

임베딩은 텍스트, 이미지, 오디오 등 비정형 데이터를 고차원 공간의 숫자 벡터로 변환하는 것입니다. 의미적으로 유사한 데이터는 벡터 공간에서 가까이 위치합니다.

from openai import OpenAI

client = OpenAI()

# 텍스트를 벡터로 변환
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Vector Database는 AI 시대의 핵심 인프라입니다"
)

embedding = response.data[0].embedding
print(f"차원 수: {len(embedding)}")   # 3072
print(f"벡터 샘플: {embedding[:5]}")  # [0.023, -0.041, 0.017, ...]

2.2 텍스트 임베딩 모델 비교

모델	제공사	차원	MTEB 점수	비용	특징
text-embedding-3-large	OpenAI	3072	64.6	유료	차원 축소 지원
text-embedding-3-small	OpenAI	1536	62.3	저렴	비용 대비 성능 우수
embed-v3.0	Cohere	1024	64.5	유료	다국어 우수
BGE-M3	BAAI	1024	68.2	무료	오픈소스 최강
Jina-embeddings-v3	Jina AI	1024	65.5	무료	다국어 특화
all-MiniLM-L6-v2	SBERT	384	56.3	무료	가볍고 빠름
nomic-embed-text	Nomic	768	62.4	무료	긴 컨텍스트

2.3 이미지 임베딩

CLIP(Contrastive Language-Image Pre-Training) 모델은 텍스트와 이미지를 동일 벡터 공간에 매핑합니다.

from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer("clip-ViT-B-32")

# 이미지 임베딩
img = Image.open("cat_photo.jpg")
img_embedding = model.encode(img)

# 텍스트 임베딩 (같은 벡터 공간)
text_embedding = model.encode("a cute orange cat")

# 텍스트로 이미지 검색 가능!
from numpy import dot
from numpy.linalg import norm

similarity = dot(img_embedding, text_embedding) / (
    norm(img_embedding) * norm(text_embedding)
)
print(f"유사도: {similarity:.4f}")  # 0.28+ (관련 있으면 높음)

2.4 멀티모달 임베딩

최신 모델은 텍스트, 이미지, 오디오를 하나의 벡터 공간에 통합합니다.

텍스트: "바다 위의 일몰" ──┐
                           ├──→ 동일 벡터 공간 → 유사도 비교 가능
이미지: 🌅 (일몰 사진)  ──┘

3. 거리 메트릭 (유사도 측정)

3.1 코사인 유사도 (Cosine Similarity)

두 벡터 사이의 각도를 측정합니다. 벡터의 크기(길이)는 무시하고 방향만 비교합니다.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 예시: 3차원 벡터
vec_a = np.array([1, 2, 3])
vec_b = np.array([2, 4, 6])  # 같은 방향, 크기만 다름
vec_c = np.array([-1, -2, -3])  # 반대 방향

print(cosine_similarity(vec_a, vec_b))  # 1.0 (완전 동일 방향)
print(cosine_similarity(vec_a, vec_c))  # -1.0 (완전 반대 방향)

사용 시기: 텍스트 임베딩(가장 보편적), 정규화된 벡터

3.2 유클리드 거리 (Euclidean Distance / L2)

두 벡터 사이의 직선 거리를 측정합니다. 값이 작을수록 유사합니다.

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

vec_a = np.array([1, 2])
vec_b = np.array([4, 6])

print(euclidean_distance(vec_a, vec_b))  # 5.0

사용 시기: 벡터 크기가 의미 있을 때, 클러스터링

3.3 내적 (Dot Product / Inner Product)

두 벡터의 내적입니다. 방향과 크기 모두 고려합니다.

def dot_product(a, b):
    return np.dot(a, b)

vec_a = np.array([1, 2, 3])
vec_b = np.array([4, 5, 6])

print(dot_product(vec_a, vec_b))  # 32

사용 시기: Maximum Inner Product Search (MIPS), 정규화된 벡터에서는 코사인 유사도와 동일

3.4 맨해튼 거리 (Manhattan Distance / L1)

각 차원별 절대 차이의 합입니다.

def manhattan_distance(a, b):
    return np.sum(np.abs(a - b))

vec_a = np.array([1, 2, 3])
vec_b = np.array([4, 6, 3])

print(manhattan_distance(vec_a, vec_b))  # 7 (3 + 4 + 0)

3.5 메트릭 선택 가이드

텍스트 임베딩 검색    → 코사인 유사도 (기본 추천)
이미지 유사도 검색    → 유클리드 거리
추천 시스템 (MIPS)   → 내적
고차원 희소 벡터     → 코사인 유사도
클러스터링 / 분류    → 유클리드 거리

4. 인덱싱 알고리즘

벡터가 수백만 개일 때, 모든 벡터와 하나씩 비교(Brute-force)하면 너무 느립니다. 인덱싱 알고리즘은 검색 속도를 수천 배 빠르게 합니다.

4.1 HNSW (Hierarchical Navigable Small World)

가장 널리 사용되는 ANN(Approximate Nearest Neighbor) 알고리즘입니다.

작동 원리:

Layer 3: [A] ────────────────── [B]     (소수 노드, 장거리 연결)
           \                   /
Layer 2: [A] ── [C] ── [D] ── [B]      (중간 노드)
           \   / \   / \   / \
Layer 1: [A]-[E]-[C]-[F]-[D]-[G]-[B]   (대부분 노드, 단거리 연결)
           |   |   |   |   |   |   |
Layer 0: [모든 벡터가 존재하는 기본 층]    (전체 노드)

상위 레이어에서 대략적 위치를 잡고, 하위 레이어에서 정밀 검색
그래프 기반이라 메모리를 많이 사용하지만 매우 빠름

# Qdrant에서 HNSW 설정 예시
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, HnswConfigDiff

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance="Cosine"
    ),
    hnsw_config=HnswConfigDiff(
        m=16,                    # 노드당 연결 수 (높을수록 정확, 메모리 증가)
        ef_construct=128,        # 인덱스 구축 시 탐색 범위
        full_scan_threshold=10000  # 이 수 이하면 brute-force
    )
)

핵심 파라미터:

파라미터	설명	기본값	효과
M	노드당 연결 수	16	높이면 정확도/메모리 증가
ef_construct	구축 탐색 범위	128	높이면 인덱스 품질 증가
ef_search	검색 탐색 범위	64	높이면 검색 정확도 증가, 속도 저하

4.2 IVF (Inverted File Index)

벡터를 클러스터로 나누고, 검색 시 가장 가까운 클러스터만 탐색합니다.

전체 벡터 공간:
┌─────────────────────────────────┐
│  ● ●   ★ ●                     │
│   ●  Cluster1  ●  ●            │
│  ●  ●          ★               │
│        ● ●  Cluster2  ●        │
│         ●     ●  ● ●           │
│              ●   ●             │
│    ● ●  ●         ★            │
│     Cluster3        Cluster4   │
│    ●  ● ●         ●  ●  ●     │
└─────────────────────────────────┘
★ = 클러스터 중심(centroid)

검색: Query와 가장 가까운 클러스터(nprobe개) 내에서만 검색

4.3 IVF-PQ (IVF + Product Quantization)

PQ는 벡터를 압축하여 메모리 사용량을 줄입니다.

# FAISS에서 IVF-PQ 인덱스 생성
import faiss

dimension = 1536
nlist = 256      # 클러스터 수
m_pq = 48        # 서브벡터 수 (차원을 m_pq개로 분할)
nbits = 8        # 각 서브벡터의 코드북 크기

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m_pq, nbits)

# 학습 데이터로 클러스터 + 코드북 학습
index.train(training_vectors)
index.add(all_vectors)

# 검색 시 nprobe 설정
index.nprobe = 16  # 탐색할 클러스터 수
distances, indices = index.search(query_vector, k=10)

메모리 비교:

원본 (float32, 1536차원): 1536 x 4 bytes = 6,144 bytes/벡터
PQ 압축 (48 서브벡터):   48 x 1 byte  = 48 bytes/벡터
→ 128배 압축! 1억 벡터: 614GB → 4.8GB

4.4 ScaNN (Scalable Nearest Neighbors)

Google이 개발한 알고리즘입니다. 비대칭 해싱과 양자화를 결합하여 높은 recall에서도 빠른 속도를 제공합니다.

4.5 Annoy (Approximate Nearest Neighbors Oh Yeah)

Spotify가 개발한 알고리즘입니다. 공간을 랜덤 하이퍼플레인으로 재귀적 분할하여 트리를 구성합니다. 읽기 전용 인덱스로 메모리 맵 방식 지원, 여러 프로세스에서 공유 가능합니다.

4.6 알고리즘 비교

알고리즘	검색 속도	메모리	인덱스 구축	정확도	적합 규모
Flat (Brute-force)	느림	높음	없음	100%	10만 이하
HNSW	매우 빠름	높음	느림	매우 높음	수백만
IVF-Flat	빠름	중간	보통	높음	수천만
IVF-PQ	빠름	낮음	보통	중간	수억
ScaNN	매우 빠름	중간	보통	높음	수억
Annoy	빠름	낮음	빠름	중간	수천만

5. Vector Database 비교

5.1 주요 솔루션 비교표

특성	Pinecone	Weaviate	Qdrant	Milvus	Chroma	pgvector
유형	관리형 SaaS	오픈소스/클라우드	오픈소스/클라우드	오픈소스	오픈소스	PostgreSQL 확장
인덱스	독자 알고리즘	HNSW	HNSW	HNSW/IVF/DiskANN	HNSW	HNSW/IVF
하이브리드 검색	Sparse-Dense	BM25 + Vector	Sparse + Dense	지원	제한적	Full-text + Vector
필터링	메타데이터 필터	GraphQL 필터	Payload 필터	속성 필터	Where 절	SQL WHERE
멀티테넌시	네임스페이스	테넌트 분리	컬렉션/페이로드	파티션	컬렉션	스키마/RLS
최대 차원	20,000	65,535	65,535	32,768	제한 없음	2,000
분산 처리	자동	Raft 합의	Raft 합의	분산 아키텍처	미지원	Citus 확장
SDK	Python/JS/Go/Java	Python/JS/Go/Java	Python/JS/Rust/Go	Python/JS/Go/Java	Python/JS	SQL
가격	Pod/Serverless 플랜	오픈소스 무료	오픈소스 무료	오픈소스 무료	오픈소스 무료	무료
GPU 지원	내부	미지원	미지원	NVIDIA GPU	미지원	미지원
실시간 업데이트	지원	지원	지원	지원	지원	지원
백업/복원	자동	Collections API	Snapshot	지원	미흡	pg_dump
커뮤니티	보통	활발	활발	매우 활발	활발	매우 활발
Serverless 옵션	지원	지원	지원(Cloud)	Zilliz Cloud	미지원	Supabase/Neon

5.2 Pinecone

완전 관리형 Vector Database입니다. 인프라 관리 없이 API만으로 사용합니다.

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_KEY")

# 인덱스 생성
pc.create_index(
    name="my-index",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

index = pc.Index("my-index")

# 벡터 업서트
index.upsert(
    vectors=[
        ("id1", [0.1, 0.2, ...], {"title": "문서 제목", "category": "tech"}),
        ("id2", [0.3, 0.4, ...], {"title": "다른 문서", "category": "science"}),
    ],
    namespace="articles"
)

# 검색 (필터 + 벡터)
results = index.query(
    vector=[0.15, 0.25, ...],
    top_k=5,
    filter={"category": "tech"},
    namespace="articles",
    include_metadata=True
)

장점: 완전 관리형, 즉시 스케일링, Serverless 옵션 단점: 벤더 종속, 비용 높음, 셀프호스팅 불가

5.3 Weaviate

GraphQL 기반 API와 내장 벡터화 모듈을 제공하는 오픈소스 Vector Database입니다.

import weaviate
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local()

# 컬렉션 생성 (내장 벡터화)
collection = client.collections.create(
    name="Article",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    ),
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
    ]
)

# 데이터 추가 (자동 벡터화)
articles = client.collections.get("Article")
articles.data.insert(
    properties={
        "title": "Vector DB 가이드",
        "content": "벡터 데이터베이스는 AI 시대의 핵심...",
        "category": "tech"
    }
)

# 시맨틱 검색
response = articles.query.near_text(
    query="인공지능 인프라",
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("category").equal("tech")
)

장점: 내장 벡터화, GraphQL API, 모듈 생태계 단점: 리소스 많이 사용, 학습 곡선

5.4 Qdrant

Rust로 작성된 고성능 Vector Database입니다. 페이로드 필터링과 양자화에 강합니다.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue
)

client = QdrantClient("localhost", port=6333)

# 컬렉션 생성
client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    )
)

# 벡터 업서트
client.upsert(
    collection_name="articles",
    points=[
        PointStruct(
            id=1,
            vector=[0.1, 0.2, ...],
            payload={"title": "Vector DB 가이드", "category": "tech", "views": 1500}
        ),
    ]
)

# 필터 + 벡터 검색
results = client.search(
    collection_name="articles",
    query_vector=[0.15, 0.25, ...],
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="tech")),
        ]
    ),
    limit=5
)

장점: Rust 기반 성능, 세밀한 필터링, Scalar/Binary 양자화 단점: 생태계 상대적으로 작음

5.5 Milvus

분산 아키텍처에 특화된 대규모 Vector Database입니다. GPU 가속을 지원합니다.

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

connections.connect("default", host="localhost", port="19530")

# 스키마 정의
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields, description="Articles")
collection = Collection("articles", schema)

# 인덱스 생성
index_params = {
    "index_type": "HNSW",
    "metric_type": "COSINE",
    "params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)

# 검색
collection.load()
results = collection.search(
    data=[[0.1, 0.2, ...]],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 128}},
    limit=5,
    output_fields=["title"]
)

장점: 대규모 분산, GPU 지원, 다양한 인덱스 단점: 운영 복잡도, 리소스 요구량

5.6 ChromaDB

가볍고 개발자 친화적인 오픈소스 Vector Database입니다. 프로토타이핑에 적합합니다.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection(
    name="articles",
    metadata={"hnsw:space": "cosine"}
)

# 문서 추가 (자동 임베딩)
collection.add(
    documents=["Vector DB는 AI 인프라의 핵심입니다", "RAG는 검색 증강 생성입니다"],
    metadatas=[{"category": "tech"}, {"category": "ai"}],
    ids=["doc1", "doc2"]
)

# 검색
results = collection.query(
    query_texts=["인공지능 데이터베이스"],
    n_results=5,
    where={"category": "tech"}
)

장점: 초간단 API, 내장 임베딩, 로컬 실행 단점: 프로덕션 스케일링 제한, 분산 미지원

6. pgvector 딥다이브

6.1 왜 pgvector인가

이미 PostgreSQL을 사용하고 있다면, 별도의 Vector Database 없이 기존 인프라에서 벡터 검색을 추가할 수 있습니다. SQL의 풍부한 기능과 벡터 검색을 결합합니다.

6.2 설치 및 설정

-- PostgreSQL 14+ 필요
-- 확장 설치
CREATE EXTENSION IF NOT EXISTS vector;

-- 벡터 컬럼이 있는 테이블 생성
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT,
    category TEXT,
    embedding VECTOR(1536),  -- OpenAI text-embedding-3-small 차원
    created_at TIMESTAMPTZ DEFAULT NOW()
);

6.3 데이터 삽입 및 검색

-- 벡터 삽입
INSERT INTO documents (title, content, category, embedding)
VALUES (
    'Vector DB 가이드',
    'Vector Database는 AI 시대의 핵심...',
    'tech',
    '[0.1, 0.2, 0.3, ...]'::vector  -- 1536차원 벡터
);

-- 코사인 거리 검색 (가장 유사한 5개)
SELECT id, title, 1 - (embedding <=> '[0.15, 0.25, ...]'::vector) AS similarity
FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> '[0.15, 0.25, ...]'::vector
LIMIT 5;

-- L2 거리 검색
SELECT id, title, embedding <-> '[0.15, 0.25, ...]'::vector AS distance
FROM documents
ORDER BY embedding <-> '[0.15, 0.25, ...]'::vector
LIMIT 5;

-- 내적 검색 (Maximum Inner Product)
SELECT id, title, (embedding <#> '[0.15, 0.25, ...]'::vector) * -1 AS inner_product
FROM documents
ORDER BY embedding <#> '[0.15, 0.25, ...]'::vector
LIMIT 5;

6.4 HNSW vs IVFFlat 인덱스

-- HNSW 인덱스 (추천)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);

-- 검색 시 ef_search 설정
SET hnsw.ef_search = 100;

-- IVFFlat 인덱스
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- 클러스터 수 (sqrt(rows) 권장)

-- 검색 시 probes 설정
SET ivfflat.probes = 10;

HNSW vs IVFFlat 비교:

특성	HNSW	IVFFlat
검색 속도	더 빠름	빠름
인덱스 구축	느림	빠름
메모리	더 많이 사용	적게 사용
정확도 (Recall)	더 높음	보통
실시간 삽입	우수	재구축 필요할 수 있음
추천	기본 선택	메모리 제약 시

6.5 쿼리 최적화

-- 1. 부분 인덱스 (특정 카테고리만)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WHERE category = 'tech';

-- 2. 반정규화를 활용한 필터 + 벡터 검색
-- 느린 패턴: 벡터 검색 후 필터
SELECT * FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> query_vec
LIMIT 10;

-- 빠른 패턴: 파티션 테이블 활용
CREATE TABLE documents_tech PARTITION OF documents
FOR VALUES IN ('tech');

-- 3. EXPLAIN으로 인덱스 사용 확인
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title
FROM documents
ORDER BY embedding <=> '[0.15, 0.25, ...]'::vector
LIMIT 5;

-- Index Scan using documents_embedding_idx 확인

6.6 Python에서 pgvector 사용

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
register_vector(conn)

cur = conn.cursor()

# 벡터 검색
query_embedding = [0.1, 0.2, ...]  # 1536차원
cur.execute("""
    SELECT id, title, 1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    WHERE category = %s
    ORDER BY embedding <=> %s::vector
    LIMIT %s
""", (query_embedding, "tech", query_embedding, 5))

results = cur.fetchall()
for row in results:
    print(f"ID: {row[0]}, Title: {row[1]}, Similarity: {row[2]:.4f}")

7. 하이브리드 검색

7.1 왜 하이브리드 검색인가

벡터 검색만으로는 부족한 경우가 있습니다.

쿼리: "PostgreSQL 16 릴리즈 노트"

벡터 검색만: "MySQL 8.0 새 기능"도 높은 유사도 (비슷한 의미)
키워드 검색만: "PostgreSQL 16"이 정확히 포함된 문서만
하이브리드 검색: 의미적으로 유사 + "PostgreSQL 16" 포함 = 최적 결과

7.2 BM25 + Vector 결합

# Weaviate의 하이브리드 검색
response = articles.query.hybrid(
    query="PostgreSQL vector search performance",
    alpha=0.5,  # 0 = 키워드만, 1 = 벡터만, 0.5 = 균형
    limit=10,
    fusion_type="relative_score"  # rankedFusion 또는 relativeScoreFusion
)

7.3 Reciprocal Rank Fusion (RRF)

두 검색 결과의 순위를 결합하는 알고리즘입니다.

def reciprocal_rank_fusion(keyword_results, vector_results, k=60):
    """
    RRF 점수 = sum(1 / (k + rank_i))
    k = 60이 일반적
    """
    scores = {}
    
    for rank, doc_id in enumerate(keyword_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    for rank, doc_id in enumerate(vector_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    # 점수 기준 정렬
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# 예시
keyword_hits = ["doc_A", "doc_B", "doc_C", "doc_D"]
vector_hits = ["doc_C", "doc_A", "doc_E", "doc_B"]

fused = reciprocal_rank_fusion(keyword_hits, vector_hits)
# doc_A와 doc_C가 양쪽 모두 상위 → 최종 상위

7.4 pgvector + Full-Text Search

-- 하이브리드 검색 (PostgreSQL)
WITH vector_search AS (
    SELECT id, title, content,
           1 - (embedding <=> query_vec) AS vector_score,
           ROW_NUMBER() OVER (ORDER BY embedding <=> query_vec) AS vector_rank
    FROM documents
    ORDER BY embedding <=> query_vec
    LIMIT 20
),
keyword_search AS (
    SELECT id, title, content,
           ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'vector database')) AS bm25_score,
           ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'vector database')) DESC) AS keyword_rank
    FROM documents
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'vector database')
    LIMIT 20
)
SELECT
    COALESCE(v.id, k.id) AS id,
    COALESCE(v.title, k.title) AS title,
    -- RRF 점수
    COALESCE(1.0 / (60 + v.vector_rank), 0) +
    COALESCE(1.0 / (60 + k.keyword_rank), 0) AS rrf_score
FROM vector_search v
FULL OUTER JOIN keyword_search k ON v.id = k.id
ORDER BY rrf_score DESC
LIMIT 10;

8. 메타데이터 필터링

8.1 필터링 전략

벡터 검색과 메타데이터 필터를 결합하는 세 가지 전략이 있습니다.

Pre-filtering:  필터 적용 → 필터된 벡터만 검색 (정확하지만 느릴 수 있음)
Post-filtering: 벡터 검색 → 결과에 필터 적용 (빠르지만 결과가 적을 수 있음)
In-filtering:   검색 중 필터 동시 적용 (최적이지만 구현 복잡)

8.2 Qdrant 고급 필터링

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# 복합 필터
results = client.search(
    collection_name="products",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="electronics")),
            FieldCondition(key="price", range=Range(gte=100, lte=500)),
        ],
        must_not=[
            FieldCondition(key="out_of_stock", match=MatchValue(value=True)),
        ],
        should=[
            FieldCondition(key="brand", match=MatchValue(value="Apple")),
            FieldCondition(key="brand", match=MatchValue(value="Samsung")),
        ]
    ),
    limit=10
)

8.3 효율적인 메타데이터 설계

권장 사항:
- 자주 필터링하는 필드에 페이로드 인덱스 생성
- 카디널리티가 낮은 필드 (category, status)에 유리
- 날짜 범위 필터는 인덱스 필수
- 중첩 객체보다 플랫 구조 선호

비권장:
- 고유 값이 매우 많은 필드에 인덱스 (사용자 ID 등)
- 벡터 검색 없이 메타데이터만으로 검색 (일반 DB가 유리)

9. 네임스페이스와 컬렉션 관리

9.1 멀티테넌시 전략

전략 1: 네임스페이스/파티션 분리 (Pinecone, Milvus)
┌──────────── Index ────────────┐
│  Namespace: tenant_A  ●●●●●  │
│  Namespace: tenant_B  ○○○○○  │
│  Namespace: tenant_C  △△△△△  │
└───────────────────────────────┘
- 장점: 간단, 데이터 격리
- 단점: 교차 테넌트 검색 어려움

전략 2: 메타데이터 필터 (Qdrant, Weaviate)
┌──────── Collection ───────────┐
│  ●(A) ○(B) △(C) ●(A) ○(B)   │
│  △(C) ●(A) ●(A) ○(B) △(C)   │
│  tenant_id 필터로 분리        │
└───────────────────────────────┘
- 장점: 유연, 교차 검색 가능
- 단점: 테넌트 많으면 성능 저하

전략 3: 컬렉션 분리 (소수 대형 테넌트)
┌─ Collection: tenant_A ─┐
│  ●●●●●●●●●●●●●●●●●●  │
└────────────────────────┘
┌─ Collection: tenant_B ─┐
│  ○○○○○○○○○○○○○○○○○○  │
└────────────────────────┘
- 장점: 완벽한 격리, 독립적 스케일링
- 단점: 관리 오버헤드

9.2 컬렉션 라이프사이클 관리

# Qdrant 예시: 컬렉션 관리
# 컬렉션 목록
collections = client.get_collections()

# 컬렉션 정보
info = client.get_collection("articles")
print(f"벡터 수: {info.points_count}")
print(f"인덱스 상태: {info.status}")

# 컬렉션 삭제
client.delete_collection("old_articles")

# 별칭 (블루-그린 배포)
client.update_collection_aliases(
    change_aliases_operations=[
        {"create_alias": {"collection_name": "articles_v2", "alias_name": "articles_prod"}}
    ]
)

10. 프로덕션 운영

10.1 샤딩 전략

수평 샤딩: 데이터를 여러 노드에 분산
┌─ Node 1 ─┐ ┌─ Node 2 ─┐ ┌─ Node 3 ─┐
│ Shard 1   │ │ Shard 2   │ │ Shard 3   │
│ 33% 데이터│ │ 33% 데이터│ │ 33% 데이터│
└───────────┘ └───────────┘ └───────────┘
        ↕ 검색 시 모든 노드 쿼리 후 병합

Qdrant 샤딩 설정:
- auto 샤딩: shard_number에 따라 자동 분배
- custom 샤딩: shard_key로 테넌트별 분리

10.2 복제 (Replication)

# Qdrant 복제 설정
client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    replication_factor=3,     # 3개 노드에 복제
    write_consistency_factor=2  # 2개 노드 확인 후 쓰기 성공
)

10.3 백업 및 복원

# Qdrant 스냅샷 생성
curl -X POST "http://localhost:6333/collections/articles/snapshots"

# 스냅샷 복원
curl -X PUT "http://localhost:6333/collections/articles/snapshots/recover" \
  -H "Content-Type: application/json" \
  -d '{"location": "http://backup-server/snapshot.tar"}'

# pgvector: pg_dump 활용
pg_dump -t documents mydb > documents_backup.sql

10.4 모니터링

# 핵심 메트릭
monitoring_metrics = {
    "검색 지연 시간 (p50, p95, p99)": "목표: p99 100ms 이하",
    "QPS (초당 쿼리 수)": "부하에 따라 모니터링",
    "Recall@K": "정확도. 0.95 이상 목표",
    "인덱스 크기 / 메모리 사용량": "OOM 방지",
    "삽입 지연 시간": "실시간 업데이트 시 중요",
    "디스크 사용량": "용량 계획",
    "복제 지연": "데이터 일관성",
}

11. 성능 벤치마크

11.1 ANN Benchmarks 결과 (1M 벡터, 128차원)

DB / 알고리즘	QPS (Recall 0.95)	QPS (Recall 0.99)	인덱스 시간	메모리
Qdrant HNSW	8,500	4,200	12분	2.1GB
Weaviate HNSW	7,800	3,900	14분	2.3GB
Milvus HNSW	9,200	4,500	11분	2.0GB
pgvector HNSW	3,400	1,600	25분	2.5GB
Pinecone (p2)	5,000	2,800	N/A	N/A
FAISS IVF-PQ	12,000	5,500	8분	0.4GB

11.2 실전 벤치마크 (10M 벡터, 1536차원)

테스트 환경: AWS r6g.2xlarge (8vCPU, 64GB RAM)

Qdrant:
  - 인덱스 구축: 45분
  - 메모리: 48GB
  - p50 지연: 3ms, p99 지연: 12ms
  - QPS: 2,100 (Recall@10 = 0.96)

pgvector (HNSW):
  - 인덱스 구축: 2시간 30분
  - 메모리: 52GB
  - p50 지연: 8ms, p99 지연: 35ms
  - QPS: 800 (Recall@10 = 0.95)

Milvus (DiskANN):
  - 인덱스 구축: 35분
  - 메모리: 12GB (디스크 활용)
  - p50 지연: 5ms, p99 지연: 18ms
  - QPS: 1,800 (Recall@10 = 0.95)

11.3 벤치마크 요약

소규모 (100K 이하) + 기존 PostgreSQL → pgvector
소규모 프로토타이핑 → ChromaDB
중규모 (100K~10M) + 셀프호스팅 → Qdrant or Weaviate
대규모 (10M+) + 셀프호스팅 → Milvus
관리형 서비스 원하면 → Pinecone or Zilliz Cloud
하이브리드 검색 중요 → Weaviate
필터링 중요 → Qdrant

12. 비용 분석

12.1 관리형 서비스 비용 비교

서비스	무료 티어	유료 시작가	1M 벡터 예상 비용
Pinecone Serverless	2GB 스토리지	읽기/쓰기 과금	월 약 70달러
Pinecone Pod (s1)	-	월 70달러	월 약 140달러
Weaviate Cloud	14일 무료	월 25달러	월 약 100달러
Qdrant Cloud	1GB 무료	월 25달러	월 약 65달러
Zilliz Cloud	무료 티어	종량제	월 약 90달러

12.2 셀프호스팅 비용

1M 벡터 (1536차원, HNSW) 예상 인프라:
  - RAM 필요: 약 12GB
  - 디스크: 약 20GB
  - AWS r6g.xlarge (4vCPU, 32GB): 월 약 150달러
  - 운영 인력 비용 별도

10M 벡터:
  - RAM 필요: 약 100GB
  - AWS r6g.4xlarge (16vCPU, 128GB): 월 약 600달러

비용 절감 팁:
  - 양자화로 메모리 60~75% 절감
  - DiskANN으로 SSD 활용 (메모리 80% 절감)
  - 차원 축소 (1536 → 512) 시 메모리 3배 절감

13. 실전 구현: RAG 파이프라인에서의 Vector DB

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance

openai_client = OpenAI()
qdrant_client = QdrantClient("localhost", port=6333)

# 1. 문서 임베딩 및 저장
def embed_and_store(documents):
    for doc in documents:
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=doc["content"]
        )
        embedding = response.data[0].embedding
        
        qdrant_client.upsert(
            collection_name="knowledge_base",
            points=[PointStruct(
                id=doc["id"],
                vector=embedding,
                payload={"title": doc["title"], "content": doc["content"]}
            )]
        )

# 2. 검색 + LLM 생성
def rag_query(question):
    # 질문 임베딩
    q_response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    )
    q_embedding = q_response.data[0].embedding
    
    # 벡터 검색
    results = qdrant_client.search(
        collection_name="knowledge_base",
        query_vector=q_embedding,
        limit=5
    )
    
    # 컨텍스트 구성
    context = "\n\n".join([
        f"[{r.payload['title']}]\n{r.payload['content']}"
        for r in results
    ])
    
    # LLM 생성
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"다음 컨텍스트를 기반으로 답변하세요:\n\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    
    return response.choices[0].message.content

14. 퀴즈

Q1. 코사인 유사도와 유클리드 거리의 차이점은?

코사인 유사도는 두 벡터의 방향(각도) 만 비교합니다. 벡터 크기와 무관하게 방향이 같으면 1입니다. 유클리드 거리는 두 벡터 사이의 직선 거리를 측정하며, 벡터의 크기도 결과에 영향을 미칩니다. 정규화된 벡터에서는 두 메트릭이 동일한 순서를 가집니다. 텍스트 임베딩에는 코사인 유사도가 기본 권장됩니다.

Q2. HNSW 알고리즘의 M 파라미터를 높이면 어떤 영향이 있나요?

M은 각 노드의 최대 연결 수입니다. M을 높이면 그래프가 더 촘촘해져 검색 정확도(recall)가 향상되지만, 메모리 사용량이 증가하고 인덱스 구축 시간도 길어집니다. 일반적으로 M=16이 기본값이며, 높은 recall이 필요하면 M=32~64로 올릴 수 있습니다. 너무 높은 값은 수확 체감이 발생합니다.

Q3. 하이브리드 검색에서 RRF(Reciprocal Rank Fusion)의 역할은?

RRF는 키워드 검색과 벡터 검색의 결과를 순위 기반으로 통합하는 알고리즘입니다. 각 문서의 RRF 점수는 1/(k + rank)로 계산되며, 두 검색 결과 모두에서 높은 순위를 가진 문서가 최종 상위에 옵니다. k=60이 표준이며, 스코어 정규화가 필요 없다는 장점이 있습니다.

Q4. pgvector의 HNSW 인덱스와 IVFFlat 인덱스 중 어떤 것을 선택해야 하나요?

HNSW를 기본 선택으로 권장합니다. HNSW는 더 높은 recall, 더 빠른 검색 속도, 실시간 삽입 지원이라는 장점이 있습니다. IVFFlat은 메모리 제약이 심할 때 고려합니다. IVFFlat은 인덱스 구축이 더 빠르고 메모리를 적게 사용하지만, 데이터가 많이 변하면 재구축(REINDEX)이 필요할 수 있습니다.

Q5. 10M 벡터 규모에서 비용 효율적인 Vector DB 선택은?

셀프호스팅이 가능하면: Qdrant 또는 Milvus가 비용 효율적입니다. Milvus의 DiskANN 인덱스를 사용하면 메모리를 크게 절감할 수 있습니다. 관리형을 원하면: Pinecone Serverless가 종량제로 비용을 관리할 수 있습니다. 양자화(Scalar/Binary)를 적용하면 메모리를 60~75% 절감하여 인프라 비용을 크게 줄일 수 있습니다. 차원 축소(1536에서 512로)도 효과적입니다.

15. 참고 자료

Pinecone Documentation - https://docs.pinecone.io/
Weaviate Documentation - https://weaviate.io/developers/weaviate
Qdrant Documentation - https://qdrant.tech/documentation/
Milvus Documentation - https://milvus.io/docs
pgvector GitHub - https://github.com/pgvector/pgvector
ChromaDB Documentation - https://docs.trychroma.com/
ANN Benchmarks - https://ann-benchmarks.com/
FAISS Wiki - https://github.com/facebookresearch/faiss/wiki
OpenAI Embeddings Guide - https://platform.openai.com/docs/guides/embeddings
Cohere Embed Documentation - https://docs.cohere.com/reference/embed
HNSW 원본 논문 - https://arxiv.org/abs/1603.09320
Product Quantization 논문 - https://hal.inria.fr/inria-00514462v2/document
MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard
Reciprocal Rank Fusion 논문 - https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf

Vector Database Complete Guide 2025: Embeddings, Similarity Search, Pinecone/Weaviate/Qdrant/pgvector

1. Why Vector Databases
2. Vector Embedding Fundamentals
3. Distance Metrics (Similarity Measurement)
4. Indexing Algorithms
5. Vector Database Comparison
6. pgvector Deep Dive
7. Hybrid Search
8. Metadata Filtering
9. Namespace and Collection Management
- 9.1 Multi-tenancy Strategies
- 9.2 Collection Lifecycle Management
10. Production Operations
11. Performance Benchmarks
12. Cost Analysis
- 12.1 Managed Service Cost Comparison
- 12.2 Self-Hosting Costs
13. Practical Implementation: Vector DB in a RAG Pipeline
14. Quiz
15. References

1. Why Vector Databases

1.1 Limitations of Traditional Search

Traditional databases rely on exact keyword matching. If you search for "a puppy playing in the park," you will not find "a dog running on the lawn." These two sentences are semantically nearly identical, but their keywords differ.

Traditional search: "puppy park" → keyword matching → only returns docs containing "puppy" AND "park"
Vector search:      "puppy park" → semantic vectorization → returns all semantically similar docs
                    → "a dog running on the lawn" ✅ found
                    → "best pet walking spots" ✅ found

1.2 Problems Vector Databases Solve

Vector databases store data as high-dimensional vectors (arrays of numbers) and search based on vector similarity.

Core Use Cases:

Domain	Description	Example
RAG (Retrieval-Augmented Generation)	Provide relevant document context to LLMs	ChatGPT + internal docs
Semantic Search	Meaning-based search	Natural language question search
Recommendation Systems	Find similar items	Product/content recommendations
Image Search	Visual similarity	"Products similar to this outfit"
Anomaly Detection	Data deviating from normal patterns	Fraud detection
Duplicate Detection	Identify similar content	Plagiarism detection, dedup

1.3 Market Growth

The Vector Database market is projected to grow from 1.5 billion dollars in 2024 to approximately 6 billion dollars by 2028. Explosive adoption of RAG pipelines is the key driver.

2. Vector Embedding Fundamentals

2.1 What Are Embeddings

Embeddings convert unstructured data such as text, images, and audio into numeric vectors in high-dimensional space. Semantically similar data is located close together in vector space.

from openai import OpenAI

client = OpenAI()

# Convert text to vector
response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Vector databases are essential AI infrastructure"
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")    # 3072
print(f"Vector sample: {embedding[:5]}")  # [0.023, -0.041, 0.017, ...]

2.2 Text Embedding Model Comparison

Model	Provider	Dimensions	MTEB Score	Cost	Features
text-embedding-3-large	OpenAI	3072	64.6	Paid	Dimension reduction support
text-embedding-3-small	OpenAI	1536	62.3	Low cost	Best cost-performance ratio
embed-v3.0	Cohere	1024	64.5	Paid	Excellent multilingual
BGE-M3	BAAI	1024	68.2	Free	Best open-source
Jina-embeddings-v3	Jina AI	1024	65.5	Free	Multilingual specialized
all-MiniLM-L6-v2	SBERT	384	56.3	Free	Lightweight and fast
nomic-embed-text	Nomic	768	62.4	Free	Long context

2.3 Image Embeddings

CLIP (Contrastive Language-Image Pre-Training) maps text and images into the same vector space.

from sentence_transformers import SentenceTransformer
from PIL import Image

model = SentenceTransformer("clip-ViT-B-32")

# Image embedding
img = Image.open("cat_photo.jpg")
img_embedding = model.encode(img)

# Text embedding (same vector space)
text_embedding = model.encode("a cute orange cat")

# Cross-modal search is possible!
from numpy import dot
from numpy.linalg import norm

similarity = dot(img_embedding, text_embedding) / (
    norm(img_embedding) * norm(text_embedding)
)
print(f"Similarity: {similarity:.4f}")  # 0.28+ (higher if related)

2.4 Multimodal Embeddings

Modern models unify text, images, and audio into a single vector space.

Text: "sunset over the ocean"  ──┐
                                  ├──→ Same vector space → similarity comparison possible
Image: (sunset photo)           ──┘

3. Distance Metrics (Similarity Measurement)

3.1 Cosine Similarity

Measures the angle between two vectors. Ignores vector magnitude (length) and compares only direction.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example: 3-dimensional vectors
vec_a = np.array([1, 2, 3])
vec_b = np.array([2, 4, 6])  # Same direction, different magnitude
vec_c = np.array([-1, -2, -3])  # Opposite direction

print(cosine_similarity(vec_a, vec_b))  # 1.0 (exact same direction)
print(cosine_similarity(vec_a, vec_c))  # -1.0 (exact opposite)

When to use: Text embeddings (most common), normalized vectors

3.2 Euclidean Distance (L2)

Measures the straight-line distance between two vectors. Smaller values mean more similar.

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

vec_a = np.array([1, 2])
vec_b = np.array([4, 6])

print(euclidean_distance(vec_a, vec_b))  # 5.0

When to use: When vector magnitude is meaningful, clustering

3.3 Dot Product (Inner Product)

The inner product of two vectors. Considers both direction and magnitude.

def dot_product(a, b):
    return np.dot(a, b)

vec_a = np.array([1, 2, 3])
vec_b = np.array([4, 5, 6])

print(dot_product(vec_a, vec_b))  # 32

When to use: Maximum Inner Product Search (MIPS), equivalent to cosine similarity on normalized vectors

3.4 Manhattan Distance (L1)

Sum of absolute differences across each dimension.

def manhattan_distance(a, b):
    return np.sum(np.abs(a - b))

vec_a = np.array([1, 2, 3])
vec_b = np.array([4, 6, 3])

print(manhattan_distance(vec_a, vec_b))  # 7 (3 + 4 + 0)

3.5 Metric Selection Guide

Text embedding search     → Cosine Similarity (default recommendation)
Image similarity search   → Euclidean Distance
Recommendation (MIPS)     → Dot Product
High-dimensional sparse   → Cosine Similarity
Clustering / Classification → Euclidean Distance

4. Indexing Algorithms

When you have millions of vectors, comparing against every single one (brute-force) is too slow. Indexing algorithms speed up search by thousands of times.

4.1 HNSW (Hierarchical Navigable Small World)

The most widely used ANN (Approximate Nearest Neighbor) algorithm.

How it works:

Layer 3: [A] ────────────────── [B]     (few nodes, long-range links)
           \                   /
Layer 2: [A] ── [C] ── [D] ── [B]      (medium nodes)
           \   / \   / \   / \
Layer 1: [A]-[E]-[C]-[F]-[D]-[G]-[B]   (most nodes, short-range links)
           |   |   |   |   |   |   |
Layer 0: [All vectors exist at base layer]  (all nodes)

Navigate from upper layers for coarse positioning, then refine at lower layers
Graph-based so it uses more memory, but very fast

# HNSW configuration in Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, HnswConfigDiff

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,
        distance="Cosine"
    ),
    hnsw_config=HnswConfigDiff(
        m=16,                    # connections per node (higher = more accurate, more memory)
        ef_construct=128,        # search scope during index building
        full_scan_threshold=10000  # brute-force below this count
    )
)

Key Parameters:

Parameter	Description	Default	Effect
M	Connections per node	16	Higher increases accuracy and memory
ef_construct	Build search scope	128	Higher increases index quality
ef_search	Query search scope	64	Higher increases accuracy, decreases speed

4.2 IVF (Inverted File Index)

Divides vectors into clusters and searches only the nearest clusters at query time.

Vector space:
┌─────────────────────────────────┐
│  * *   X *                      │
│   *  Cluster1  *  *             │
│  *  *          X                │
│        * *  Cluster2  *         │
│         *     *  * *            │
│              *   *              │
│    * *  *         X             │
│     Cluster3        Cluster4    │
│    *  * *         *  *  *       │
└─────────────────────────────────┘
X = cluster centroid

Search: only search within the nprobe closest clusters to the query

4.3 IVF-PQ (IVF + Product Quantization)

PQ compresses vectors to reduce memory usage.

# Creating IVF-PQ index with FAISS
import faiss

dimension = 1536
nlist = 256      # number of clusters
m_pq = 48        # number of sub-vectors (splits dimensions into m_pq parts)
nbits = 8        # codebook size per sub-vector

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m_pq, nbits)

# Train clusters + codebook on training data
index.train(training_vectors)
index.add(all_vectors)

# Set nprobe at search time
index.nprobe = 16  # number of clusters to probe
distances, indices = index.search(query_vector, k=10)

Memory comparison:

Original (float32, 1536 dims): 1536 x 4 bytes = 6,144 bytes/vector
PQ compressed (48 sub-vectors): 48 x 1 byte  = 48 bytes/vector
→ 128x compression! 100M vectors: 614GB → 4.8GB

4.4 ScaNN (Scalable Nearest Neighbors)

Developed by Google. Combines asymmetric hashing with quantization to deliver fast speed even at high recall levels.

4.5 Annoy (Approximate Nearest Neighbors Oh Yeah)

Developed by Spotify. Recursively partitions space using random hyperplanes to build a tree structure. Read-only index with memory-mapped file support, shareable across multiple processes.

4.6 Algorithm Comparison

Algorithm	Search Speed	Memory	Index Build	Accuracy	Suitable Scale
Flat (Brute-force)	Slow	High	None	100%	Under 100K
HNSW	Very Fast	High	Slow	Very High	Millions
IVF-Flat	Fast	Medium	Medium	High	Tens of millions
IVF-PQ	Fast	Low	Medium	Medium	Hundreds of millions
ScaNN	Very Fast	Medium	Medium	High	Hundreds of millions
Annoy	Fast	Low	Fast	Medium	Tens of millions

5. Vector Database Comparison

5.1 Major Solutions Comparison

Feature	Pinecone	Weaviate	Qdrant	Milvus	Chroma	pgvector
Type	Managed SaaS	Open-source/Cloud	Open-source/Cloud	Open-source	Open-source	PostgreSQL extension
Index	Proprietary	HNSW	HNSW	HNSW/IVF/DiskANN	HNSW	HNSW/IVF
Hybrid Search	Sparse-Dense	BM25 + Vector	Sparse + Dense	Supported	Limited	Full-text + Vector
Filtering	Metadata filter	GraphQL filter	Payload filter	Attribute filter	Where clause	SQL WHERE
Multi-tenancy	Namespaces	Tenant isolation	Collection/Payload	Partitions	Collections	Schema/RLS
Max Dimensions	20,000	65,535	65,535	32,768	Unlimited	2,000
Distributed	Automatic	Raft consensus	Raft consensus	Distributed arch	Not supported	Citus extension
SDKs	Python/JS/Go/Java	Python/JS/Go/Java	Python/JS/Rust/Go	Python/JS/Go/Java	Python/JS	SQL
Pricing	Pod/Serverless plans	Open-source free	Open-source free	Open-source free	Open-source free	Free
GPU Support	Internal	No	No	NVIDIA GPU	No	No
Real-time Updates	Yes	Yes	Yes	Yes	Yes	Yes
Backup/Restore	Automatic	Collections API	Snapshot	Supported	Limited	pg_dump
Community	Medium	Active	Active	Very Active	Active	Very Active
Serverless	Yes	Yes	Yes (Cloud)	Zilliz Cloud	No	Supabase/Neon

5.2 Pinecone

Fully managed vector database. Use via API without any infrastructure management.

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_KEY")

# Create index
pc.create_index(
    name="my-index",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

index = pc.Index("my-index")

# Upsert vectors
index.upsert(
    vectors=[
        ("id1", [0.1, 0.2, ...], {"title": "Document Title", "category": "tech"}),
        ("id2", [0.3, 0.4, ...], {"title": "Another Doc", "category": "science"}),
    ],
    namespace="articles"
)

# Search (filter + vector)
results = index.query(
    vector=[0.15, 0.25, ...],
    top_k=5,
    filter={"category": "tech"},
    namespace="articles",
    include_metadata=True
)

Pros: Fully managed, instant scaling, Serverless option Cons: Vendor lock-in, higher cost, no self-hosting

5.3 Weaviate

Open-source vector database with GraphQL-based API and built-in vectorization modules.

import weaviate
from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local()

# Create collection (built-in vectorization)
collection = client.collections.create(
    name="Article",
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    ),
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="content", data_type=DataType.TEXT),
        Property(name="category", data_type=DataType.TEXT),
    ]
)

# Add data (automatic vectorization)
articles = client.collections.get("Article")
articles.data.insert(
    properties={
        "title": "Vector DB Guide",
        "content": "Vector databases are essential for AI...",
        "category": "tech"
    }
)

# Semantic search
response = articles.query.near_text(
    query="artificial intelligence infrastructure",
    limit=5,
    filters=weaviate.classes.query.Filter.by_property("category").equal("tech")
)

Pros: Built-in vectorization, GraphQL API, module ecosystem Cons: Resource heavy, learning curve

5.4 Qdrant

High-performance vector database written in Rust. Strong in payload filtering and quantization.

from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue
)

client = QdrantClient("localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(
        size=1536,
        distance=Distance.COSINE
    )
)

# Upsert vectors
client.upsert(
    collection_name="articles",
    points=[
        PointStruct(
            id=1,
            vector=[0.1, 0.2, ...],
            payload={"title": "Vector DB Guide", "category": "tech", "views": 1500}
        ),
    ]
)

# Filter + vector search
results = client.search(
    collection_name="articles",
    query_vector=[0.15, 0.25, ...],
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="tech")),
        ]
    ),
    limit=5
)

Pros: Rust-based performance, fine-grained filtering, Scalar/Binary quantization Cons: Relatively smaller ecosystem

5.5 Milvus

Large-scale vector database specialized for distributed architectures. Supports GPU acceleration.

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

connections.connect("default", host="localhost", port="19530")

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1536),
]
schema = CollectionSchema(fields, description="Articles")
collection = Collection("articles", schema)

# Create index
index_params = {
    "index_type": "HNSW",
    "metric_type": "COSINE",
    "params": {"M": 16, "efConstruction": 256}
}
collection.create_index("embedding", index_params)

# Search
collection.load()
results = collection.search(
    data=[[0.1, 0.2, ...]],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 128}},
    limit=5,
    output_fields=["title"]
)

Pros: Large-scale distributed, GPU support, diverse indexes Cons: Operational complexity, resource requirements

5.6 ChromaDB

Lightweight, developer-friendly open-source vector database. Ideal for prototyping.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection(
    name="articles",
    metadata={"hnsw:space": "cosine"}
)

# Add documents (automatic embedding)
collection.add(
    documents=["Vector DBs are core AI infrastructure", "RAG is retrieval augmented generation"],
    metadatas=[{"category": "tech"}, {"category": "ai"}],
    ids=["doc1", "doc2"]
)

# Search
results = collection.query(
    query_texts=["artificial intelligence database"],
    n_results=5,
    where={"category": "tech"}
)

Pros: Ultra-simple API, built-in embeddings, local execution Cons: Limited production scaling, no distributed support

6. pgvector Deep Dive

6.1 Why pgvector

If you already use PostgreSQL, you can add vector search to your existing infrastructure without a separate vector database. Combine the rich features of SQL with vector search.

6.2 Installation and Setup

-- Requires PostgreSQL 14+
-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with vector column
CREATE TABLE documents (
    id BIGSERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT,
    category TEXT,
    embedding VECTOR(1536),  -- OpenAI text-embedding-3-small dimensions
    created_at TIMESTAMPTZ DEFAULT NOW()
);

6.3 Data Insertion and Search

-- Insert vector
INSERT INTO documents (title, content, category, embedding)
VALUES (
    'Vector DB Guide',
    'Vector databases are essential for AI...',
    'tech',
    '[0.1, 0.2, 0.3, ...]'::vector  -- 1536-dimensional vector
);

-- Cosine distance search (top 5 most similar)
SELECT id, title, 1 - (embedding <=> '[0.15, 0.25, ...]'::vector) AS similarity
FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> '[0.15, 0.25, ...]'::vector
LIMIT 5;

-- L2 distance search
SELECT id, title, embedding <-> '[0.15, 0.25, ...]'::vector AS distance
FROM documents
ORDER BY embedding <-> '[0.15, 0.25, ...]'::vector
LIMIT 5;

-- Inner product search (Maximum Inner Product)
SELECT id, title, (embedding <#> '[0.15, 0.25, ...]'::vector) * -1 AS inner_product
FROM documents
ORDER BY embedding <#> '[0.15, 0.25, ...]'::vector
LIMIT 5;

6.4 HNSW vs IVFFlat Index

-- HNSW index (recommended)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);

-- Set ef_search at query time
SET hnsw.ef_search = 100;

-- IVFFlat index
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- number of clusters (sqrt(rows) recommended)

-- Set probes at query time
SET ivfflat.probes = 10;

HNSW vs IVFFlat comparison:

Feature	HNSW	IVFFlat
Search Speed	Faster	Fast
Index Build	Slow	Fast
Memory	Higher	Lower
Accuracy (Recall)	Higher	Medium
Real-time Insert	Excellent	May need rebuild
Recommendation	Default choice	When memory constrained

6.5 Query Optimization

-- 1. Partial index (specific category only)
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WHERE category = 'tech';

-- 2. Filter + vector search with denormalization
-- Slow pattern: vector search then filter
SELECT * FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> query_vec
LIMIT 10;

-- Fast pattern: use partitioned tables
CREATE TABLE documents_tech PARTITION OF documents
FOR VALUES IN ('tech');

-- 3. Confirm index usage with EXPLAIN
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, title
FROM documents
ORDER BY embedding <=> '[0.15, 0.25, ...]'::vector
LIMIT 5;

-- Verify: Index Scan using documents_embedding_idx

6.6 Using pgvector with Python

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
register_vector(conn)

cur = conn.cursor()

# Vector search
query_embedding = [0.1, 0.2, ...]  # 1536 dimensions
cur.execute("""
    SELECT id, title, 1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    WHERE category = %s
    ORDER BY embedding <=> %s::vector
    LIMIT %s
""", (query_embedding, "tech", query_embedding, 5))

results = cur.fetchall()
for row in results:
    print(f"ID: {row[0]}, Title: {row[1]}, Similarity: {row[2]:.4f}")

7. Hybrid Search

7.1 Why Hybrid Search

Vector search alone is insufficient in some cases.

Query: "PostgreSQL 16 release notes"

Vector only: "MySQL 8.0 new features" may also rank high (similar semantics)
Keyword only: only documents containing exactly "PostgreSQL 16"
Hybrid: semantically similar + contains "PostgreSQL 16" = optimal results

7.2 BM25 + Vector Combination

# Hybrid search in Weaviate
response = articles.query.hybrid(
    query="PostgreSQL vector search performance",
    alpha=0.5,  # 0 = keyword only, 1 = vector only, 0.5 = balanced
    limit=10,
    fusion_type="relative_score"  # rankedFusion or relativeScoreFusion
)

7.3 Reciprocal Rank Fusion (RRF)

An algorithm that combines rankings from two search results.

def reciprocal_rank_fusion(keyword_results, vector_results, k=60):
    """
    RRF score = sum(1 / (k + rank_i))
    k = 60 is standard
    """
    scores = {}
    
    for rank, doc_id in enumerate(keyword_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    for rank, doc_id in enumerate(vector_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    # Sort by score
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Example
keyword_hits = ["doc_A", "doc_B", "doc_C", "doc_D"]
vector_hits = ["doc_C", "doc_A", "doc_E", "doc_B"]

fused = reciprocal_rank_fusion(keyword_hits, vector_hits)
# doc_A and doc_C appear top in both → final top results

7.4 pgvector + Full-Text Search

-- Hybrid search (PostgreSQL)
WITH vector_search AS (
    SELECT id, title, content,
           1 - (embedding <=> query_vec) AS vector_score,
           ROW_NUMBER() OVER (ORDER BY embedding <=> query_vec) AS vector_rank
    FROM documents
    ORDER BY embedding <=> query_vec
    LIMIT 20
),
keyword_search AS (
    SELECT id, title, content,
           ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'vector database')) AS bm25_score,
           ROW_NUMBER() OVER (ORDER BY ts_rank(to_tsvector('english', content), plainto_tsquery('english', 'vector database')) DESC) AS keyword_rank
    FROM documents
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'vector database')
    LIMIT 20
)
SELECT
    COALESCE(v.id, k.id) AS id,
    COALESCE(v.title, k.title) AS title,
    -- RRF score
    COALESCE(1.0 / (60 + v.vector_rank), 0) +
    COALESCE(1.0 / (60 + k.keyword_rank), 0) AS rrf_score
FROM vector_search v
FULL OUTER JOIN keyword_search k ON v.id = k.id
ORDER BY rrf_score DESC
LIMIT 10;

8. Metadata Filtering

8.1 Filtering Strategies

Three strategies for combining vector search with metadata filters.

Pre-filtering:  Apply filter → search only filtered vectors (accurate but can be slow)
Post-filtering: Vector search → apply filter to results (fast but may return fewer results)
In-filtering:   Apply filter during search simultaneously (optimal but complex to implement)

8.2 Advanced Filtering in Qdrant

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# Complex filter
results = client.search(
    collection_name="products",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="electronics")),
            FieldCondition(key="price", range=Range(gte=100, lte=500)),
        ],
        must_not=[
            FieldCondition(key="out_of_stock", match=MatchValue(value=True)),
        ],
        should=[
            FieldCondition(key="brand", match=MatchValue(value="Apple")),
            FieldCondition(key="brand", match=MatchValue(value="Samsung")),
        ]
    ),
    limit=10
)

8.3 Efficient Metadata Design

Recommended:
- Create payload indexes on frequently filtered fields
- Works best for low-cardinality fields (category, status)
- Date range filters need indexes
- Prefer flat structures over nested objects

Not recommended:
- Indexing fields with very high cardinality (user IDs, etc.)
- Using only metadata for search without vectors (a regular DB is better)

9. Namespace and Collection Management

9.1 Multi-tenancy Strategies

Strategy 1: Namespace/Partition Separation (Pinecone, Milvus)
┌──────────── Index ────────────┐
│  Namespace: tenant_A  *****   │
│  Namespace: tenant_B  ooooo   │
│  Namespace: tenant_C  ^^^^^   │
└───────────────────────────────┘
- Pros: Simple, data isolation
- Cons: Cross-tenant search difficult

Strategy 2: Metadata Filter (Qdrant, Weaviate)
┌──────── Collection ───────────┐
│  *(A) o(B) ^(C) *(A) o(B)    │
│  ^(C) *(A) *(A) o(B) ^(C)    │
│  Separated by tenant_id filter│
└───────────────────────────────┘
- Pros: Flexible, cross-search possible
- Cons: Performance degrades with many tenants

Strategy 3: Collection Separation (few large tenants)
┌─ Collection: tenant_A ─┐
│  ******************     │
└────────────────────────┘
┌─ Collection: tenant_B ─┐
│  oooooooooooooooooo     │
└────────────────────────┘
- Pros: Perfect isolation, independent scaling
- Cons: Management overhead

9.2 Collection Lifecycle Management

# Qdrant example: collection management
# List collections
collections = client.get_collections()

# Collection info
info = client.get_collection("articles")
print(f"Vector count: {info.points_count}")
print(f"Index status: {info.status}")

# Delete collection
client.delete_collection("old_articles")

# Aliases (blue-green deployment)
client.update_collection_aliases(
    change_aliases_operations=[
        {"create_alias": {"collection_name": "articles_v2", "alias_name": "articles_prod"}}
    ]
)

10. Production Operations

10.1 Sharding Strategy

Horizontal Sharding: distribute data across multiple nodes
┌─ Node 1 ─┐ ┌─ Node 2 ─┐ ┌─ Node 3 ─┐
│ Shard 1   │ │ Shard 2   │ │ Shard 3   │
│ 33% data  │ │ 33% data  │ │ 33% data  │
└───────────┘ └───────────┘ └───────────┘
        ↕ Query all nodes at search time, then merge

Qdrant sharding config:
- auto sharding: automatically distributes based on shard_number
- custom sharding: tenant-level separation via shard_key

10.2 Replication

# Qdrant replication setup
client.create_collection(
    collection_name="articles",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
    replication_factor=3,        # replicate to 3 nodes
    write_consistency_factor=2   # write succeeds after 2 nodes confirm
)

10.3 Backup and Restore

# Qdrant snapshot creation
curl -X POST "http://localhost:6333/collections/articles/snapshots"

# Snapshot recovery
curl -X PUT "http://localhost:6333/collections/articles/snapshots/recover" \
  -H "Content-Type: application/json" \
  -d '{"location": "http://backup-server/snapshot.tar"}'

# pgvector: use pg_dump
pg_dump -t documents mydb > documents_backup.sql

10.4 Monitoring

# Key metrics
monitoring_metrics = {
    "Search latency (p50, p95, p99)": "Target: p99 under 100ms",
    "QPS (queries per second)": "Monitor under load",
    "Recall@K": "Accuracy. Target 0.95+",
    "Index size / memory usage": "Prevent OOM",
    "Insert latency": "Critical for real-time updates",
    "Disk usage": "Capacity planning",
    "Replication lag": "Data consistency",
}

11. Performance Benchmarks

11.1 ANN Benchmark Results (1M vectors, 128 dimensions)

DB / Algorithm	QPS (Recall 0.95)	QPS (Recall 0.99)	Index Time	Memory
Qdrant HNSW	8,500	4,200	12 min	2.1GB
Weaviate HNSW	7,800	3,900	14 min	2.3GB
Milvus HNSW	9,200	4,500	11 min	2.0GB
pgvector HNSW	3,400	1,600	25 min	2.5GB
Pinecone (p2)	5,000	2,800	N/A	N/A
FAISS IVF-PQ	12,000	5,500	8 min	0.4GB

11.2 Real-World Benchmark (10M vectors, 1536 dimensions)

Test environment: AWS r6g.2xlarge (8vCPU, 64GB RAM)

Qdrant:
  - Index build: 45 min
  - Memory: 48GB
  - p50 latency: 3ms, p99 latency: 12ms
  - QPS: 2,100 (Recall@10 = 0.96)

pgvector (HNSW):
  - Index build: 2h 30min
  - Memory: 52GB
  - p50 latency: 8ms, p99 latency: 35ms
  - QPS: 800 (Recall@10 = 0.95)

Milvus (DiskANN):
  - Index build: 35 min
  - Memory: 12GB (disk-based)
  - p50 latency: 5ms, p99 latency: 18ms
  - QPS: 1,800 (Recall@10 = 0.95)

11.3 Benchmark Summary

Small scale (under 100K) + existing PostgreSQL → pgvector
Small scale prototyping → ChromaDB
Medium scale (100K-10M) + self-hosted → Qdrant or Weaviate
Large scale (10M+) + self-hosted → Milvus
Want managed service → Pinecone or Zilliz Cloud
Hybrid search priority → Weaviate
Filtering priority → Qdrant

12. Cost Analysis

12.1 Managed Service Cost Comparison

Service	Free Tier	Starting Price	Est. Cost for 1M Vectors
Pinecone Serverless	2GB storage	Pay per read/write	~70 USD/month
Pinecone Pod (s1)	-	70 USD/month	~140 USD/month
Weaviate Cloud	14-day free	25 USD/month	~100 USD/month
Qdrant Cloud	1GB free	25 USD/month	~65 USD/month
Zilliz Cloud	Free tier	Pay-as-you-go	~90 USD/month

12.2 Self-Hosting Costs

1M vectors (1536 dims, HNSW) estimated infrastructure:
  - RAM needed: ~12GB
  - Disk: ~20GB
  - AWS r6g.xlarge (4vCPU, 32GB): ~150 USD/month
  - Operations personnel cost separate

10M vectors:
  - RAM needed: ~100GB
  - AWS r6g.4xlarge (16vCPU, 128GB): ~600 USD/month

Cost reduction tips:
  - Quantization saves 60-75% memory
  - DiskANN leverages SSD (80% memory savings)
  - Dimension reduction (1536 to 512) saves 3x memory

13. Practical Implementation: Vector DB in a RAG Pipeline

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance

openai_client = OpenAI()
qdrant_client = QdrantClient("localhost", port=6333)

# 1. Embed and store documents
def embed_and_store(documents):
    for doc in documents:
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=doc["content"]
        )
        embedding = response.data[0].embedding
        
        qdrant_client.upsert(
            collection_name="knowledge_base",
            points=[PointStruct(
                id=doc["id"],
                vector=embedding,
                payload={"title": doc["title"], "content": doc["content"]}
            )]
        )

# 2. Search + LLM generation
def rag_query(question):
    # Embed question
    q_response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    )
    q_embedding = q_response.data[0].embedding
    
    # Vector search
    results = qdrant_client.search(
        collection_name="knowledge_base",
        query_vector=q_embedding,
        limit=5
    )
    
    # Build context
    context = "\n\n".join([
        f"[{r.payload['title']}]\n{r.payload['content']}"
        for r in results
    ])
    
    # LLM generation
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on the following context:\n\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    
    return response.choices[0].message.content

14. Quiz

Q1. What is the difference between cosine similarity and Euclidean distance?

Cosine similarity compares only the direction (angle) of two vectors. It equals 1 when vectors point in the same direction regardless of magnitude. Euclidean distance measures the straight-line distance between two vectors, where magnitude also affects the result. For normalized vectors, both metrics produce the same ordering. Cosine similarity is the default recommendation for text embeddings.

Q2. What happens when you increase the M parameter in the HNSW algorithm?

M is the maximum number of connections per node. Increasing M makes the graph denser, which improves search accuracy (recall), but increases memory usage and lengthens index build time. The default M=16 works well generally. For higher recall requirements, you can increase to M=32-64. Very high values show diminishing returns.

Q3. What is the role of RRF (Reciprocal Rank Fusion) in hybrid search?

RRF is an algorithm that combines results from keyword search and vector search based on ranking. The RRF score for each document is calculated as 1/(k + rank), and documents that rank highly in both search results appear at the top of the final results. k=60 is the standard, and it has the advantage of not requiring score normalization.

Q4. Should you choose HNSW or IVFFlat index for pgvector?

HNSW is the recommended default choice. HNSW offers higher recall, faster search speed, and real-time insert support. IVFFlat should be considered when memory constraints are severe. IVFFlat has faster index build time and uses less memory, but may require rebuilding (REINDEX) when data changes frequently.

Q5. What is a cost-effective Vector DB choice at 10M vector scale?

If self-hosting is possible: Qdrant or Milvus are cost-effective. Using Milvus with DiskANN index can significantly reduce memory usage. If you want managed: Pinecone Serverless offers pay-per-use cost management. Applying quantization (Scalar/Binary) can reduce memory by 60-75%, greatly cutting infrastructure costs. Dimension reduction (1536 to 512) is also effective.

15. References

Pinecone Documentation - https://docs.pinecone.io/
Weaviate Documentation - https://weaviate.io/developers/weaviate
Qdrant Documentation - https://qdrant.tech/documentation/
Milvus Documentation - https://milvus.io/docs
pgvector GitHub - https://github.com/pgvector/pgvector
ChromaDB Documentation - https://docs.trychroma.com/
ANN Benchmarks - https://ann-benchmarks.com/
FAISS Wiki - https://github.com/facebookresearch/faiss/wiki
OpenAI Embeddings Guide - https://platform.openai.com/docs/guides/embeddings
Cohere Embed Documentation - https://docs.cohere.com/reference/embed
HNSW Original Paper - https://arxiv.org/abs/1603.09320
Product Quantization Paper - https://hal.inria.fr/inria-00514462v2/document
MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard
Reciprocal Rank Fusion Paper - https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf