Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Intro — In May 2026, the vector DB has become commodity infra

Back in 2023, vector databases were the new category that answered "where do I put my embeddings if I want to build RAG?" By May 2026 that question is settled. RAG has matured, vector search has shown up as a default option in almost every OLTP/OLAP DB, and hybrid search (BM25 + dense vector + reranker) is the new standard.

This post is not a marketing matrix. It is an honest read on which vector DBs occupy which slot in production today. We compare Pinecone Serverless v3, Weaviate 1.30, Qdrant 1.13, Milvus 2.5 + Zilliz Cloud, Chroma, LanceDB, pgvector 0.8 + pgvectorscale 0.6 + ParadeDB, Vespa, OpenSearch k-NN, Redis Vector, MongoDB Atlas Vector, Couchbase, SingleStore, Turbopuffer, Marqo, Vald, NGT, FAISS, ScaNN, DiskANN, Annoy, DuckDB VSS, sqlite-vec, and jina HnswLib usage with concrete API examples.

The vector DB landscape in 2026 — five tracks

Here is the big picture. The 2026 market splits into five tracks:

1. **Pure-play vector DBs**: Pinecone, Weaviate, Qdrant, Milvus, Chroma, LanceDB, Turbopuffer, Marqo

2. **Relational DB vector extensions**: pgvector + pgvectorscale, ParadeDB, SingleStore, Oracle 23ai, SQL Server 2025

3. **Search engine vector extensions**: Elasticsearch dense_vector, OpenSearch k-NN, Vespa, Solr

4. **General-purpose NoSQL extensions**: MongoDB Atlas Vector, Redis Vector, Couchbase, DynamoDB (GA 2025)

5. **Embedded / library**: FAISS, ScaNN, DiskANN, Annoy, NGT, HnswLib, Vald, DuckDB VSS, sqlite-vec

They use the same ANN algorithm in many cases (HNSW dominates) but diverge widely in operational model, pricing, multi-tenancy, and hybrid search support. We walk through each track below.

ANN algorithm 1 — why HNSW became the de facto standard

The 2026 default ANN algorithm is **HNSW (Hierarchical Navigable Small World)**, from the 2016 paper by Yu. Malkov and D. Yashunin. It is greedy search over a multi-layer proximity graph.

- **Layer 0**: every node, densest neighbor links.

- **Upper layers**: a probabilistic subset of nodes promoted, sparser as you go up.

- **Search**: start at the top layer, descend greedily, maintain an `ef`-size candidate queue per layer.

The strength is that **insert, delete, and search are all graph operations**, so it copes with dynamic data. The weakness is **memory footprint**: 100M float32 vectors at 768d with HNSW (M=16) costs roughly 350 GB of RAM.

Pinecone, Weaviate, Qdrant, Milvus, pgvector 0.8, Elasticsearch, OpenSearch, Redis, Chroma, LanceDB, and Vespa all expose HNSW as default or option.

ANN algorithm 2 — the place for IVF, IVF-PQ, DiskANN, and ScaNN

HNSW may be the default, but the alternatives are still alive.

- **IVF (Inverted File Index)**: partition space into `nlist` cells via k-means, probe only `nprobe` cells at query time. Classic Milvus and FAISS option.

- **IVF-PQ**: IVF + Product Quantization. Compress vectors to PQ codes and cut memory by 8x to 32x. Essential beyond ~1B vectors.

- **DiskANN** (Microsoft Research): NVMe-SSD-friendly graph index, billion-scale on a single node. pgvectorscale's StreamingDiskANN is built on this.

- **ScaNN** (Google): asymmetric quantization plus tree. TensorFlow-friendly. Excellent recall-vs-latency curve.

- **Annoy** (Spotify): random projection forest, mmap-friendly, well-suited to static indexes.

- **NGT** (Yahoo Japan): ONNG graph, the core of the Vald cluster.

Simplified picker:

- **Under 10M vectors, dynamic writes**: HNSW

- **100M+ with memory pressure**: IVF-PQ or DiskANN

- **Read-only at scale**: ScaNN, Annoy

Recall vs latency vs cost — the ann-benchmarks reality

Compressed numbers from ann-benchmarks.com as of May 2026 on GIST-960-1M:

- HNSW (M=16, ef=200): recall@10 0.99, p99 1.2 ms, RAM 4 GB

- IVF-PQ (nlist=4096, nprobe=64, m=32): recall@10 0.93, p99 0.9 ms, RAM 0.5 GB

- DiskANN (R=64, L=100): recall@10 0.98, p99 4 ms, RAM 1 GB + SSD

- ScaNN (reorder=200): recall@10 0.99, p99 0.6 ms, RAM 3 GB

The point is "there is no free lunch." Lifting recall from 0.95 to 0.99 usually costs 2x to 5x latency and 1.5x to 3x memory. In production `recall@10 = 0.95` is often plenty.

Pinecone Serverless v3 — the managed default

Pinecone went GA with Serverless in January 2024 and finished v3 in Q4 2025, fully splitting **storage and compute**. As of May 2026, it is the "just use it" managed option.

- Index creation only needs dimension, metric (cosine / dotproduct / euclidean), cloud (aws / gcp / azure), and region.

- The backend is proprietary ANN (HNSW variant + partitioning). Users do not see it.

- Pricing is **storage + read units (RU) + write units (WU)**. Cold indexes are nearly free.

- Multi-tenancy is per **namespace**. Each namespace is an isolated index for RU accounting.

- Since 2025, sparse (BM25) + dense hybrid is a first-class concept.

- Q1 2026 added ColBERT-style late interaction in beta.

Typical Python usage:

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="pc-...")

pc.create_index(

name="rag-prod-2026",

dimension=1024,

metric="cosine",

spec=ServerlessSpec(cloud="aws", region="us-east-1"),

)

index = pc.Index("rag-prod-2026")

index.upsert(

vectors=[

{"id": "doc-1", "values": [0.01] * 1024, "metadata": {"tenant": "acme", "lang": "en"}},

namespace="tenant-acme",

)

result = index.query(

vector=[0.01] * 1024,

top_k=10,

namespace="tenant-acme",

filter={"lang": {"$eq": "en"}},

include_metadata=True,

)

Pinecone's biggest strength is "zero ops." The weakness is unpredictable pricing — RU/WU billing means a traffic spike turns into a bill spike.

Qdrant — the Rust-based self-hosting champion

Qdrant is a Rust-written OSS vector DB and is the most popular pick on the self-hosted track. Highlights of the 1.13 line as of May 2026:

- **HNSW + scalar / product quantization** (int8, binary, PQ). Binary quant + reorder saves 4x to 32x on disk and RAM.

- **Payload indexes**: keyword, integer, float, geo, full-text, and datetime indexes are built in. Pre-filters behave like an RDB.

- **Sparse + dense vector hybrid search** with RRF or DBSF fusion.

- **Multi-vector** (ColBERT-style late interaction) is first class.

- **Distributed cluster** (sharding + Raft) and **Qdrant Cloud** managed.

Collection setup and query:

from qdrant_client import QdrantClient

from qdrant_client.models import (

Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue,

)

client = QdrantClient(url="http://qdrant:6333", api_key="q-...")

client.create_collection(

collection_name="rag_prod",

vectors_config=VectorParams(size=1024, distance=Distance.COSINE),

)

client.upsert(

collection_name="rag_prod",

points=[PointStruct(id=1, vector=[0.01] * 1024, payload={"lang": "en", "tenant": "acme"})],

)

hits = client.search(

collection_name="rag_prod",

query_vector=[0.01] * 1024,

query_filter=Filter(must=[FieldCondition(key="lang", match=MatchValue(value="en"))]),

limit=10,

)

Qdrant Cloud is managed on GCP / AWS / Azure with memory/disk-based pricing that is more predictable than Pinecone. Self-host with the Helm chart works well in K8s.

Weaviate — the modules + hybrid leader

Weaviate is a Go-written OSS vector DB. Its strengths are the **module system** and **first-class hybrid search**. As of May 2026, 1.30:

- **Modules**: text2vec-openai, text2vec-cohere, text2vec-jina, multi2vec-clip, generative-openai, reranker-cohere, and many more. Embeddings happen inside the DB.

- **Hybrid search**: BM25 + dense + Reciprocal Rank Fusion in a single call.

- **Multi-tenancy**: each tenant has its own index directory, friendly up to tens of thousands of tenants.

- **Weaviate Cloud Services (WCS)** is the managed flavor. Hybrid Cloud (BYOC) went GA in Q4 2025.

- **Generative search** ships query results directly into an LLM for one-line RAG.

Schema and hybrid query:

from weaviate.classes.config import Configure, Property, DataType

client = weaviate.connect_to_local()

client.collections.create(

name="Doc",

properties=[

Property(name="content", data_type=DataType.TEXT),

Property(name="lang", data_type=DataType.TEXT),

vectorizer_config=Configure.Vectorizer.text2vec_openai(model="text-embedding-3-large"),

generative_config=Configure.Generative.openai(model="gpt-4.1-mini"),

)

docs = client.collections.get("Doc")

result = docs.query.hybrid(

query="vector database comparison",

alpha=0.5,

limit=10,

)

Weaviate's strength is doing "embedding + search + generation" in one system. The weakness is module sprawl, which raises ops cost for self-hosting.

Milvus 2.5 + Zilliz Cloud — the large-scale and GPU champion

Milvus is the LF AI & Data Foundation's large-scale vector DB and is the most battle-tested option at billion-to-tens-of-billions scale. May 2026, 2.5 line:

- **Distributed architecture**: query node, data node, index node, and coord are separate. K8s-native.

- **Index options**: HNSW, IVF, IVF-PQ, DiskANN, GPU-CAGRA, GPU-IVF-PQ.

- **GPU indexing**: NVIDIA cuVS / RAFT-based GPU-CAGRA, 10x to 100x throughput vs CPU.

- **Hybrid search**: BM25 + dense + RRF, multiple vector fields, partition_key-based multi-tenancy.

- **Zilliz Cloud**: managed by the Milvus authors, all regions on AWS / GCP / Azure.

Collection creation and hybrid search:

from pymilvus import MilvusClient, DataType

client = MilvusClient(uri="http://milvus:19530", token="root:Milvus")

schema = client.create_schema()

schema.add_field("id", DataType.INT64, is_primary=True, auto_id=True)

schema.add_field("dense", DataType.FLOAT_VECTOR, dim=1024)

schema.add_field("sparse", DataType.SPARSE_FLOAT_VECTOR)

schema.add_field("tenant", DataType.VARCHAR, max_length=64)

idx = client.prepare_index_params()

idx.add_index("dense", index_type="HNSW", metric_type="COSINE", params={"M": 16, "efConstruction": 200})

idx.add_index("sparse", index_type="SPARSE_INVERTED_INDEX", metric_type="IP")

client.create_collection("rag", schema=schema, index_params=idx)

Milvus's identity is "enterprise scale." For workloads under ~100M vectors the operational overhead may not pay off — Pinecone / Qdrant / Weaviate fit better there.

pgvector + pgvectorscale + ParadeDB — Postgres is "good enough"

The interesting trend of 2026 is that **Postgres + pgvector** has settled in as the "good-enough default" when you do not have special requirements.

- **pgvector 0.8**: HNSW + IVFFlat. Added binary quantization and halfvec (16-bit) in 2025.

- **pgvectorscale 0.6** (Timescale): StreamingDiskANN index + statistical binary quantization, 10x to 100x faster filtered search than vanilla pgvector.

- **ParadeDB**: BM25 (`pg_search`) plus vector inside Postgres. Hybrid search ends inside Postgres.

- **Supabase, Neon, Aurora, Cloud SQL**: all ship pgvector by default.

Typical usage:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE EXTENSION IF NOT EXISTS vectorscale;

CREATE TABLE doc (

id bigserial PRIMARY KEY,

tenant_id uuid NOT NULL,

content text NOT NULL,

embedding vector(1024) NOT NULL

);

CREATE INDEX ON doc USING diskann (embedding vector_cosine_ops);

CREATE INDEX ON doc (tenant_id);

SELECT id, content, 1 - (embedding <=> $1) AS score

FROM doc

WHERE tenant_id = $2

ORDER BY embedding <=> $1

LIMIT 10;

The upside is that **transactions, joins, permissions, and backup** are simply Postgres. The downside is that HNSW memory pressure hurts above ~100M vectors, and RLS-based multi-tenancy can degrade index efficiency.

Chroma and LanceDB — the embedded / local track

The two most common picks for RAG prototypes on a laptop or single production node.

- **Chroma**: Python / JS first. Embedded in-process or HTTP server. SQLite + DuckDB + HNSWLib under the hood. Distributed mode went into beta in Q1 2026.

- **LanceDB**: Rust-written. Apache Arrow + Lance file format. Runs straight off S3/GCS. Embedded or server. Strong multimodal story.

LanceDB's pitch is "one `.lance` file on S3, done." No standalone managed product yet, but LanceDB Cloud (beta) and SageMaker integration are expanding.

db = lancedb.connect("s3://my-bucket/lance-db")

tbl = db.create_table(

"docs",

data=[{"id": 1, "vector": [0.01] * 1024, "lang": "en"}],

mode="overwrite",

)

tbl.create_index(metric="cosine", index_type="IVF_HNSW_SQ", num_partitions=256)

hits = tbl.search([0.01] * 1024).where("lang = 'en'").limit(10).to_list()

Elasticsearch, OpenSearch, Vespa — the search engine track

Existing BM25 engines now ship serious dense vector support, so many teams keep one cluster for both keyword and hybrid search instead of adding a dedicated vector DB.

- **Elasticsearch 8.x to 9.x**: `dense_vector` field with HNSW, `knn` + `query` combined in `_search` for hybrid. ELSER (Elastic Learned Sparse Encoder) provides an in-house sparse vector model.

- **OpenSearch 2.x to 3.x**: `knn` plugin, FAISS / Lucene / NMSLib backend choices. Managed via AWS OpenSearch Service.

- **Vespa**: built by Yahoo and open-sourced. First-class tensor types, multi-vector, and multi-stage ranking. Used by Spotify Discover Weekly and Yahoo Mail.

This is the most natural path when an organization already runs a search cluster. The trade-off is that they do not match dedicated vector DBs on indexing throughput and RAM efficiency.

NoSQL vector extensions — Mongo, Redis, Couchbase, SingleStore, DuckDB, SQLite

When operational data already lives in a DB, having that DB do vector search is the simplest setup.

- **MongoDB Atlas Vector Search**: dedicated Lucene-based search index in Atlas, `$vectorSearch` aggregation stage.

- **Redis 8 Vector Search**: HNSW + flat. Redis Stack / Redis Enterprise. Sub-millisecond responses since it is in-memory.

- **Couchbase Vector Search**: GA in 2024. Vector queries inside SQL++.

- **SingleStore**: distributed SQL DB plus vectors. Analytics, search, and transactions in one cluster.

- **DuckDB VSS extension**: in-process analytics with HNSW. Index over DataFrames and Parquet.

- **sqlite-vec**: by Alex Garcia. A SQLite extension. Made for mobile and edge RAG.

The common thread is "no separate cluster for vectors." When you are under 100M vectors and your business is OLTP, this is often the most rational choice.

Emerging track — Turbopuffer, Marqo, Vald, NGT, Tair Vector

Newer entrants are also growing in managed and specialized markets.

- **Turbopuffer**: serverless vector DB on S3. Very low price (storage plus per-query) and strong for cold data. Notion adopted it in 2025 and drew major attention.

- **Marqo**: bundles embeddings and search. Multimodal-friendly. Cloud and self-host.

- **Vald** (Yahoo Japan / LY): K8s-native distributed vector DB built on NGT. Used inside LINE search.

- **NGT** (Yahoo Japan): graph-based ANN library with ONNG, PANNG, and QG variants.

- **Tair Vector** (Alibaba Cloud): Redis-compatible plus vector. The default in the Chinese market.

FAISS, ScaNN, DiskANN, Annoy, HnswLib — the library track

Not databases but libraries you wire in yourself. Still core for research and embedded use.

- **FAISS** (Meta): C++/Python. IVF, HNSW, IVF-PQ, GPU FAISS. Default inside Meta and in academia.

- **ScaNN** (Google): TensorFlow-friendly, asymmetric quantization. Excellent accuracy / latency ratio.

- **DiskANN** (Microsoft): SSD-friendly graph. Used by pgvectorscale and Milvus.

- **Annoy** (Spotify): random projection, mmap-friendly. Static index use case.

- **HnswLib**: the simple HNSW library by the algorithm's authors. Early backend for Chroma, Jina, Weaviate.

- **NGT** (Yahoo Japan): graph-based, common adoption in Japan and Korea.

Library example with FAISS:

d = 1024

nb = 100000

xb = np.random.random((nb, d)).astype("float32")

xq = np.random.random((1, d)).astype("float32")

index = faiss.IndexHNSWFlat(d, 32)

index.hnsw.efConstruction = 200

index.add(xb)

index.hnsw.efSearch = 64

D, I = index.search(xq, 10)

print(I, D)

Hybrid search — why BM25 + dense + RRF + reranker became standard

A clear lesson since 2024: **dense vectors alone are not enough**. They struggle with keyword (names, IDs, code) matches and break down on out-of-domain queries. The 2026 standard is this four-step pipeline:

1. **BM25 search** (sparse) for top-50 to 100.

2. **Dense vector search** for top-50 to 100.

3. **RRF (Reciprocal Rank Fusion)** or weighted blend to merge 100 to 200 candidates.

4. **Reranker** (Cohere Rerank 3.5, Voyage Reranker 2, BGE Reranker v2-m3, Jina Reranker v2) to top-10.

The RRF formula is simple:

def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:

scores: dict[str, float] = {}

for ranking in rankings:

for rank, doc_id in enumerate(ranking, start=1):

scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)

return scores

The reranker is a cross-encoder that sees query and document together. Typical NDCG@10 lift is 5 to 15 percent. Cohere Rerank 3.5 reranks 100 docs in roughly 80 ms at about $2 per 1,000 queries.

Multi-vector, ColBERT, and ColPali — the new accuracy bar

Compressing a document into a single vector throws away detail, so **multi-vector** retrieval emerged.

- **ColBERT v2** (Stanford 2022, PLAID-optimized in 2024): store a vector per document token, late-interaction (MaxSim) against query tokens. Large NDCG gains over single-vector dense.

- **ColPali** (2024): a vision transformer embeds each page image as a bundle of patch vectors. Game-changer for PDF and slide RAG.

- **ColQwen2**: a Qwen2-VL backbone variant of ColPali.

The cost is storage: per-token vectors balloon the dataset 50x to 200x. Qdrant, Vespa, Weaviate, and ParadeDB ship multi-vector natively. Pinecone added a beta in Q1 2026.

Filtered search — pre-filter vs post-filter, and metadata indexes

Vector search is almost always paired with metadata filters ("tenant_id = X AND lang = en").

- **Post-filter**: ANN search first, filter after. Risks not enough candidates.

- **Pre-filter**: apply the filter first, then ANN. Qdrant / Weaviate / Milvus / pgvector all do this well.

- **Indexed filter**: build a real index on metadata to support combined search.

Pinecone, Weaviate, Qdrant, Milvus, and Elasticsearch all build native metadata indexes. pgvector leans on Postgres B-tree / GIN indexes. If multi-tenancy dominates the workload, always benchmark pre-filter performance.

Embedding model choice — OpenAI vs Cohere vs Voyage vs BGE vs E5 vs Jina

A great vector DB cannot save bad embeddings. Top of the MTEB leaderboard as of May 2026:

- **OpenAI text-embedding-3-large** (3072d, matryoshka 256 / 1024 / 3072). Well balanced.

- **Cohere Embed v4** (1024d, multimodal, multilingual, matryoshka).

- **Voyage 3 / Voyage 3-large** (1024d, strong on both domain-specific and general).

- **BGE-M3** (1024d, emits dense + sparse + ColBERT simultaneously, OSS).

- **E5-Mistral-7B-instruct** (4096d, OSS).

- **gte-Qwen2-7B-instruct** (3584d, OSS).

- **Jina Embeddings v3** (1024d, multilingual, late chunking support).

The picker is (1) do you need multilingual, (2) do you need matryoshka dimensions, (3) do you need to self-host, (4) what is your cost ceiling. For Korean and Japanese, BGE-M3, Cohere v4, Jina v3, and multilingual-e5-large are the safe picks.

RAG-specific patterns — chunking, contextual retrieval, parent-child

Standard patterns layered on top of a vector DB for RAG in 2026:

- **Chunking**: 300 to 800 tokens with about 50 tokens of overlap is most common. Semantic chunking is popular.

- **Contextual retrieval** (Anthropic, 2024): have an LLM prepend a short "this chunk is part of which document and where" before embedding. Cuts retrieval failure rates roughly in half.

- **Parent-child**: search on small chunks, return the parent chunk or document to the LLM.

- **Query expansion**: HyDE (generate hypothetical answers and embed those), multi-query, query rewriting.

- **Reranking**: top-100 -> top-10 via a cross-encoder reranker is recommended for almost every RAG.

These patterns are DB-agnostic, but multi-vector and sparse support decide which patterns are easy to wire up.

GPU vector indexing — RAFT, cuVS, and NVIDIA acceleration

NVIDIA RAPIDS' **cuVS** (RAFT's successor) went GA in 2025 and brought GPU vector indexing into mainstream practice.

- **CAGRA**: GPU-friendly graph ANN. Builds 10x+ faster than HNSW and queries 5x to 50x faster on GPU.

- **GPU IVF-PQ**: turns billion-vector index builds from hours into minutes.

- **Milvus** GPU CAGRA, **FAISS** GPU module, and **RAPIDS cuVS** all adopted it. Pinecone, Weaviate, and Qdrant do not use GPU as their core engine — CPU still dominates query-time.

The 2025 to 2026 window is the inflection where GPUs matter beyond training. Cost-wise, under 100M vectors on a single node still favors CPU + HNSW.

Cost economics — Pinecone vs Weaviate Cloud vs self-hosted Qdrant

Rough monthly price comparison (May 2026, 100M vectors at 1024d, p50 target 50 ms):

- **Pinecone Serverless**: about $2,500 to $4,000 (storage + average traffic).

- **Weaviate Cloud Serverless**: about $2,000 to $3,500.

- **Zilliz Cloud (Milvus)**: about $1,500 to $3,000.

- **Qdrant Cloud**: about $1,200 to $2,500.

- **Turbopuffer**: about $300 to $800 (cold workload).

- **Self-hosted Qdrant on K8s** (EKS, 4x r6i.2xlarge): about $1,200 infra + ops headcount.

- **Self-hosted pgvector** (Aurora Serverless v2): about $700 to $1,500.

Numbers fan out 4x to 10x depending on traffic shape. For cold RAG, Turbopuffer or pgvector wins. For always-hot traffic, self-hosted Qdrant is usually cheapest.

Adoption in Korea — Naver, Kakao, Toss, Karrot

Korean tech market snapshot as of May 2026:

- **Naver Clova**: HyperCLOVA X RAG infra. Likely a Milvus variant plus in-house embeddings.

- **Kakao Brain**: pgvector with in-house embeddings for KoGPT RAG.

- **Toss**: the search team uses Vespa-based multi-stage ranking. Vectors live in internal indexes.

- **Karrot (당근)**: marketplace recommendations on FAISS plus in-house embeddings. Talks about Qdrant adoption have surfaced.

- **Samsung SDS, LG CNS**: managed Pinecone or Weaviate for internal RAG in published cases.

- **Korean startups (chatbots, search)**: Qdrant, Weaviate, and pgvector are the most common.

For Korean embeddings, BGE-M3, Cohere Embed v4, and multilingual-e5-large are de facto standards. Korean-specialized models (KoSimCSE, etc.) still shine on narrow domains.

Adoption in Japan — Vald (LY), NGT (Yahoo), Mercari, LINE Search

- **Yahoo Japan / LY Corporation**: builds NGT and OSS Vald. Used in LINE search and recommendations.

- **Mercari**: semantic search on Elasticsearch + in-house embeddings, with some workloads on Vespa.

- **CyberAgent**: ad recommendations on a FAISS-based internal system.

- **Rakuten**: product search blends Solr / Elastic with in-house embeddings.

- **Preferred Networks**: published a case study using Qdrant for internal RAG.

NGT is known in Korea too, but running Vald clusters directly is mostly limited to LY / Naver-level operators inside Japan. Mercari's Vespa adoption is frequently cited in the search community.

Decision guide — recommendations by workload

A simplified picker by workload:

- **Laptop or single-node RAG prototype**: Chroma, LanceDB, DuckDB VSS, sqlite-vec

- **Monolithic SaaS, under 100M vectors, Postgres already there**: pgvector + pgvectorscale (+ ParadeDB)

- **K8s self-host, 100M to 1B vectors**: Qdrant, Weaviate, Milvus

- **Minimal ops managed**: Pinecone Serverless, Qdrant Cloud, Zilliz Cloud, Weaviate Cloud

- **Above 1B vectors, GPU indexing needed**: Milvus GPU (CAGRA), Vespa, self-built FAISS

- **Cold workload, price-optimized**: Turbopuffer, pgvector, LanceDB on S3

- **Already running a search cluster**: Elasticsearch dense_vector, OpenSearch k-NN, Vespa

- **Finish it inside the OLTP DB**: MongoDB Atlas Vector, Redis Vector, Couchbase, SingleStore

- **Embedded or mobile**: sqlite-vec, LanceDB embedded

- **Need first-class hybrid + reranking**: Qdrant, Weaviate, Vespa, Milvus, ParadeDB

- **Need first-class multi-vector (ColBERT / ColPali)**: Qdrant, Vespa, Weaviate, ParadeDB

If you treat the defaults as **(A) Postgres + pgvector + pgvectorscale** or **(B) managed Qdrant** and move only the exceptions, you have the safest path for a fresh RAG project in May 2026.

References

- [Pinecone](https://www.pinecone.io/) — Serverless v3 managed vector DB

- [Weaviate](https://weaviate.io/) — modules + hybrid + generative search

- [Qdrant](https://qdrant.tech/) — Rust-based OSS vector DB

- [Milvus](https://milvus.io/) — large-scale distributed vector DB

- [Zilliz Cloud](https://zilliz.com/cloud) — managed Milvus

- [Chroma](https://www.trychroma.com/) — embedded vector DB

- [LanceDB](https://lancedb.com/) — Arrow-based multimodal vector DB

- [pgvector](https://github.com/pgvector/pgvector) — Postgres vector extension

- [pgvectorscale](https://github.com/timescale/pgvectorscale) — StreamingDiskANN

- [ParadeDB](https://www.paradedb.com/) — Postgres BM25 + vector

- [Vespa](https://vespa.ai/) — Yahoo search engine

- [OpenSearch k-NN](https://opensearch.org/docs/latest/search-plugins/knn/) — k-NN plugin

- [Elasticsearch dense_vector](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html) — vector search

- [Redis Vector Search](https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/) — in-memory vector search

- [MongoDB Atlas Vector Search](https://www.mongodb.com/products/platform/atlas-vector-search) — Atlas vector search

- [Turbopuffer](https://turbopuffer.com/) — serverless vector DB on S3

- [Marqo](https://www.marqo.ai/) — embeddings + search

- [Vald](https://vald.vdaas.org/) — Yahoo Japan distributed vector DB

- [NGT](https://github.com/yahoojapan/NGT) — Yahoo Japan graph ANN

- [FAISS](https://github.com/facebookresearch/faiss) — Meta vector library

- [DiskANN](https://github.com/microsoft/DiskANN) — Microsoft SSD graph index

- [ScaNN](https://github.com/google-research/google-research/tree/master/scann) — Google ANN

- [Annoy](https://github.com/spotify/annoy) — Spotify static ANN

- [HnswLib](https://github.com/nmslib/hnswlib) — HNSW reference implementation

- [Cohere Rerank](https://cohere.com/rerank) — Rerank 3.5

- [Voyage Reranker](https://docs.voyageai.com/docs/reranker) — Voyage Reranker 2

- [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) — embedding benchmark

- [ann-benchmarks](https://ann-benchmarks.com/) — ANN algorithm benchmark

- [ColBERT](https://github.com/stanford-futuredata/ColBERT) — late interaction

- [ColPali](https://github.com/illuin-tech/colpali) — vision multi-vector

- [Anthropic Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval) — chunk contextualization