Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — In 2026, Vector DBs Are No Longer an "AI Accessory"

In the spring of 2023, when ChatGPT first exploded, the term "vector database" appeared as an emerging category bundled with OpenAI embeddings. Pinecone was nearly the only managed option, and even the acronym "RAG" was unfamiliar. By 2024, Weaviate, Qdrant, and Milvus took firm positions, and pgvector started catching up at a frightening speed from the Postgres side. In 2025, new serverless entrants like Turbopuffer arrived, and in 2026, this market has **clearly entered maturity.**

- **Vector DBs are no longer optional.** If you build an AI product, you must do vector search in some form. The question is "which one."

- **The era of single vector search is over.** The core of 2026 is hybrid retrieval — dense + sparse + lexical + filtering in a single query.

- **Cost structures have become frighteningly varied.** Pinecone serverless auto-scaling, Turbopuffer's S3 backing, pgvector's "nearly free" cost — the same workload can be operated at a 100x cost difference.

What this article covers:

1. The 2026 vector DB map — who builds what, who uses what

2. Embedding and vector search fundamentals

3. HNSW, DiskANN, IVF, PQ — index algorithm comparison

4. Pinecone — the original managed contender

5. Weaviate — the modular powerhouse

6. Milvus and Zilliz Cloud — the China-Silicon Valley camp

7. Qdrant — the Rust-based rising star

8. Chroma — the embedded standard

9. LanceDB — Arrow-native columnar

10. pgvector — Postgres strikes back

11. Vespa.ai — the Yahoo veteran

12. Turbopuffer — the serverless dark horse

13. Elasticsearch and OpenSearch — the search camp joins

14. MongoDB, Redis, SingleStore — general DBs with vector mode

15. Sparse vectors, BM25, hybrid retrieval

16. Quantization — int8, scalar, binary, ternary

17. Multi-vector retrieval — ColBERT v2 and late interaction

18. Korean and Japanese vendors and cloud region issues

19. Cost comparison and scale matrix

20. Which DB for which workload

21. Operational anti-patterns

22. References

Chapter 1 - The 2026 Vector DB Map

First, the big picture. The market splits into five camps.

**Managed SaaS Camp**

- **Pinecone**. Founded in 2019, the oldest managed vector DB. Serverless became GA in 2024, completely changing the pricing structure. It currently offers both pod-based and serverless modes.

- **Zilliz Cloud**. The managed version of Milvus. Built on a proprietary vector engine called Knowhere.

- **Turbopuffer**. Appeared in 2024. Adopted by AWS Bedrock's vector backend, Cursor, and Notion, becoming instantly famous. Its core is the serverless architecture layered on S3.

**Open Source + Cloud Camp (both BYO and managed)**

- **Weaviate**. GraphQL as first-class, BM25 hybrid supported by default from the start. Integrates embedding models into the DB through a module system.

- **Milvus**. CNCF graduated project, strength in large-scale distributed architecture. A multi-index machine supporting DiskANN, IVF, and HNSW.

- **Qdrant**. Rust-based, with strengths in memory efficiency and quantization. Payload filtering is first-class.

**Embedded Camp**

- **Chroma**. The lightest start. One line `import chromadb` in Python to begin.

- **LanceDB**. Based on the Lance columnar format, Arrow-native. Data lives directly in object storage.

**Traditional DBs with Vector Mode**

- **pgvector** (PostgreSQL extension). Adds vector search to Postgres. It became a serious option when halfvec and sparsevec were added in 0.7 (2024).

- **Elasticsearch 8.18+** and **OpenSearch 2.18+**. Layered ANN over existing search indexes.

- **MongoDB Atlas Vector Search**. Integrates vectors into a document DB.

- **Redis Stack**. Vector search via the RediSearch module.

- **SingleStore**, **Couchbase Capella**, **CockroachDB**. SQL DB camp joining vectors.

**Search Engine Camp**

- **Vespa.ai**. An old engine made by Yahoo. Tensors, ANN, and ranking expressions are first-class. Strongest in large-scale ranking.

What this map means is simple: **There is no "single correct vector DB."** The right answer differs by workload size, query patterns, and operational team maturity.

Chapter 2 - Embedding and Vector Search Fundamentals

To understand vector DBs, you must first grasp the nature of embeddings.

**What is an embedding?** Mapping unstructured data like text, images, or audio to fixed-length real number vectors. OpenAI `text-embedding-3-large` is 3072 dimensions, Cohere `embed-v4` is 1024 dimensions, and BGE-M3 is 1024 dimensions. Dimensionality is a model design decision, not a guarantee that "bigger is better."

**Why is vector search needed?** Embeddings guarantee the property that "if meanings are similar, distances are close." Therefore, finding "the k nearest vectors" to a new query is the core operation. This is called **k-NN (k Nearest Neighbors)** search.

**The problem is the curse of dimensionality.** Finding true nearest neighbors among 100 million 1024-dim vectors requires 100 million 1024-dim dot products. Simple brute force is limited to about 1 query per second even on a GPU. That is why **ANN (Approximate Nearest Neighbors)** emerged — algorithms that give up 100% accuracy and find 99% accuracy 1000x faster.

**Distance function choices**

- **Cosine similarity** — the most common. Ignores vector length, compares only direction. OpenAI embeddings assume cosine.

- **Inner product (IP)** — includes length. Some embedding models are IP-based.

- **Euclidean (L2)** — absolute distance. Used in some image embeddings.

- **Hamming** — for binary vectors. Used on binary quantization results.

Most vector DBs support all four. But the **distance function the model was trained on** is the correct one — using L2 on OpenAI embeddings breaks results.

Chapter 3 - Index Algorithms: HNSW, DiskANN, IVF, PQ

The index determines the character of a vector DB. In 2026, four mainstream algorithms dominate.

**HNSW (Hierarchical Navigable Small World)**

- A graph-based algorithm proposed by Yury Malkov in 2016. Builds a multi-layer graph that narrows from top to bottom during search.

- The fastest in-memory search. Can achieve 99% recall in 1ms on 1 million 1024-dim vectors.

- Downside: The entire index must fit in memory. 100 million vectors times 1024 dimensions equals 400 GB+ RAM.

- Supported by Pinecone pod, Weaviate, Qdrant, Milvus, pgvector (0.5+), and Chroma.

**DiskANN (Microsoft Research, 2019)**

- Builds an SSD-friendly graph on disk. The key is enabling billion-scale vectors on a single server.

- Memory usage is about 5% of HNSW, latency is about 2 to 3 times slower.

- The recommended algorithm in Milvus 2.5, the internal index of Turbopuffer, and one of Vespa's options.

**IVF (Inverted File Index)**

- A clustering-based method that originated in 1990s information retrieval. Divides the vector space into N clusters and searches only the m clusters near the query.

- Made famous by Faiss (Facebook), and Milvus supports various IVF, IVF-PQ, and IVF-SQ variants.

- Builds faster than HNSW and uses less memory, but has higher latency.

**PQ (Product Quantization) and SQ (Scalar Quantization)**

- These are compression techniques rather than indexes. PQ splits the vector into sub-vectors and maps each to a codebook, while SQ reduces float32 to int8.

- Combined with HNSW or IVF, they reduce memory 4 to 32 times. Almost all large indexes in 2026 use PQ or binary quantization.

**ScaNN (Google, 2020)**

- An algorithm Google built for its own data centers. It was SOTA for 2 to 3 years. Now open-sourced, with some DBs supporting it as an option.

**Selection Guide**

| Workload | Recommendation |

| --- | --- |

| Under 1 million vectors, latency-first | HNSW (in-memory) |

| 100 million+ vectors, cost-optimized | DiskANN or IVF-PQ |

| 1 billion+ vectors, low query frequency | IVF-PQ + binary quantization |

| Embedded, single machine | HNSW + SQ |

Chapter 4 - Pinecone: The Original Managed Contender

Pinecone was founded in 2019 by Edo Liberty (formerly at AWS and Yahoo). The company first commercialized the concept of "managed vector search."

**Two operational modes as of 2026**

- **Pod-based**. The traditional operational mode. You pick a pod type like p1, s1, p2 and spin up instances. The user decides memory, disk, and CPU tradeoffs.

- **Serverless**. GA in 2024. The user only creates an index, and it auto-scales with data volume. Pricing is **$0.40/M vectors** + per-query cost. Most attractive for workloads with the lowest usage.

**Core features**

- **Namespaces**. Per-user or per-tenant isolation within a single index. A standard pattern for multi-tenant RAG.

- **Hybrid search**. Dense + sparse vectors in a single query. SPLADE-style sparse embeddings indexed alongside.

- **Metadata filtering**. Arbitrary JSON metadata as pre-filter. However, overly selective filters can break ANN.

**Pinecone's strengths**

- Zero operational burden. Create an index and you are done.

- Supports all AWS, GCP, and Azure regions. Data residency is clear.

- 99.99% SLA. The easiest option for enterprises to adopt.

**Pinecone's weaknesses**

- Prices climb quickly. Operating 100 million vectors on serverless costs around $400+/month (storage alone). Heavy query load easily exceeds $1000.

- No self-hosted option. Limiting for Japanese and Korean customers where data sovereignty matters.

- Migration is hard after entry — the index format is closed.

**When to choose?** First RAG prototypes, startups with limited operational headcount, places where enterprise SLAs matter.

Chapter 5 - Weaviate: The Modular Powerhouse

Weaviate is an open-source vector DB started by SeMI Technologies (now Weaviate B.V.) in 2019. As of 2026, it is at version 1.27+, with "modules" as its core concept.

**Core concept: Modules**

- Activating modules like `text2vec-openai`, `text2vec-cohere`, or `text2vec-huggingface` makes the DB **automatically generate embeddings** when you insert data.

- In other words, you do not need to call the embedding model directly from the client. "Insert data, then search" just works.

- Downside: If you change the embedding model, you must rebuild the index.

**Hybrid search as first-class**

- BM25 (lexical) plus vector search combined by weights. This pattern was established earliest by Weaviate.

- Just add one `hybrid` option to a `nearText` query.

**Multi-tenancy**

- Officially supported from 1.20. Per-tenant isolation within a class. A standard pattern for SaaS RAG.

**Dynamic index (1.25+)**

- Automatically switches the index type based on data volume. When small, uses flat (brute force); once it exceeds a certain size, switches automatically to HNSW.

**Weaviate's strengths**

- Both GraphQL and REST APIs. Natural for GraphQL-friendly teams.

- Embedding model integration. Less client code.

- BM25 hybrid by default from the start.

**Weaviate's weaknesses**

- GraphQL learning curve. Hard to enter if unfamiliar.

- Cluster operation tends to be complex. Weaviate Cloud Services resolves this but at increased cost.

**When to choose?** Workloads requiring hybrid search, when you want to delegate embedding to the DB, and GraphQL-friendly teams.

Chapter 6 - Milvus and Zilliz Cloud: The Distributed Camp

Milvus is a Chinese-originated open-source vector DB started by Zilliz in 2019. As of 2026, version 2.5+ is mainstream, and the managed offering is Zilliz Cloud.

**Architecture — the most distributed design**

- A microservice structure separating Coordinator, Query Node, Data Node, and Index Node.

- Uses Pulsar or Kafka as a message queue to process writes asynchronously.

- Result: A design that works well at the largest scale (1 billion+ vectors).

**Knowhere engine**

- The core vector engine of Milvus. Originated from Faiss (Facebook's ANN library), but has evolved independently since.

- Offers nearly every algorithm — HNSW, IVF-Flat, IVF-PQ, IVF-SQ, DiskANN, ScaNN — as options.

**Multi-index support**

- Multiple index types can coexist in the same collection. Different indexes can be used for different workloads.

**Milvus's strengths**

- The widest variety of index algorithms supported. Most suitable for research and experimentation.

- The most validated open source at large scale (1 billion+).

- CNCF graduated — solid governance.

**Milvus's weaknesses**

- Operation is the most complex. Helm chart-based K8s deployment is standard, but with many components, the learning curve is steep.

- Overkill for small scale (under 1 million).

**Zilliz Cloud**

- The managed version of Milvus. Multi-cloud on AWS, GCP, and Azure. The AWS Tokyo region is officially supported, making it popular with Japanese customers.

**When to choose?** Workloads of 1 billion+ vectors, when you want to experiment with index algorithms, and teams with K8s operational expertise.

Chapter 7 - Qdrant: The Rust-Based Rising Star

Qdrant is a Rust-based vector DB started by Andrey Vasnetsov in 2021. As of 2026, version 1.13+ is current, with Qdrant Cloud as the managed offering.

**Why Rust?**

- Memory safety + performance. Memory usage is 30 to 50% lower than Go or Python-based competitors.

- No GC, so latency variance is small. Favorable for workloads where P99 latency matters.

**Core features**

- **Payload filtering**. Attach arbitrary JSON payloads to every point and filter in queries. The difference from other DBs is the "filter-first" index structure — ANN works well even when filter results are small.

- **Scalar and Binary quantization**. Officially supported from 1.7. float32 to int8 (4x compression), float32 to binary (32x compression). Binary quantization is nearly standard in 2026.

- **Multivector**. Multiple vectors per point. Supports late interaction models like ColBERT.

- **Sparse vectors**. Supports SPLADE-style sparse embeddings from 1.10.

**Qdrant's strengths**

- The best memory efficiency.

- The cleanest combination of filtering and ANN.

- Rust clients, plus rich Python, TypeScript, and Go.

- Reasonable pricing (Qdrant Cloud is 30 to 50% of Pinecone).

**Qdrant's weaknesses**

- Fewer operators. The community is smaller than Pinecone or Weaviate.

- Distributed mode stability is still being validated. Milvus has an edge at 10 billion+ scale.

**When to choose?** Workloads where P99 latency matters, search with heavy payload filtering, and cost-sensitive places.

Chapter 8 - Chroma: The Embedded Standard

Chroma is an embedded vector DB started by Jeff Huber and Anton Troynikov in 2022. As of 2026, it is at version 0.5+.

**Concept — "AI-native open-source embedding database"**

- One line `import chromadb; client = chromadb.Client()` in Python to start.

- Supports both local disk mode (`PersistentClient`) and client-server mode.

**What is different**

- The lightest start. The fastest prototyping.

- Embedding functions can be bound to collections — embeddings are generated automatically on insert.

- Multi-modal collections (text and images in the same collection).

**Chroma's strengths**

- The easiest to start. Practically the default in LangChain and LlamaIndex tutorials.

- Solid embedded mode. Optimal for small RAG demos.

**Chroma's weaknesses**

- Unsuitable for large scale (100 million+). Single-machine limits.

- Distributed operational options are weak.

**When to choose?** RAG prototyping, notebook-level demos, embedded use under 1 million.

Chapter 9 - LanceDB: Arrow-Native Columnar

LanceDB is an embedded vector DB started by Eto Labs (now LanceDB Inc.) in 2023. As of 2026, version 0.20+ is current.

**Core — the Lance columnar format**

- Lance is a columnar format aiming to be the successor to Parquet. It is Arrow-native and optimized for vector search.

- Data lives directly on object storage (S3 or GCS). No separate DB instance is needed.

**Why columnar?**

- Embeddings + metadata + original text in a single file. No need to sync with a separate RDBMS.

- The Lance format embeds the vector index directly in the file.

**Blob storage mode (0.18+)**

- Data lives in object storage, and only required partitions are fetched at query time.

- A concept similar to Turbopuffer, but embedded.

**LanceDB's strengths**

- Data + vector + metadata in one system. Natural integration with ML pipelines.

- Arrow-native — directly compatible with pandas, Polars, and DuckDB.

- Open source + Rust core (stability).

**LanceDB's weaknesses**

- The managed SaaS is small (LanceDB Cloud is still developing).

- Few case studies of large-scale distributed operations.

**When to choose?** Places where ML pipelines are Arrow-based, object-storage-friendly workloads, and when you want to manage data and vectors in one system.

Chapter 10 - pgvector: Postgres Strikes Back

pgvector is a Postgres extension started by Andrew Kane in 2021. As of 2026, version 0.8+ is current. One of the most shocking trends is the rise of the opinion that "instead of a dedicated vector DB, just use Postgres."

**Why pgvector?**

- No additional infrastructure if Postgres is already running.

- Add vector search while keeping transactions, foreign keys, and JOINs.

- Zero learning curve if your ops team is already familiar with Postgres.

**New features in 0.8**

- **halfvec** — a float16 vector type. 50% memory savings.

- **sparsevec** — a sparse vector type. Supports SPLADE-style sparse embeddings.

- Both **HNSW** and **IVFFlat** indexes supported.

**Extension ecosystem**

- **pg_vectorize** (Tembo) — A wrapper that automates embedding generation and search inside Postgres.

- **pg_vector_scale** (TigerData / Timescale) — Brings the DiskANN algorithm to Postgres. Key for large-scale workloads.

- **pgvecto.rs** — A compatible extension rewritten in Rust. Claims to be faster.

**pgvector's strengths**

- The simplest operation. Postgres is already there.

- The cheapest cost. No separate DB licensing or SaaS cost.

- Natural transactions, JOINs, and complex filtering.

**pgvector's weaknesses**

- Perfect for under 1 million workloads, but latency rises steeply over 10 million.

- High load when writes and index builds happen simultaneously.

- Cost of Postgres operation itself — backups, HA, tuning.

**When to choose?** When Postgres is already the main DB, under 1 million vector workloads, when transactions and vector search must coexist.

Chapter 11 - Vespa.ai: The Yahoo Veteran

Vespa was started by Yahoo in 2003 as an internal search engine and was open-sourced in 2017. It remains actively updated in 2026.

**Vespa's identity — "Not just a vector DB, but a full-stack search engine"**

- ANN (HNSW) + tensor operations + ranking expressions + distributed indexes + real-time indexing in one system.

- "Ranking" is a first-class citizen. learned-to-rank, GBDT, and neural rankers are evaluated directly on the index.

**Tensor model**

- Vespa treats all data as tensors. Vectors are 1D tensors, embedding groups are 2D tensors. Multi-vectors like ColBERT are naturally expressed.

**Vespa's strengths**

- The strongest ranking. Multi-stage retrieval (sparse to dense to reranker) is the cleanest.

- Validation at the largest scale. Yahoo used it for tens of billions of document search.

- Real-time indexing as first-class.

**Vespa's weaknesses**

- The steepest learning curve. Lots of documentation, but many core concepts.

- Overkill for small workloads.

- The highest operational complexity.

**When to choose?** Places where search ranking is a core differentiator, 1 billion+ document scale, and multi-stage retrieval pipelines.

Chapter 12 - Turbopuffer: The 2024 Dark Horse

Turbopuffer is a serverless vector DB that emerged in 2024. The reason it quickly became famous is simple: **AWS Bedrock, Cursor, and Notion adopted it.**

**Concept — "Vector search layered on S3"**

- Stores index data on object storage (S3). Memory is used only as cache.

- Result: Storage costs are overwhelmingly cheap. You can operate 100 million vectors at almost disk-only cost.

- Downside: Cold query latency is high (first query fetches from S3).

**Why adoption was fast**

- Adopted as one of the AWS Bedrock Knowledge Base backends. That is, Bedrock customers automatically use Turbopuffer.

- Cursor uses it for codebase indexing. Efficiently searches tens of millions of code chunks per company.

- Notion adopted it for workspace search. Multi-tenancy is core.

**Turbopuffer's strengths**

- The cheapest price. Inactive indexes are nearly free.

- Multi-tenancy-friendly — per-namespace indexes hibernate independently.

- AWS-friendly.

**Turbopuffer's weaknesses**

- Serverless cold start. First queries are slow.

- Features are still narrow. Advanced features like multi-vector and sparse arrive later than other camps.

- No self-hosted option.

**When to choose?** Multi-tenant RAG, when you must operate many inactive indexes, and AWS-region-friendly workloads.

Chapter 13 - Elasticsearch and OpenSearch: The Search Camp Joins

Traditional search engine camps have quickly absorbed vector search.

**Elasticsearch 8.18+**

- Store embeddings in the `dense_vector` field type. HNSW index supported.

- BM25 and hybrid queries are natural (Elasticsearch is the home of BM25).

- **Byte vectors** and quantization officially supported from 8.13.

- The greatest strength: integration with the existing ELK stack.

**OpenSearch 2.18+**

- AWS-forked Elasticsearch. Strong managed offerings on AWS regions.

- k-NN plugin as first-class. Three engine options: Nmslib, Faiss, and Lucene.

- One of the most common vector DB options on Korean KT Cloud and AWS Tokyo for Japan.

**When to choose?** When ELK is already in operation, when you want to combine lexical and vector in a single query, and when AWS's OpenSearch Service is natural.

**Weaknesses**

- ANN performance is slightly weaker than dedicated vector DBs.

- Index build times are long.

- Operational complexity follows the search engine.

Chapter 14 - MongoDB, Redis, SingleStore: General DBs with Vector Mode

**MongoDB Atlas Vector Search**

- Announced in 2023, GA in 2024. Works only on MongoDB Atlas (managed).

- As a document DB, embeddings + original document + metadata fit in a single document.

- HNSW-based. Integrated into the aggregation pipeline as the `$vectorSearch` stage.

- Weakness: Self-hosted MongoDB is unsupported. Prices climb quickly.

**Redis Stack (RediSearch + Vector)**

- In-memory — the fastest latency.

- Weakness: Data must fit in memory. 100 million vectors is unrealistic.

- Strength: Places needing sub-millisecond latency (e.g., recommendation systems).

**SingleStore**

- HTAP DB (OLTP + OLAP). Integrates vector search into SQL.

- SQL JOINs and vector search mix naturally.

- Weakness: Operational cost. Managed is expensive.

**Couchbase Capella**

- Vector search supported from 7.6. The vector mode of a JSON document DB.

- Mobile-friendly (syncs with Couchbase Lite).

**CockroachDB and ClickHouse**

- CockroachDB has experimental vector indexes from 24.1.

- ClickHouse has experimental ANN indexes from 24.x. Vector search in OLAP context.

The common message of this camp: **Vector search is now a basic feature of databases.**

Chapter 15 - Sparse Vector, BM25, Hybrid Retrieval

RAG in 2026 does not end with a single dense vector.

**What is a sparse vector?**

- A high-dimensional vector where most dimensions are 0 and only some dimensions have values.

- Example: SPLADE generates 30,000-dim sparse vectors at the lexical unit (BERT's vocab).

- Advantages: Lexical matching and semantic matching in one model — keyword search accuracy + embedding semantic understanding.

**BM25**

- A search standard since the 1990s. An integer-valued lexical score in the tf-idf family.

- Still powerful — especially does better than embeddings for queries with clear keywords.

**Hybrid retrieval patterns**

1. **Reciprocal Rank Fusion (RRF)**. Merges two search results, dense and sparse, by rank. The simplest and frequently works well.

2. **Weighted sum**. `score = alpha · dense_score + (1-alpha) · sparse_score`. Needs alpha tuning.

3. **Two-stage (cascade)**. Sparse for 1000 candidates, then dense reranker for 100, then cross-encoder for 10.

**Which DBs do hybrid well?**

- Weaviate, Vespa — first-class from the start.

- Qdrant, Pinecone — sparse vector support added.

- Elasticsearch — the home of BM25, hybrid is natural.

- pgvector — added with the sparsevec type, but hybrid scoring must be written by hand.

Chapter 16 - Quantization: int8, scalar, binary, ternary

A core technique that reduces vector storage cost by 1/4 to 1/32.

**Scalar quantization (int8)**

- Maps float32 to int8 (256 levels). 4x memory savings.

- Recall loss is usually 1 to 3 percentage points. The first option for most workloads.

- Officially supported by Qdrant, Milvus, pgvector, and Weaviate.

**Product quantization (PQ)**

- Splits vectors into sub-vectors and maps each to a codebook. 8 to 32x memory savings.

- Recall loss is large (5 to 15 percentage points). Unavoidable for large indexes.

**Binary quantization (1-bit)**

- Each dimension to 1 bit. 32x savings versus float32.

- Sentence embedding models in 2024 to 2025 trained binary-friendly, greatly reducing recall loss.

- Cohere `embed-v4` binary mode and OpenAI `text-embedding-3` MRL (Matryoshka) naturally combine dimension reduction and quantization.

**Ternary quantization (1.58-bit)**

- Rising after the 2024 BitNet paper. Each dimension to one of three values: -1, 0, +1.

- Official vector DB support is still rare, but experimental options are growing.

**MRL (Matryoshka Representation Learning)**

- Introduced by OpenAI `text-embedding-3-large`. Cutting only the first N dimensions from the same vector preserves meaning.

- Cutting 3072 to 1024 dimensions yields 75% memory savings with minimal recall loss.

**Practical recommendations**

- First step: float32 to int8 (scalar). Nearly lossless.

- Next: HNSW + int8.

- 100 million+ scale: IVF-PQ or binary.

Chapter 17 - Multi-vector Retrieval: ColBERT v2 and Late Interaction

Traditional dense retrieval was "one document to one vector." A big change in 2024 to 2025 is the rise of **multi-vector retrieval**.

**ColBERT v2 (Stanford, 2022)**

- Preserves embeddings at the token level. A document with 100 tokens has 100 vectors.

- Queries are also broken down to tokens, matched by max-sim operation.

- Result: Naturally combines semantic matching and lexical matching.

**Why stronger than single vectors?**

- Averaging loses no information. In a document "machine learning," both "machine" and "learning" survive individually.

- Especially strong for queries with distributed intent.

**Late interaction**

- The core mechanism of ColBERT v2. The token-by-token interaction between query and document is computed last.

- Downside: Storage cost increases (100x). The index becomes huge.

**Which DBs support multi-vector?**

- Vespa — most natural with the tensor model.

- Qdrant — officially supports the `Multivector` type.

- Weaviate — supported via named vectors from 1.25.

- Milvus — supports multi-vector collections from 2.4.

**Practical use**

- Code search (what Cursor does).

- Legal and medical search — domains where precise term matching matters.

- Multilingual search — token-level matching preserves meaning.

Chapter 18 - Korean and Japanese Vendors, Cloud Region Issues

For Korean and Japanese enterprises choosing a vector DB, data residency and region issues are decisive.

**Korea**

- **Naver Cloud Platform**. NCP officially offers managed Milvus. The de facto standard for Korean government and financial customers.

- **KT Cloud**. Provides OpenSearch-based vector search.

- **Kakao i Cloud**. A solution combining its own embedding model and vector DB.

- **KoSearch** (spun off from Naver Search team) — A Korean-specialized search engine with first-class vector search.

- Self-hosted options — Weaviate, Qdrant, and Milvus are frequently adopted.

**Japan**

- **AWS Tokyo (ap-northeast-1)** — the most standard. Pinecone, Qdrant, and Zilliz all officially support this region.

- **AWS Osaka (ap-northeast-3)** — used as a DR pair.

- **GCP Tokyo (asia-northeast1)** — the Japanese region of Weaviate Cloud.

- **Azure Japan East** — preferred by some enterprises (financial and public sector).

- Japan-only clouds like Sakura Internet and IDC Frontier have nearly no managed vector DB options → self-hosting is common.

**Regulatory issues**

- **Personal Information Protection Act** (Korea) and **APPI** (Japan) — embeddings may be classified as PII. Some regulators view embeddings as PII because they may allow recovering the original text.

- Medical and financial domains — data residency is stricter. Self-hosting is a natural choice.

Chapter 19 - Cost Comparison and Scale Matrix

Prices change quickly, so verify exact numbers each time. The following is a rough comparison as of May 2026 (assuming 10 million vectors, 1024 dimensions, 10 million queries per month).

| --- | --- | --- | --- | --- |

| Pinecone Serverless | SaaS | $100 | $200 | $300 |

| Weaviate Cloud (sandbox) | SaaS | $50 | $100 | $150 |

| Qdrant Cloud | SaaS | $80 | $80 | $160 |

| Zilliz Cloud (Standard) | SaaS | $120 | $150 | $270 |

| Turbopuffer | SaaS | $30 | $50 | $80 |

| Self-hosted Qdrant (1 vCPU, 8GB) | EC2 | $50 | included | $50 |

**Interpretation**

- The cheapest options are Turbopuffer or self-hosted Qdrant.

- The most expensive options are Pinecone pod or MongoDB Atlas.

- Including operational headcount cost, SaaS is always cheaper.

- pgvector is favorable only when Postgres is already there. Spinning it up new makes RDS costs significant.

**Scale matrix**

| Vector count | Recommendation | Reason |

| --- | --- | --- |

| Under 100,000 | Chroma, pgvector | Embedded/single machine is enough |

| 1 million | pgvector, Qdrant Cloud | Works well on a single instance |

| 10 million | Qdrant, Weaviate, Pinecone Serverless | When distributed indexing becomes necessary |

| 100 million | Milvus, Pinecone Pod, Vespa | Serious distributed operations needed |

| 1 billion+ | Milvus, Vespa, Zilliz Cloud Enterprise | Multi-node operational experience essential |

Chapter 20 - Which DB for Which Workload

**Scenario 1: First RAG prototype (100,000 vectors)**

- Recommendation: Chroma or pgvector

- Reason: Fastest start. Zero decision cost.

**Scenario 2: Startup RAG SaaS (1 million vectors, multi-tenant)**

- Recommendation: Pinecone Serverless or Turbopuffer

- Reason: Managed operational ease + multi-tenancy friendliness.

**Scenario 3: Enterprise RAG (10 million vectors, SLA important)**

- Recommendation: Pinecone Pod or Weaviate Cloud

- Reason: 99.99% SLA, well-equipped enterprise contracts.

**Scenario 4: Large-scale search system (100 million+ vectors, ranking is core)**

- Recommendation: Vespa or Milvus

- Reason: Serious distributed operations, multi-stage ranking.

**Scenario 5: Code search (10 million vectors, multi-vector needed)**

- Recommendation: Qdrant or Vespa

- Reason: Multivector and payload filtering are first-class.

**Scenario 6: Cost optimization (low-usage workloads)**

- Recommendation: Turbopuffer or self-hosted Qdrant

- Reason: Usage-based pricing or sufficient on a single EC2.

**Scenario 7: Postgres already exists, want to add vectors**

- Recommendation: pgvector + pg_vector_scale

- Reason: No separate infrastructure, handled in the same transaction.

**Scenario 8: Japanese data residency requirements**

- Recommendation: Pinecone or Zilliz Cloud in AWS Tokyo, or self-hosted Qdrant

- Reason: Region guarantee, APPI compliance.

Chapter 21 - Operational Anti-patterns

Mistakes frequently seen in vector DB operations. All from real cases.

**Anti-pattern 1: Changing embedding models frequently**

- When the embedding model changes, you must rebuild the entire index. Re-embedding 100 million vectors costs thousands of dollars.

- Response: Decide on a model annually and explicitly version it like frontmatter.

**Anti-pattern 2: Filter selectivity too high**

- "Search only documents of user A" reduces results to 100 and breaks the ANN graph.

- Response: For multi-tenancy, use namespace separation. For high selectivity, use a dedicated index.

**Anti-pattern 3: Not measuring ANN recall**

- Most teams do not actually know recall@10 of their operational index.

- Response: Calculate brute-force ground truth on 100 queries, measure recall@10 periodically.

**Anti-pattern 4: Not tuning hybrid score weights**

- Starting weights at 0.5 / 0.5 and never adjusting. Does not fit the data distribution.

- Response: Grid search on a golden dataset.

**Anti-pattern 5: Arbitrary vector dimension truncation**

- "I cut 1536 to 768 dimensions to save memory." → Models not MRL-trained break meaning.

- Response: Check the model card for MRL support. OpenAI text-embedding-3 is MRL OK.

**Anti-pattern 6: Storing embeddings and original text separately**

- Embeddings in the DB, original text in S3. JOIN every time a human views results.

- Response: Embedding + original + metadata in one record. Or integrated options like LanceDB or MongoDB.

**Anti-pattern 7: Index build and queries on the same instance**

- Query latency rises 5x during index rebuild.

- Response: Dedicated build instance, blue-green index swap.

**Anti-pattern 8: HNSW M and ef parameters at defaults**

- The optimal values differ by data distribution. Whether defaults (M=16, ef=64) give 90% or 99% recall varies each time.

- Response: Tune M and ef to the dataset.

Chapter 22 - References

Mostly official documentation and major academic or public announcement materials.

**Vector DB official docs**

- Pinecone — https://docs.pinecone.io/

- Weaviate — https://weaviate.io/developers/weaviate

- Milvus — https://milvus.io/docs

- Zilliz Cloud — https://docs.zilliz.com/

- Qdrant — https://qdrant.tech/documentation/

- Chroma — https://docs.trychroma.com/

- LanceDB — https://lancedb.github.io/lancedb/

- pgvector — https://github.com/pgvector/pgvector

- pg_vectorize — https://github.com/tembo-io/pg_vectorize

- pg_vector_scale — https://github.com/timescale/pgvectorscale

- pgvecto.rs — https://github.com/tensorchord/pgvecto.rs

- Vespa.ai — https://docs.vespa.ai/

- Turbopuffer — https://turbopuffer.com/docs

**Existing search/DB camps' vector docs**

- Elasticsearch dense_vector — https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html

- OpenSearch k-NN — https://opensearch.org/docs/latest/search-plugins/knn/index/

- MongoDB Atlas Vector Search — https://www.mongodb.com/docs/atlas/atlas-vector-search/

- Redis Vector Search — https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/

- SingleStore Vector — https://docs.singlestore.com/

**Core papers**

- HNSW (Malkov and Yashunin, 2016) — https://arxiv.org/abs/1603.09320

- DiskANN (Subramanya et al., 2019) — https://www.microsoft.com/en-us/research/publication/diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node/

- ScaNN (Guo et al., 2020) — https://arxiv.org/abs/1908.10396

- ColBERT v2 (Santhanam et al., 2022) — https://arxiv.org/abs/2112.01488

- SPLADE (Formal et al., 2021) — https://arxiv.org/abs/2107.05720

- MRL — Matryoshka Representation Learning (Kusupati et al., 2022) — https://arxiv.org/abs/2205.13147

- BitNet b1.58 (Ma et al., 2024) — https://arxiv.org/abs/2402.17764

**Reference articles**

- Anthropic — Contextual Retrieval (2024) — https://www.anthropic.com/news/contextual-retrieval

- Cohere — Binary Embeddings — https://cohere.com/blog/int8-binary-embeddings

- OpenAI — New embedding models — https://openai.com/index/new-embedding-models-and-api-updates/

- Microsoft — DiskANN announcement — https://www.microsoft.com/en-us/research/blog/diskann-billion-scale-similarity-search/

Epilogue — Choosing an Opinion

This article's one-sentence summary: **A vector DB is not a tool but an opinion.** Pinecone believes "managed is the answer for vector search." Milvus believes "a vector DB is a distributed system." Qdrant believes "Rust efficiency is everything." pgvector believes "Postgres is already there — why buy a new DB?" Vespa believes "vector search is a subproblem of ranking." Turbopuffer believes "layering on object storage makes costs 100x cheaper."

The same problem yields different solutions when opinions differ. So — as much as when choosing a model — **be conscious of the opinion when choosing a vector DB.**

Next article candidates: **RAG evaluation systems deep dive (Ragas, DeepEval, TruLens)**, **embedding model comparison (OpenAI vs Cohere vs Voyage vs BGE)**, **hybrid search ranking tuning guide**.

> "A vector DB is not a library but an opinion. The awareness that you are choosing an opinion is the first button of tool selection."

— Vector Databases 2026, end.