- Published on
Complete Guide to Qdrant Vector DB Operations — From Collection Design to RAG Integration
- Authors
- Name
- Introduction
- 1. Core Concepts
- 2. Architecture
- 3. Collection Design and Index Strategy
- 4. CRUD and Search
- 5. Filtering and Payloads
- 6. RAG Pipeline Integration
- 7. Operations Checklist
- 8. Common Mistakes
- 9. Summary
- Quiz
Introduction
As LLM-powered applications continue to proliferate, vector databases have become essential infrastructure rather than optional tooling. Among them, Qdrant stands out: written in Rust for high performance, supporting both gRPC and REST APIs, and offering production-grade features like payload filtering and multi-tenancy. This guide covers everything you need to deploy Qdrant in production — from collection design, CRUD operations, and filtering to RAG integration and operational monitoring.
1. Core Concepts
What Is a Vector Database?
A vector database stores high-dimensional vectors (embeddings) and performs similarity search in milliseconds. Text, images, and audio are transformed into vectors via embedding models, stored in the database, and then queried by finding the nearest vectors to a given query vector.
Key Features of Qdrant
- Rust-based: Achieves memory safety and high throughput simultaneously
- Payload: Store JSON metadata alongside vectors for filtering
- Multi-tenancy: Isolate tenant data within a single collection
- Quantization: Reduce memory usage by up to 4x (Scalar) or 32x (Binary)
- Distributed mode: Cluster support with Raft consensus protocol
Distance Metric Comparison
| Metric | Formula | Use Case | Range |
|---|---|---|---|
| Cosine | 1 - cos(a, b) | Text embeddings (OpenAI, Cohere) | 0 to 2 |
| Euclid | ||a - b|| | Image feature vectors, coordinate-based | 0 to +inf |
| Dot Product | -a . b | Normalized embeddings, recommendation systems | -inf to +inf |
| Manhattan | sum(|a_i - b_i|) | Sparse vectors, specialized domains | 0 to +inf |
Tip: OpenAI's
text-embedding-3-small/largereturns normalized vectors, so Cosine and Dot Product yield identical results. In this case, Dot Product is computationally faster.
2. Architecture
Key Components
- Segment: The physical unit that stores vectors and payloads separately. Acts as the basic unit for parallel search
- WAL (Write-Ahead Log): Ensures durability for write operations. Replayed during crash recovery
- HNSW Index: Approximate Nearest Neighbor index based on Hierarchical Navigable Small World graphs
- Payload Index: Secondary index for metadata filtering (keyword, integer, geo, etc.)
Cluster Mode
In cluster mode, data is distributed across Shards, and the Raft consensus protocol maintains metadata consistency. Each shard can have replicas across multiple nodes for high availability. To set up a cluster with Docker Compose, set the environment variable QDRANT__CLUSTER__ENABLED=true and configure P2P ports. From the second node onward, use the --bootstrap flag pointing to the first node.
3. Collection Design and Index Strategy
Creating a Collection
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff, OptimizersConfigDiff
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType
client = QdrantClient(host="localhost", port=6333)
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(
size=1536, # OpenAI text-embedding-3-small dimensions
distance=Distance.COSINE,
),
hnsw_config=HnswConfigDiff(
m=16, # Graph connections
ef_construct=128, # Search width during index build
),
optimizers_config=OptimizersConfigDiff(
indexing_threshold=20000,
),
quantization_config=ScalarQuantization(
scalar=ScalarQuantizationConfig(type=ScalarType.INT8, quantile=0.99, always_ram=True),
),
)
HNSW Parameter Guide
| Parameter | Default | Recommended Range | Description |
|---|---|---|---|
m | 16 | 8 to 64 | Graph connections per node. Higher improves recall but increases memory |
ef_construct | 100 | 64 to 512 | Index build quality. Higher is more accurate but slower |
ef (search) | 128 | 64 to 512 | Search-time width. Recall vs. latency trade-off |
Quantization Strategy
# Scalar Quantization: float32 -> int8 (4x memory reduction)
# Suitable for general text search, recall loss < 1%
# Binary Quantization: float32 -> 1-bit (32x memory reduction)
# Very aggressive. Only use with high dimensions (1536+). Requires oversampling
from qdrant_client.models import BinaryQuantization, BinaryQuantizationConfig
client.update_collection(
collection_name="documents",
quantization_config=BinaryQuantization(
binary=BinaryQuantizationConfig(
always_ram=True,
),
),
)
4. CRUD and Search
Inserting Vectors (Upsert)
from qdrant_client.models import PointStruct
import openai
def get_embedding(text: str) -> list[float]:
resp = openai.embeddings.create(model="text-embedding-3-small", input=text)
return resp.data[0].embedding
# Single upsert
client.upsert(
collection_name="documents",
points=[
PointStruct(
id=1,
vector=get_embedding("Qdrant is a vector DB written in Rust."),
payload={"title": "Qdrant Introduction", "category": "database", "tenant_id": "team-alpha"},
),
],
)
# Batch upsert — process in batches of 100
BATCH_SIZE = 100
for i in range(0, len(points), BATCH_SIZE):
client.upsert(collection_name="documents", points=points[i : i + BATCH_SIZE])
Similarity Search
from qdrant_client.models import SearchParams
results = client.search(
collection_name="documents",
query_vector=get_embedding("vector database performance optimization"),
limit=10,
score_threshold=0.7,
search_params=SearchParams(
hnsw_ef=128, # Search-time ef value (controls accuracy)
exact=False, # True for brute-force (accurate but slow)
),
with_payload=True,
with_vectors=False, # Don't return vectors (reduce response size)
)
for result in results:
print(f"ID: {result.id}, Score: {result.score:.4f}")
print(f" Title: {result.payload['title']}")
Search via REST API
curl -X POST "http://localhost:6333/collections/documents/points/search" \
-H "Content-Type: application/json" \
-d '{
"vector": [0.1, 0.2, ...],
"limit": 10,
"with_payload": true,
"params": {
"hnsw_ef": 128
}
}'
Update and Delete
from qdrant_client.models import PointIdsList, FilterSelector, Filter, FieldCondition, MatchValue
# Update payload
client.set_payload(collection_name="documents", payload={"category": "vector-database"}, points=[1, 2, 3])
# Delete by ID
client.delete(collection_name="documents", points_selector=PointIdsList(points=[10, 11, 12]))
# Delete by filter
client.delete(
collection_name="documents",
points_selector=FilterSelector(
filter=Filter(must=[FieldCondition(key="category", match=MatchValue(value="deprecated"))])
),
)
5. Filtering and Payloads
Creating Payload Indexes
To achieve good filtering performance, you must create payload indexes for filtered fields.
from qdrant_client.models import PayloadSchemaType, TextIndexParams, TokenizerType
# Keyword index (exact matching)
client.create_payload_index(collection_name="documents", field_name="category", field_schema=PayloadSchemaType.KEYWORD)
# Integer index (range queries)
client.create_payload_index(collection_name="documents", field_name="page_count", field_schema=PayloadSchemaType.INTEGER)
# Text index (full-text search with multilingual tokenizer)
client.create_payload_index(
collection_name="documents",
field_name="content",
field_schema=TextIndexParams(type="text", tokenizer=TokenizerType.MULTILINGUAL, min_token_len=2, max_token_len=20),
)
Compound Filter Search
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
# Search for documents where category is "database" and page_count >= 10
results = client.search(
collection_name="documents",
query_vector=get_embedding("vector index optimization"),
query_filter=Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value="database"),
),
FieldCondition(
key="page_count",
range=Range(gte=10),
),
],
must_not=[
FieldCondition(
key="status",
match=MatchValue(value="archived"),
),
],
),
limit=5,
)
Hybrid Search (Vector + Text)
Qdrant supports Sparse Vectors to combine BM25-style keyword search with dense vector search.
from qdrant_client.models import SparseVector, SparseVectorParams, Prefetch, FusionQuery, Fusion
# Create collection with dense + sparse vector configuration
client.create_collection(
collection_name="hybrid_docs",
vectors_config={"dense": VectorParams(size=1536, distance=Distance.COSINE)},
sparse_vectors_config={"sparse": SparseVectorParams()},
)
# Hybrid search with Reciprocal Rank Fusion
results = client.query_points(
collection_name="hybrid_docs",
prefetch=[
Prefetch(query=get_embedding("vector search performance"), using="dense", limit=20),
Prefetch(query=SparseVector(indices=[1, 42, 1337], values=[0.9, 0.3, 0.7]), using="sparse", limit=20),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=10,
)
6. RAG Pipeline Integration
LangChain Integration
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
# 1. Split documents
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(documents)
# 2. Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = QdrantVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="rag_documents",
force_recreate=True,
)
# 3. Build RAG chain
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={
"k": 5,
"score_threshold": 0.7,
},
)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
)
# 4. Query
response = qa_chain.invoke({"query": "How do I tune HNSW index parameters in Qdrant?"})
print(response["result"])
LlamaIndex Integration
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import qdrant_client
# 1. Qdrant client & vector store
qclient = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(
client=qclient,
collection_name="llamaindex_docs",
)
# 2. Configure settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# 3. Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(
documents,
vector_store=vector_store,
)
# 4. Query engine
query_engine = index.as_query_engine(
similarity_top_k=5,
)
response = query_engine.query("How can I combine filtering and vector search in Qdrant?")
print(response)
7. Operations Checklist
Monitoring
Qdrant exposes Prometheus-format metrics at the /metrics endpoint.
# Check metrics
curl http://localhost:6333/metrics
# Key metrics
# qdrant_grpc_responses_total - gRPC request count
# qdrant_rest_responses_total - REST request count
# qdrant_collection_points_total - Points per collection
# qdrant_collection_search_latency - Search latency
Operations Checklist Table
| Item | What to Check | Frequency |
|---|---|---|
| Memory | Keep RAM usage below 80% | Daily |
| Disk | Maintain at least 20% free storage | Daily |
| Search Latency | p99 latency < 100ms | Real-time |
| Index Status | optimizer_status = "ok" | Hourly |
| Snapshots | Verify automatic snapshots are created | Daily |
| Replicas | All shard replicas in active state | Hourly |
| WAL Size | Monitor WAL directory for abnormal growth | Daily |
| API Error Rate | 5xx response ratio < 0.1% | Real-time |
Snapshots and Backup
# Create and list snapshots
client.create_snapshot(collection_name="documents")
snapshots = client.list_snapshots(collection_name="documents")
for snap in snapshots:
print(f"Name: {snap.name}, Size: {snap.size}")
# Download snapshot via REST API
curl -o snapshot.tar "http://localhost:6333/collections/documents/snapshots/snapshot-2026-03-09.snapshot"
# Full storage snapshot (all collections)
curl -X POST "http://localhost:6333/snapshots"
8. Common Mistakes
Using filter search without payload indexes: Without payload indexes, Qdrant sequentially scans all points. Always create indexes for fields used in filters.
Setting HNSW
eftoo low: Lowering it below the default (128) causes a sharp drop in recall. Keep the default unless you have specific latency requirements.Vector dimension mismatch: If the collection's
sizedoesn't match the embedding model's dimensions, upsert operations will fail.text-embedding-3-smalloutputs 1536 dimensions;text-embedding-3-largeoutputs 3072.Batch size too large: Upserting more than 10,000 points at once can cause memory spikes. Process in batches of 100 to 500.
Skipping tests after applying quantization: Scalar Quantization has minimal recall loss, but Binary Quantization requires mandatory recall benchmarking before production deployment.
Storing all data in a single collection: Mixing data from different embedding models or dimensions in one collection degrades search quality. Separate collections by use case.
No snapshot backup configured: WAL alone cannot recover from disk failures. Store regular snapshots in external storage (S3, etc.).
Using only REST instead of gRPC: For bulk data operations, gRPC is 2-3x faster than REST. In the Python client, set
prefer_grpc=Trueto automatically use gRPC.
# Recommended: use gRPC
client = QdrantClient(
host="localhost",
port=6333,
grpc_port=6334,
prefer_grpc=True,
)
9. Summary
Qdrant combines Rust's performance with a rich feature set, making it particularly well-suited for production RAG pipelines thanks to payload filtering, quantization, and hybrid search capabilities. When designing collections, carefully choose HNSW parameters and quantization strategies, and make full use of payload indexes. For operations, always configure Prometheus metrics monitoring and regular snapshots, and plan your cluster mode and sharding strategy ahead of traffic growth.
Quiz
Q1: What programming language is Qdrant written in, and what advantages does this provide?
Qdrant is written in Rust. Rust guarantees memory safety without a garbage collector, enabling both high throughput and low latency simultaneously. This is particularly beneficial for large-scale vector search workloads.
Q2: What is the difference between Cosine and Dot Product distance metrics, and when should you choose Dot Product?
Cosine distance measures directional similarity between two vectors and normalizes for magnitude. Dot Product computes the inner product without normalization. When using pre-normalized vectors (e.g., OpenAI embeddings), both metrics produce identical results, so Dot Product is preferred because it is computationally faster.
Q3: What does the HNSW m parameter control, and what trade-offs occur when increasing it?
m parameter control, and what trade-offs occur when increasing it?The m parameter determines the number of connections (neighbors) per node in the HNSW graph. Increasing it improves search recall (accuracy) but increases memory consumption and index build time. It is typically set within the range of 8 to 64.
Q4: What is the difference between Scalar Quantization and Binary Quantization?
Scalar Quantization converts float32 to int8, reducing memory by approximately 4x with less than 1% recall loss, making it suitable for general text search. Binary Quantization converts float32 to 1-bit, reducing memory by 32x but with significant recall loss, requiring mandatory oversampling and benchmarking. It is recommended only for high-dimensional vectors (1536+).
Q5: What happens if you perform filtered searches without creating payload indexes?
Without payload indexes, Qdrant must sequentially scan the payload of every point to evaluate filter conditions. As data volume grows, search latency increases linearly, causing severe performance degradation in production environments.
Q6: How do you implement hybrid search (vector + keyword) in Qdrant?
Use Qdrant's Sparse Vector feature to combine BM25-style keyword search with dense vector search. Configure both dense and sparse vectors in the collection, use Prefetch to retrieve results from each search, then merge results using a fusion strategy like Reciprocal Rank Fusion (RRF).
Q7: How are data distribution and consistency managed in Qdrant's cluster mode?
In cluster mode, data is distributed across nodes at the Shard level. Metadata consistency is maintained through the Raft consensus protocol, and each shard can have replicas across multiple nodes to ensure high availability.
Q8: What are the benefits of using gRPC instead of REST, and how do you configure it in the Python client?
gRPC uses Protocol Buffers binary serialization, delivering 2-3x faster processing compared to REST (JSON). In the Python client, set grpc_port=6334 and prefer_grpc=True when creating the QdrantClient to automatically use gRPC.