Skip to content
Published on

Complete Guide to Qdrant Vector DB Operations — From Collection Design to RAG Integration

Authors
  • Name
    Twitter

Introduction

As LLM-powered applications continue to proliferate, vector databases have become essential infrastructure rather than optional tooling. Among them, Qdrant stands out: written in Rust for high performance, supporting both gRPC and REST APIs, and offering production-grade features like payload filtering and multi-tenancy. This guide covers everything you need to deploy Qdrant in production — from collection design, CRUD operations, and filtering to RAG integration and operational monitoring.


1. Core Concepts

What Is a Vector Database?

A vector database stores high-dimensional vectors (embeddings) and performs similarity search in milliseconds. Text, images, and audio are transformed into vectors via embedding models, stored in the database, and then queried by finding the nearest vectors to a given query vector.

Key Features of Qdrant

  • Rust-based: Achieves memory safety and high throughput simultaneously
  • Payload: Store JSON metadata alongside vectors for filtering
  • Multi-tenancy: Isolate tenant data within a single collection
  • Quantization: Reduce memory usage by up to 4x (Scalar) or 32x (Binary)
  • Distributed mode: Cluster support with Raft consensus protocol

Distance Metric Comparison

MetricFormulaUse CaseRange
Cosine1 - cos(a, b)Text embeddings (OpenAI, Cohere)0 to 2
Euclid||a - b||Image feature vectors, coordinate-based0 to +inf
Dot Product-a . bNormalized embeddings, recommendation systems-inf to +inf
Manhattansum(|a_i - b_i|)Sparse vectors, specialized domains0 to +inf

Tip: OpenAI's text-embedding-3-small/large returns normalized vectors, so Cosine and Dot Product yield identical results. In this case, Dot Product is computationally faster.

2. Architecture

Key Components

  • Segment: The physical unit that stores vectors and payloads separately. Acts as the basic unit for parallel search
  • WAL (Write-Ahead Log): Ensures durability for write operations. Replayed during crash recovery
  • HNSW Index: Approximate Nearest Neighbor index based on Hierarchical Navigable Small World graphs
  • Payload Index: Secondary index for metadata filtering (keyword, integer, geo, etc.)

Cluster Mode

In cluster mode, data is distributed across Shards, and the Raft consensus protocol maintains metadata consistency. Each shard can have replicas across multiple nodes for high availability. To set up a cluster with Docker Compose, set the environment variable QDRANT__CLUSTER__ENABLED=true and configure P2P ports. From the second node onward, use the --bootstrap flag pointing to the first node.

3. Collection Design and Index Strategy

Creating a Collection

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, HnswConfigDiff, OptimizersConfigDiff
from qdrant_client.models import ScalarQuantization, ScalarQuantizationConfig, ScalarType

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=1536,              # OpenAI text-embedding-3-small dimensions
        distance=Distance.COSINE,
    ),
    hnsw_config=HnswConfigDiff(
        m=16,                   # Graph connections
        ef_construct=128,       # Search width during index build
    ),
    optimizers_config=OptimizersConfigDiff(
        indexing_threshold=20000,
    ),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(type=ScalarType.INT8, quantile=0.99, always_ram=True),
    ),
)

HNSW Parameter Guide

ParameterDefaultRecommended RangeDescription
m168 to 64Graph connections per node. Higher improves recall but increases memory
ef_construct10064 to 512Index build quality. Higher is more accurate but slower
ef (search)12864 to 512Search-time width. Recall vs. latency trade-off

Quantization Strategy

# Scalar Quantization: float32 -> int8 (4x memory reduction)
# Suitable for general text search, recall loss < 1%

# Binary Quantization: float32 -> 1-bit (32x memory reduction)
# Very aggressive. Only use with high dimensions (1536+). Requires oversampling
from qdrant_client.models import BinaryQuantization, BinaryQuantizationConfig

client.update_collection(
    collection_name="documents",
    quantization_config=BinaryQuantization(
        binary=BinaryQuantizationConfig(
            always_ram=True,
        ),
    ),
)

Inserting Vectors (Upsert)

from qdrant_client.models import PointStruct
import openai

def get_embedding(text: str) -> list[float]:
    resp = openai.embeddings.create(model="text-embedding-3-small", input=text)
    return resp.data[0].embedding

# Single upsert
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=get_embedding("Qdrant is a vector DB written in Rust."),
            payload={"title": "Qdrant Introduction", "category": "database", "tenant_id": "team-alpha"},
        ),
    ],
)

# Batch upsert — process in batches of 100
BATCH_SIZE = 100
for i in range(0, len(points), BATCH_SIZE):
    client.upsert(collection_name="documents", points=points[i : i + BATCH_SIZE])
from qdrant_client.models import SearchParams

results = client.search(
    collection_name="documents",
    query_vector=get_embedding("vector database performance optimization"),
    limit=10,
    score_threshold=0.7,
    search_params=SearchParams(
        hnsw_ef=128,       # Search-time ef value (controls accuracy)
        exact=False,       # True for brute-force (accurate but slow)
    ),
    with_payload=True,
    with_vectors=False,    # Don't return vectors (reduce response size)
)

for result in results:
    print(f"ID: {result.id}, Score: {result.score:.4f}")
    print(f"  Title: {result.payload['title']}")

Search via REST API

curl -X POST "http://localhost:6333/collections/documents/points/search" \
  -H "Content-Type: application/json" \
  -d '{
    "vector": [0.1, 0.2, ...],
    "limit": 10,
    "with_payload": true,
    "params": {
      "hnsw_ef": 128
    }
  }'

Update and Delete

from qdrant_client.models import PointIdsList, FilterSelector, Filter, FieldCondition, MatchValue

# Update payload
client.set_payload(collection_name="documents", payload={"category": "vector-database"}, points=[1, 2, 3])

# Delete by ID
client.delete(collection_name="documents", points_selector=PointIdsList(points=[10, 11, 12]))

# Delete by filter
client.delete(
    collection_name="documents",
    points_selector=FilterSelector(
        filter=Filter(must=[FieldCondition(key="category", match=MatchValue(value="deprecated"))])
    ),
)

5. Filtering and Payloads

Creating Payload Indexes

To achieve good filtering performance, you must create payload indexes for filtered fields.

from qdrant_client.models import PayloadSchemaType, TextIndexParams, TokenizerType

# Keyword index (exact matching)
client.create_payload_index(collection_name="documents", field_name="category", field_schema=PayloadSchemaType.KEYWORD)

# Integer index (range queries)
client.create_payload_index(collection_name="documents", field_name="page_count", field_schema=PayloadSchemaType.INTEGER)

# Text index (full-text search with multilingual tokenizer)
client.create_payload_index(
    collection_name="documents",
    field_name="content",
    field_schema=TextIndexParams(type="text", tokenizer=TokenizerType.MULTILINGUAL, min_token_len=2, max_token_len=20),
)
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# Search for documents where category is "database" and page_count >= 10
results = client.search(
    collection_name="documents",
    query_vector=get_embedding("vector index optimization"),
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="database"),
            ),
            FieldCondition(
                key="page_count",
                range=Range(gte=10),
            ),
        ],
        must_not=[
            FieldCondition(
                key="status",
                match=MatchValue(value="archived"),
            ),
        ],
    ),
    limit=5,
)

Hybrid Search (Vector + Text)

Qdrant supports Sparse Vectors to combine BM25-style keyword search with dense vector search.

from qdrant_client.models import SparseVector, SparseVectorParams, Prefetch, FusionQuery, Fusion

# Create collection with dense + sparse vector configuration
client.create_collection(
    collection_name="hybrid_docs",
    vectors_config={"dense": VectorParams(size=1536, distance=Distance.COSINE)},
    sparse_vectors_config={"sparse": SparseVectorParams()},
)

# Hybrid search with Reciprocal Rank Fusion
results = client.query_points(
    collection_name="hybrid_docs",
    prefetch=[
        Prefetch(query=get_embedding("vector search performance"), using="dense", limit=20),
        Prefetch(query=SparseVector(indices=[1, 42, 1337], values=[0.9, 0.3, 0.7]), using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),
    limit=10,
)

6. RAG Pipeline Integration

LangChain Integration

from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# 1. Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(documents)

# 2. Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = QdrantVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name="rag_documents",
    force_recreate=True,
)

# 3. Build RAG chain
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "k": 5,
        "score_threshold": 0.7,
    },
)

llm = ChatOpenAI(model="gpt-4o", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
)

# 4. Query
response = qa_chain.invoke({"query": "How do I tune HNSW index parameters in Qdrant?"})
print(response["result"])

LlamaIndex Integration

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import qdrant_client

# 1. Qdrant client & vector store
qclient = qdrant_client.QdrantClient(host="localhost", port=6333)

vector_store = QdrantVectorStore(
    client=qclient,
    collection_name="llamaindex_docs",
)

# 2. Configure settings
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.chunk_size = 512
Settings.chunk_overlap = 50

# 3. Load and index documents
documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(
    documents,
    vector_store=vector_store,
)

# 4. Query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
)

response = query_engine.query("How can I combine filtering and vector search in Qdrant?")
print(response)

7. Operations Checklist

Monitoring

Qdrant exposes Prometheus-format metrics at the /metrics endpoint.

# Check metrics
curl http://localhost:6333/metrics

# Key metrics
# qdrant_grpc_responses_total        - gRPC request count
# qdrant_rest_responses_total        - REST request count
# qdrant_collection_points_total     - Points per collection
# qdrant_collection_search_latency   - Search latency

Operations Checklist Table

ItemWhat to CheckFrequency
MemoryKeep RAM usage below 80%Daily
DiskMaintain at least 20% free storageDaily
Search Latencyp99 latency < 100msReal-time
Index Statusoptimizer_status = "ok"Hourly
SnapshotsVerify automatic snapshots are createdDaily
ReplicasAll shard replicas in active stateHourly
WAL SizeMonitor WAL directory for abnormal growthDaily
API Error Rate5xx response ratio < 0.1%Real-time

Snapshots and Backup

# Create and list snapshots
client.create_snapshot(collection_name="documents")
snapshots = client.list_snapshots(collection_name="documents")
for snap in snapshots:
    print(f"Name: {snap.name}, Size: {snap.size}")
# Download snapshot via REST API
curl -o snapshot.tar "http://localhost:6333/collections/documents/snapshots/snapshot-2026-03-09.snapshot"

# Full storage snapshot (all collections)
curl -X POST "http://localhost:6333/snapshots"

8. Common Mistakes

  1. Using filter search without payload indexes: Without payload indexes, Qdrant sequentially scans all points. Always create indexes for fields used in filters.

  2. Setting HNSW ef too low: Lowering it below the default (128) causes a sharp drop in recall. Keep the default unless you have specific latency requirements.

  3. Vector dimension mismatch: If the collection's size doesn't match the embedding model's dimensions, upsert operations will fail. text-embedding-3-small outputs 1536 dimensions; text-embedding-3-large outputs 3072.

  4. Batch size too large: Upserting more than 10,000 points at once can cause memory spikes. Process in batches of 100 to 500.

  5. Skipping tests after applying quantization: Scalar Quantization has minimal recall loss, but Binary Quantization requires mandatory recall benchmarking before production deployment.

  6. Storing all data in a single collection: Mixing data from different embedding models or dimensions in one collection degrades search quality. Separate collections by use case.

  7. No snapshot backup configured: WAL alone cannot recover from disk failures. Store regular snapshots in external storage (S3, etc.).

  8. Using only REST instead of gRPC: For bulk data operations, gRPC is 2-3x faster than REST. In the Python client, set prefer_grpc=True to automatically use gRPC.

# Recommended: use gRPC
client = QdrantClient(
    host="localhost",
    port=6333,
    grpc_port=6334,
    prefer_grpc=True,
)

9. Summary

Qdrant combines Rust's performance with a rich feature set, making it particularly well-suited for production RAG pipelines thanks to payload filtering, quantization, and hybrid search capabilities. When designing collections, carefully choose HNSW parameters and quantization strategies, and make full use of payload indexes. For operations, always configure Prometheus metrics monitoring and regular snapshots, and plan your cluster mode and sharding strategy ahead of traffic growth.

Quiz

Q1: What programming language is Qdrant written in, and what advantages does this provide?

Qdrant is written in Rust. Rust guarantees memory safety without a garbage collector, enabling both high throughput and low latency simultaneously. This is particularly beneficial for large-scale vector search workloads.

Q2: What is the difference between Cosine and Dot Product distance metrics, and when should you choose Dot Product?

Cosine distance measures directional similarity between two vectors and normalizes for magnitude. Dot Product computes the inner product without normalization. When using pre-normalized vectors (e.g., OpenAI embeddings), both metrics produce identical results, so Dot Product is preferred because it is computationally faster.

Q3: What does the HNSW m parameter control, and what trade-offs occur when increasing it?

The m parameter determines the number of connections (neighbors) per node in the HNSW graph. Increasing it improves search recall (accuracy) but increases memory consumption and index build time. It is typically set within the range of 8 to 64.

Q4: What is the difference between Scalar Quantization and Binary Quantization? Scalar Quantization converts float32 to int8, reducing memory by approximately 4x with less than 1% recall loss, making it suitable for general text search. Binary Quantization converts float32 to 1-bit, reducing memory by 32x but with significant recall loss, requiring mandatory oversampling and benchmarking. It is recommended only for high-dimensional vectors (1536+).

Q5: What happens if you perform filtered searches without creating payload indexes?

Without payload indexes, Qdrant must sequentially scan the payload of every point to evaluate filter conditions. As data volume grows, search latency increases linearly, causing severe performance degradation in production environments.

Q6: How do you implement hybrid search (vector + keyword) in Qdrant? Use Qdrant's Sparse Vector feature to combine BM25-style keyword search with dense vector search. Configure both dense and sparse vectors in the collection, use Prefetch to retrieve results from each search, then merge results using a fusion strategy like Reciprocal Rank Fusion (RRF).

Q7: How are data distribution and consistency managed in Qdrant's cluster mode? In cluster mode, data is distributed across nodes at the Shard level. Metadata consistency is maintained through the Raft consensus protocol, and each shard can have replicas across multiple nodes to ensure high availability.

Q8: What are the benefits of using gRPC instead of REST, and how do you configure it in the Python client?

gRPC uses Protocol Buffers binary serialization, delivering 2-3x faster processing compared to REST (JSON). In the Python client, set grpc_port=6334 and prefer_grpc=True when creating the QdrantClient to automatically use gRPC.