- Published on
Complete Guide to Embedding Models: Vector Search, RAG, and Sentence Transformers in Practice
- Authors
- Name
- Introduction
- Fundamentals of Embeddings
- Comparing Key Embedding Models
- Using Sentence Transformers
- Vector Databases and Indexing
- Similarity Search and Semantic Search
- Embeddings in RAG Pipelines
- Fine-tuning Embedding Models
- Performance Optimization and Evaluation
- Conclusion
- References

Introduction
Embeddings are a foundational technology of modern AI systems. By converting unstructured data such as text, images, and audio into numerical vectors, they enable machines to understand and compare "meaning." With RAG (Retrieval-Augmented Generation) pipelines becoming the core architecture of LLM applications, the quality of embedding models has become a critical factor that determines overall system performance.
Since the advent of Word2Vec in 2013, the field has evolved rapidly through GloVe and FastText, then BERT-based sentence embeddings, and recently to instruction-tuned large-scale embedding models. In 2024-2025, models with significantly improved performance and multilingual support emerged, including OpenAI text-embedding-3, Cohere embed-v3, BGE-M3, and GTE-Qwen2, with fierce competition on the MTEB (Massive Text Embedding Benchmark) leaderboard.
This article systematically covers everything about embedding models, from fundamental principles to the latest model comparisons, vector database utilization, RAG integration, fine-tuning, and performance evaluation, all accompanied by practical code examples.
Fundamentals of Embeddings
What Are Embeddings?
An embedding is a technique that maps high-dimensional discrete data into a lower-dimensional continuous vector space. The core idea is to learn representations where semantically similar items are positioned close together in the vector space.
# Intuitive understanding: representing words as vectors
# "king" = [0.2, 0.8, 0.1, ...]
# "queen" = [0.3, 0.9, 0.1, ...]
# "man" = [0.1, 0.2, 0.8, ...]
# The famous vector arithmetic: king - man + woman ≈ queen
import numpy as np
king = np.array([0.2, 0.8, 0.1, 0.5])
man = np.array([0.1, 0.2, 0.8, 0.4])
woman = np.array([0.15, 0.25, 0.85, 0.6])
queen = np.array([0.3, 0.9, 0.1, 0.7])
result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen = {queen}")
# The two vectors are very similar
Geometric Intuition of Embeddings
In vector space, embeddings exhibit the following properties:
- Distance = Semantic Difference: Words/sentences with similar meanings are positioned at close distances
- Direction = Relationship: Specific directions encode specific semantic relationships (e.g., gender, tense, size)
- Clustering: Items belonging to the same topic or category naturally form clusters
Evolution of Embeddings
| Generation | Model | Characteristics | Dimensions |
|---|---|---|---|
| 1st Gen (2013) | Word2Vec, GloVe | Static word embeddings, context-independent | 50-300 |
| 2nd Gen (2018) | ELMo, BERT | Context-dependent embeddings, bidirectional | 768-1024 |
| 3rd Gen (2019) | Sentence-BERT | Sentence-level embeddings, efficient similarity computation | 384-768 |
| 4th Gen (2023-) | E5, BGE, GTE | Instruction-tuned, multilingual, large-scale | 768-4096 |
| 5th Gen (2024-) | text-embedding-3, Matryoshka | Variable dimensions, multilingual, high-performance | 256-3072 |
Comparing Key Embedding Models
Commercial Embedding Models
| Model | Provider | Max Tokens | Dimensions | MTEB Average | Price (1M tokens) |
|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 8,191 | 3,072 | 64.6 | ~0.13 USD |
| text-embedding-3-small | OpenAI | 8,191 | 1,536 | 62.3 | ~0.02 USD |
| embed-v3.0 (English) | Cohere | 512 | 1,024 | 64.5 | ~0.10 USD |
| embed-v3.0 (Multilingual) | Cohere | 512 | 1,024 | 64.0 | ~0.10 USD |
| Voyage-3 | Voyage AI | 32,000 | 1,024 | 67.3 | ~0.06 USD |
Open-Source Embedding Models
| Model | Developer | Parameters | Dimensions | MTEB Average | Features |
|---|---|---|---|---|---|
| BGE-M3 | BAAI | 568M | 1,024 | 66.1 | Multilingual, Dense+Sparse+ColBERT |
| BGE-large-en-v1.5 | BAAI | 335M | 1,024 | 64.2 | English-optimized |
| E5-mistral-7b-instruct | Microsoft | 7B | 4,096 | 66.6 | LLM-based, high-performance |
| GTE-Qwen2-7B-instruct | Alibaba | 7B | 3,584 | 70.2 | Top MTEB ranking |
| Jina-embeddings-v3 | Jina AI | 572M | 1,024 | 65.5 | Multilingual, Task LoRA |
| nomic-embed-text-v1.5 | Nomic | 137M | 768 | 62.3 | Lightweight, 8192 tokens |
| mxbai-embed-large-v1 | Mixedbread | 335M | 1,024 | 64.7 | Matryoshka support |
Model Selection Criteria
# Model selection decision tree
def select_embedding_model(requirements):
"""Embedding model selection guide based on requirements"""
if requirements.get("budget") == "unlimited":
if requirements.get("max_performance"):
return "GTE-Qwen2-7B-instruct (self-hosted) or Voyage-3 (API)"
return "text-embedding-3-large (OpenAI API)"
if requirements.get("multilingual"):
if requirements.get("self_hosted"):
return "BGE-M3 (Dense+Sparse hybrid)"
return "Cohere embed-v3 multilingual"
if requirements.get("low_latency"):
if requirements.get("self_hosted"):
return "nomic-embed-text-v1.5 (lightweight 137M)"
return "text-embedding-3-small (OpenAI API)"
if requirements.get("domain_specific"):
return "Sentence Transformers + fine-tuning (base model: BGE or E5)"
# Default recommendation
return "text-embedding-3-small (cost-effective general-purpose choice)"
Using Sentence Transformers
Basic Usage
Sentence Transformers is the most widely used Python library for generating sentence-level embeddings.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
# Single sentence embedding
sentence = "Embedding models convert text into numerical vectors."
embedding = model.encode(sentence)
print(f"Dimensions: {embedding.shape}") # (1024,)
# Batch embeddings
sentences = [
"Embedding models convert text into numerical vectors.",
"Vector search quickly finds similar documents.",
"RAG is a retrieval-augmented generation technique.",
"The weather is very nice today.",
]
embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(f"Embedding matrix shape: {embeddings.shape}") # (4, 1024)
# Similarity computation
from sentence_transformers.util import cos_sim
similarity_matrix = cos_sim(embeddings, embeddings)
print("Similarity matrix:")
print(similarity_matrix.numpy().round(3))
BGE-M3 Multilingual Embeddings
from sentence_transformers import SentenceTransformer
# BGE-M3: A multilingual embedding model supporting 100+ languages
model = SentenceTransformer('BAAI/bge-m3')
# Multilingual sentence embeddings
sentences = [
"Machine learning is transforming the world.", # English
"머신러닝이 세상을 변화시키고 있다.", # Korean
"機械学習が世界を変えている。", # Japanese
"机器学习正在改变世界。", # Chinese
]
embeddings = model.encode(sentences, normalize_embeddings=True)
# Cross-lingual similarity check
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings, embeddings)
print("Cross-lingual similarity matrix:")
for i, s1 in enumerate(sentences):
for j, s2 in enumerate(sentences):
if i < j:
print(f" '{s1[:30]}...' <-> '{s2[:30]}...': {similarities[i][j]:.4f}")
# Sentences with the same meaning in different languages show high similarity
Using the OpenAI Embedding API
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_openai_embeddings(texts, model="text-embedding-3-small", dimensions=None):
"""Generate OpenAI embeddings (supports Matryoshka dimension reduction)"""
kwargs = {"input": texts, "model": model}
if dimensions:
kwargs["dimensions"] = dimensions
response = client.embeddings.create(**kwargs)
return np.array([item.embedding for item in response.data])
# Basic usage
texts = ["Principles of embedding models", "Building vector search systems"]
embeddings_full = get_openai_embeddings(texts, model="text-embedding-3-large")
print(f"Full dimensions: {embeddings_full.shape}") # (2, 3072)
# Matryoshka: optimize cost/speed through dimension reduction
embeddings_256 = get_openai_embeddings(
texts, model="text-embedding-3-large", dimensions=256
)
print(f"Reduced dimensions: {embeddings_256.shape}") # (2, 256)
# Cosine similarity comparison
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sim_full = cosine_similarity(embeddings_full[0], embeddings_full[1])
sim_256 = cosine_similarity(embeddings_256[0], embeddings_256[1])
print(f"Full dimension similarity: {sim_full:.4f}")
print(f"256-dimension similarity: {sim_256:.4f}")
Vector Databases and Indexing
Vector Database Comparison
| Database | Type | Index Algorithm | Scalability | Filtering | Features |
|---|---|---|---|---|---|
| Pinecone | Fully managed | Proprietary | High | Metadata | Serverless, simple API |
| Weaviate | Open-source/Cloud | HNSW | High | GraphQL | Hybrid search, modular |
| Milvus | Open-source | HNSW, IVF, DiskANN | Very high | Attribute | GPU acceleration, large-scale |
| Chroma | Open-source | HNSW | Medium | Metadata | Lightweight, developer-friendly |
| FAISS | Library | IVF, PQ, HNSW | High | None (separate impl.) | Meta-developed, top performance |
| Qdrant | Open-source/Cloud | HNSW | High | Payload | Rust-based, high-performance |
| pgvector | PostgreSQL extension | IVFFlat, HNSW | Medium | SQL | Leverages existing PostgreSQL |
Understanding Indexing Algorithms
import faiss
import numpy as np
import time
# Generate test data
np.random.seed(42)
dimension = 1024
num_vectors = 1_000_000
num_queries = 100
# Generate normalized random vectors
data = np.random.randn(num_vectors, dimension).astype('float32')
faiss.normalize_L2(data)
queries = np.random.randn(num_queries, dimension).astype('float32')
faiss.normalize_L2(queries)
# 1. Flat Index (Exact Search, Brute-force)
print("=== Flat Index (Exact Search) ===")
index_flat = faiss.IndexFlatIP(dimension) # Inner Product (cosine similarity)
index_flat.add(data)
start = time.time()
D_exact, I_exact = index_flat.search(queries, 10)
flat_time = time.time() - start
print(f"Search time: {flat_time:.3f}s")
print(f"Recall@10: 1.000 (exact search)")
# 2. IVF (Inverted File Index)
print("\n=== IVF Index ===")
nlist = 1024 # Number of clusters
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(data)
index_ivf.add(data)
index_ivf.nprobe = 32 # Number of clusters to search
start = time.time()
D_ivf, I_ivf = index_ivf.search(queries, 10)
ivf_time = time.time() - start
# Calculate recall
recall = np.mean([len(set(I_ivf[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {ivf_time:.3f}s (x{flat_time/ivf_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")
# 3. HNSW (Hierarchical Navigable Small World)
print("\n=== HNSW Index ===")
index_hnsw = faiss.IndexHNSWFlat(dimension, 32) # M=32
index_hnsw.hnsw.efConstruction = 200
index_hnsw.hnsw.efSearch = 64
index_hnsw.metric_type = faiss.METRIC_INNER_PRODUCT
index_hnsw.add(data)
start = time.time()
D_hnsw, I_hnsw = index_hnsw.search(queries, 10)
hnsw_time = time.time() - start
recall = np.mean([len(set(I_hnsw[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {hnsw_time:.3f}s (x{flat_time/hnsw_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")
# 4. IVF-PQ (Product Quantization)
print("\n=== IVF-PQ Index (Memory Optimized) ===")
m = 64 # Number of sub-vectors
nbits = 8 # Codebook bits
index_ivfpq = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
index_ivfpq.train(data)
index_ivfpq.add(data)
index_ivfpq.nprobe = 32
start = time.time()
D_pq, I_pq = index_ivfpq.search(queries, 10)
pq_time = time.time() - start
recall = np.mean([len(set(I_pq[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {pq_time:.3f}s (x{flat_time/pq_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")
print(f"Memory: Flat={data.nbytes/1e9:.1f}GB, PQ={index_ivfpq.sa_code_size()*num_vectors/1e9:.3f}GB")
Indexing Algorithm Comparison
| Algorithm | Search Speed | Recall | Memory Usage | Build Time | Best For |
|---|---|---|---|---|---|
| Flat | Slow | 100% | High | Instant | Small scale (under 100K) |
| IVF | Medium | 95-99% | High | Medium | Medium scale, frequent updates |
| HNSW | Fast | 97-99% | High+overhead | Slow | High-performance, read-heavy |
| IVF-PQ | Fast | 90-95% | Low | Medium | Large scale, memory-constrained |
| ScaNN | Very fast | 95-98% | Medium | Medium | Large scale, Google ecosystem |
Building a Vector Store with Chroma
import chromadb
from chromadb.utils import embedding_functions
# Create Chroma client
client = chromadb.PersistentClient(path="./chroma_db")
# Set up Sentence Transformers embedding function
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="BAAI/bge-m3"
)
# Create collection
collection = client.get_or_create_collection(
name="tech_documents",
embedding_function=embedding_fn,
metadata={"hnsw:space": "cosine"} # Use cosine similarity
)
# Add documents
documents = [
"Kubernetes is a container orchestration platform.",
"Docker is a tool for packaging applications into containers.",
"Prometheus is a metrics-based monitoring system.",
"Grafana is a data visualization and dashboard tool.",
"Terraform is an IaC tool for managing infrastructure as code.",
"Embedding models convert text into vectors.",
"RAG is a retrieval-augmented generation technique that reduces LLM hallucinations.",
]
collection.add(
documents=documents,
ids=[f"doc_{i}" for i in range(len(documents))],
metadatas=[
{"category": "kubernetes"},
{"category": "docker"},
{"category": "monitoring"},
{"category": "monitoring"},
{"category": "iac"},
{"category": "ai"},
{"category": "ai"},
]
)
# Semantic search
results = collection.query(
query_texts=["What are container-related technologies?"],
n_results=3
)
print("Search results:")
for doc, score in zip(results["documents"][0], results["distances"][0]):
print(f" [{score:.4f}] {doc}")
# Metadata filtering + semantic search
results_filtered = collection.query(
query_texts=["monitoring tools"],
n_results=2,
where={"category": "monitoring"}
)
print("\nFiltered search results:")
for doc in results_filtered["documents"][0]:
print(f" {doc}")
Similarity Search and Semantic Search
Comparing Similarity Metrics
import numpy as np
def cosine_similarity(a, b):
"""Cosine similarity: measures directional similarity of vectors"""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def dot_product(a, b):
"""Dot product: equivalent to cosine similarity for normalized vectors"""
return np.dot(a, b)
def euclidean_distance(a, b):
"""Euclidean distance: straight-line distance between vectors"""
return np.linalg.norm(a - b)
def manhattan_distance(a, b):
"""Manhattan distance: sum of absolute differences per dimension"""
return np.sum(np.abs(a - b))
# Example vectors
a = np.array([1.0, 2.0, 3.0])
b = np.array([1.0, 2.0, 3.1])
c = np.array([-1.0, -2.0, -3.0])
print("Vectors a and b (very similar):")
print(f" Cosine similarity: {cosine_similarity(a, b):.4f}")
print(f" Euclidean distance: {euclidean_distance(a, b):.4f}")
print(f" Dot product: {dot_product(a, b):.4f}")
print("\nVectors a and c (opposite direction):")
print(f" Cosine similarity: {cosine_similarity(a, c):.4f}")
print(f" Euclidean distance: {euclidean_distance(a, c):.4f}")
print(f" Dot product: {dot_product(a, c):.4f}")
Similarity Metric Selection Guide
| Metric | Formula | Range | Normalization Required | Use Case |
|---|---|---|---|---|
| Cosine Similarity | cos(a,b) | -1 to 1 | Not required | Text similarity (most common) |
| Dot Product | a . b | -inf to inf | Required | Normalized embeddings, search ranking |
| Euclidean Distance (L2) | L2 norm of vector difference | 0 to inf | Recommended | Clustering, anomaly detection |
Implementing a Semantic Search Pipeline
from sentence_transformers import SentenceTransformer, util
import torch
class SemanticSearchEngine:
def __init__(self, model_name="BAAI/bge-m3"):
self.model = SentenceTransformer(model_name)
self.documents = []
self.embeddings = None
def index_documents(self, documents):
"""Index documents by generating embeddings"""
self.documents = documents
self.embeddings = self.model.encode(
documents,
convert_to_tensor=True,
normalize_embeddings=True,
show_progress_bar=True
)
print(f"Indexed {len(documents)} documents (dimensions: {self.embeddings.shape[1]})")
def search(self, query, top_k=5):
"""Perform semantic search"""
query_embedding = self.model.encode(
query,
convert_to_tensor=True,
normalize_embeddings=True
)
# Calculate cosine similarity
scores = util.cos_sim(query_embedding, self.embeddings)[0]
# Return top k results
top_results = torch.topk(scores, k=min(top_k, len(self.documents)))
results = []
for score, idx in zip(top_results.values, top_results.indices):
results.append({
"document": self.documents[idx],
"score": score.item(),
"index": idx.item()
})
return results
def search_with_reranking(self, query, top_k=5, initial_k=20):
"""Two-stage search: embedding retrieval + reranking"""
from sentence_transformers import CrossEncoder
# Stage 1: Embedding-based candidate retrieval
candidates = self.search(query, top_k=initial_k)
# Stage 2: Reranking with cross-encoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
pairs = [(query, c["document"]) for c in candidates]
rerank_scores = reranker.predict(pairs)
# Return reranked results
for i, score in enumerate(rerank_scores):
candidates[i]["rerank_score"] = float(score)
reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
return reranked[:top_k]
# Usage example
engine = SemanticSearchEngine()
documents = [
"Python is the most widely used programming language for data science and machine learning.",
"JavaScript is the core language for web development, also used server-side through Node.js.",
"Kubernetes automates deployment, scaling, and management of containerized applications.",
"PostgreSQL is a powerful open-source relational database management system.",
"TensorFlow and PyTorch are the most widely used frameworks for deep learning model development.",
"Redis is an in-memory data structure store used as a cache and message broker.",
"Docker packages applications and their dependencies into containers for portability.",
"GraphQL is an alternative to REST that allows clients to request only the data they need.",
]
engine.index_documents(documents)
# Semantic search
query = "What tools should I use for deep learning development?"
results = engine.search(query, top_k=3)
print(f"\nQuery: {query}")
for r in results:
print(f" [{r['score']:.4f}] {r['document']}")
Embeddings in RAG Pipelines
RAG Architecture Overview
In a RAG (Retrieval-Augmented Generation) pipeline, embeddings play a central role in the retrieval stage. The overall flow is as follows:
- Document Preprocessing: Split source documents into appropriately sized chunks
- Embedding Generation: Convert each chunk into an embedding vector and store in a vector database
- Query Retrieval: Embed the user query and search for similar chunks
- Reranking: Reorder search results using a cross-encoder
- Generation: Pass the retrieved context along with the query to the LLM for answer generation
RAG Pipeline Implementation
from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI
import chromadb
from chromadb.utils import embedding_functions
import tiktoken
from typing import List, Dict
class RAGPipeline:
def __init__(
self,
embedding_model: str = "BAAI/bge-m3",
reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2",
llm_model: str = "gpt-4o",
):
self.embedder = SentenceTransformer(embedding_model)
self.reranker = CrossEncoder(reranker_model)
self.llm_client = OpenAI()
self.llm_model = llm_model
# Initialize Chroma vector DB
self.chroma_client = chromadb.PersistentClient(path="./rag_db")
self.collection = self.chroma_client.get_or_create_collection(
name="rag_documents",
metadata={"hnsw:space": "cosine"}
)
def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
"""Token-based text chunking"""
tokenizer = tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
start = end - overlap # Apply overlap
return chunks
def ingest_documents(self, documents: List[Dict[str, str]]):
"""Chunk documents and store in vector DB"""
all_chunks = []
all_ids = []
all_metadatas = []
for doc_idx, doc in enumerate(documents):
chunks = self.chunk_text(doc["content"])
for chunk_idx, chunk in enumerate(chunks):
all_chunks.append(chunk)
all_ids.append(f"doc{doc_idx}_chunk{chunk_idx}")
all_metadatas.append({
"source": doc.get("source", "unknown"),
"doc_index": doc_idx,
"chunk_index": chunk_idx,
})
# Generate embeddings and store
embeddings = self.embedder.encode(all_chunks, normalize_embeddings=True)
self.collection.add(
documents=all_chunks,
embeddings=embeddings.tolist(),
ids=all_ids,
metadatas=all_metadatas,
)
print(f"{len(documents)} documents -> {len(all_chunks)} chunks indexed")
def retrieve(self, query: str, top_k: int = 10) -> List[Dict]:
"""Vector similarity-based retrieval"""
query_embedding = self.embedder.encode(
[query], normalize_embeddings=True
).tolist()
results = self.collection.query(
query_embeddings=query_embedding,
n_results=top_k,
)
retrieved = []
for i in range(len(results["documents"][0])):
retrieved.append({
"text": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"distance": results["distances"][0][i],
})
return retrieved
def rerank(self, query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
"""Cross-encoder based reranking"""
pairs = [(query, c["text"]) for c in candidates]
scores = self.reranker.predict(pairs)
for i, score in enumerate(scores):
candidates[i]["rerank_score"] = float(score)
reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
return reranked[:top_k]
def generate(self, query: str, context_docs: List[Dict]) -> str:
"""Generate LLM response based on retrieved context"""
context = "\n\n---\n\n".join([doc["text"] for doc in context_docs])
messages = [
{
"role": "system",
"content": (
"You are a helpful assistant. Answer the question based on "
"the provided context. If the context doesn't contain "
"relevant information, say so."
),
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}",
},
]
response = self.llm_client.chat.completions.create(
model=self.llm_model,
messages=messages,
temperature=0.1,
)
return response.choices[0].message.content
def query(self, question: str, top_k_retrieve: int = 10, top_k_rerank: int = 5) -> str:
"""Execute full RAG pipeline"""
# Step 1: Retrieve
candidates = self.retrieve(question, top_k=top_k_retrieve)
print(f"Step 1 retrieval: {len(candidates)} candidates")
# Step 2: Rerank
reranked = self.rerank(question, candidates, top_k=top_k_rerank)
print(f"Step 2 reranking: top {len(reranked)} selected")
# Step 3: Generate
answer = self.generate(question, reranked)
return answer
# Usage example
rag = RAGPipeline()
# Ingest documents
documents = [
{"content": "Long technical document content...", "source": "tech_doc_1.pdf"},
{"content": "Another document content...", "source": "tech_doc_2.pdf"},
]
rag.ingest_documents(documents)
# Query
answer = rag.query("How does embedding dimension size affect performance?")
print(f"\nAnswer: {answer}")
Hybrid Search Strategy
from rank_bm25 import BM25Okapi
import numpy as np
class HybridSearchEngine:
"""Dense (embedding) + Sparse (BM25) hybrid search"""
def __init__(self, embedding_model="BAAI/bge-m3"):
self.embedder = SentenceTransformer(embedding_model)
self.documents = []
self.embeddings = None
self.bm25 = None
def index(self, documents):
self.documents = documents
# Dense: generate embeddings
self.embeddings = self.embedder.encode(
documents, normalize_embeddings=True, convert_to_tensor=True
)
# Sparse: build BM25 index
tokenized = [doc.split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
def search(self, query, top_k=5, alpha=0.7):
"""Hybrid search (alpha: dense weight, 1-alpha: sparse weight)"""
# Dense search
query_emb = self.embedder.encode(
query, normalize_embeddings=True, convert_to_tensor=True
)
dense_scores = util.cos_sim(query_emb, self.embeddings)[0].cpu().numpy()
# Sparse search (BM25)
sparse_scores = self.bm25.get_scores(query.split())
# Normalize
if dense_scores.max() > 0:
dense_scores = dense_scores / dense_scores.max()
if sparse_scores.max() > 0:
sparse_scores = sparse_scores / sparse_scores.max()
# Weighted combination
hybrid_scores = alpha * dense_scores + (1 - alpha) * sparse_scores
# Return top k
top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
return [
{
"document": self.documents[i],
"score": hybrid_scores[i],
"dense_score": dense_scores[i],
"sparse_score": sparse_scores[i],
}
for i in top_indices
]
Fine-tuning Embedding Models
Why Fine-tuning Is Necessary
General-purpose embedding models perform well on general text similarity tasks, but they may underperform on specific domains (medical, legal, financial, etc.) or specialized search patterns. Fine-tuning can significantly improve domain-specific performance.
Contrastive Learning-Based Fine-tuning
from sentence_transformers import (
SentenceTransformer,
InputExample,
losses,
evaluation,
)
from torch.utils.data import DataLoader
# Load base model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
# Prepare training data (anchor, positive, negative)
train_examples = [
# (query, relevant document, irrelevant document)
InputExample(texts=[
"How to deploy a Kubernetes pod?",
"kubectl apply -f pod.yaml creates a new pod in the cluster.",
"Python is a popular programming language for data science."
]),
InputExample(texts=[
"What is a Docker container?",
"A Docker container is a lightweight, standalone executable package.",
"Machine learning models require large datasets for training."
]),
InputExample(texts=[
"How does Redis caching work?",
"Redis stores data in memory for fast read/write access as a cache layer.",
"Kubernetes orchestrates containerized applications across clusters."
]),
# ... thousands to tens of thousands of training examples
]
# Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
# Loss function: TripletLoss (anchor, positive, negative)
train_loss = losses.TripletLoss(model=model)
# Evaluation data
eval_examples = [
InputExample(texts=["query1", "relevant_doc1"], label=1.0),
InputExample(texts=["query2", "irrelevant_doc2"], label=0.0),
]
evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
eval_examples, name="domain-eval"
)
# Run fine-tuning
model.fit(
train_objectives=[(train_dataloader, train_loss)],
evaluator=evaluator,
epochs=3,
warmup_steps=100,
evaluation_steps=500,
output_path="./finetuned_embedding_model",
save_best_model=True,
)
print("Fine-tuning complete!")
# Load and use the fine-tuned model
finetuned_model = SentenceTransformer('./finetuned_embedding_model')
embeddings = finetuned_model.encode(["domain-specific query"])
Efficient Training with MultipleNegativesRankingLoss
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
# Training with just (query, positive_passage) pairs
# Automatically leverages in-batch negatives
train_examples = [
InputExample(texts=["What is embedding?", "An embedding is a vector representation of data."]),
InputExample(texts=["How does HNSW work?", "HNSW builds a hierarchical graph for approximate nearest neighbor search."]),
InputExample(texts=["What is RAG?", "RAG retrieves relevant documents and uses them to augment LLM generation."]),
# ... more (query, positive) pairs
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)
# MultipleNegativesRankingLoss: uses other positives in the batch as negatives
train_loss = losses.MultipleNegativesRankingLoss(model=model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
output_path="./mnrl_finetuned_model",
)
Training Data Preparation Strategies
| Data Type | Description | Example |
|---|---|---|
| Natural Pairs | Real user queries and clicked documents | Search log data |
| LLM-Generated | Synthesized query-document pairs using GPT-4 etc. | Auto-generating questions from documents |
| Hard Negatives | Semantically similar but incorrect documents | Non-relevant docs from BM25 search results |
| Cross-Encoder Distillation | Using cross-encoder scores as training targets | Automatic high-quality label generation |
Performance Optimization and Evaluation
MTEB Benchmark
MTEB (Massive Text Embedding Benchmark) is the standard benchmark for comprehensively evaluating embedding model performance. It evaluates models across various task categories:
| Task Category | Description | Representative Datasets |
|---|---|---|
| Classification | Text classification | AmazonReviews, TweetSentiment |
| Clustering | Text clustering | ArXiv, Reddit |
| Pair Classification | Sentence pair relation classification | TwitterPara, SprintDuplicateQuestions |
| Reranking | Search result reordering | AskUbuntuDupQuestions, StackOverflowDupQuestions |
| Retrieval | Document retrieval | MSMarco, NQ, HotpotQA |
| STS | Sentence semantic similarity | STSBenchmark, SICK-R |
| Summarization | Summary quality evaluation | SummEval |
Dimensionality Reduction and Matryoshka Representation Learning
from sentence_transformers import SentenceTransformer
import numpy as np
# Model supporting Matryoshka Representation Learning (MRL)
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')
texts = [
"Vector databases store embeddings for similarity search.",
"Embedding models convert text into numerical representations.",
"RAG systems combine retrieval with language generation.",
]
# Full-dimension embeddings
full_embeddings = model.encode(texts)
print(f"Full dimensions: {full_embeddings.shape[1]}") # 768
# Matryoshka: truncate to desired dimension and normalize
def truncate_embeddings(embeddings, target_dim):
"""Dimension reduction using Matryoshka approach"""
truncated = embeddings[:, :target_dim]
# L2 normalization
norms = np.linalg.norm(truncated, axis=1, keepdims=True)
return truncated / norms
# Compare similarity at various dimensions
for dim in [64, 128, 256, 512, 768]:
reduced = truncate_embeddings(full_embeddings, dim)
sim = np.dot(reduced[0], reduced[1]) # Dot product of normalized vectors = cosine similarity
print(f" Dimension {dim:>4}: similarity = {sim:.4f}")
Memory Optimization through Quantization
import numpy as np
def scalar_quantize_int8(embeddings):
"""Scalar quantization: float32 -> int8 (75% memory reduction)"""
min_val = embeddings.min(axis=0)
max_val = embeddings.max(axis=0)
scale = (max_val - min_val) / 255.0
quantized = ((embeddings - min_val) / scale).astype(np.int8)
return quantized, min_val, scale
def scalar_dequantize_int8(quantized, min_val, scale):
"""Dequantize: int8 -> float32"""
return quantized.astype(np.float32) * scale + min_val
def binary_quantize(embeddings):
"""Binary quantization: float32 -> 1bit (32x memory reduction)"""
return (embeddings > 0).astype(np.uint8)
# Memory comparison
num_vectors = 1_000_000
dimension = 1024
embeddings = np.random.randn(num_vectors, dimension).astype(np.float32)
print(f"Original (float32): {embeddings.nbytes / 1e9:.2f} GB")
quantized, _, _ = scalar_quantize_int8(embeddings)
print(f"int8 quantized: {quantized.nbytes / 1e9:.2f} GB")
binary = binary_quantize(embeddings)
print(f"Binary quantized: {binary.nbytes / 1e9:.2f} GB")
# Original (float32): 4.10 GB
# int8 quantized: 1.02 GB
# Binary quantized: 1.02 GB (0.13 GB with actual bit packing)
Production Optimization Checklist
| Optimization | Technique | Effect |
|---|---|---|
| Batch Processing | Batch embedding requests together | 3-5x throughput improvement |
| Caching | Cache frequently used query embeddings | 90% latency reduction |
| Dimension Reduction | Apply Matryoshka or PCA | 2-4x memory/speed improvement |
| Quantization | int8/binary quantization | 4-32x memory reduction |
| GPU Inference | ONNX Runtime or TensorRT | 2-3x inference speed improvement |
| Async Processing | asyncio-based parallel embedding | Overall throughput improvement |
| Model Selection | Choose appropriate model for requirements | Cost-performance optimization |
import asyncio
from sentence_transformers import SentenceTransformer
from functools import lru_cache
import hashlib
class OptimizedEmbeddingService:
def __init__(self, model_name="BAAI/bge-m3", cache_size=10000):
self.model = SentenceTransformer(model_name)
self.cache = {}
self.cache_size = cache_size
def _get_cache_key(self, text):
return hashlib.md5(text.encode()).hexdigest()
def encode_with_cache(self, texts, batch_size=64):
"""Generate embeddings with caching"""
uncached_texts = []
uncached_indices = []
results = [None] * len(texts)
# Check cache hits
for i, text in enumerate(texts):
key = self._get_cache_key(text)
if key in self.cache:
results[i] = self.cache[key]
else:
uncached_texts.append(text)
uncached_indices.append(i)
# Batch embed only cache misses
if uncached_texts:
new_embeddings = self.model.encode(
uncached_texts,
batch_size=batch_size,
normalize_embeddings=True,
)
for idx, emb in zip(uncached_indices, new_embeddings):
key = self._get_cache_key(texts[idx])
self.cache[key] = emb
results[idx] = emb
# Manage cache size
if len(self.cache) > self.cache_size:
oldest_key = next(iter(self.cache))
del self.cache[oldest_key]
return results
def get_cache_stats(self):
return {"cache_size": len(self.cache), "max_size": self.cache_size}
Conclusion
Embedding models are core infrastructure of modern AI systems, playing essential roles in diverse applications including semantic search, RAG, recommendation systems, and anomaly detection. Here is a summary of the key points covered in this article:
Model selection matters: Reference the MTEB benchmark, but evaluating on your actual data is the most accurate approach. Consider BGE-M3 for multilingual support, GTE-Qwen2-7B for top performance, and text-embedding-3-small for cost efficiency.
Choose vector databases based on requirements: Chroma for rapid prototyping, Milvus or Pinecone for production scale, and pgvector for leveraging existing PostgreSQL infrastructure.
Hybrid search outperforms single approaches: Combining Dense (embedding) + Sparse (BM25) with reranking significantly improves search quality.
Fine-tuning is key for domain-specific performance: Using MultipleNegativesRankingLoss with hard negative mining can achieve significant performance improvements even with limited data.
Optimization is essential: Apply dimension reduction (Matryoshka), quantization, caching, and batch processing to optimize cost and latency in production environments.
Embedding technology is rapidly evolving, with new techniques such as Matryoshka Representation Learning, multimodal embeddings, and task-specific LoRA adapters continually emerging. By understanding the core principles and building practical experience, you can construct optimal embedding strategies for your own projects.
References
- Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019.
- Wang, L. et al. (2024). "Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)." ACL 2024.
- Chen, J. et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation."
- Kusupati, A. et al. (2024). "Matryoshka Representation Learning." NeurIPS 2024.
- Muennighoff, N. et al. (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023.
- MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
- Sentence Transformers Documentation: https://www.sbert.net/
- FAISS Documentation: https://github.com/facebookresearch/faiss
- Pinecone Learning Center: https://www.pinecone.io/learn/
- Chroma Documentation: https://docs.trychroma.com/