Skip to content
Published on

Complete Guide to Embedding Models: Vector Search, RAG, and Sentence Transformers in Practice

Authors
  • Name
    Twitter

Embedding Model Complete Guide

Introduction

Embeddings are a foundational technology of modern AI systems. By converting unstructured data such as text, images, and audio into numerical vectors, they enable machines to understand and compare "meaning." With RAG (Retrieval-Augmented Generation) pipelines becoming the core architecture of LLM applications, the quality of embedding models has become a critical factor that determines overall system performance.

Since the advent of Word2Vec in 2013, the field has evolved rapidly through GloVe and FastText, then BERT-based sentence embeddings, and recently to instruction-tuned large-scale embedding models. In 2024-2025, models with significantly improved performance and multilingual support emerged, including OpenAI text-embedding-3, Cohere embed-v3, BGE-M3, and GTE-Qwen2, with fierce competition on the MTEB (Massive Text Embedding Benchmark) leaderboard.

This article systematically covers everything about embedding models, from fundamental principles to the latest model comparisons, vector database utilization, RAG integration, fine-tuning, and performance evaluation, all accompanied by practical code examples.

Fundamentals of Embeddings

What Are Embeddings?

An embedding is a technique that maps high-dimensional discrete data into a lower-dimensional continuous vector space. The core idea is to learn representations where semantically similar items are positioned close together in the vector space.

# Intuitive understanding: representing words as vectors
# "king" = [0.2, 0.8, 0.1, ...]
# "queen" = [0.3, 0.9, 0.1, ...]
# "man" = [0.1, 0.2, 0.8, ...]

# The famous vector arithmetic: king - man + woman ≈ queen
import numpy as np

king = np.array([0.2, 0.8, 0.1, 0.5])
man = np.array([0.1, 0.2, 0.8, 0.4])
woman = np.array([0.15, 0.25, 0.85, 0.6])
queen = np.array([0.3, 0.9, 0.1, 0.7])

result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# The two vectors are very similar

Geometric Intuition of Embeddings

In vector space, embeddings exhibit the following properties:

  • Distance = Semantic Difference: Words/sentences with similar meanings are positioned at close distances
  • Direction = Relationship: Specific directions encode specific semantic relationships (e.g., gender, tense, size)
  • Clustering: Items belonging to the same topic or category naturally form clusters

Evolution of Embeddings

GenerationModelCharacteristicsDimensions
1st Gen (2013)Word2Vec, GloVeStatic word embeddings, context-independent50-300
2nd Gen (2018)ELMo, BERTContext-dependent embeddings, bidirectional768-1024
3rd Gen (2019)Sentence-BERTSentence-level embeddings, efficient similarity computation384-768
4th Gen (2023-)E5, BGE, GTEInstruction-tuned, multilingual, large-scale768-4096
5th Gen (2024-)text-embedding-3, MatryoshkaVariable dimensions, multilingual, high-performance256-3072

Comparing Key Embedding Models

Commercial Embedding Models

ModelProviderMax TokensDimensionsMTEB AveragePrice (1M tokens)
text-embedding-3-largeOpenAI8,1913,07264.6~0.13 USD
text-embedding-3-smallOpenAI8,1911,53662.3~0.02 USD
embed-v3.0 (English)Cohere5121,02464.5~0.10 USD
embed-v3.0 (Multilingual)Cohere5121,02464.0~0.10 USD
Voyage-3Voyage AI32,0001,02467.3~0.06 USD

Open-Source Embedding Models

ModelDeveloperParametersDimensionsMTEB AverageFeatures
BGE-M3BAAI568M1,02466.1Multilingual, Dense+Sparse+ColBERT
BGE-large-en-v1.5BAAI335M1,02464.2English-optimized
E5-mistral-7b-instructMicrosoft7B4,09666.6LLM-based, high-performance
GTE-Qwen2-7B-instructAlibaba7B3,58470.2Top MTEB ranking
Jina-embeddings-v3Jina AI572M1,02465.5Multilingual, Task LoRA
nomic-embed-text-v1.5Nomic137M76862.3Lightweight, 8192 tokens
mxbai-embed-large-v1Mixedbread335M1,02464.7Matryoshka support

Model Selection Criteria

# Model selection decision tree
def select_embedding_model(requirements):
    """Embedding model selection guide based on requirements"""

    if requirements.get("budget") == "unlimited":
        if requirements.get("max_performance"):
            return "GTE-Qwen2-7B-instruct (self-hosted) or Voyage-3 (API)"
        return "text-embedding-3-large (OpenAI API)"

    if requirements.get("multilingual"):
        if requirements.get("self_hosted"):
            return "BGE-M3 (Dense+Sparse hybrid)"
        return "Cohere embed-v3 multilingual"

    if requirements.get("low_latency"):
        if requirements.get("self_hosted"):
            return "nomic-embed-text-v1.5 (lightweight 137M)"
        return "text-embedding-3-small (OpenAI API)"

    if requirements.get("domain_specific"):
        return "Sentence Transformers + fine-tuning (base model: BGE or E5)"

    # Default recommendation
    return "text-embedding-3-small (cost-effective general-purpose choice)"

Using Sentence Transformers

Basic Usage

Sentence Transformers is the most widely used Python library for generating sentence-level embeddings.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Single sentence embedding
sentence = "Embedding models convert text into numerical vectors."
embedding = model.encode(sentence)
print(f"Dimensions: {embedding.shape}")  # (1024,)

# Batch embeddings
sentences = [
    "Embedding models convert text into numerical vectors.",
    "Vector search quickly finds similar documents.",
    "RAG is a retrieval-augmented generation technique.",
    "The weather is very nice today.",
]

embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(f"Embedding matrix shape: {embeddings.shape}")  # (4, 1024)

# Similarity computation
from sentence_transformers.util import cos_sim

similarity_matrix = cos_sim(embeddings, embeddings)
print("Similarity matrix:")
print(similarity_matrix.numpy().round(3))

BGE-M3 Multilingual Embeddings

from sentence_transformers import SentenceTransformer

# BGE-M3: A multilingual embedding model supporting 100+ languages
model = SentenceTransformer('BAAI/bge-m3')

# Multilingual sentence embeddings
sentences = [
    "Machine learning is transforming the world.",        # English
    "머신러닝이 세상을 변화시키고 있다.",                      # Korean
    "機械学習が世界を変えている。",                           # Japanese
    "机器学习正在改变世界。",                                # Chinese
]

embeddings = model.encode(sentences, normalize_embeddings=True)

# Cross-lingual similarity check
from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings, embeddings)
print("Cross-lingual similarity matrix:")
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            print(f"  '{s1[:30]}...' <-> '{s2[:30]}...': {similarities[i][j]:.4f}")
# Sentences with the same meaning in different languages show high similarity

Using the OpenAI Embedding API

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_openai_embeddings(texts, model="text-embedding-3-small", dimensions=None):
    """Generate OpenAI embeddings (supports Matryoshka dimension reduction)"""
    kwargs = {"input": texts, "model": model}
    if dimensions:
        kwargs["dimensions"] = dimensions

    response = client.embeddings.create(**kwargs)
    return np.array([item.embedding for item in response.data])

# Basic usage
texts = ["Principles of embedding models", "Building vector search systems"]
embeddings_full = get_openai_embeddings(texts, model="text-embedding-3-large")
print(f"Full dimensions: {embeddings_full.shape}")  # (2, 3072)

# Matryoshka: optimize cost/speed through dimension reduction
embeddings_256 = get_openai_embeddings(
    texts, model="text-embedding-3-large", dimensions=256
)
print(f"Reduced dimensions: {embeddings_256.shape}")  # (2, 256)

# Cosine similarity comparison
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim_full = cosine_similarity(embeddings_full[0], embeddings_full[1])
sim_256 = cosine_similarity(embeddings_256[0], embeddings_256[1])
print(f"Full dimension similarity: {sim_full:.4f}")
print(f"256-dimension similarity: {sim_256:.4f}")

Vector Databases and Indexing

Vector Database Comparison

DatabaseTypeIndex AlgorithmScalabilityFilteringFeatures
PineconeFully managedProprietaryHighMetadataServerless, simple API
WeaviateOpen-source/CloudHNSWHighGraphQLHybrid search, modular
MilvusOpen-sourceHNSW, IVF, DiskANNVery highAttributeGPU acceleration, large-scale
ChromaOpen-sourceHNSWMediumMetadataLightweight, developer-friendly
FAISSLibraryIVF, PQ, HNSWHighNone (separate impl.)Meta-developed, top performance
QdrantOpen-source/CloudHNSWHighPayloadRust-based, high-performance
pgvectorPostgreSQL extensionIVFFlat, HNSWMediumSQLLeverages existing PostgreSQL

Understanding Indexing Algorithms

import faiss
import numpy as np
import time

# Generate test data
np.random.seed(42)
dimension = 1024
num_vectors = 1_000_000
num_queries = 100

# Generate normalized random vectors
data = np.random.randn(num_vectors, dimension).astype('float32')
faiss.normalize_L2(data)
queries = np.random.randn(num_queries, dimension).astype('float32')
faiss.normalize_L2(queries)

# 1. Flat Index (Exact Search, Brute-force)
print("=== Flat Index (Exact Search) ===")
index_flat = faiss.IndexFlatIP(dimension)  # Inner Product (cosine similarity)
index_flat.add(data)

start = time.time()
D_exact, I_exact = index_flat.search(queries, 10)
flat_time = time.time() - start
print(f"Search time: {flat_time:.3f}s")
print(f"Recall@10: 1.000 (exact search)")

# 2. IVF (Inverted File Index)
print("\n=== IVF Index ===")
nlist = 1024  # Number of clusters
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(data)
index_ivf.add(data)
index_ivf.nprobe = 32  # Number of clusters to search

start = time.time()
D_ivf, I_ivf = index_ivf.search(queries, 10)
ivf_time = time.time() - start

# Calculate recall
recall = np.mean([len(set(I_ivf[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {ivf_time:.3f}s (x{flat_time/ivf_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")

# 3. HNSW (Hierarchical Navigable Small World)
print("\n=== HNSW Index ===")
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)  # M=32
index_hnsw.hnsw.efConstruction = 200
index_hnsw.hnsw.efSearch = 64
index_hnsw.metric_type = faiss.METRIC_INNER_PRODUCT
index_hnsw.add(data)

start = time.time()
D_hnsw, I_hnsw = index_hnsw.search(queries, 10)
hnsw_time = time.time() - start

recall = np.mean([len(set(I_hnsw[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {hnsw_time:.3f}s (x{flat_time/hnsw_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")

# 4. IVF-PQ (Product Quantization)
print("\n=== IVF-PQ Index (Memory Optimized) ===")
m = 64  # Number of sub-vectors
nbits = 8  # Codebook bits
index_ivfpq = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
index_ivfpq.train(data)
index_ivfpq.add(data)
index_ivfpq.nprobe = 32

start = time.time()
D_pq, I_pq = index_ivfpq.search(queries, 10)
pq_time = time.time() - start

recall = np.mean([len(set(I_pq[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {pq_time:.3f}s (x{flat_time/pq_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")
print(f"Memory: Flat={data.nbytes/1e9:.1f}GB, PQ={index_ivfpq.sa_code_size()*num_vectors/1e9:.3f}GB")

Indexing Algorithm Comparison

AlgorithmSearch SpeedRecallMemory UsageBuild TimeBest For
FlatSlow100%HighInstantSmall scale (under 100K)
IVFMedium95-99%HighMediumMedium scale, frequent updates
HNSWFast97-99%High+overheadSlowHigh-performance, read-heavy
IVF-PQFast90-95%LowMediumLarge scale, memory-constrained
ScaNNVery fast95-98%MediumMediumLarge scale, Google ecosystem

Building a Vector Store with Chroma

import chromadb
from chromadb.utils import embedding_functions

# Create Chroma client
client = chromadb.PersistentClient(path="./chroma_db")

# Set up Sentence Transformers embedding function
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-m3"
)

# Create collection
collection = client.get_or_create_collection(
    name="tech_documents",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents
documents = [
    "Kubernetes is a container orchestration platform.",
    "Docker is a tool for packaging applications into containers.",
    "Prometheus is a metrics-based monitoring system.",
    "Grafana is a data visualization and dashboard tool.",
    "Terraform is an IaC tool for managing infrastructure as code.",
    "Embedding models convert text into vectors.",
    "RAG is a retrieval-augmented generation technique that reduces LLM hallucinations.",
]

collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[
        {"category": "kubernetes"},
        {"category": "docker"},
        {"category": "monitoring"},
        {"category": "monitoring"},
        {"category": "iac"},
        {"category": "ai"},
        {"category": "ai"},
    ]
)

# Semantic search
results = collection.query(
    query_texts=["What are container-related technologies?"],
    n_results=3
)
print("Search results:")
for doc, score in zip(results["documents"][0], results["distances"][0]):
    print(f"  [{score:.4f}] {doc}")

# Metadata filtering + semantic search
results_filtered = collection.query(
    query_texts=["monitoring tools"],
    n_results=2,
    where={"category": "monitoring"}
)
print("\nFiltered search results:")
for doc in results_filtered["documents"][0]:
    print(f"  {doc}")

Comparing Similarity Metrics

import numpy as np

def cosine_similarity(a, b):
    """Cosine similarity: measures directional similarity of vectors"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def dot_product(a, b):
    """Dot product: equivalent to cosine similarity for normalized vectors"""
    return np.dot(a, b)

def euclidean_distance(a, b):
    """Euclidean distance: straight-line distance between vectors"""
    return np.linalg.norm(a - b)

def manhattan_distance(a, b):
    """Manhattan distance: sum of absolute differences per dimension"""
    return np.sum(np.abs(a - b))

# Example vectors
a = np.array([1.0, 2.0, 3.0])
b = np.array([1.0, 2.0, 3.1])
c = np.array([-1.0, -2.0, -3.0])

print("Vectors a and b (very similar):")
print(f"  Cosine similarity:   {cosine_similarity(a, b):.4f}")
print(f"  Euclidean distance:  {euclidean_distance(a, b):.4f}")
print(f"  Dot product:         {dot_product(a, b):.4f}")

print("\nVectors a and c (opposite direction):")
print(f"  Cosine similarity:   {cosine_similarity(a, c):.4f}")
print(f"  Euclidean distance:  {euclidean_distance(a, c):.4f}")
print(f"  Dot product:         {dot_product(a, c):.4f}")

Similarity Metric Selection Guide

MetricFormulaRangeNormalization RequiredUse Case
Cosine Similaritycos(a,b)-1 to 1Not requiredText similarity (most common)
Dot Producta . b-inf to infRequiredNormalized embeddings, search ranking
Euclidean Distance (L2)L2 norm of vector difference0 to infRecommendedClustering, anomaly detection

Implementing a Semantic Search Pipeline

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearchEngine:
    def __init__(self, model_name="BAAI/bge-m3"):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index_documents(self, documents):
        """Index documents by generating embeddings"""
        self.documents = documents
        self.embeddings = self.model.encode(
            documents,
            convert_to_tensor=True,
            normalize_embeddings=True,
            show_progress_bar=True
        )
        print(f"Indexed {len(documents)} documents (dimensions: {self.embeddings.shape[1]})")

    def search(self, query, top_k=5):
        """Perform semantic search"""
        query_embedding = self.model.encode(
            query,
            convert_to_tensor=True,
            normalize_embeddings=True
        )

        # Calculate cosine similarity
        scores = util.cos_sim(query_embedding, self.embeddings)[0]

        # Return top k results
        top_results = torch.topk(scores, k=min(top_k, len(self.documents)))

        results = []
        for score, idx in zip(top_results.values, top_results.indices):
            results.append({
                "document": self.documents[idx],
                "score": score.item(),
                "index": idx.item()
            })
        return results

    def search_with_reranking(self, query, top_k=5, initial_k=20):
        """Two-stage search: embedding retrieval + reranking"""
        from sentence_transformers import CrossEncoder

        # Stage 1: Embedding-based candidate retrieval
        candidates = self.search(query, top_k=initial_k)

        # Stage 2: Reranking with cross-encoder
        reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
        pairs = [(query, c["document"]) for c in candidates]
        rerank_scores = reranker.predict(pairs)

        # Return reranked results
        for i, score in enumerate(rerank_scores):
            candidates[i]["rerank_score"] = float(score)

        reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return reranked[:top_k]

# Usage example
engine = SemanticSearchEngine()

documents = [
    "Python is the most widely used programming language for data science and machine learning.",
    "JavaScript is the core language for web development, also used server-side through Node.js.",
    "Kubernetes automates deployment, scaling, and management of containerized applications.",
    "PostgreSQL is a powerful open-source relational database management system.",
    "TensorFlow and PyTorch are the most widely used frameworks for deep learning model development.",
    "Redis is an in-memory data structure store used as a cache and message broker.",
    "Docker packages applications and their dependencies into containers for portability.",
    "GraphQL is an alternative to REST that allows clients to request only the data they need.",
]

engine.index_documents(documents)

# Semantic search
query = "What tools should I use for deep learning development?"
results = engine.search(query, top_k=3)
print(f"\nQuery: {query}")
for r in results:
    print(f"  [{r['score']:.4f}] {r['document']}")

Embeddings in RAG Pipelines

RAG Architecture Overview

In a RAG (Retrieval-Augmented Generation) pipeline, embeddings play a central role in the retrieval stage. The overall flow is as follows:

  1. Document Preprocessing: Split source documents into appropriately sized chunks
  2. Embedding Generation: Convert each chunk into an embedding vector and store in a vector database
  3. Query Retrieval: Embed the user query and search for similar chunks
  4. Reranking: Reorder search results using a cross-encoder
  5. Generation: Pass the retrieved context along with the query to the LLM for answer generation

RAG Pipeline Implementation

from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI
import chromadb
from chromadb.utils import embedding_functions
import tiktoken
from typing import List, Dict

class RAGPipeline:
    def __init__(
        self,
        embedding_model: str = "BAAI/bge-m3",
        reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2",
        llm_model: str = "gpt-4o",
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.reranker = CrossEncoder(reranker_model)
        self.llm_client = OpenAI()
        self.llm_model = llm_model

        # Initialize Chroma vector DB
        self.chroma_client = chromadb.PersistentClient(path="./rag_db")
        self.collection = self.chroma_client.get_or_create_collection(
            name="rag_documents",
            metadata={"hnsw:space": "cosine"}
        )

    def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
        """Token-based text chunking"""
        tokenizer = tiktoken.get_encoding("cl100k_base")
        tokens = tokenizer.encode(text)
        chunks = []

        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunk_text = tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
            start = end - overlap  # Apply overlap

        return chunks

    def ingest_documents(self, documents: List[Dict[str, str]]):
        """Chunk documents and store in vector DB"""
        all_chunks = []
        all_ids = []
        all_metadatas = []

        for doc_idx, doc in enumerate(documents):
            chunks = self.chunk_text(doc["content"])
            for chunk_idx, chunk in enumerate(chunks):
                all_chunks.append(chunk)
                all_ids.append(f"doc{doc_idx}_chunk{chunk_idx}")
                all_metadatas.append({
                    "source": doc.get("source", "unknown"),
                    "doc_index": doc_idx,
                    "chunk_index": chunk_idx,
                })

        # Generate embeddings and store
        embeddings = self.embedder.encode(all_chunks, normalize_embeddings=True)

        self.collection.add(
            documents=all_chunks,
            embeddings=embeddings.tolist(),
            ids=all_ids,
            metadatas=all_metadatas,
        )
        print(f"{len(documents)} documents -> {len(all_chunks)} chunks indexed")

    def retrieve(self, query: str, top_k: int = 10) -> List[Dict]:
        """Vector similarity-based retrieval"""
        query_embedding = self.embedder.encode(
            [query], normalize_embeddings=True
        ).tolist()

        results = self.collection.query(
            query_embeddings=query_embedding,
            n_results=top_k,
        )

        retrieved = []
        for i in range(len(results["documents"][0])):
            retrieved.append({
                "text": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "distance": results["distances"][0][i],
            })
        return retrieved

    def rerank(self, query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
        """Cross-encoder based reranking"""
        pairs = [(query, c["text"]) for c in candidates]
        scores = self.reranker.predict(pairs)

        for i, score in enumerate(scores):
            candidates[i]["rerank_score"] = float(score)

        reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return reranked[:top_k]

    def generate(self, query: str, context_docs: List[Dict]) -> str:
        """Generate LLM response based on retrieved context"""
        context = "\n\n---\n\n".join([doc["text"] for doc in context_docs])

        messages = [
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer the question based on "
                    "the provided context. If the context doesn't contain "
                    "relevant information, say so."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            },
        ]

        response = self.llm_client.chat.completions.create(
            model=self.llm_model,
            messages=messages,
            temperature=0.1,
        )
        return response.choices[0].message.content

    def query(self, question: str, top_k_retrieve: int = 10, top_k_rerank: int = 5) -> str:
        """Execute full RAG pipeline"""
        # Step 1: Retrieve
        candidates = self.retrieve(question, top_k=top_k_retrieve)
        print(f"Step 1 retrieval: {len(candidates)} candidates")

        # Step 2: Rerank
        reranked = self.rerank(question, candidates, top_k=top_k_rerank)
        print(f"Step 2 reranking: top {len(reranked)} selected")

        # Step 3: Generate
        answer = self.generate(question, reranked)
        return answer

# Usage example
rag = RAGPipeline()

# Ingest documents
documents = [
    {"content": "Long technical document content...", "source": "tech_doc_1.pdf"},
    {"content": "Another document content...", "source": "tech_doc_2.pdf"},
]
rag.ingest_documents(documents)

# Query
answer = rag.query("How does embedding dimension size affect performance?")
print(f"\nAnswer: {answer}")

Hybrid Search Strategy

from rank_bm25 import BM25Okapi
import numpy as np

class HybridSearchEngine:
    """Dense (embedding) + Sparse (BM25) hybrid search"""

    def __init__(self, embedding_model="BAAI/bge-m3"):
        self.embedder = SentenceTransformer(embedding_model)
        self.documents = []
        self.embeddings = None
        self.bm25 = None

    def index(self, documents):
        self.documents = documents

        # Dense: generate embeddings
        self.embeddings = self.embedder.encode(
            documents, normalize_embeddings=True, convert_to_tensor=True
        )

        # Sparse: build BM25 index
        tokenized = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query, top_k=5, alpha=0.7):
        """Hybrid search (alpha: dense weight, 1-alpha: sparse weight)"""
        # Dense search
        query_emb = self.embedder.encode(
            query, normalize_embeddings=True, convert_to_tensor=True
        )
        dense_scores = util.cos_sim(query_emb, self.embeddings)[0].cpu().numpy()

        # Sparse search (BM25)
        sparse_scores = self.bm25.get_scores(query.split())

        # Normalize
        if dense_scores.max() > 0:
            dense_scores = dense_scores / dense_scores.max()
        if sparse_scores.max() > 0:
            sparse_scores = sparse_scores / sparse_scores.max()

        # Weighted combination
        hybrid_scores = alpha * dense_scores + (1 - alpha) * sparse_scores

        # Return top k
        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
        return [
            {
                "document": self.documents[i],
                "score": hybrid_scores[i],
                "dense_score": dense_scores[i],
                "sparse_score": sparse_scores[i],
            }
            for i in top_indices
        ]

Fine-tuning Embedding Models

Why Fine-tuning Is Necessary

General-purpose embedding models perform well on general text similarity tasks, but they may underperform on specific domains (medical, legal, financial, etc.) or specialized search patterns. Fine-tuning can significantly improve domain-specific performance.

Contrastive Learning-Based Fine-tuning

from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    losses,
    evaluation,
)
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Prepare training data (anchor, positive, negative)
train_examples = [
    # (query, relevant document, irrelevant document)
    InputExample(texts=[
        "How to deploy a Kubernetes pod?",
        "kubectl apply -f pod.yaml creates a new pod in the cluster.",
        "Python is a popular programming language for data science."
    ]),
    InputExample(texts=[
        "What is a Docker container?",
        "A Docker container is a lightweight, standalone executable package.",
        "Machine learning models require large datasets for training."
    ]),
    InputExample(texts=[
        "How does Redis caching work?",
        "Redis stores data in memory for fast read/write access as a cache layer.",
        "Kubernetes orchestrates containerized applications across clusters."
    ]),
    # ... thousands to tens of thousands of training examples
]

# Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Loss function: TripletLoss (anchor, positive, negative)
train_loss = losses.TripletLoss(model=model)

# Evaluation data
eval_examples = [
    InputExample(texts=["query1", "relevant_doc1"], label=1.0),
    InputExample(texts=["query2", "irrelevant_doc2"], label=0.0),
]
evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    eval_examples, name="domain-eval"
)

# Run fine-tuning
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,
    warmup_steps=100,
    evaluation_steps=500,
    output_path="./finetuned_embedding_model",
    save_best_model=True,
)

print("Fine-tuning complete!")

# Load and use the fine-tuned model
finetuned_model = SentenceTransformer('./finetuned_embedding_model')
embeddings = finetuned_model.encode(["domain-specific query"])

Efficient Training with MultipleNegativesRankingLoss

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Training with just (query, positive_passage) pairs
# Automatically leverages in-batch negatives
train_examples = [
    InputExample(texts=["What is embedding?", "An embedding is a vector representation of data."]),
    InputExample(texts=["How does HNSW work?", "HNSW builds a hierarchical graph for approximate nearest neighbor search."]),
    InputExample(texts=["What is RAG?", "RAG retrieves relevant documents and uses them to augment LLM generation."]),
    # ... more (query, positive) pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

# MultipleNegativesRankingLoss: uses other positives in the batch as negatives
train_loss = losses.MultipleNegativesRankingLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./mnrl_finetuned_model",
)

Training Data Preparation Strategies

Data TypeDescriptionExample
Natural PairsReal user queries and clicked documentsSearch log data
LLM-GeneratedSynthesized query-document pairs using GPT-4 etc.Auto-generating questions from documents
Hard NegativesSemantically similar but incorrect documentsNon-relevant docs from BM25 search results
Cross-Encoder DistillationUsing cross-encoder scores as training targetsAutomatic high-quality label generation

Performance Optimization and Evaluation

MTEB Benchmark

MTEB (Massive Text Embedding Benchmark) is the standard benchmark for comprehensively evaluating embedding model performance. It evaluates models across various task categories:

Task CategoryDescriptionRepresentative Datasets
ClassificationText classificationAmazonReviews, TweetSentiment
ClusteringText clusteringArXiv, Reddit
Pair ClassificationSentence pair relation classificationTwitterPara, SprintDuplicateQuestions
RerankingSearch result reorderingAskUbuntuDupQuestions, StackOverflowDupQuestions
RetrievalDocument retrievalMSMarco, NQ, HotpotQA
STSSentence semantic similaritySTSBenchmark, SICK-R
SummarizationSummary quality evaluationSummEval

Dimensionality Reduction and Matryoshka Representation Learning

from sentence_transformers import SentenceTransformer
import numpy as np

# Model supporting Matryoshka Representation Learning (MRL)
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')

texts = [
    "Vector databases store embeddings for similarity search.",
    "Embedding models convert text into numerical representations.",
    "RAG systems combine retrieval with language generation.",
]

# Full-dimension embeddings
full_embeddings = model.encode(texts)
print(f"Full dimensions: {full_embeddings.shape[1]}")  # 768

# Matryoshka: truncate to desired dimension and normalize
def truncate_embeddings(embeddings, target_dim):
    """Dimension reduction using Matryoshka approach"""
    truncated = embeddings[:, :target_dim]
    # L2 normalization
    norms = np.linalg.norm(truncated, axis=1, keepdims=True)
    return truncated / norms

# Compare similarity at various dimensions
for dim in [64, 128, 256, 512, 768]:
    reduced = truncate_embeddings(full_embeddings, dim)
    sim = np.dot(reduced[0], reduced[1])  # Dot product of normalized vectors = cosine similarity
    print(f"  Dimension {dim:>4}: similarity = {sim:.4f}")

Memory Optimization through Quantization

import numpy as np

def scalar_quantize_int8(embeddings):
    """Scalar quantization: float32 -> int8 (75% memory reduction)"""
    min_val = embeddings.min(axis=0)
    max_val = embeddings.max(axis=0)
    scale = (max_val - min_val) / 255.0

    quantized = ((embeddings - min_val) / scale).astype(np.int8)
    return quantized, min_val, scale

def scalar_dequantize_int8(quantized, min_val, scale):
    """Dequantize: int8 -> float32"""
    return quantized.astype(np.float32) * scale + min_val

def binary_quantize(embeddings):
    """Binary quantization: float32 -> 1bit (32x memory reduction)"""
    return (embeddings > 0).astype(np.uint8)

# Memory comparison
num_vectors = 1_000_000
dimension = 1024
embeddings = np.random.randn(num_vectors, dimension).astype(np.float32)

print(f"Original (float32): {embeddings.nbytes / 1e9:.2f} GB")

quantized, _, _ = scalar_quantize_int8(embeddings)
print(f"int8 quantized:     {quantized.nbytes / 1e9:.2f} GB")

binary = binary_quantize(embeddings)
print(f"Binary quantized:   {binary.nbytes / 1e9:.2f} GB")
# Original (float32): 4.10 GB
# int8 quantized:     1.02 GB
# Binary quantized:   1.02 GB (0.13 GB with actual bit packing)

Production Optimization Checklist

OptimizationTechniqueEffect
Batch ProcessingBatch embedding requests together3-5x throughput improvement
CachingCache frequently used query embeddings90% latency reduction
Dimension ReductionApply Matryoshka or PCA2-4x memory/speed improvement
Quantizationint8/binary quantization4-32x memory reduction
GPU InferenceONNX Runtime or TensorRT2-3x inference speed improvement
Async Processingasyncio-based parallel embeddingOverall throughput improvement
Model SelectionChoose appropriate model for requirementsCost-performance optimization
import asyncio
from sentence_transformers import SentenceTransformer
from functools import lru_cache
import hashlib

class OptimizedEmbeddingService:
    def __init__(self, model_name="BAAI/bge-m3", cache_size=10000):
        self.model = SentenceTransformer(model_name)
        self.cache = {}
        self.cache_size = cache_size

    def _get_cache_key(self, text):
        return hashlib.md5(text.encode()).hexdigest()

    def encode_with_cache(self, texts, batch_size=64):
        """Generate embeddings with caching"""
        uncached_texts = []
        uncached_indices = []
        results = [None] * len(texts)

        # Check cache hits
        for i, text in enumerate(texts):
            key = self._get_cache_key(text)
            if key in self.cache:
                results[i] = self.cache[key]
            else:
                uncached_texts.append(text)
                uncached_indices.append(i)

        # Batch embed only cache misses
        if uncached_texts:
            new_embeddings = self.model.encode(
                uncached_texts,
                batch_size=batch_size,
                normalize_embeddings=True,
            )

            for idx, emb in zip(uncached_indices, new_embeddings):
                key = self._get_cache_key(texts[idx])
                self.cache[key] = emb
                results[idx] = emb

                # Manage cache size
                if len(self.cache) > self.cache_size:
                    oldest_key = next(iter(self.cache))
                    del self.cache[oldest_key]

        return results

    def get_cache_stats(self):
        return {"cache_size": len(self.cache), "max_size": self.cache_size}

Conclusion

Embedding models are core infrastructure of modern AI systems, playing essential roles in diverse applications including semantic search, RAG, recommendation systems, and anomaly detection. Here is a summary of the key points covered in this article:

  1. Model selection matters: Reference the MTEB benchmark, but evaluating on your actual data is the most accurate approach. Consider BGE-M3 for multilingual support, GTE-Qwen2-7B for top performance, and text-embedding-3-small for cost efficiency.

  2. Choose vector databases based on requirements: Chroma for rapid prototyping, Milvus or Pinecone for production scale, and pgvector for leveraging existing PostgreSQL infrastructure.

  3. Hybrid search outperforms single approaches: Combining Dense (embedding) + Sparse (BM25) with reranking significantly improves search quality.

  4. Fine-tuning is key for domain-specific performance: Using MultipleNegativesRankingLoss with hard negative mining can achieve significant performance improvements even with limited data.

  5. Optimization is essential: Apply dimension reduction (Matryoshka), quantization, caching, and batch processing to optimize cost and latency in production environments.

Embedding technology is rapidly evolving, with new techniques such as Matryoshka Representation Learning, multimodal embeddings, and task-specific LoRA adapters continually emerging. By understanding the core principles and building practical experience, you can construct optimal embedding strategies for your own projects.

References