Skip to content
Published on

Embedding Model Selection Guide 2025: From OpenAI to Open-Source Options

Authors

Introduction

When building a RAG system, one of the first decisions you make is which embedding model to use. Choose poorly and your retrieval quality suffers, your costs might be 10x higher than necessary, or your Korean/Japanese text ends up with mediocre representation. Yet a surprising number of teams make this decision with "let's just use OpenAI" and call it done.

This guide compares the major embedding models available in 2025 on real performance, cost, multilingual support, and infrastructure requirements — and gives you a clear decision framework for picking the right one.


What Is an Embedding?

An embedding converts text into a dense vector — a list of numbers that captures semantic meaning. Text with similar meaning maps to nearby points in this vector space.

from openai import OpenAI
import numpy as np

client = OpenAI()

# Text → vector
text = "machine learning is transforming the world"
response = client.embeddings.create(
    input=text,
    model="text-embedding-3-small"
)
embedding = response.data[0].embedding
print(f"Vector dimension: {len(embedding)}")  # 1536

# Semantic similarity via cosine similarity
def cosine_similarity(a: list, b: list) -> float:
    a_np, b_np = np.array(a), np.array(b)
    return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))

text1 = "machine learning is a subfield of AI"
text2 = "deep learning is an artificial intelligence technique"
text3 = "the weather is sunny today"

emb1 = client.embeddings.create(input=text1, model="text-embedding-3-small").data[0].embedding
emb2 = client.embeddings.create(input=text2, model="text-embedding-3-small").data[0].embedding
emb3 = client.embeddings.create(input=text3, model="text-embedding-3-small").data[0].embedding

print(f"AI sentences similarity: {cosine_similarity(emb1, emb2):.3f}")  # ~0.85
print(f"AI vs weather similarity: {cosine_similarity(emb1, emb3):.3f}")  # ~0.30

RAG systems exploit this property to find "semantically most relevant documents" to a query — not exact keyword matches, but meaning-level matches.


Major Embedding Models: 2025 Comparison

OpenAI Embeddings

The most widely used commercial embeddings.

from openai import OpenAI
client = OpenAI()

# text-embedding-3-small: fast, cheap, good for most use cases
small_emb = client.embeddings.create(
    input="your text here",
    model="text-embedding-3-small"
).data[0].embedding
# Dimensions: 1536 | Cost: $0.020/1M tokens

# text-embedding-3-large: highest quality, 6.5x more expensive
large_emb = client.embeddings.create(
    input="your text here",
    model="text-embedding-3-large"
).data[0].embedding
# Dimensions: 3072 | Cost: $0.130/1M tokens

# Dimension reduction via Matryoshka Representation Learning
reduced_emb = client.embeddings.create(
    input="your text here",
    model="text-embedding-3-small",
    dimensions=256  # Reduced from 1536 — 6x storage savings
).data[0].embedding

Pros: Zero ops, consistent quality, tight integration with OpenAI ecosystem.

Cons: Cost spikes at scale, data leaves your infrastructure, weaker multilingual performance than dedicated models.

Cohere Embed

The strongest commercial option for multilingual workloads.

import cohere
co = cohere.Client(api_key="your-key")

# 100+ languages including Korean and Japanese
texts = [
    "machine learning is transforming the world",
    "AI is changing how we work",
    "the future of technology is bright"
]

# Note: use different input_type for documents vs queries!
doc_response = co.embed(
    texts=texts,
    model="embed-multilingual-v3.0",
    input_type="search_document"  # For indexing
)

query_response = co.embed(
    texts=["what is machine learning?"],
    model="embed-multilingual-v3.0",
    input_type="search_query"  # For querying — matters for quality!
)

# Compression support: dramatically reduce storage costs
compressed = co.embed(
    texts=texts,
    model="embed-multilingual-v3.0",
    input_type="search_document",
    embedding_types=["float", "int8", "binary"]
)
# binary: 32x storage reduction vs float32!
# Quality drop for binary: minimal for most use cases

Pros: Excellent multilingual (100+ languages), int8/binary compression for 4-32x storage savings, separated document/query encoding improves retrieval quality.

Cons: Slightly more expensive than OpenAI at float32, data goes to Cohere servers.

Open-Source Models (Self-Hosted)

When data privacy matters or you're operating at scale, self-hosting wins.

from sentence_transformers import SentenceTransformer
import numpy as np

# BGE-M3: Best open-source multilingual embedding model (2025)
# Strong for Korean, Japanese, Chinese + English
bge_model = SentenceTransformer("BAAI/bge-m3")

texts = ["RAG is retrieval-augmented generation", "embeddings convert text to vectors"]
# BGE models use "passage:" prefix for documents, "query:" for queries
doc_texts = [f"passage: {t}" for t in texts]
embeddings = bge_model.encode(doc_texts, normalize_embeddings=True)
print(f"Shape: {embeddings.shape}")  # (2, 1024)

# E5-mistral-7b: Uses a 7B LLM as encoder — highest quality open-source
# Requires 16GB+ GPU RAM
e5_model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")
# Requires "Instruct: ...\nQuery: " prefix for queries
query = "Instruct: Retrieve relevant documents\nQuery: how does RAG work?"
doc = "RAG combines retrieval with language model generation"
q_emb = e5_model.encode(query, normalize_embeddings=True)
d_emb = e5_model.encode(doc, normalize_embeddings=True)

# nomic-embed-text-v1.5: Great performance/size ratio
nomic_model = SentenceTransformer(
    "nomic-ai/nomic-embed-text-v1.5",
    trust_remote_code=True
)
# 768 dimensions, excellent performance, easy to self-host

# multilingual-e5-large: Solid multilingual for CJK languages
ml_e5 = SentenceTransformer("intfloat/multilingual-e5-large")
# Strong for Japanese, Korean, Chinese

Reading MTEB Benchmark Results

MTEB (Massive Text Embedding Benchmark) is the standard for evaluating embedding models — 56 datasets across retrieval, classification, clustering, STS, and reranking tasks.

MTEB task types:
- Retrieval:    Most important for RAG. Metric: nDCG@10
- STS:          Sentence similarity. Metric: Spearman correlation
- Classification: Text classification. Metric: Accuracy
- Clustering:   Document grouping. Metric: V-measure
- Reranking:    Result reordering. Metric: MAP

For RAG systems → prioritize Retrieval score
For text classification → prioritize Classification score
For semantic similarity → prioritize STS score

2025 MTEB Scores (English-focused)

ModelAvgRetrievalSTSDimCost
E5-mistral-7b66.656.984.74096Self-hosted
text-embedding-3-large64.655.481.73072$0.13/1M
Cohere embed-v364.555.082.11024$0.10/1M
BGE-M363.254.979.81024Self-hosted
text-embedding-3-small62.353.280.41536$0.02/1M
nomic-embed-text-v1.562.053.579.3768Self-hosted

Note: These are English-focused scores. Korean/Japanese performance requires separate evaluation.


Language-Specific Recommendations

LanguageTop ChoiceRunner-upNotes
KoreanBGE-M3 (OSS)Cohere multilingualBGE-M3 particularly strong for Korean
Japanesemultilingual-e5-largeBGE-M3Both excellent
EnglishE5-mistral-7btext-embedding-3-largeOpen-source beats commercial
Mixed multilingualCohere multilingualBGE-M3100+ languages uniformly
Codetext-embedding-3-largevoyage-code-2Specialized code models help

Practical Decision Framework

def choose_embedding_model(
    languages: list,         # e.g., ['en'], ['ko', 'en'], ['ja', 'en', 'zh']
    budget: str,             # 'low', 'medium', 'high'
    privacy_required: bool,  # Can data leave your infrastructure?
    scale: str,              # 'small'(<100k/day), 'medium', 'large'(>1M/day)
    use_case: str = 'rag'    # 'rag', 'classification', 'similarity'
) -> dict:
    """
    Returns the optimal embedding model for your situation.
    """

    # Privacy or large scale → self-hosting
    if privacy_required or scale == 'large':
        if 'ko' in languages or 'ja' in languages:
            return {
                "model": "BAAI/bge-m3",
                "reason": "Best OSS multilingual + privacy guaranteed",
                "est_monthly_cost": "GPU server only (~$200-500/mo)"
            }
        if use_case == 'rag' and budget == 'high':
            return {
                "model": "intfloat/e5-mistral-7b-instruct",
                "reason": "Best OSS quality overall, especially English retrieval",
                "est_monthly_cost": "GPU server only (A100 ~$1000/mo)"
            }
        return {
            "model": "nomic-ai/nomic-embed-text-v1.5",
            "reason": "Lightweight, strong performance, easy to self-host",
            "est_monthly_cost": "GPU server only (~$100-200/mo)"
        }

    # 3+ languages → Cohere
    if len(languages) > 2:
        return {
            "model": "cohere/embed-multilingual-v3.0",
            "reason": "Uniformly high quality across 100+ languages",
            "est_monthly_cost": "Usage-based, $100-1000/mo typical"
        }

    # Cost-sensitive + English-primary
    if budget == 'low' and languages == ['en']:
        return {
            "model": "text-embedding-3-small",
            "reason": "Cheapest commercial option, sufficient for most English tasks",
            "est_monthly_cost": "~$20/mo at 1M requests"
        }

    # Default: best commercial quality, no ops
    return {
        "model": "text-embedding-3-large",
        "reason": "Best commercial quality, zero infrastructure burden",
        "est_monthly_cost": "~$130/mo at 1M requests"
    }

# Example usage
recommendation = choose_embedding_model(
    languages=['en'],
    budget='low',
    privacy_required=False,
    scale='small',
    use_case='rag'
)
print(f"Model: {recommendation['model']}")
print(f"Reason: {recommendation['reason']}")

Dimension Reduction: Cut Storage Costs Without Sacrificing Quality

OpenAI's text-embedding-3 models and several open-source models support Matryoshka Representation Learning (MRL). You can truncate to fewer dimensions while retaining most quality.

from openai import OpenAI
client = OpenAI()

# Quality vs cost tradeoff for text-embedding-3-small
# (MTEB Retrieval scores, approximate)
# 1536 dims: 53.2 (baseline)
# 1024 dims: 52.8 (-0.4)
#  512 dims: 52.1 (-1.1)
#  256 dims: 50.9 (-2.3)
# → 256 dims = <5% quality drop, 6x storage savings!

# Storage impact at 1M documents:
# 1536 dims: 1M × 1536 × 4 bytes = 6.14 GB
#  256 dims: 1M × 256 × 4 bytes = 1.02 GB
# Plus faster vector search proportional to dimension count

def get_reduced_embedding(text: str, dimensions: int = 256) -> list:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small",
        dimensions=dimensions
    )
    return response.data[0].embedding

# For most RAG applications, 256-512 dimensions is the sweet spot:
# - Storage cost 3-6x lower
# - Search speed 2-3x faster
# - Quality loss typically <3%

Production Embedding Pipeline

from sentence_transformers import SentenceTransformer
import numpy as np
import time

class EmbeddingPipeline:
    """Production-ready embedding pipeline"""

    def __init__(self, model_name: str = "BAAI/bge-m3"):
        self.model = SentenceTransformer(model_name)
        self.model_name = model_name
        self.is_bge = "bge" in model_name.lower()
        self.is_e5 = "e5" in model_name.lower()

    def embed_documents(self, texts: list, batch_size: int = 32) -> np.ndarray:
        """Batch embed documents for indexing"""
        all_embeddings = []

        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]

            # Model-specific prefixes matter for quality
            if self.is_bge:
                batch = [f"passage: {t}" for t in batch]

            embeddings = self.model.encode(
                batch,
                normalize_embeddings=True,
                show_progress_bar=False
            )
            all_embeddings.append(embeddings)

        return np.vstack(all_embeddings)

    def embed_query(self, query: str) -> np.ndarray:
        """Embed a single query for search"""
        if self.is_bge:
            query = f"query: {query}"
        elif self.is_e5:
            query = f"Instruct: Retrieve relevant passages\nQuery: {query}"

        return self.model.encode(query, normalize_embeddings=True)

    def search(self, query: str, doc_embeddings: np.ndarray,
               documents: list, top_k: int = 5) -> list:
        """Semantic search over pre-computed document embeddings"""
        query_emb = self.embed_query(query)
        scores = np.dot(doc_embeddings, query_emb)
        top_indices = np.argsort(scores)[::-1][:top_k]

        return [
            {"document": documents[i], "score": float(scores[i])}
            for i in top_indices
        ]

# Usage
pipeline = EmbeddingPipeline("BAAI/bge-m3")

docs = [
    "RAG combines retrieval with language model generation",
    "Embeddings map text to dense vector representations",
    "Vector databases support approximate nearest neighbor search",
    "Prompt engineering optimizes LLM outputs"
]

# Index documents
doc_embeddings = pipeline.embed_documents(docs)

# Search
results = pipeline.search(
    "how do vector databases work?",
    doc_embeddings,
    docs,
    top_k=2
)
for r in results:
    print(f"Score: {r['score']:.3f} | {r['document']}")

Quick Decision Summary

Starting immediately: text-embedding-3-small — easiest, sufficient for most cases, $0.02/1M tokens.

Korean or Japanese is important: BGE-M3 (self-hosted) or Cohere embed-multilingual-v3.0.

Cost is the top constraint: Self-hosted BGE-M3 or nomic-embed-text-v1.5 — dramatically cheaper at scale.

Maximum quality needed: E5-mistral-7b-instruct — best open-source quality, but requires 16GB+ GPU.

Data privacy is non-negotiable: No commercial APIs. Self-hosted only.

One critical note: switching embedding models requires re-indexing your entire document store. Evaluate thoroughly before committing — changing models mid-project is expensive.