Skip to content

필사 모드: SOTA Text Embedding Models — The Heart of Search and RAG

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Why Embeddings Matter

No matter how capable a large language model (LLM) is, handling internal documents or fresh information the model never saw requires pulling in external knowledge. Deciding which documents to pull is the job of **retrieval**, and the core tool of modern retrieval is the **text embedding**.

An embedding converts text — a word, sentence, or passage — into a fixed-length real-valued vector. Turning a sentence into a 768-dimensional vector means compressing its "meaning" into 768 numbers. In this vector space, sentences with similar meaning cluster together and dissimilar ones drift apart.

This article covers the principles behind text embedding models, the heart of search and RAG. We look at contrastive learning as a training method, the dual-encoder architecture, hard negative mining, the open-source families represented by E5/BGE/GTE, Matryoshka representation learning, the MTEB benchmark, and finally the relationship with reranking. Because AI SOTA changes fast, we focus on concepts and architectural principles rather than specific rankings or numbers.

Semantic Vectors: The Core Idea

Meaning in Vector Space

Traditional text representations like TF-IDF and BM25 rely on word frequency. Such methods cannot link words that are spelled differently but mean the same thing, such as "puppy" and "dog." This is the lexical mismatch problem.

Dense embeddings represent meaning as vectors, so even with different spellings, close meaning yields close vectors. The vector space can capture relationships like the following.

"A puppy runs in the park" ---- close ---- "A dog plays outside"

| |

far far

| |

"Impact of rate hikes on stocks" --------------------

Measuring Similarity

How close two embedding vectors are is usually measured by cosine similarity — the cosine of the angle between them. It approaches 1 when directions align and -1 when opposite.

cosine_similarity(a, b) = dot(a, b) / (norm(a) * norm(b))

- Range: -1 (opposite) to 1 (same direction)

- Most embedding models output normalized vectors, so

cosine similarity equals the dot product.

A search system embeds the query and computes similarity against precomputed document vectors to find the closest documents. This is nearest neighbor search. With many documents, Approximate Nearest Neighbor (ANN) algorithms and vector databases keep it fast.

Contrastive Learning: How to Train Embeddings

The Intuition

A good embedding places "what should be close, close; what should be far, far." The main method for this is **contrastive learning**.

Its basic unit is a triangle: an anchor, a positive that shares meaning with the anchor, and a negative unrelated to the anchor.

anchor (query)

/ \

pull together push apart

/ \

positive (answer doc) negative (unrelated doc)

As training proceeds, anchor and positive vectors attract and move closer, while anchor and negative vectors repel. Repeated over vast data, this forms a vector space that reflects semantic structure.

The InfoNCE Loss

The most widely used loss is InfoNCE (Noise Contrastive Estimation). It frames training as picking the positive out of many candidates.

InfoNCE loss (anchor q, positive p, negative set N)

exp(sim(q, p) / T)

L = -log ---------------------------------------

exp(sim(q, p)/T) + sum exp(sim(q, n)/T)

n in N

- sim: similarity (usually dot product or cosine)

- T: temperature; smaller values sharpen the distribution

- numerator: positive; denominator: positive + all negatives

Intuitively the loss says "raise similarity to the positive, lower it to negatives." Being softmax-based, what matters is how much higher the positive scores relative to negatives.

In-Batch Negatives

How you obtain negatives drives training efficiency. The most common trick is in-batch negatives. With several (query, answer) pairs in a batch, for any query all other pairs' answers act as negatives.

Batch of 4 (query, answer) pairs

ans1 ans2 ans3 ans4

q1 [ pos neg neg neg ]

q2 [ neg pos neg neg ]

q3 [ neg neg pos neg ]

q4 [ neg neg neg pos ]

Diagonal is positive, the rest negative, learned at once

This yields negatives proportional to batch size for free, so larger batch sizes tend to help embedding training.

Dual-Encoder Architecture

What a Dual Encoder Is

Retrieval embedding models mostly use a **dual encoder (bi-encoder)**. Query and document are encoded independently into vectors, and relevance comes solely from their similarity.

query document

| |

[encoder A] [encoder B]

| |

query vec doc vec

\ /

\-- similarity ------/

|

relevance score

The key point is that query and document become vectors without seeing each other. This lets you precompute document vectors and store them, so at search time you only encode the query. That precomputability is what makes large-scale retrieval practical.

Often encoders A and B share weights — one encoder handles both. To distinguish roles, models may add a prefix or instruction. The E5 family's use of "query:" and "passage:" prefixes is a well-known example.

Contrast with Cross Encoders

A frequent point of comparison is the cross encoder. It concatenates query and document into a single input and computes their interaction directly via attention.

[query + [SEP] + document] → [encoder] → single relevance score

The cross encoder sees both together, so it is accurate, but it must be run per query-document pair and cannot precompute, making it slow. Hence cross encoders usually serve reranking while dual encoders handle first-stage retrieval. We revisit this in the reranking section.

Hard Negatives

Why Hard Negatives Help

In-batch negatives are mostly documents wholly unrelated to the query, so they are easy to tell apart. If a model only sees easy negatives, it struggles to filter documents that look similar on the surface but are not actually correct.

So we deliberately construct **hard negatives** — documents that look superficially related to the query but are not the answer.

Query: "How to sort a list in Python"

Easy negative: "Best cat food recommendations" (unrelated, easy)

Hard negative: "How to create a list in Python" (same topic but

not the answer)

Hard Negative Mining

Hard negatives are usually gathered by mining. A common recipe: use an already-trained retriever to fetch top documents for a query, then take the high-ranking non-answers as hard negatives. Feeding these back into training sharpens the model's ability to spot subtle distinctions.

Beware that mined documents may include false negatives — unlabeled but actually correct answers. Training these as negatives hurts performance, so it is common to skip the very top few or apply extra filtering.

Training Data and Stages

Kinds of Data

Embedding quality depends heavily on data volume and diversity. Common forms:

- Query-document pairs: (query, relevant doc) from search logs, QA datasets

- Natural Language Inference (NLI) data: entailment/contradiction defines positives and negatives

- Sentence-pair similarity data: human-scored similarity pairs

- Naturally paired web data: title-body, question-answer, citation-source

Contrastive Pre-training and Supervised Fine-tuning

Many recent open-source models train in two stages.

[Stage 1: Large-scale weakly-supervised contrastive pre-training]

Contrastive learning on vast naturally paired text pairs.

Low label quality, very high volume.

|

v

[Stage 2: High-quality supervised fine-tuning]

Contrastive learning on curated retrieval/QA/NLI data

plus hard negatives. Low volume, high quality.

The E5 family is a well-known example that clarified this "weakly-supervised pre-training then supervised fine-tuning" flow. Stage 1 learns broad semantic structure; Stage 2 tunes for retrieval.

E5, BGE, GTE Family Concepts

Below are the concepts of major open-source embedding families. Specific version numbers and benchmark rankings shift with time and version, so we focus on ideas.

E5 Family

E5 reportedly stands for "EmbEddings from bidirectional Encoder rEpresentations" and systematized the two-stage training described above. It marks roles by prefixing queries and documents with "query:" and "passage:". Later variants extended to multilingual, large models, and LLM-backbone instruction-based embeddings.

BGE Family

BGE stands for BAAI General Embedding, released by the Beijing Academy of Artificial Intelligence. It combines large-scale contrastive pre-training and supervised fine-tuning, broadly supporting many languages and tasks. Notably, some versions extended to combine dense, sparse, and multi-vector representations in a single model, aiming to unify multiple retrieval styles.

GTE Family

GTE stands for General Text Embeddings, released by an Alibaba-affiliated group. It also uses multi-stage contrastive learning and mixes broad-domain data to boost generality.

Family Summary

| Family | Known Origin | Concept | Role Prefix |

| --- | --- | --- | --- |

| E5 | Microsoft research | Weak pre-training + supervised fine-tune | query/passage |

| BGE | BAAI | Multilingual, multi-representation fusion | instruction variants exist |

| GTE | Alibaba-affiliated | Multi-domain mixed contrastive learning | instruction variants exist |

Details here can change with time and version, so check each model card and official docs before adopting.

Matryoshka Representation Learning

Truncatable Dimensions

Larger embedding dimensions improve expressiveness but raise storage and search cost. Enter **Matryoshka Representation Learning (MRL)**, named after Russian nesting dolls.

The core idea: train one large embedding such that using only the leading dimensions still preserves meaning. The first 256 or even first 128 dims of a 768-dim vector remain a decent embedding.

Full embedding (e.g., 768 dims)

[####################]

|first 64|

|-- first 128 --|

|---- first 256 ----|

|-------- first 512 --------|

|---------- first 768 (full) ----------|

Trained so each truncation point is independently usable

Training and Benefits

MRL computes a contrastive loss at several truncation points and minimizes their sum. One model then serves embeddings at many dimensions.

In practice: retrieve a large candidate set quickly with short dims (e.g., first 128), then rescore only the top candidates at full dimension. This saves storage and time while minimizing accuracy loss. Truncate too aggressively and accuracy drops, so find the right point by experiment.

The MTEB Benchmark

What MTEB Is

Comparing embedding models needs a shared standard. The main one is **MTEB (Massive Text Embedding Benchmark)**, which aggregates many task types into a comprehensive evaluation.

Representative MTEB task types

- Retrieval : find relevant docs for a query

- Reranking

- Clustering

- Classification

- STS : semantic textual similarity

- Pair Classification / Summarization, etc.

Aggregating across tasks makes it hard for a model overfit to one task to top the overall ranking. It has since expanded to broadly cover multilingual settings.

Reading Benchmarks Carefully

The MTEB leaderboard is useful, but taking rankings as absolute is risky.

- Rankings and scores shift with version, time, and eval setup. "Currently #1" quickly becomes outdated.

- Benchmark scores do not always match performance on your production data. Different domains can flip rankings.

- Model size, embedding dimension, inference speed, and license also matter in practice.

Treat MTEB as a starting point to shortlist candidates, then evaluate on your own data.

Applying to Search and RAG

Embeddings in a RAG Pipeline

RAG (Retrieval-Augmented Generation) retrieves relevant documents and appends them to the LLM input to generate an answer. Embeddings are the heart of retrieval.

[Index preparation (offline)]

docs → chunking → embedding → vector DB

[Query time (online)]

user query → query embedding

→ similarity search in vector DB (top-k)

→ (optional) reranking

→ docs + query to LLM

→ generate answer

Chunking and Embedding

Embedding an entire long document blurs meaning, so we split into chunks and embed each. Chunk size and overlap heavily affect quality. Too large mixes in irrelevant content; too small cuts context and weakens meaning.

Models also have different maximum input lengths, so chunk within those limits. Long-context embedding models are increasingly common, but longer input can dilute meaning, which remains a consideration.

Hybrid Search

Dense embeddings can be weak on exact keyword matches (proper nouns, code, numbers). Hybrid search combines dense retrieval with sparse retrieval like BM25, taking the strengths of both.

[dense] semantic similarity → candidate A

[sparse] keyword match → candidate B

\ /

score fusion (e.g., RRF)

|

final candidate list

RRF is Reciprocal Rank Fusion, a simple, robust method that sums reciprocal ranks across result lists.

Relationship with Reranking

We contrasted dual and cross encoders earlier. Production search is often two-stage.

[Stage 1: retrieval (dual encoder)]

quickly fetch top-k candidates from vector DB

(e.g., top 100)

|

v

[Stage 2: reranking (cross encoder)]

rescore only the 100 candidates precisely, reorder

pass only the top few to the LLM (e.g., top 5)

Stage 1 prioritizes speed and uses the precomputable dual encoder; Stage 2 prioritizes accuracy and uses the cross encoder that sees query and document together. This avoids scoring every document with a cross encoder while boosting final accuracy. Embedding retrieval and reranking are complementary, not competing.

Vector Databases and Approximate Nearest Neighbor

Why Approximation Is Needed

With millions or tens of millions of documents, computing similarity between the query vector and every document vector one by one is too slow. In practice we use Approximate Nearest Neighbor (ANN) algorithms. Trading a little accuracy, they boost search speed by tens to hundreds of times.

[exact search]

query vector vs every document vector, one by one

-> accurate but slow, proportional to document count

[approximate nearest neighbor (ANN)]

organize vectors into an index in advance,

search only nearby regions

-> slight chance of misses, much faster

Representative Index Types

ANN indexes come in several forms. The graph-based HNSW (Hierarchical Navigable Small World) connects vectors into a multi-layer graph and traverses toward near neighbors for fast search. Another approach, Product Quantization (PQ), splits vectors into pieces and compresses them, saving much memory. Production vector databases embed such indexes and support search alongside filter conditions (e.g., only a certain document type).

Indexes involve trade-offs among recall, speed, and memory. Tuning parameters to balance "how fast, how accurate" is the heart of practical tuning.

Retrieval Quality Metrics

Quantitatively evaluating embedding and retrieval quality requires metrics. Key ones:

[key retrieval metrics]

- Recall@k : fraction where the answer is within the top k

- MRR : mean reciprocal of the rank where the answer first appears

- nDCG@k : cumulative gain weighted by rank position

- MAP : mean of average precision over multiple answers

Recall@k asks "did we miss the answer," while MRR and nDCG ask "how high did we place it." When only a few top results go to the LLM, as in RAG, top-rank quality (nDCG, MRR) matters especially. Measuring these on your own data to pick models and parameters is far more trustworthy than leaderboards.

Fine-Tuning on Your Own Data

If a general-purpose model falls short in your domain, consider fine-tuning on your own data. A common procedure:

[custom fine-tuning flow]

1. collect (query, answer doc) pairs (search logs, manual labels)

2. mine hard negatives (top wrong answers from an existing model)

3. fine-tune with contrastive learning such as InfoNCE

4. measure Recall@k, nDCG on your own eval set

5. deploy if improved, else adjust data/settings

Fine-tuning can greatly boost domain performance, but with little data there is overfitting risk, and changing the model means rebuilding the index. So a sensible order is to first try low-cost improvements like prefix tuning, hybrid search, and reranking, and only move to fine-tuning when those are insufficient.

Deployment Checklist

Points to consider when adopting an embedding model.

- Language: confirm the model supports your target languages well.

- Domain: evaluate on your own data; do not trust leaderboards alone.

- Dimension and cost: dimension drives storage/search cost. MRL support lets you tune dimension to save cost.

- Prefix/instruction: strictly follow required prefixes or instructions; omitting them can hurt a lot.

- Normalization: check whether vectors must be normalized for cosine similarity.

- Reranking: add a cross-encoder reranker if accuracy matters.

- License: verify commercial-use terms.

Limitations and Caveats

Embedding-based retrieval has clear limits.

- Domain shift: performance can drop on domains unlike the training data. Self-evaluation matters most in jargon-heavy fields.

- Exact-match weakness: dense embeddings can miss proper nouns, code, and numbers, so hybrid search is often needed.

- Chunking sensitivity: results vary greatly with chunk size and boundaries.

- Recency: rankings and latest-model info change very fast. The family notes here are for understanding; verify specifics in official docs.

- Bias: training-data bias can reflect in embeddings, requiring validation in sensitive applications.

Closing

Text embeddings are the heart of search and RAG. Contrastive learning and InfoNCE as a training principle, the dual-encoder architecture, hard negatives, Matryoshka representation learning, hybrid search, and reranking — these pieces mesh to form modern semantic search.

Key takeaways: first, an embedding compresses meaning into a vector, trained by contrastive learning to place "close things close." Second, a two-stage design — fast retrieval with dual encoders, precise reranking with cross encoders — is practical. Third, choose models on your own data, not leaderboards. SOTA moves fast, but these principles endure.

References

- InfoNCE / Contrastive Predictive Coding (arXiv 1807.03748): [arxiv.org/abs/1807.03748](https://arxiv.org/abs/1807.03748)

- Dense Passage Retrieval (arXiv 2004.04906): [arxiv.org/abs/2004.04906](https://arxiv.org/abs/2004.04906)

- Sentence-BERT (arXiv 1908.10084): [arxiv.org/abs/1908.10084](https://arxiv.org/abs/1908.10084)

- Text Embeddings by Weakly-Supervised Contrastive Pre-training, E5 (arXiv 2212.03533): [arxiv.org/abs/2212.03533](https://arxiv.org/abs/2212.03533)

- Matryoshka Representation Learning (arXiv 2205.13147): [arxiv.org/abs/2205.13147](https://arxiv.org/abs/2205.13147)

- MTEB: Massive Text Embedding Benchmark (arXiv 2210.07316): [arxiv.org/abs/2210.07316](https://arxiv.org/abs/2210.07316)

- MTEB leaderboard (Hugging Face): [huggingface.co/spaces/mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard)

- BGE (FlagEmbedding) repo: [github.com/FlagOpen/FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)

- Sentence Transformers docs: [sbert.net](https://www.sbert.net)

현재 단락 (1/196)

No matter how capable a large language model (LLM) is, handling internal documents or fresh informat...

작성 글자: 0원문 글자: 16,446작성 단락: 0/196