Split View: 임베딩 모델 완전 가이드: 벡터 검색·RAG·Sentence Transformers 실전 활용

임베딩 모델 완전 가이드: 벡터 검색·RAG·Sentence Transformers 실전 활용

들어가며
임베딩의 기본 개념
주요 임베딩 모델 비교
Sentence Transformers 활용
벡터 데이터베이스와 인덱싱
유사도 검색과 시맨틱 서치
RAG 파이프라인에서의 임베딩
임베딩 모델 파인튜닝
성능 최적화와 평가
마치며
참고자료

들어가며

임베딩(Embedding)은 현대 AI 시스템의 기반 기술이다. 텍스트, 이미지, 오디오 등 비정형 데이터를 수치 벡터로 변환하여 기계가 "의미"를 이해하고 비교할 수 있게 한다. 특히 RAG(Retrieval-Augmented Generation) 파이프라인이 LLM 애플리케이션의 핵심 아키텍처로 자리 잡으면서, 임베딩 모델의 품질이 전체 시스템 성능을 좌우하는 핵심 요소가 되었다.

2013년 Word2Vec의 등장 이후, GloVe, FastText를 거쳐 BERT 기반 문장 임베딩, 그리고 최근의 Instruction-tuned 대규모 임베딩 모델까지, 이 분야는 빠르게 진화해 왔다. 2024-2025년에는 OpenAI text-embedding-3, Cohere embed-v3, BGE-M3, GTE-Qwen2 등 성능과 다국어 지원이 크게 향상된 모델들이 등장했으며, MTEB(Massive Text Embedding Benchmark) 리더보드에서 치열한 경쟁이 벌어지고 있다.

이 글에서는 임베딩의 기본 원리부터 최신 모델 비교, 벡터 데이터베이스 활용, RAG 통합, 파인튜닝, 성능 평가까지 임베딩 모델의 모든 것을 실전 코드와 함께 체계적으로 다룬다.

임베딩의 기본 개념

임베딩이란 무엇인가

임베딩은 고차원의 이산적(discrete) 데이터를 저차원의 연속적(continuous) 벡터 공간에 매핑하는 기법이다. 핵심 아이디어는 의미적으로 유사한 항목이 벡터 공간에서도 가까이 위치하도록 학습하는 것이다.

# 직관적 이해: 단어를 벡터로 표현
# "왕" = [0.2, 0.8, 0.1, ...]
# "여왕" = [0.3, 0.9, 0.1, ...]
# "남자" = [0.1, 0.2, 0.8, ...]

# 유명한 벡터 산술: king - man + woman ≈ queen
import numpy as np

king = np.array([0.2, 0.8, 0.1, 0.5])
man = np.array([0.1, 0.2, 0.8, 0.4])
woman = np.array([0.15, 0.25, 0.85, 0.6])
queen = np.array([0.3, 0.9, 0.1, 0.7])

result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# 두 벡터가 매우 유사함을 확인할 수 있다

임베딩의 기하학적 직관

벡터 공간에서 임베딩은 다음과 같은 특성을 갖는다:

거리 = 의미 차이: 유사한 의미의 단어/문장은 가까운 거리에 위치
방향 = 관계: 특정 방향이 특정 의미 관계를 인코딩 (예: 성별, 시제, 크기)
클러스터링: 같은 주제나 카테고리의 항목들이 자연스럽게 군집 형성

임베딩의 진화

세대	모델	특징	차원
1세대 (2013)	Word2Vec, GloVe	정적 단어 임베딩, 문맥 무시	50-300
2세대 (2018)	ELMo, BERT	문맥 의존 임베딩, 양방향	768-1024
3세대 (2019)	Sentence-BERT	문장 수준 임베딩, 효율적 유사도 계산	384-768
4세대 (2023-)	E5, BGE, GTE	Instruction-tuned, 다국어, 대규모	768-4096
5세대 (2024-)	text-embedding-3, Matryoshka	가변 차원, 다국어, 고성능	256-3072

주요 임베딩 모델 비교

상용 임베딩 모델

모델	제공사	최대 토큰	차원	MTEB 평균	가격 (1M 토큰)
text-embedding-3-large	OpenAI	8,191	3,072	64.6	약 0.13 달러
text-embedding-3-small	OpenAI	8,191	1,536	62.3	약 0.02 달러
embed-v3.0 (English)	Cohere	512	1,024	64.5	약 0.10 달러
embed-v3.0 (Multilingual)	Cohere	512	1,024	64.0	약 0.10 달러
Voyage-3	Voyage AI	32,000	1,024	67.3	약 0.06 달러

오픈소스 임베딩 모델

모델	개발사	파라미터	차원	MTEB 평균	특징
BGE-M3	BAAI	568M	1,024	66.1	다국어, Dense+Sparse+ColBERT
BGE-large-en-v1.5	BAAI	335M	1,024	64.2	영어 특화
E5-mistral-7b-instruct	Microsoft	7B	4,096	66.6	LLM 기반, 고성능
GTE-Qwen2-7B-instruct	Alibaba	7B	3,584	70.2	MTEB 최상위권
Jina-embeddings-v3	Jina AI	572M	1,024	65.5	다국어, Task LoRA
nomic-embed-text-v1.5	Nomic	137M	768	62.3	경량, 8192 토큰
mxbai-embed-large-v1	Mixedbread	335M	1,024	64.7	Matryoshka 지원

모델 선택 기준

# 모델 선택 의사결정 트리
def select_embedding_model(requirements):
    """요구사항에 따른 임베딩 모델 선택 가이드"""

    if requirements.get("budget") == "unlimited":
        if requirements.get("max_performance"):
            return "GTE-Qwen2-7B-instruct (자체 호스팅) 또는 Voyage-3 (API)"
        return "text-embedding-3-large (OpenAI API)"

    if requirements.get("multilingual"):
        if requirements.get("self_hosted"):
            return "BGE-M3 (Dense+Sparse 하이브리드)"
        return "Cohere embed-v3 multilingual"

    if requirements.get("low_latency"):
        if requirements.get("self_hosted"):
            return "nomic-embed-text-v1.5 (경량 137M)"
        return "text-embedding-3-small (OpenAI API)"

    if requirements.get("domain_specific"):
        return "Sentence Transformers + 파인튜닝 (기본 모델: BGE or E5)"

    # 기본 추천
    return "text-embedding-3-small (비용 효율적 범용 선택)"

Sentence Transformers 활용

기본 사용법

Sentence Transformers는 문장 수준의 임베딩을 생성하는 가장 널리 사용되는 Python 라이브러리다.

from sentence_transformers import SentenceTransformer
import numpy as np

# 모델 로드
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# 단일 문장 임베딩
sentence = "Embedding models convert text into numerical vectors."
embedding = model.encode(sentence)
print(f"차원: {embedding.shape}")  # (1024,)

# 배치 임베딩
sentences = [
    "임베딩 모델은 텍스트를 수치 벡터로 변환한다.",
    "벡터 검색은 유사한 문서를 빠르게 찾아준다.",
    "RAG는 검색 기반 생성 기법이다.",
    "오늘 날씨가 매우 좋다.",
]

embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(f"임베딩 행렬 크기: {embeddings.shape}")  # (4, 1024)

# 유사도 계산
from sentence_transformers.util import cos_sim

similarity_matrix = cos_sim(embeddings, embeddings)
print("유사도 행렬:")
print(similarity_matrix.numpy().round(3))

BGE-M3 다국어 임베딩

from sentence_transformers import SentenceTransformer

# BGE-M3: 100개 이상의 언어를 지원하는 다국어 임베딩 모델
model = SentenceTransformer('BAAI/bge-m3')

# 다국어 문장 임베딩
sentences = [
    "Machine learning is transforming the world.",        # 영어
    "머신러닝이 세상을 변화시키고 있다.",                      # 한국어
    "機械学習が世界を変えている。",                           # 일본어
    "机器学习正在改变世界。",                                # 중국어
]

embeddings = model.encode(sentences, normalize_embeddings=True)

# 다국어 간 유사도 확인
from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings, embeddings)
print("다국어 유사도 행렬:")
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            print(f"  '{s1[:20]}...' <-> '{s2[:20]}...': {similarities[i][j]:.4f}")
# 같은 의미의 다른 언어 문장들이 높은 유사도를 보인다

OpenAI 임베딩 API 활용

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_openai_embeddings(texts, model="text-embedding-3-small", dimensions=None):
    """OpenAI 임베딩 생성 (Matryoshka 차원 축소 지원)"""
    kwargs = {"input": texts, "model": model}
    if dimensions:
        kwargs["dimensions"] = dimensions

    response = client.embeddings.create(**kwargs)
    return np.array([item.embedding for item in response.data])

# 기본 사용
texts = ["임베딩 모델의 원리", "벡터 검색 시스템 구축"]
embeddings_full = get_openai_embeddings(texts, model="text-embedding-3-large")
print(f"전체 차원: {embeddings_full.shape}")  # (2, 3072)

# Matryoshka: 차원 축소로 비용/속도 최적화
embeddings_256 = get_openai_embeddings(
    texts, model="text-embedding-3-large", dimensions=256
)
print(f"축소 차원: {embeddings_256.shape}")  # (2, 256)

# 코사인 유사도 비교
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim_full = cosine_similarity(embeddings_full[0], embeddings_full[1])
sim_256 = cosine_similarity(embeddings_256[0], embeddings_256[1])
print(f"전체 차원 유사도: {sim_full:.4f}")
print(f"256차원 유사도: {sim_256:.4f}")

벡터 데이터베이스와 인덱싱

벡터 데이터베이스 비교

데이터베이스	유형	인덱스 알고리즘	확장성	필터링	특징
Pinecone	완전 관리형	독자 구현	높음	메타데이터	서버리스, 간편한 API
Weaviate	오픈소스/클라우드	HNSW	높음	GraphQL	하이브리드 검색, 모듈형
Milvus	오픈소스	HNSW, IVF, DiskANN	매우 높음	속성	GPU 가속, 대규모 처리
Chroma	오픈소스	HNSW	중간	메타데이터	경량, 개발 친화적
FAISS	라이브러리	IVF, PQ, HNSW	높음	없음 (별도 구현)	Meta 개발, 최고 성능
Qdrant	오픈소스/클라우드	HNSW	높음	페이로드	Rust 기반, 고성능
pgvector	PostgreSQL 확장	IVFFlat, HNSW	중간	SQL	기존 PostgreSQL 활용

인덱싱 알고리즘 이해

import faiss
import numpy as np
import time

# 테스트 데이터 생성
np.random.seed(42)
dimension = 1024
num_vectors = 1_000_000
num_queries = 100

# 정규화된 랜덤 벡터 생성
data = np.random.randn(num_vectors, dimension).astype('float32')
faiss.normalize_L2(data)
queries = np.random.randn(num_queries, dimension).astype('float32')
faiss.normalize_L2(queries)

# 1. Flat Index (정확 검색, Brute-force)
print("=== Flat Index (Exact Search) ===")
index_flat = faiss.IndexFlatIP(dimension)  # Inner Product (코사인 유사도)
index_flat.add(data)

start = time.time()
D_exact, I_exact = index_flat.search(queries, 10)
flat_time = time.time() - start
print(f"검색 시간: {flat_time:.3f}초")
print(f"Recall@10: 1.000 (정확 검색)")

# 2. IVF (Inverted File Index)
print("\n=== IVF Index ===")
nlist = 1024  # 클러스터 수
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(data)
index_ivf.add(data)
index_ivf.nprobe = 32  # 검색할 클러스터 수

start = time.time()
D_ivf, I_ivf = index_ivf.search(queries, 10)
ivf_time = time.time() - start

# Recall 계산
recall = np.mean([len(set(I_ivf[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"검색 시간: {ivf_time:.3f}초 (x{flat_time/ivf_time:.1f} 빠름)")
print(f"Recall@10: {recall:.3f}")

# 3. HNSW (Hierarchical Navigable Small World)
print("\n=== HNSW Index ===")
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)  # M=32
index_hnsw.hnsw.efConstruction = 200
index_hnsw.hnsw.efSearch = 64
index_hnsw.metric_type = faiss.METRIC_INNER_PRODUCT
index_hnsw.add(data)

start = time.time()
D_hnsw, I_hnsw = index_hnsw.search(queries, 10)
hnsw_time = time.time() - start

recall = np.mean([len(set(I_hnsw[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"검색 시간: {hnsw_time:.3f}초 (x{flat_time/hnsw_time:.1f} 빠름)")
print(f"Recall@10: {recall:.3f}")

# 4. IVF-PQ (Product Quantization)
print("\n=== IVF-PQ Index (메모리 최적화) ===")
m = 64  # 서브벡터 수
nbits = 8  # 코드북 비트 수
index_ivfpq = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
index_ivfpq.train(data)
index_ivfpq.add(data)
index_ivfpq.nprobe = 32

start = time.time()
D_pq, I_pq = index_ivfpq.search(queries, 10)
pq_time = time.time() - start

recall = np.mean([len(set(I_pq[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"검색 시간: {pq_time:.3f}초 (x{flat_time/pq_time:.1f} 빠름)")
print(f"Recall@10: {recall:.3f}")
print(f"메모리: Flat={data.nbytes/1e9:.1f}GB, PQ={index_ivfpq.sa_code_size()*num_vectors/1e9:.3f}GB")

인덱싱 알고리즘 비교

알고리즘	검색 속도	Recall	메모리 사용	구축 시간	적합 사례
Flat	느림	100%	높음	즉시	소규모 (10만 이하)
IVF	중간	95-99%	높음	중간	중규모, 업데이트 빈번
HNSW	빠름	97-99%	높음+a	느림	고성능 요구, 읽기 위주
IVF-PQ	빠름	90-95%	낮음	중간	대규모, 메모리 제한
ScaNN	매우 빠름	95-98%	중간	중간	대규모, Google 환경

Chroma를 이용한 벡터 저장소 구축

import chromadb
from chromadb.utils import embedding_functions

# Chroma 클라이언트 생성
client = chromadb.PersistentClient(path="./chroma_db")

# Sentence Transformers 임베딩 함수 설정
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-m3"
)

# 컬렉션 생성
collection = client.get_or_create_collection(
    name="tech_documents",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}  # 코사인 유사도 사용
)

# 문서 추가
documents = [
    "Kubernetes는 컨테이너 오케스트레이션 플랫폼이다.",
    "Docker는 애플리케이션을 컨테이너로 패키징하는 도구다.",
    "Prometheus는 메트릭 기반의 모니터링 시스템이다.",
    "Grafana는 데이터 시각화 및 대시보드 도구다.",
    "Terraform은 인프라를 코드로 관리하는 IaC 도구다.",
    "임베딩 모델은 텍스트를 벡터로 변환한다.",
    "RAG는 검색 증강 생성 기법으로 LLM의 환각을 줄인다.",
]

collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[
        {"category": "kubernetes"},
        {"category": "docker"},
        {"category": "monitoring"},
        {"category": "monitoring"},
        {"category": "iac"},
        {"category": "ai"},
        {"category": "ai"},
    ]
)

# 시맨틱 검색
results = collection.query(
    query_texts=["컨테이너 관련 기술은?"],
    n_results=3
)
print("검색 결과:")
for doc, score in zip(results["documents"][0], results["distances"][0]):
    print(f"  [{score:.4f}] {doc}")

# 메타데이터 필터링 + 시맨틱 검색
results_filtered = collection.query(
    query_texts=["모니터링 도구"],
    n_results=2,
    where={"category": "monitoring"}
)
print("\n필터링된 검색 결과:")
for doc in results_filtered["documents"][0]:
    print(f"  {doc}")

유사도 검색과 시맨틱 서치

유사도 메트릭 비교

import numpy as np

def cosine_similarity(a, b):
    """코사인 유사도: 벡터 방향의 유사성 측정"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def dot_product(a, b):
    """내적: 정규화된 벡터에서는 코사인 유사도와 동일"""
    return np.dot(a, b)

def euclidean_distance(a, b):
    """유클리드 거리: 벡터 간 직선 거리"""
    return np.linalg.norm(a - b)

def manhattan_distance(a, b):
    """맨하탄 거리: 차원별 절대 차이의 합"""
    return np.sum(np.abs(a - b))

# 예시 벡터
a = np.array([1.0, 2.0, 3.0])
b = np.array([1.0, 2.0, 3.1])
c = np.array([-1.0, -2.0, -3.0])

print("벡터 a와 b (매우 유사):")
print(f"  코사인 유사도:  {cosine_similarity(a, b):.4f}")
print(f"  유클리드 거리:  {euclidean_distance(a, b):.4f}")
print(f"  내적:          {dot_product(a, b):.4f}")

print("\n벡터 a와 c (반대 방향):")
print(f"  코사인 유사도:  {cosine_similarity(a, c):.4f}")
print(f"  유클리드 거리:  {euclidean_distance(a, c):.4f}")
print(f"  내적:          {dot_product(a, c):.4f}")

유사도 메트릭 선택 가이드

메트릭	수식	범위	정규화 필요	사용 사례
코사인 유사도	cos(a,b)	-1 ~ 1	불필요	텍스트 유사도 (가장 보편적)
내적 (Dot Product)	a . b	-inf ~ inf	필요	정규화된 임베딩, 검색 랭킹
유클리드 거리 (L2)	벡터 차이의 L2 노름	0 ~ inf	권장	클러스터링, 이상 탐지

시맨틱 검색 파이프라인 구현

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearchEngine:
    def __init__(self, model_name="BAAI/bge-m3"):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index_documents(self, documents):
        """문서를 인덱싱하여 임베딩 생성"""
        self.documents = documents
        self.embeddings = self.model.encode(
            documents,
            convert_to_tensor=True,
            normalize_embeddings=True,
            show_progress_bar=True
        )
        print(f"{len(documents)}개 문서 인덱싱 완료 (차원: {self.embeddings.shape[1]})")

    def search(self, query, top_k=5):
        """시맨틱 검색 수행"""
        query_embedding = self.model.encode(
            query,
            convert_to_tensor=True,
            normalize_embeddings=True
        )

        # 코사인 유사도 계산
        scores = util.cos_sim(query_embedding, self.embeddings)[0]

        # 상위 k개 결과 반환
        top_results = torch.topk(scores, k=min(top_k, len(self.documents)))

        results = []
        for score, idx in zip(top_results.values, top_results.indices):
            results.append({
                "document": self.documents[idx],
                "score": score.item(),
                "index": idx.item()
            })
        return results

    def search_with_reranking(self, query, top_k=5, initial_k=20):
        """2단계 검색: 임베딩 검색 + 리랭킹"""
        from sentence_transformers import CrossEncoder

        # 1단계: 임베딩 기반 후보 검색
        candidates = self.search(query, top_k=initial_k)

        # 2단계: Cross-encoder로 리랭킹
        reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
        pairs = [(query, c["document"]) for c in candidates]
        rerank_scores = reranker.predict(pairs)

        # 리랭킹 결과 반환
        for i, score in enumerate(rerank_scores):
            candidates[i]["rerank_score"] = float(score)

        reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return reranked[:top_k]

# 사용 예시
engine = SemanticSearchEngine()

documents = [
    "Python은 데이터 과학과 머신러닝에 가장 많이 사용되는 프로그래밍 언어다.",
    "JavaScript는 웹 개발의 핵심 언어로 Node.js를 통해 서버 사이드에서도 사용된다.",
    "Kubernetes는 컨테이너화된 애플리케이션의 배포, 스케일링, 관리를 자동화한다.",
    "PostgreSQL은 강력한 오픈소스 관계형 데이터베이스 관리 시스템이다.",
    "TensorFlow와 PyTorch는 딥러닝 모델 개발에 가장 널리 사용되는 프레임워크다.",
    "Redis는 인메모리 데이터 구조 저장소로 캐싱과 메시지 브로커로 활용된다.",
    "Docker는 애플리케이션과 종속성을 컨테이너로 패키징하여 이식성을 제공한다.",
    "GraphQL은 REST의 대안으로 클라이언트가 필요한 데이터만 요청할 수 있게 한다.",
]

engine.index_documents(documents)

# 시맨틱 검색
query = "딥러닝 개발에 어떤 도구를 사용해야 하나요?"
results = engine.search(query, top_k=3)
print(f"\n쿼리: {query}")
for r in results:
    print(f"  [{r['score']:.4f}] {r['document']}")

RAG 파이프라인에서의 임베딩

RAG 아키텍처 개요

RAG(Retrieval-Augmented Generation) 파이프라인에서 임베딩은 검색 단계의 핵심 역할을 한다. 전체 흐름은 다음과 같다:

문서 전처리: 원본 문서를 적절한 크기의 청크로 분할
임베딩 생성: 각 청크를 임베딩 벡터로 변환하여 벡터 데이터베이스에 저장
쿼리 검색: 사용자 질의를 임베딩하여 유사한 청크 검색
리랭킹: Cross-encoder 등으로 검색 결과 재정렬
생성: 검색된 컨텍스트와 함께 LLM에 전달하여 답변 생성

RAG 파이프라인 구현

from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI
import chromadb
from chromadb.utils import embedding_functions
import tiktoken
from typing import List, Dict

class RAGPipeline:
    def __init__(
        self,
        embedding_model: str = "BAAI/bge-m3",
        reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2",
        llm_model: str = "gpt-4o",
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.reranker = CrossEncoder(reranker_model)
        self.llm_client = OpenAI()
        self.llm_model = llm_model

        # Chroma 벡터 DB 초기화
        self.chroma_client = chromadb.PersistentClient(path="./rag_db")
        self.collection = self.chroma_client.get_or_create_collection(
            name="rag_documents",
            metadata={"hnsw:space": "cosine"}
        )

    def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
        """텍스트를 토큰 기반으로 청크 분할"""
        tokenizer = tiktoken.get_encoding("cl100k_base")
        tokens = tokenizer.encode(text)
        chunks = []

        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunk_text = tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
            start = end - overlap  # 오버랩 적용

        return chunks

    def ingest_documents(self, documents: List[Dict[str, str]]):
        """문서를 청크 분할하고 벡터 DB에 저장"""
        all_chunks = []
        all_ids = []
        all_metadatas = []

        for doc_idx, doc in enumerate(documents):
            chunks = self.chunk_text(doc["content"])
            for chunk_idx, chunk in enumerate(chunks):
                all_chunks.append(chunk)
                all_ids.append(f"doc{doc_idx}_chunk{chunk_idx}")
                all_metadatas.append({
                    "source": doc.get("source", "unknown"),
                    "doc_index": doc_idx,
                    "chunk_index": chunk_idx,
                })

        # 임베딩 생성 및 저장
        embeddings = self.embedder.encode(all_chunks, normalize_embeddings=True)

        self.collection.add(
            documents=all_chunks,
            embeddings=embeddings.tolist(),
            ids=all_ids,
            metadatas=all_metadatas,
        )
        print(f"{len(documents)}개 문서 -> {len(all_chunks)}개 청크 인덱싱 완료")

    def retrieve(self, query: str, top_k: int = 10) -> List[Dict]:
        """벡터 유사도 기반 검색"""
        query_embedding = self.embedder.encode(
            [query], normalize_embeddings=True
        ).tolist()

        results = self.collection.query(
            query_embeddings=query_embedding,
            n_results=top_k,
        )

        retrieved = []
        for i in range(len(results["documents"][0])):
            retrieved.append({
                "text": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "distance": results["distances"][0][i],
            })
        return retrieved

    def rerank(self, query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
        """Cross-encoder 기반 리랭킹"""
        pairs = [(query, c["text"]) for c in candidates]
        scores = self.reranker.predict(pairs)

        for i, score in enumerate(scores):
            candidates[i]["rerank_score"] = float(score)

        reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return reranked[:top_k]

    def generate(self, query: str, context_docs: List[Dict]) -> str:
        """검색된 컨텍스트를 기반으로 LLM 응답 생성"""
        context = "\n\n---\n\n".join([doc["text"] for doc in context_docs])

        messages = [
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer the question based on "
                    "the provided context. If the context doesn't contain "
                    "relevant information, say so."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            },
        ]

        response = self.llm_client.chat.completions.create(
            model=self.llm_model,
            messages=messages,
            temperature=0.1,
        )
        return response.choices[0].message.content

    def query(self, question: str, top_k_retrieve: int = 10, top_k_rerank: int = 5) -> str:
        """전체 RAG 파이프라인 실행"""
        # 1. 검색
        candidates = self.retrieve(question, top_k=top_k_retrieve)
        print(f"1단계 검색: {len(candidates)}개 후보")

        # 2. 리랭킹
        reranked = self.rerank(question, candidates, top_k=top_k_rerank)
        print(f"2단계 리랭킹: 상위 {len(reranked)}개 선택")

        # 3. 생성
        answer = self.generate(question, reranked)
        return answer

# 사용 예시
rag = RAGPipeline()

# 문서 인제스트
documents = [
    {"content": "긴 기술 문서 내용...", "source": "tech_doc_1.pdf"},
    {"content": "또 다른 문서 내용...", "source": "tech_doc_2.pdf"},
]
rag.ingest_documents(documents)

# 질의
answer = rag.query("임베딩 모델의 차원 수는 성능에 어떤 영향을 미치나요?")
print(f"\n답변: {answer}")

하이브리드 검색 전략

from rank_bm25 import BM25Okapi
import numpy as np

class HybridSearchEngine:
    """Dense(임베딩) + Sparse(BM25) 하이브리드 검색"""

    def __init__(self, embedding_model="BAAI/bge-m3"):
        self.embedder = SentenceTransformer(embedding_model)
        self.documents = []
        self.embeddings = None
        self.bm25 = None

    def index(self, documents):
        self.documents = documents

        # Dense: 임베딩 생성
        self.embeddings = self.embedder.encode(
            documents, normalize_embeddings=True, convert_to_tensor=True
        )

        # Sparse: BM25 인덱스 구축
        tokenized = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query, top_k=5, alpha=0.7):
        """하이브리드 검색 (alpha: dense 가중치, 1-alpha: sparse 가중치)"""
        # Dense 검색
        query_emb = self.embedder.encode(
            query, normalize_embeddings=True, convert_to_tensor=True
        )
        dense_scores = util.cos_sim(query_emb, self.embeddings)[0].cpu().numpy()

        # Sparse 검색 (BM25)
        sparse_scores = self.bm25.get_scores(query.split())

        # 정규화
        if dense_scores.max() > 0:
            dense_scores = dense_scores / dense_scores.max()
        if sparse_scores.max() > 0:
            sparse_scores = sparse_scores / sparse_scores.max()

        # 가중 합산
        hybrid_scores = alpha * dense_scores + (1 - alpha) * sparse_scores

        # 상위 k개 반환
        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
        return [
            {
                "document": self.documents[i],
                "score": hybrid_scores[i],
                "dense_score": dense_scores[i],
                "sparse_score": sparse_scores[i],
            }
            for i in top_indices
        ]

임베딩 모델 파인튜닝

왜 파인튜닝이 필요한가

범용 임베딩 모델은 일반적인 텍스트 유사도에서는 좋은 성능을 보이지만, 특정 도메인(의료, 법률, 금융 등)이나 특수한 검색 패턴에서는 성능이 떨어질 수 있다. 파인튜닝을 통해 도메인 특화 성능을 크게 향상시킬 수 있다.

대조 학습(Contrastive Learning) 기반 파인튜닝

from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    losses,
    evaluation,
)
from torch.utils.data import DataLoader

# 기본 모델 로드
model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# 학습 데이터 준비 (anchor, positive, negative)
train_examples = [
    # (쿼리, 관련 문서, 비관련 문서)
    InputExample(texts=[
        "How to deploy a Kubernetes pod?",
        "kubectl apply -f pod.yaml creates a new pod in the cluster.",
        "Python is a popular programming language for data science."
    ]),
    InputExample(texts=[
        "What is a Docker container?",
        "A Docker container is a lightweight, standalone executable package.",
        "Machine learning models require large datasets for training."
    ]),
    InputExample(texts=[
        "How does Redis caching work?",
        "Redis stores data in memory for fast read/write access as a cache layer.",
        "Kubernetes orchestrates containerized applications across clusters."
    ]),
    # ... 수천 ~ 수만 개의 학습 예시
]

# DataLoader 생성
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# 손실 함수: TripletLoss (anchor, positive, negative)
train_loss = losses.TripletLoss(model=model)

# 평가 데이터
eval_examples = [
    InputExample(texts=["query1", "relevant_doc1"], label=1.0),
    InputExample(texts=["query2", "irrelevant_doc2"], label=0.0),
]
evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    eval_examples, name="domain-eval"
)

# 파인튜닝 실행
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,
    warmup_steps=100,
    evaluation_steps=500,
    output_path="./finetuned_embedding_model",
    save_best_model=True,
)

print("파인튜닝 완료!")

# 파인튜닝된 모델 로드 및 사용
finetuned_model = SentenceTransformer('./finetuned_embedding_model')
embeddings = finetuned_model.encode(["도메인 특화 쿼리"])

MultipleNegativesRankingLoss를 활용한 효율적 학습

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# (query, positive_passage) 쌍만으로 학습 가능
# in-batch negatives를 자동으로 활용
train_examples = [
    InputExample(texts=["What is embedding?", "An embedding is a vector representation of data."]),
    InputExample(texts=["How does HNSW work?", "HNSW builds a hierarchical graph for approximate nearest neighbor search."]),
    InputExample(texts=["What is RAG?", "RAG retrieves relevant documents and uses them to augment LLM generation."]),
    # ... 더 많은 (query, positive) 쌍
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

# MultipleNegativesRankingLoss: 배치 내 다른 positive를 negative로 활용
train_loss = losses.MultipleNegativesRankingLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./mnrl_finetuned_model",
)

학습 데이터 준비 전략

데이터 유형	설명	예시
자연 쌍	실제 사용자 쿼리와 클릭한 문서	검색 로그 데이터
LLM 생성	GPT-4 등으로 쿼리-문서 쌍 합성	문서에서 자동 질문 생성
Hard Negative	의미적으로 비슷하지만 정답이 아닌 문서	BM25 검색 결과 중 비관련 문서
크로스 인코더 증류	Cross-encoder 스코어를 학습 타겟으로 활용	고품질 레이블 자동 생성

성능 최적화와 평가

MTEB 벤치마크

MTEB(Massive Text Embedding Benchmark)는 임베딩 모델의 성능을 종합적으로 평가하는 표준 벤치마크다. 다양한 태스크 카테고리에서 모델을 평가한다:

태스크 카테고리	설명	대표 데이터셋
Classification	텍스트 분류	AmazonReviews, TweetSentiment
Clustering	텍스트 클러스터링	ArXiv, Reddit
Pair Classification	문장 쌍 관계 분류	TwitterPara, SprintDuplicateQuestions
Reranking	검색 결과 재정렬	AskUbuntuDupQuestions, StackOverflowDupQuestions
Retrieval	문서 검색	MSMarco, NQ, HotpotQA
STS	문장 의미 유사도	STSBenchmark, SICK-R
Summarization	요약 품질 평가	SummEval

차원 축소와 Matryoshka Representation Learning

from sentence_transformers import SentenceTransformer
import numpy as np

# Matryoshka Representation Learning (MRL) 지원 모델
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')

texts = [
    "Vector databases store embeddings for similarity search.",
    "Embedding models convert text into numerical representations.",
    "RAG systems combine retrieval with language generation.",
]

# 전체 차원 임베딩
full_embeddings = model.encode(texts)
print(f"전체 차원: {full_embeddings.shape[1]}")  # 768

# Matryoshka: 원하는 차원으로 절삭 후 정규화
def truncate_embeddings(embeddings, target_dim):
    """Matryoshka 방식으로 차원 축소"""
    truncated = embeddings[:, :target_dim]
    # L2 정규화
    norms = np.linalg.norm(truncated, axis=1, keepdims=True)
    return truncated / norms

# 다양한 차원에서의 유사도 비교
for dim in [64, 128, 256, 512, 768]:
    reduced = truncate_embeddings(full_embeddings, dim)
    sim = np.dot(reduced[0], reduced[1])  # 정규화된 벡터의 내적 = 코사인 유사도
    print(f"  차원 {dim:>4}: 유사도 = {sim:.4f}")

양자화를 통한 메모리 최적화

import numpy as np

def scalar_quantize_int8(embeddings):
    """스칼라 양자화: float32 -> int8 (메모리 75% 절감)"""
    min_val = embeddings.min(axis=0)
    max_val = embeddings.max(axis=0)
    scale = (max_val - min_val) / 255.0

    quantized = ((embeddings - min_val) / scale).astype(np.int8)
    return quantized, min_val, scale

def scalar_dequantize_int8(quantized, min_val, scale):
    """역양자화: int8 -> float32"""
    return quantized.astype(np.float32) * scale + min_val

def binary_quantize(embeddings):
    """이진 양자화: float32 -> 1bit (메모리 32배 절감)"""
    return (embeddings > 0).astype(np.uint8)

# 메모리 비교
num_vectors = 1_000_000
dimension = 1024
embeddings = np.random.randn(num_vectors, dimension).astype(np.float32)

print(f"원본 (float32): {embeddings.nbytes / 1e9:.2f} GB")

quantized, _, _ = scalar_quantize_int8(embeddings)
print(f"int8 양자화:    {quantized.nbytes / 1e9:.2f} GB")

binary = binary_quantize(embeddings)
print(f"이진 양자화:    {binary.nbytes / 1e9:.2f} GB")
# 원본 (float32): 4.10 GB
# int8 양자화:    1.02 GB
# 이진 양자화:    1.02 GB (실제 비트 패킹 시 0.13 GB)

프로덕션 최적화 체크리스트

최적화 항목	기법	효과
배치 처리	임베딩 요청을 배치로 묶어 처리	처리량 3-5배 향상
캐싱	자주 사용되는 쿼리 임베딩 캐시	지연 시간 90% 감소
차원 축소	Matryoshka 또는 PCA 적용	메모리/속도 2-4배 향상
양자화	int8/이진 양자화	메모리 4-32배 절감
GPU 추론	ONNX Runtime 또는 TensorRT	추론 속도 2-3배 향상
비동기 처리	asyncio 기반 병렬 임베딩	전체 처리량 향상
모델 선택	요구사항에 맞는 적정 모델	비용 대비 성능 최적화

import asyncio
from sentence_transformers import SentenceTransformer
from functools import lru_cache
import hashlib

class OptimizedEmbeddingService:
    def __init__(self, model_name="BAAI/bge-m3", cache_size=10000):
        self.model = SentenceTransformer(model_name)
        self.cache = {}
        self.cache_size = cache_size

    def _get_cache_key(self, text):
        return hashlib.md5(text.encode()).hexdigest()

    def encode_with_cache(self, texts, batch_size=64):
        """캐싱을 적용한 임베딩 생성"""
        uncached_texts = []
        uncached_indices = []
        results = [None] * len(texts)

        # 캐시 히트 확인
        for i, text in enumerate(texts):
            key = self._get_cache_key(text)
            if key in self.cache:
                results[i] = self.cache[key]
            else:
                uncached_texts.append(text)
                uncached_indices.append(i)

        # 캐시 미스 텍스트만 배치 임베딩
        if uncached_texts:
            new_embeddings = self.model.encode(
                uncached_texts,
                batch_size=batch_size,
                normalize_embeddings=True,
            )

            for idx, emb in zip(uncached_indices, new_embeddings):
                key = self._get_cache_key(texts[idx])
                self.cache[key] = emb
                results[idx] = emb

                # 캐시 크기 관리
                if len(self.cache) > self.cache_size:
                    oldest_key = next(iter(self.cache))
                    del self.cache[oldest_key]

        return results

    def get_cache_stats(self):
        return {"cache_size": len(self.cache), "max_size": self.cache_size}

마치며

임베딩 모델은 현대 AI 시스템의 핵심 인프라로, 시맨틱 검색, RAG, 추천 시스템, 이상 탐지 등 다양한 응용 분야에서 필수적인 역할을 한다. 이 글에서 다룬 핵심 사항을 정리하면 다음과 같다:

모델 선택이 중요하다: MTEB 벤치마크를 참고하되, 실제 데이터로 평가하는 것이 가장 정확하다. 다국어 지원이 필요하면 BGE-M3, 최고 성능이 필요하면 GTE-Qwen2-7B, 비용 효율이 중요하면 text-embedding-3-small을 고려하라.
벡터 데이터베이스는 요구사항에 맞게 선택하라: 빠른 프로토타이핑에는 Chroma, 프로덕션 규모에는 Milvus나 Pinecone, 기존 PostgreSQL 활용에는 pgvector가 적합하다.
하이브리드 검색이 단일 방식보다 우수하다: Dense(임베딩) + Sparse(BM25) 조합에 리랭킹을 추가하면 검색 품질이 크게 향상된다.
파인튜닝은 도메인 특화 성능의 핵심이다: MultipleNegativesRankingLoss와 hard negative mining을 활용하면 적은 데이터로도 상당한 성능 향상을 얻을 수 있다.
최적화는 필수다: 차원 축소(Matryoshka), 양자화, 캐싱, 배치 처리 등을 적용하여 프로덕션 환경에서의 비용과 지연 시간을 최적화하라.

임베딩 기술은 빠르게 발전하고 있으며, Matryoshka Representation Learning, 멀티모달 임베딩, Task-specific LoRA 어댑터 등 새로운 기법이 계속 등장하고 있다. 핵심 원리를 이해하고 실전 경험을 쌓아, 자신의 프로젝트에 최적의 임베딩 전략을 구축하기 바란다.

참고자료

Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019.
Wang, L. et al. (2024). "Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)." ACL 2024.
Chen, J. et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation."
Kusupati, A. et al. (2024). "Matryoshka Representation Learning." NeurIPS 2024.
Muennighoff, N. et al. (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023.
MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
Sentence Transformers Documentation: https://www.sbert.net/
FAISS Documentation: https://github.com/facebookresearch/faiss
Pinecone Learning Center: https://www.pinecone.io/learn/
Chroma Documentation: https://docs.trychroma.com/

Complete Guide to Embedding Models: Vector Search, RAG, and Sentence Transformers in Practice

Introduction
Fundamentals of Embeddings
Comparing Key Embedding Models
Using Sentence Transformers
Vector Databases and Indexing
Similarity Search and Semantic Search
Embeddings in RAG Pipelines
Fine-tuning Embedding Models
Performance Optimization and Evaluation
Conclusion
References

Introduction

Embeddings are a foundational technology of modern AI systems. By converting unstructured data such as text, images, and audio into numerical vectors, they enable machines to understand and compare "meaning." With RAG (Retrieval-Augmented Generation) pipelines becoming the core architecture of LLM applications, the quality of embedding models has become a critical factor that determines overall system performance.

Since the advent of Word2Vec in 2013, the field has evolved rapidly through GloVe and FastText, then BERT-based sentence embeddings, and recently to instruction-tuned large-scale embedding models. In 2024-2025, models with significantly improved performance and multilingual support emerged, including OpenAI text-embedding-3, Cohere embed-v3, BGE-M3, and GTE-Qwen2, with fierce competition on the MTEB (Massive Text Embedding Benchmark) leaderboard.

This article systematically covers everything about embedding models, from fundamental principles to the latest model comparisons, vector database utilization, RAG integration, fine-tuning, and performance evaluation, all accompanied by practical code examples.

Fundamentals of Embeddings

What Are Embeddings?

An embedding is a technique that maps high-dimensional discrete data into a lower-dimensional continuous vector space. The core idea is to learn representations where semantically similar items are positioned close together in the vector space.

# Intuitive understanding: representing words as vectors
# "king" = [0.2, 0.8, 0.1, ...]
# "queen" = [0.3, 0.9, 0.1, ...]
# "man" = [0.1, 0.2, 0.8, ...]

# The famous vector arithmetic: king - man + woman ≈ queen
import numpy as np

king = np.array([0.2, 0.8, 0.1, 0.5])
man = np.array([0.1, 0.2, 0.8, 0.4])
woman = np.array([0.15, 0.25, 0.85, 0.6])
queen = np.array([0.3, 0.9, 0.1, 0.7])

result = king - man + woman
print(f"king - man + woman = {result}")
print(f"queen              = {queen}")
# The two vectors are very similar

Geometric Intuition of Embeddings

In vector space, embeddings exhibit the following properties:

Distance = Semantic Difference: Words/sentences with similar meanings are positioned at close distances
Direction = Relationship: Specific directions encode specific semantic relationships (e.g., gender, tense, size)
Clustering: Items belonging to the same topic or category naturally form clusters

Evolution of Embeddings

Generation	Model	Characteristics	Dimensions
1st Gen (2013)	Word2Vec, GloVe	Static word embeddings, context-independent	50-300
2nd Gen (2018)	ELMo, BERT	Context-dependent embeddings, bidirectional	768-1024
3rd Gen (2019)	Sentence-BERT	Sentence-level embeddings, efficient similarity computation	384-768
4th Gen (2023-)	E5, BGE, GTE	Instruction-tuned, multilingual, large-scale	768-4096
5th Gen (2024-)	text-embedding-3, Matryoshka	Variable dimensions, multilingual, high-performance	256-3072

Comparing Key Embedding Models

Commercial Embedding Models

Model	Provider	Max Tokens	Dimensions	MTEB Average	Price (1M tokens)
text-embedding-3-large	OpenAI	8,191	3,072	64.6	~0.13 USD
text-embedding-3-small	OpenAI	8,191	1,536	62.3	~0.02 USD
embed-v3.0 (English)	Cohere	512	1,024	64.5	~0.10 USD
embed-v3.0 (Multilingual)	Cohere	512	1,024	64.0	~0.10 USD
Voyage-3	Voyage AI	32,000	1,024	67.3	~0.06 USD

Open-Source Embedding Models

Model	Developer	Parameters	Dimensions	MTEB Average	Features
BGE-M3	BAAI	568M	1,024	66.1	Multilingual, Dense+Sparse+ColBERT
BGE-large-en-v1.5	BAAI	335M	1,024	64.2	English-optimized
E5-mistral-7b-instruct	Microsoft	7B	4,096	66.6	LLM-based, high-performance
GTE-Qwen2-7B-instruct	Alibaba	7B	3,584	70.2	Top MTEB ranking
Jina-embeddings-v3	Jina AI	572M	1,024	65.5	Multilingual, Task LoRA
nomic-embed-text-v1.5	Nomic	137M	768	62.3	Lightweight, 8192 tokens
mxbai-embed-large-v1	Mixedbread	335M	1,024	64.7	Matryoshka support

Model Selection Criteria

# Model selection decision tree
def select_embedding_model(requirements):
    """Embedding model selection guide based on requirements"""

    if requirements.get("budget") == "unlimited":
        if requirements.get("max_performance"):
            return "GTE-Qwen2-7B-instruct (self-hosted) or Voyage-3 (API)"
        return "text-embedding-3-large (OpenAI API)"

    if requirements.get("multilingual"):
        if requirements.get("self_hosted"):
            return "BGE-M3 (Dense+Sparse hybrid)"
        return "Cohere embed-v3 multilingual"

    if requirements.get("low_latency"):
        if requirements.get("self_hosted"):
            return "nomic-embed-text-v1.5 (lightweight 137M)"
        return "text-embedding-3-small (OpenAI API)"

    if requirements.get("domain_specific"):
        return "Sentence Transformers + fine-tuning (base model: BGE or E5)"

    # Default recommendation
    return "text-embedding-3-small (cost-effective general-purpose choice)"

Using Sentence Transformers

Basic Usage

Sentence Transformers is the most widely used Python library for generating sentence-level embeddings.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Single sentence embedding
sentence = "Embedding models convert text into numerical vectors."
embedding = model.encode(sentence)
print(f"Dimensions: {embedding.shape}")  # (1024,)

# Batch embeddings
sentences = [
    "Embedding models convert text into numerical vectors.",
    "Vector search quickly finds similar documents.",
    "RAG is a retrieval-augmented generation technique.",
    "The weather is very nice today.",
]

embeddings = model.encode(sentences, batch_size=32, show_progress_bar=True)
print(f"Embedding matrix shape: {embeddings.shape}")  # (4, 1024)

# Similarity computation
from sentence_transformers.util import cos_sim

similarity_matrix = cos_sim(embeddings, embeddings)
print("Similarity matrix:")
print(similarity_matrix.numpy().round(3))

BGE-M3 Multilingual Embeddings

from sentence_transformers import SentenceTransformer

# BGE-M3: A multilingual embedding model supporting 100+ languages
model = SentenceTransformer('BAAI/bge-m3')

# Multilingual sentence embeddings
sentences = [
    "Machine learning is transforming the world.",        # English
    "머신러닝이 세상을 변화시키고 있다.",                      # Korean
    "機械学習が世界を変えている。",                           # Japanese
    "机器学习正在改变世界。",                                # Chinese
]

embeddings = model.encode(sentences, normalize_embeddings=True)

# Cross-lingual similarity check
from sentence_transformers.util import cos_sim

similarities = cos_sim(embeddings, embeddings)
print("Cross-lingual similarity matrix:")
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            print(f"  '{s1[:30]}...' <-> '{s2[:30]}...': {similarities[i][j]:.4f}")
# Sentences with the same meaning in different languages show high similarity

Using the OpenAI Embedding API

from openai import OpenAI
import numpy as np

client = OpenAI()

def get_openai_embeddings(texts, model="text-embedding-3-small", dimensions=None):
    """Generate OpenAI embeddings (supports Matryoshka dimension reduction)"""
    kwargs = {"input": texts, "model": model}
    if dimensions:
        kwargs["dimensions"] = dimensions

    response = client.embeddings.create(**kwargs)
    return np.array([item.embedding for item in response.data])

# Basic usage
texts = ["Principles of embedding models", "Building vector search systems"]
embeddings_full = get_openai_embeddings(texts, model="text-embedding-3-large")
print(f"Full dimensions: {embeddings_full.shape}")  # (2, 3072)

# Matryoshka: optimize cost/speed through dimension reduction
embeddings_256 = get_openai_embeddings(
    texts, model="text-embedding-3-large", dimensions=256
)
print(f"Reduced dimensions: {embeddings_256.shape}")  # (2, 256)

# Cosine similarity comparison
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

sim_full = cosine_similarity(embeddings_full[0], embeddings_full[1])
sim_256 = cosine_similarity(embeddings_256[0], embeddings_256[1])
print(f"Full dimension similarity: {sim_full:.4f}")
print(f"256-dimension similarity: {sim_256:.4f}")

Vector Databases and Indexing

Vector Database Comparison

Database	Type	Index Algorithm	Scalability	Filtering	Features
Pinecone	Fully managed	Proprietary	High	Metadata	Serverless, simple API
Weaviate	Open-source/Cloud	HNSW	High	GraphQL	Hybrid search, modular
Milvus	Open-source	HNSW, IVF, DiskANN	Very high	Attribute	GPU acceleration, large-scale
Chroma	Open-source	HNSW	Medium	Metadata	Lightweight, developer-friendly
FAISS	Library	IVF, PQ, HNSW	High	None (separate impl.)	Meta-developed, top performance
Qdrant	Open-source/Cloud	HNSW	High	Payload	Rust-based, high-performance
pgvector	PostgreSQL extension	IVFFlat, HNSW	Medium	SQL	Leverages existing PostgreSQL

Understanding Indexing Algorithms

import faiss
import numpy as np
import time

# Generate test data
np.random.seed(42)
dimension = 1024
num_vectors = 1_000_000
num_queries = 100

# Generate normalized random vectors
data = np.random.randn(num_vectors, dimension).astype('float32')
faiss.normalize_L2(data)
queries = np.random.randn(num_queries, dimension).astype('float32')
faiss.normalize_L2(queries)

# 1. Flat Index (Exact Search, Brute-force)
print("=== Flat Index (Exact Search) ===")
index_flat = faiss.IndexFlatIP(dimension)  # Inner Product (cosine similarity)
index_flat.add(data)

start = time.time()
D_exact, I_exact = index_flat.search(queries, 10)
flat_time = time.time() - start
print(f"Search time: {flat_time:.3f}s")
print(f"Recall@10: 1.000 (exact search)")

# 2. IVF (Inverted File Index)
print("\n=== IVF Index ===")
nlist = 1024  # Number of clusters
quantizer = faiss.IndexFlatIP(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)
index_ivf.train(data)
index_ivf.add(data)
index_ivf.nprobe = 32  # Number of clusters to search

start = time.time()
D_ivf, I_ivf = index_ivf.search(queries, 10)
ivf_time = time.time() - start

# Calculate recall
recall = np.mean([len(set(I_ivf[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {ivf_time:.3f}s (x{flat_time/ivf_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")

# 3. HNSW (Hierarchical Navigable Small World)
print("\n=== HNSW Index ===")
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)  # M=32
index_hnsw.hnsw.efConstruction = 200
index_hnsw.hnsw.efSearch = 64
index_hnsw.metric_type = faiss.METRIC_INNER_PRODUCT
index_hnsw.add(data)

start = time.time()
D_hnsw, I_hnsw = index_hnsw.search(queries, 10)
hnsw_time = time.time() - start

recall = np.mean([len(set(I_hnsw[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {hnsw_time:.3f}s (x{flat_time/hnsw_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")

# 4. IVF-PQ (Product Quantization)
print("\n=== IVF-PQ Index (Memory Optimized) ===")
m = 64  # Number of sub-vectors
nbits = 8  # Codebook bits
index_ivfpq = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)
index_ivfpq.train(data)
index_ivfpq.add(data)
index_ivfpq.nprobe = 32

start = time.time()
D_pq, I_pq = index_ivfpq.search(queries, 10)
pq_time = time.time() - start

recall = np.mean([len(set(I_pq[i]) & set(I_exact[i])) / 10 for i in range(num_queries)])
print(f"Search time: {pq_time:.3f}s (x{flat_time/pq_time:.1f} faster)")
print(f"Recall@10: {recall:.3f}")
print(f"Memory: Flat={data.nbytes/1e9:.1f}GB, PQ={index_ivfpq.sa_code_size()*num_vectors/1e9:.3f}GB")

Indexing Algorithm Comparison

Algorithm	Search Speed	Recall	Memory Usage	Build Time	Best For
Flat	Slow	100%	High	Instant	Small scale (under 100K)
IVF	Medium	95-99%	High	Medium	Medium scale, frequent updates
HNSW	Fast	97-99%	High+overhead	Slow	High-performance, read-heavy
IVF-PQ	Fast	90-95%	Low	Medium	Large scale, memory-constrained
ScaNN	Very fast	95-98%	Medium	Medium	Large scale, Google ecosystem

Building a Vector Store with Chroma

import chromadb
from chromadb.utils import embedding_functions

# Create Chroma client
client = chromadb.PersistentClient(path="./chroma_db")

# Set up Sentence Transformers embedding function
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-m3"
)

# Create collection
collection = client.get_or_create_collection(
    name="tech_documents",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}  # Use cosine similarity
)

# Add documents
documents = [
    "Kubernetes is a container orchestration platform.",
    "Docker is a tool for packaging applications into containers.",
    "Prometheus is a metrics-based monitoring system.",
    "Grafana is a data visualization and dashboard tool.",
    "Terraform is an IaC tool for managing infrastructure as code.",
    "Embedding models convert text into vectors.",
    "RAG is a retrieval-augmented generation technique that reduces LLM hallucinations.",
]

collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[
        {"category": "kubernetes"},
        {"category": "docker"},
        {"category": "monitoring"},
        {"category": "monitoring"},
        {"category": "iac"},
        {"category": "ai"},
        {"category": "ai"},
    ]
)

# Semantic search
results = collection.query(
    query_texts=["What are container-related technologies?"],
    n_results=3
)
print("Search results:")
for doc, score in zip(results["documents"][0], results["distances"][0]):
    print(f"  [{score:.4f}] {doc}")

# Metadata filtering + semantic search
results_filtered = collection.query(
    query_texts=["monitoring tools"],
    n_results=2,
    where={"category": "monitoring"}
)
print("\nFiltered search results:")
for doc in results_filtered["documents"][0]:
    print(f"  {doc}")

Similarity Search and Semantic Search

Comparing Similarity Metrics

import numpy as np

def cosine_similarity(a, b):
    """Cosine similarity: measures directional similarity of vectors"""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def dot_product(a, b):
    """Dot product: equivalent to cosine similarity for normalized vectors"""
    return np.dot(a, b)

def euclidean_distance(a, b):
    """Euclidean distance: straight-line distance between vectors"""
    return np.linalg.norm(a - b)

def manhattan_distance(a, b):
    """Manhattan distance: sum of absolute differences per dimension"""
    return np.sum(np.abs(a - b))

# Example vectors
a = np.array([1.0, 2.0, 3.0])
b = np.array([1.0, 2.0, 3.1])
c = np.array([-1.0, -2.0, -3.0])

print("Vectors a and b (very similar):")
print(f"  Cosine similarity:   {cosine_similarity(a, b):.4f}")
print(f"  Euclidean distance:  {euclidean_distance(a, b):.4f}")
print(f"  Dot product:         {dot_product(a, b):.4f}")

print("\nVectors a and c (opposite direction):")
print(f"  Cosine similarity:   {cosine_similarity(a, c):.4f}")
print(f"  Euclidean distance:  {euclidean_distance(a, c):.4f}")
print(f"  Dot product:         {dot_product(a, c):.4f}")

Similarity Metric Selection Guide

Metric	Formula	Range	Normalization Required	Use Case
Cosine Similarity	cos(a,b)	-1 to 1	Not required	Text similarity (most common)
Dot Product	a . b	-inf to inf	Required	Normalized embeddings, search ranking
Euclidean Distance (L2)	L2 norm of vector difference	0 to inf	Recommended	Clustering, anomaly detection

Implementing a Semantic Search Pipeline

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearchEngine:
    def __init__(self, model_name="BAAI/bge-m3"):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index_documents(self, documents):
        """Index documents by generating embeddings"""
        self.documents = documents
        self.embeddings = self.model.encode(
            documents,
            convert_to_tensor=True,
            normalize_embeddings=True,
            show_progress_bar=True
        )
        print(f"Indexed {len(documents)} documents (dimensions: {self.embeddings.shape[1]})")

    def search(self, query, top_k=5):
        """Perform semantic search"""
        query_embedding = self.model.encode(
            query,
            convert_to_tensor=True,
            normalize_embeddings=True
        )

        # Calculate cosine similarity
        scores = util.cos_sim(query_embedding, self.embeddings)[0]

        # Return top k results
        top_results = torch.topk(scores, k=min(top_k, len(self.documents)))

        results = []
        for score, idx in zip(top_results.values, top_results.indices):
            results.append({
                "document": self.documents[idx],
                "score": score.item(),
                "index": idx.item()
            })
        return results

    def search_with_reranking(self, query, top_k=5, initial_k=20):
        """Two-stage search: embedding retrieval + reranking"""
        from sentence_transformers import CrossEncoder

        # Stage 1: Embedding-based candidate retrieval
        candidates = self.search(query, top_k=initial_k)

        # Stage 2: Reranking with cross-encoder
        reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
        pairs = [(query, c["document"]) for c in candidates]
        rerank_scores = reranker.predict(pairs)

        # Return reranked results
        for i, score in enumerate(rerank_scores):
            candidates[i]["rerank_score"] = float(score)

        reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return reranked[:top_k]

# Usage example
engine = SemanticSearchEngine()

documents = [
    "Python is the most widely used programming language for data science and machine learning.",
    "JavaScript is the core language for web development, also used server-side through Node.js.",
    "Kubernetes automates deployment, scaling, and management of containerized applications.",
    "PostgreSQL is a powerful open-source relational database management system.",
    "TensorFlow and PyTorch are the most widely used frameworks for deep learning model development.",
    "Redis is an in-memory data structure store used as a cache and message broker.",
    "Docker packages applications and their dependencies into containers for portability.",
    "GraphQL is an alternative to REST that allows clients to request only the data they need.",
]

engine.index_documents(documents)

# Semantic search
query = "What tools should I use for deep learning development?"
results = engine.search(query, top_k=3)
print(f"\nQuery: {query}")
for r in results:
    print(f"  [{r['score']:.4f}] {r['document']}")

Embeddings in RAG Pipelines

RAG Architecture Overview

In a RAG (Retrieval-Augmented Generation) pipeline, embeddings play a central role in the retrieval stage. The overall flow is as follows:

Document Preprocessing: Split source documents into appropriately sized chunks
Embedding Generation: Convert each chunk into an embedding vector and store in a vector database
Query Retrieval: Embed the user query and search for similar chunks
Reranking: Reorder search results using a cross-encoder
Generation: Pass the retrieved context along with the query to the LLM for answer generation

RAG Pipeline Implementation

from sentence_transformers import SentenceTransformer, CrossEncoder
from openai import OpenAI
import chromadb
from chromadb.utils import embedding_functions
import tiktoken
from typing import List, Dict

class RAGPipeline:
    def __init__(
        self,
        embedding_model: str = "BAAI/bge-m3",
        reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2",
        llm_model: str = "gpt-4o",
    ):
        self.embedder = SentenceTransformer(embedding_model)
        self.reranker = CrossEncoder(reranker_model)
        self.llm_client = OpenAI()
        self.llm_model = llm_model

        # Initialize Chroma vector DB
        self.chroma_client = chromadb.PersistentClient(path="./rag_db")
        self.collection = self.chroma_client.get_or_create_collection(
            name="rag_documents",
            metadata={"hnsw:space": "cosine"}
        )

    def chunk_text(self, text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
        """Token-based text chunking"""
        tokenizer = tiktoken.get_encoding("cl100k_base")
        tokens = tokenizer.encode(text)
        chunks = []

        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunk_text = tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
            start = end - overlap  # Apply overlap

        return chunks

    def ingest_documents(self, documents: List[Dict[str, str]]):
        """Chunk documents and store in vector DB"""
        all_chunks = []
        all_ids = []
        all_metadatas = []

        for doc_idx, doc in enumerate(documents):
            chunks = self.chunk_text(doc["content"])
            for chunk_idx, chunk in enumerate(chunks):
                all_chunks.append(chunk)
                all_ids.append(f"doc{doc_idx}_chunk{chunk_idx}")
                all_metadatas.append({
                    "source": doc.get("source", "unknown"),
                    "doc_index": doc_idx,
                    "chunk_index": chunk_idx,
                })

        # Generate embeddings and store
        embeddings = self.embedder.encode(all_chunks, normalize_embeddings=True)

        self.collection.add(
            documents=all_chunks,
            embeddings=embeddings.tolist(),
            ids=all_ids,
            metadatas=all_metadatas,
        )
        print(f"{len(documents)} documents -> {len(all_chunks)} chunks indexed")

    def retrieve(self, query: str, top_k: int = 10) -> List[Dict]:
        """Vector similarity-based retrieval"""
        query_embedding = self.embedder.encode(
            [query], normalize_embeddings=True
        ).tolist()

        results = self.collection.query(
            query_embeddings=query_embedding,
            n_results=top_k,
        )

        retrieved = []
        for i in range(len(results["documents"][0])):
            retrieved.append({
                "text": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "distance": results["distances"][0][i],
            })
        return retrieved

    def rerank(self, query: str, candidates: List[Dict], top_k: int = 5) -> List[Dict]:
        """Cross-encoder based reranking"""
        pairs = [(query, c["text"]) for c in candidates]
        scores = self.reranker.predict(pairs)

        for i, score in enumerate(scores):
            candidates[i]["rerank_score"] = float(score)

        reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
        return reranked[:top_k]

    def generate(self, query: str, context_docs: List[Dict]) -> str:
        """Generate LLM response based on retrieved context"""
        context = "\n\n---\n\n".join([doc["text"] for doc in context_docs])

        messages = [
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer the question based on "
                    "the provided context. If the context doesn't contain "
                    "relevant information, say so."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            },
        ]

        response = self.llm_client.chat.completions.create(
            model=self.llm_model,
            messages=messages,
            temperature=0.1,
        )
        return response.choices[0].message.content

    def query(self, question: str, top_k_retrieve: int = 10, top_k_rerank: int = 5) -> str:
        """Execute full RAG pipeline"""
        # Step 1: Retrieve
        candidates = self.retrieve(question, top_k=top_k_retrieve)
        print(f"Step 1 retrieval: {len(candidates)} candidates")

        # Step 2: Rerank
        reranked = self.rerank(question, candidates, top_k=top_k_rerank)
        print(f"Step 2 reranking: top {len(reranked)} selected")

        # Step 3: Generate
        answer = self.generate(question, reranked)
        return answer

# Usage example
rag = RAGPipeline()

# Ingest documents
documents = [
    {"content": "Long technical document content...", "source": "tech_doc_1.pdf"},
    {"content": "Another document content...", "source": "tech_doc_2.pdf"},
]
rag.ingest_documents(documents)

# Query
answer = rag.query("How does embedding dimension size affect performance?")
print(f"\nAnswer: {answer}")

Hybrid Search Strategy

from rank_bm25 import BM25Okapi
import numpy as np

class HybridSearchEngine:
    """Dense (embedding) + Sparse (BM25) hybrid search"""

    def __init__(self, embedding_model="BAAI/bge-m3"):
        self.embedder = SentenceTransformer(embedding_model)
        self.documents = []
        self.embeddings = None
        self.bm25 = None

    def index(self, documents):
        self.documents = documents

        # Dense: generate embeddings
        self.embeddings = self.embedder.encode(
            documents, normalize_embeddings=True, convert_to_tensor=True
        )

        # Sparse: build BM25 index
        tokenized = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query, top_k=5, alpha=0.7):
        """Hybrid search (alpha: dense weight, 1-alpha: sparse weight)"""
        # Dense search
        query_emb = self.embedder.encode(
            query, normalize_embeddings=True, convert_to_tensor=True
        )
        dense_scores = util.cos_sim(query_emb, self.embeddings)[0].cpu().numpy()

        # Sparse search (BM25)
        sparse_scores = self.bm25.get_scores(query.split())

        # Normalize
        if dense_scores.max() > 0:
            dense_scores = dense_scores / dense_scores.max()
        if sparse_scores.max() > 0:
            sparse_scores = sparse_scores / sparse_scores.max()

        # Weighted combination
        hybrid_scores = alpha * dense_scores + (1 - alpha) * sparse_scores

        # Return top k
        top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
        return [
            {
                "document": self.documents[i],
                "score": hybrid_scores[i],
                "dense_score": dense_scores[i],
                "sparse_score": sparse_scores[i],
            }
            for i in top_indices
        ]

Fine-tuning Embedding Models

Why Fine-tuning Is Necessary

General-purpose embedding models perform well on general text similarity tasks, but they may underperform on specific domains (medical, legal, financial, etc.) or specialized search patterns. Fine-tuning can significantly improve domain-specific performance.

Contrastive Learning-Based Fine-tuning

from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    losses,
    evaluation,
)
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Prepare training data (anchor, positive, negative)
train_examples = [
    # (query, relevant document, irrelevant document)
    InputExample(texts=[
        "How to deploy a Kubernetes pod?",
        "kubectl apply -f pod.yaml creates a new pod in the cluster.",
        "Python is a popular programming language for data science."
    ]),
    InputExample(texts=[
        "What is a Docker container?",
        "A Docker container is a lightweight, standalone executable package.",
        "Machine learning models require large datasets for training."
    ]),
    InputExample(texts=[
        "How does Redis caching work?",
        "Redis stores data in memory for fast read/write access as a cache layer.",
        "Kubernetes orchestrates containerized applications across clusters."
    ]),
    # ... thousands to tens of thousands of training examples
]

# Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

# Loss function: TripletLoss (anchor, positive, negative)
train_loss = losses.TripletLoss(model=model)

# Evaluation data
eval_examples = [
    InputExample(texts=["query1", "relevant_doc1"], label=1.0),
    InputExample(texts=["query2", "irrelevant_doc2"], label=0.0),
]
evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples(
    eval_examples, name="domain-eval"
)

# Run fine-tuning
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,
    warmup_steps=100,
    evaluation_steps=500,
    output_path="./finetuned_embedding_model",
    save_best_model=True,
)

print("Fine-tuning complete!")

# Load and use the fine-tuned model
finetuned_model = SentenceTransformer('./finetuned_embedding_model')
embeddings = finetuned_model.encode(["domain-specific query"])

Efficient Training with MultipleNegativesRankingLoss

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Training with just (query, positive_passage) pairs
# Automatically leverages in-batch negatives
train_examples = [
    InputExample(texts=["What is embedding?", "An embedding is a vector representation of data."]),
    InputExample(texts=["How does HNSW work?", "HNSW builds a hierarchical graph for approximate nearest neighbor search."]),
    InputExample(texts=["What is RAG?", "RAG retrieves relevant documents and uses them to augment LLM generation."]),
    # ... more (query, positive) pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

# MultipleNegativesRankingLoss: uses other positives in the batch as negatives
train_loss = losses.MultipleNegativesRankingLoss(model=model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path="./mnrl_finetuned_model",
)

Training Data Preparation Strategies

Data Type	Description	Example
Natural Pairs	Real user queries and clicked documents	Search log data
LLM-Generated	Synthesized query-document pairs using GPT-4 etc.	Auto-generating questions from documents
Hard Negatives	Semantically similar but incorrect documents	Non-relevant docs from BM25 search results
Cross-Encoder Distillation	Using cross-encoder scores as training targets	Automatic high-quality label generation

Performance Optimization and Evaluation

MTEB Benchmark

MTEB (Massive Text Embedding Benchmark) is the standard benchmark for comprehensively evaluating embedding model performance. It evaluates models across various task categories:

Task Category	Description	Representative Datasets
Classification	Text classification	AmazonReviews, TweetSentiment
Clustering	Text clustering	ArXiv, Reddit
Pair Classification	Sentence pair relation classification	TwitterPara, SprintDuplicateQuestions
Reranking	Search result reordering	AskUbuntuDupQuestions, StackOverflowDupQuestions
Retrieval	Document retrieval	MSMarco, NQ, HotpotQA
STS	Sentence semantic similarity	STSBenchmark, SICK-R
Summarization	Summary quality evaluation	SummEval

Dimensionality Reduction and Matryoshka Representation Learning

from sentence_transformers import SentenceTransformer
import numpy as np

# Model supporting Matryoshka Representation Learning (MRL)
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1.5')

texts = [
    "Vector databases store embeddings for similarity search.",
    "Embedding models convert text into numerical representations.",
    "RAG systems combine retrieval with language generation.",
]

# Full-dimension embeddings
full_embeddings = model.encode(texts)
print(f"Full dimensions: {full_embeddings.shape[1]}")  # 768

# Matryoshka: truncate to desired dimension and normalize
def truncate_embeddings(embeddings, target_dim):
    """Dimension reduction using Matryoshka approach"""
    truncated = embeddings[:, :target_dim]
    # L2 normalization
    norms = np.linalg.norm(truncated, axis=1, keepdims=True)
    return truncated / norms

# Compare similarity at various dimensions
for dim in [64, 128, 256, 512, 768]:
    reduced = truncate_embeddings(full_embeddings, dim)
    sim = np.dot(reduced[0], reduced[1])  # Dot product of normalized vectors = cosine similarity
    print(f"  Dimension {dim:>4}: similarity = {sim:.4f}")

Memory Optimization through Quantization

import numpy as np

def scalar_quantize_int8(embeddings):
    """Scalar quantization: float32 -> int8 (75% memory reduction)"""
    min_val = embeddings.min(axis=0)
    max_val = embeddings.max(axis=0)
    scale = (max_val - min_val) / 255.0

    quantized = ((embeddings - min_val) / scale).astype(np.int8)
    return quantized, min_val, scale

def scalar_dequantize_int8(quantized, min_val, scale):
    """Dequantize: int8 -> float32"""
    return quantized.astype(np.float32) * scale + min_val

def binary_quantize(embeddings):
    """Binary quantization: float32 -> 1bit (32x memory reduction)"""
    return (embeddings > 0).astype(np.uint8)

# Memory comparison
num_vectors = 1_000_000
dimension = 1024
embeddings = np.random.randn(num_vectors, dimension).astype(np.float32)

print(f"Original (float32): {embeddings.nbytes / 1e9:.2f} GB")

quantized, _, _ = scalar_quantize_int8(embeddings)
print(f"int8 quantized:     {quantized.nbytes / 1e9:.2f} GB")

binary = binary_quantize(embeddings)
print(f"Binary quantized:   {binary.nbytes / 1e9:.2f} GB")
# Original (float32): 4.10 GB
# int8 quantized:     1.02 GB
# Binary quantized:   1.02 GB (0.13 GB with actual bit packing)

Production Optimization Checklist

Optimization	Technique	Effect
Batch Processing	Batch embedding requests together	3-5x throughput improvement
Caching	Cache frequently used query embeddings	90% latency reduction
Dimension Reduction	Apply Matryoshka or PCA	2-4x memory/speed improvement
Quantization	int8/binary quantization	4-32x memory reduction
GPU Inference	ONNX Runtime or TensorRT	2-3x inference speed improvement
Async Processing	asyncio-based parallel embedding	Overall throughput improvement
Model Selection	Choose appropriate model for requirements	Cost-performance optimization

import asyncio
from sentence_transformers import SentenceTransformer
from functools import lru_cache
import hashlib

class OptimizedEmbeddingService:
    def __init__(self, model_name="BAAI/bge-m3", cache_size=10000):
        self.model = SentenceTransformer(model_name)
        self.cache = {}
        self.cache_size = cache_size

    def _get_cache_key(self, text):
        return hashlib.md5(text.encode()).hexdigest()

    def encode_with_cache(self, texts, batch_size=64):
        """Generate embeddings with caching"""
        uncached_texts = []
        uncached_indices = []
        results = [None] * len(texts)

        # Check cache hits
        for i, text in enumerate(texts):
            key = self._get_cache_key(text)
            if key in self.cache:
                results[i] = self.cache[key]
            else:
                uncached_texts.append(text)
                uncached_indices.append(i)

        # Batch embed only cache misses
        if uncached_texts:
            new_embeddings = self.model.encode(
                uncached_texts,
                batch_size=batch_size,
                normalize_embeddings=True,
            )

            for idx, emb in zip(uncached_indices, new_embeddings):
                key = self._get_cache_key(texts[idx])
                self.cache[key] = emb
                results[idx] = emb

                # Manage cache size
                if len(self.cache) > self.cache_size:
                    oldest_key = next(iter(self.cache))
                    del self.cache[oldest_key]

        return results

    def get_cache_stats(self):
        return {"cache_size": len(self.cache), "max_size": self.cache_size}

Conclusion

Embedding models are core infrastructure of modern AI systems, playing essential roles in diverse applications including semantic search, RAG, recommendation systems, and anomaly detection. Here is a summary of the key points covered in this article:

Model selection matters: Reference the MTEB benchmark, but evaluating on your actual data is the most accurate approach. Consider BGE-M3 for multilingual support, GTE-Qwen2-7B for top performance, and text-embedding-3-small for cost efficiency.
Choose vector databases based on requirements: Chroma for rapid prototyping, Milvus or Pinecone for production scale, and pgvector for leveraging existing PostgreSQL infrastructure.
Hybrid search outperforms single approaches: Combining Dense (embedding) + Sparse (BM25) with reranking significantly improves search quality.
Fine-tuning is key for domain-specific performance: Using MultipleNegativesRankingLoss with hard negative mining can achieve significant performance improvements even with limited data.
Optimization is essential: Apply dimension reduction (Matryoshka), quantization, caching, and batch processing to optimize cost and latency in production environments.

Embedding technology is rapidly evolving, with new techniques such as Matryoshka Representation Learning, multimodal embeddings, and task-specific LoRA adapters continually emerging. By understanding the core principles and building practical experience, you can construct optimal embedding strategies for your own projects.

References

Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019.
Wang, L. et al. (2024). "Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)." ACL 2024.
Chen, J. et al. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation."
Kusupati, A. et al. (2024). "Matryoshka Representation Learning." NeurIPS 2024.
Muennighoff, N. et al. (2023). "MTEB: Massive Text Embedding Benchmark." EACL 2023.
MTEB Leaderboard: https://huggingface.co/spaces/mteb/leaderboard
Sentence Transformers Documentation: https://www.sbert.net/
FAISS Documentation: https://github.com/facebookresearch/faiss
Pinecone Learning Center: https://www.pinecone.io/learn/
Chroma Documentation: https://docs.trychroma.com/