Embeddings & Vector Representations

Before reading: you should understand tensors, the Transformer architecture, and self-supervised learning — all covered in Machine Learning.

You search for “fast Python web framework” and get the Flask docs. Not because the page contains those exact words — it doesn’t — but because an embedding model understood that “Flask” is a fast Python web framework and placed its vector near that query in semantic space.

This is the power of embeddings: they capture meaning, not just text matching. And they’re the backbone of every modern search, recommendation, and RAG system.

What Is an Embedding?

An embedding is a dense vector of floating-point numbers — typically 768 to 3072 dimensions — that represents a piece of text, an image, or any data in a continuous vector space. The key property: semantically similar items are close together. The distance between two embeddings encodes how related their meanings are.

The classic example: in a well-trained word embedding space, vec("king") − vec("man") + vec("woman") ≈ vec("queen"). Vector arithmetic captures semantic relationships as geometric relationships.

Embeddings are the bridge between discrete symbols (words, tokens) and continuous mathematics (gradients, optimization). Without embeddings, neural networks would operate on integer IDs with no notion of similarity — “cat” would be as different from “kitten” as from “concrete.”

How Embeddings Are Trained

Modern text embedding models use a dual-encoder architecture:

Two identical Transformer encoders process a pair of texts (query/document, sentence/sentence, or text/image).
The final hidden state is pooled (mean, CLS token, or last token) into a single embedding vector.
A contrastive loss function pulls similar pairs closer in vector space and pushes dissimilar pairs apart.

The training data creates the signal. For search embeddings: (query, relevant document) pairs. For sentence similarity: naturally occurring paraphrases, or synthetically generated by LLMs. For code embeddings: (docstring, function body) pairs. For multimodal embeddings (CLIP): (image, caption) pairs — the model learns to map images and text into a shared embedding space.

Popular Embedding Models

Model	Dimensions	Max Tokens	Strengths	Weaknesses
OpenAI text-embedding-3-small	512/1536	8191	Cheap, easy, Matryoshka-compatible	Closed-source, tied to OpenAI API
OpenAI text-embedding-3-large	256–3072	8191	Best-in-class on MTEB, Matryoshka	Expensive (~$0.13/1M tokens)
Voyage voyage-3	1024	32000	Long context, strong retrieval	Closed-source, fewer dimensions
Jina embeddings v3	1024	8192	Task-specific LoRA adapters, multilingual	Newer, smaller community
BGE-M3 (BAAI)	1024	8192	Dense + sparse + ColBERT, multilingual, open-weight	Needs careful batching for throughput
Cohere Embed v3	1024	512	Compression-aware, good for long docs	Closed-source
E5-mistral-7b-instruct	4096	32768	Synthetic data trained, open-weight	Huge embed dim → expensive vector DB

The MTEB (Massive Text Embedding Benchmark) leaderboard tracks performance across classification, clustering, pair classification, reranking, retrieval, STS, and summarization. No single model dominates all tasks. Benchmark on your specific data and task type.

Similarity Measures

Once you have vectors, you need a way to compare them:

Measure	Formula	Range	When to Use	When It Fails
Cosine Similarity	`A·B / (‖A‖‖B‖)`	[-1, 1]	Default for text embeddings	Magnitude matters (rare)
Dot Product	`A·B`	(-∞, ∞)	Vector DBs with normalized vectors	Unnormalized — longer docs score higher
Euclidean Distance	`‖A − B‖`	[0, ∞)	Clustering, anomaly detection	Embeddings at different scales

Most embedding models normalize output vectors to unit length, making cosine similarity and dot product equivalent. Use cosine as the default. For unnormalized embeddings (rare), prefer Euclidean distance in clustering where absolute position matters.

L²-normalize your embeddings before storing them in a vector database. It makes dot-product search equivalent to cosine similarity search, which is faster to compute without the normalization denominator.

Dimensionality Tradeoffs

Higher-dimensional embeddings capture more nuance — but cost more:

Dimensions	Storage per 1M vectors	Approximate Recall	Use Case
256	~1 GB (FP32)	95–97% of full-dim	Budget-sensitive, high-volume
768	~3 GB	98–99%	Good default for most tasks
1024	~4 GB	98–99%	Open-weight model sweet spot
1536	~6 GB	99%+	Best-in-class retrieval
3072	~12 GB	99%+	Diminishing returns past 1536 for most tasks

Storage in a vector database isn’t just the raw vectors — add index overhead (HNSW graphs, IVF clusters) and metadata. Budget 1.5–2× the raw vector size for the full index. For 10M vectors at 1536 dimensions: ~60 GB raw + ~30 GB index = ~90 GB total.

Matryoshka embeddings — Train once, use at any dimension. A Matryoshka embedding model produces a single 3072-dim vector, but you can truncate it to 1536, 768, or 256 and keep strong performance. text-embedding-3 and voyage-3 use this technique. This means you can store full-dim embeddings in cold storage and truncate to 256 dims for a fast approximate index — no separate model or re-embedding needed.

Practical: Storing and Querying

Embeddings are worthless without retrieval. The vector database handles indexing and similarity search at scale (see Specialized Databases for pgvector/Pinecone/Milvus).

The pipeline:

Embed documents → store vectors + metadata in vector DB
Embed user query → search vector DB for nearest neighbors (k-NN or ANN)
Retrieve top-k results → feed into LLM context (if RAG) or return directly (if search)

Chunking matters more than embedding model choice. A single 10,000-token document produces one embedding that averages all its topics into a single point — useless for retrieval. Split into chunks of 256–1024 tokens with 10–20% overlap. Semantic chunking (split at natural boundaries like paragraphs or sentence groups) outperforms fixed-size chunking. Bad chunking makes even text-embedding-3-large look bad.

Code, Image, and Multimodal Embeddings

Code embeddings — Models like Voyage-code-2 or CodeBERT embed code snippets for semantic search. “Find all functions that handle file uploads” works because the embedding captures what the code does, not what it’s named. Useful for codebase-wide search and RAG over documentation.

Image embeddings — CLIP embeds images and text into the same space. An image of a dog and the text “a photo of a dog” produce similar vectors. This enables text-to-image search, zero-shot image classification, and multimodal RAG.

Multimodal embeddings — Jina CLIP v2 and similar models produce a single embedding from an image + its surrounding text. Store these in a vector DB and you can search a PDF with diagrams using natural language.

Key Things

No single best embedding model. text-embedding-3-large leads MTEB but costs 13× more than open-weight alternatives. Benchmark on your data and task.
Chunking is the hidden variable. Bad chunking ruins retrieval regardless of embedding quality. Spend time on chunk boundaries and overlap.
Normalize your vectors. Most models output unit vectors. If yours doesn’t, L²-normalize before storing — it makes cosine and dot-product equivalent.
Matryoshka gives you dimensionality for free. Use a Matryoshka-compatible model and tune dimensionality as a hyperparameter without re-embedding.
Dimensionality hits a wall around 1536. Going from 1536 to 3072 dims typically improves retrieval <1% for 2× the storage cost. Profile before scaling up.
Vector DB index overhead is real. Budget 1.5–2× the raw vector size for a production search index.

References

MTEB: Muennighoff et al., 2023 — MTEB: Massive Text Embedding Benchmark — arXiv
BGE: Xiao et al., 2023 — C-Pack: Packaged Resources To Advance General Chinese Embedding — arXiv
BGE-M3: Chen et al., 2024 — BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity — arXiv
Matryoshka: Kusupati et al., 2022 — Matryoshka Representation Learning — arXiv
Jina embeddings v3: Günther et al., 2024 — jina-embeddings-v3 — arXiv
E5: Wang et al., 2022 — Text Embeddings by Weakly-Supervised Contrastive Pre-training — arXiv
CLIP: Radford et al., 2021 — Learning Transferable Visual Models From Natural Language Supervision — arXiv