Embeddings & Vector Representations
Before reading: you should understand tensors, the Transformer architecture, and self-supervised learning — all covered in Machine Learning.
You search for “fast Python web framework” and get the Flask docs. Not because the page contains those exact words — it doesn’t — but because an embedding model understood that “Flask” is a fast Python web framework and placed its vector near that query in semantic space.
This is the power of embeddings: they capture meaning, not just text matching. And they’re the backbone of every modern search, recommendation, and RAG system.
What Is an Embedding?
An embedding is a dense vector of floating-point numbers — typically 768 to 3072 dimensions — that represents a piece of text, an image, or any data in a continuous vector space. The key property: semantically similar items are close together. The distance between two embeddings encodes how related their meanings are.
The classic example: in a well-trained word embedding space, vec("king") − vec("man") + vec("woman") ≈ vec("queen"). Vector arithmetic captures semantic relationships as geometric relationships.
Embeddings are the bridge between discrete symbols (words, tokens) and continuous mathematics (gradients, optimization). Without embeddings, neural networks would operate on integer IDs with no notion of similarity — “cat” would be as different from “kitten” as from “concrete.”
How Embeddings Are Trained
Modern text embedding models use a dual-encoder architecture:
- Two identical Transformer encoders process a pair of texts (query/document, sentence/sentence, or text/image).
- The final hidden state is pooled (mean, CLS token, or last token) into a single embedding vector.
- A contrastive loss function pulls similar pairs closer in vector space and pushes dissimilar pairs apart.
The training data creates the signal. For search embeddings: (query, relevant document) pairs. For sentence similarity: naturally occurring paraphrases, or synthetically generated by LLMs. For code embeddings: (docstring, function body) pairs. For multimodal embeddings (CLIP): (image, caption) pairs — the model learns to map images and text into a shared embedding space.
Popular Embedding Models
| Model | Dimensions | Max Tokens | Strengths | Weaknesses |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 512/1536 | 8191 | Cheap, easy, Matryoshka-compatible | Closed-source, tied to OpenAI API |
| OpenAI text-embedding-3-large | 256–3072 | 8191 | Best-in-class on MTEB, Matryoshka | Expensive (~$0.13/1M tokens) |
| Voyage voyage-3 | 1024 | 32000 | Long context, strong retrieval | Closed-source, fewer dimensions |
| Jina embeddings v3 | 1024 | 8192 | Task-specific LoRA adapters, multilingual | Newer, smaller community |
| BGE-M3 (BAAI) | 1024 | 8192 | Dense + sparse + ColBERT, multilingual, open-weight | Needs careful batching for throughput |
| Cohere Embed v3 | 1024 | 512 | Compression-aware, good for long docs | Closed-source |
| E5-mistral-7b-instruct | 4096 | 32768 | Synthetic data trained, open-weight | Huge embed dim → expensive vector DB |
The MTEB (Massive Text Embedding Benchmark) leaderboard tracks performance across classification, clustering, pair classification, reranking, retrieval, STS, and summarization. No single model dominates all tasks. Benchmark on your specific data and task type.
Similarity Measures
Once you have vectors, you need a way to compare them:
| Measure | Formula | Range | When to Use | When It Fails |
|---|---|---|---|---|
| Cosine Similarity | A·B / (‖A‖‖B‖) | [-1, 1] | Default for text embeddings | Magnitude matters (rare) |
| Dot Product | A·B | (-∞, ∞) | Vector DBs with normalized vectors | Unnormalized — longer docs score higher |
| Euclidean Distance | ‖A − B‖ | [0, ∞) | Clustering, anomaly detection | Embeddings at different scales |
Most embedding models normalize output vectors to unit length, making cosine similarity and dot product equivalent. Use cosine as the default. For unnormalized embeddings (rare), prefer Euclidean distance in clustering where absolute position matters.
L²-normalize your embeddings before storing them in a vector database. It makes dot-product search equivalent to cosine similarity search, which is faster to compute without the normalization denominator.
Dimensionality Tradeoffs
Higher-dimensional embeddings capture more nuance — but cost more:
| Dimensions | Storage per 1M vectors | Approximate Recall | Use Case |
|---|---|---|---|
| 256 | ~1 GB (FP32) | 95–97% of full-dim | Budget-sensitive, high-volume |
| 768 | ~3 GB | 98–99% | Good default for most tasks |
| 1024 | ~4 GB | 98–99% | Open-weight model sweet spot |
| 1536 | ~6 GB | 99%+ | Best-in-class retrieval |
| 3072 | ~12 GB | 99%+ | Diminishing returns past 1536 for most tasks |
Storage in a vector database isn’t just the raw vectors — add index overhead (HNSW graphs, IVF clusters) and metadata. Budget 1.5–2× the raw vector size for the full index. For 10M vectors at 1536 dimensions: ~60 GB raw + ~30 GB index = ~90 GB total.
Matryoshka embeddings — Train once, use at any dimension. A Matryoshka embedding model produces a single 3072-dim vector, but you can truncate it to 1536, 768, or 256 and keep strong performance. text-embedding-3 and voyage-3 use this technique. This means you can store full-dim embeddings in cold storage and truncate to 256 dims for a fast approximate index — no separate model or re-embedding needed.
Practical: Storing and Querying
Embeddings are worthless without retrieval. The vector database handles indexing and similarity search at scale (see Specialized Databases for pgvector/Pinecone/Milvus).
The pipeline:
- Embed documents → store vectors + metadata in vector DB
- Embed user query → search vector DB for nearest neighbors (k-NN or ANN)
- Retrieve top-k results → feed into LLM context (if RAG) or return directly (if search)
Chunking matters more than embedding model choice. A single 10,000-token document produces one embedding that averages all its topics into a single point — useless for retrieval. Split into chunks of 256–1024 tokens with 10–20% overlap. Semantic chunking (split at natural boundaries like paragraphs or sentence groups) outperforms fixed-size chunking. Bad chunking makes even text-embedding-3-large look bad.
Code, Image, and Multimodal Embeddings
Code embeddings — Models like Voyage-code-2 or CodeBERT embed code snippets for semantic search. “Find all functions that handle file uploads” works because the embedding captures what the code does, not what it’s named. Useful for codebase-wide search and RAG over documentation.
Image embeddings — CLIP embeds images and text into the same space. An image of a dog and the text “a photo of a dog” produce similar vectors. This enables text-to-image search, zero-shot image classification, and multimodal RAG.
Multimodal embeddings — Jina CLIP v2 and similar models produce a single embedding from an image + its surrounding text. Store these in a vector DB and you can search a PDF with diagrams using natural language.
Key Things
- No single best embedding model. text-embedding-3-large leads MTEB but costs 13× more than open-weight alternatives. Benchmark on your data and task.
- Chunking is the hidden variable. Bad chunking ruins retrieval regardless of embedding quality. Spend time on chunk boundaries and overlap.
- Normalize your vectors. Most models output unit vectors. If yours doesn’t, L²-normalize before storing — it makes cosine and dot-product equivalent.
- Matryoshka gives you dimensionality for free. Use a Matryoshka-compatible model and tune dimensionality as a hyperparameter without re-embedding.
- Dimensionality hits a wall around 1536. Going from 1536 to 3072 dims typically improves retrieval <1% for 2× the storage cost. Profile before scaling up.
- Vector DB index overhead is real. Budget 1.5–2× the raw vector size for a production search index.
References
- MTEB: Muennighoff et al., 2023 — MTEB: Massive Text Embedding Benchmark — arXiv
- BGE: Xiao et al., 2023 — C-Pack: Packaged Resources To Advance General Chinese Embedding — arXiv
- BGE-M3: Chen et al., 2024 — BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity — arXiv
- Matryoshka: Kusupati et al., 2022 — Matryoshka Representation Learning — arXiv
- Jina embeddings v3: Günther et al., 2024 — jina-embeddings-v3 — arXiv
- E5: Wang et al., 2022 — Text Embeddings by Weakly-Supervised Contrastive Pre-training — arXiv
- CLIP: Radford et al., 2021 — Learning Transferable Visual Models From Natural Language Supervision — arXiv