Chandra
Ai / Embeddings & Vector Representations

Embeddings & Vector Representations

Before reading: you should understand tensors, the Transformer architecture, and self-supervised learning — all covered in Machine Learning.

You search for “fast Python web framework” and get the Flask docs. Not because the page contains those exact words — it doesn’t — but because an embedding model understood that “Flask” is a fast Python web framework and placed its vector near that query in semantic space.

This is the power of embeddings: they capture meaning, not just text matching. And they’re the backbone of every modern search, recommendation, and RAG system.

What Is an Embedding?

An embedding is a dense vector of floating-point numbers — typically 768 to 3072 dimensions — that represents a piece of text, an image, or any data in a continuous vector space. The key property: semantically similar items are close together. The distance between two embeddings encodes how related their meanings are.

The classic example: in a well-trained word embedding space, vec("king") − vec("man") + vec("woman") ≈ vec("queen"). Vector arithmetic captures semantic relationships as geometric relationships.

Embeddings are the bridge between discrete symbols (words, tokens) and continuous mathematics (gradients, optimization). Without embeddings, neural networks would operate on integer IDs with no notion of similarity — “cat” would be as different from “kitten” as from “concrete.”

How Embeddings Are Trained

Modern text embedding models use a dual-encoder architecture:

  1. Two identical Transformer encoders process a pair of texts (query/document, sentence/sentence, or text/image).
  2. The final hidden state is pooled (mean, CLS token, or last token) into a single embedding vector.
  3. A contrastive loss function pulls similar pairs closer in vector space and pushes dissimilar pairs apart.

The training data creates the signal. For search embeddings: (query, relevant document) pairs. For sentence similarity: naturally occurring paraphrases, or synthetically generated by LLMs. For code embeddings: (docstring, function body) pairs. For multimodal embeddings (CLIP): (image, caption) pairs — the model learns to map images and text into a shared embedding space.

ModelDimensionsMax TokensStrengthsWeaknesses
OpenAI text-embedding-3-small512/15368191Cheap, easy, Matryoshka-compatibleClosed-source, tied to OpenAI API
OpenAI text-embedding-3-large256–30728191Best-in-class on MTEB, MatryoshkaExpensive (~$0.13/1M tokens)
Voyage voyage-3102432000Long context, strong retrievalClosed-source, fewer dimensions
Jina embeddings v310248192Task-specific LoRA adapters, multilingualNewer, smaller community
BGE-M3 (BAAI)10248192Dense + sparse + ColBERT, multilingual, open-weightNeeds careful batching for throughput
Cohere Embed v31024512Compression-aware, good for long docsClosed-source
E5-mistral-7b-instruct409632768Synthetic data trained, open-weightHuge embed dim → expensive vector DB

The MTEB (Massive Text Embedding Benchmark) leaderboard tracks performance across classification, clustering, pair classification, reranking, retrieval, STS, and summarization. No single model dominates all tasks. Benchmark on your specific data and task type.

Similarity Measures

Once you have vectors, you need a way to compare them:

MeasureFormulaRangeWhen to UseWhen It Fails
Cosine SimilarityA·B / (‖A‖‖B‖)[-1, 1]Default for text embeddingsMagnitude matters (rare)
Dot ProductA·B(-∞, ∞)Vector DBs with normalized vectorsUnnormalized — longer docs score higher
Euclidean Distance‖A − B‖[0, ∞)Clustering, anomaly detectionEmbeddings at different scales

Most embedding models normalize output vectors to unit length, making cosine similarity and dot product equivalent. Use cosine as the default. For unnormalized embeddings (rare), prefer Euclidean distance in clustering where absolute position matters.

L²-normalize your embeddings before storing them in a vector database. It makes dot-product search equivalent to cosine similarity search, which is faster to compute without the normalization denominator.

Dimensionality Tradeoffs

Higher-dimensional embeddings capture more nuance — but cost more:

DimensionsStorage per 1M vectorsApproximate RecallUse Case
256~1 GB (FP32)95–97% of full-dimBudget-sensitive, high-volume
768~3 GB98–99%Good default for most tasks
1024~4 GB98–99%Open-weight model sweet spot
1536~6 GB99%+Best-in-class retrieval
3072~12 GB99%+Diminishing returns past 1536 for most tasks

Storage in a vector database isn’t just the raw vectors — add index overhead (HNSW graphs, IVF clusters) and metadata. Budget 1.5–2× the raw vector size for the full index. For 10M vectors at 1536 dimensions: ~60 GB raw + ~30 GB index = ~90 GB total.

Matryoshka embeddings — Train once, use at any dimension. A Matryoshka embedding model produces a single 3072-dim vector, but you can truncate it to 1536, 768, or 256 and keep strong performance. text-embedding-3 and voyage-3 use this technique. This means you can store full-dim embeddings in cold storage and truncate to 256 dims for a fast approximate index — no separate model or re-embedding needed.

Practical: Storing and Querying

Embeddings are worthless without retrieval. The vector database handles indexing and similarity search at scale (see Specialized Databases for pgvector/Pinecone/Milvus).

The pipeline:

  1. Embed documents → store vectors + metadata in vector DB
  2. Embed user query → search vector DB for nearest neighbors (k-NN or ANN)
  3. Retrieve top-k results → feed into LLM context (if RAG) or return directly (if search)

Chunking matters more than embedding model choice. A single 10,000-token document produces one embedding that averages all its topics into a single point — useless for retrieval. Split into chunks of 256–1024 tokens with 10–20% overlap. Semantic chunking (split at natural boundaries like paragraphs or sentence groups) outperforms fixed-size chunking. Bad chunking makes even text-embedding-3-large look bad.

Code, Image, and Multimodal Embeddings

Code embeddings — Models like Voyage-code-2 or CodeBERT embed code snippets for semantic search. “Find all functions that handle file uploads” works because the embedding captures what the code does, not what it’s named. Useful for codebase-wide search and RAG over documentation.

Image embeddings — CLIP embeds images and text into the same space. An image of a dog and the text “a photo of a dog” produce similar vectors. This enables text-to-image search, zero-shot image classification, and multimodal RAG.

Multimodal embeddings — Jina CLIP v2 and similar models produce a single embedding from an image + its surrounding text. Store these in a vector DB and you can search a PDF with diagrams using natural language.

Key Things

  1. No single best embedding model. text-embedding-3-large leads MTEB but costs 13× more than open-weight alternatives. Benchmark on your data and task.
  2. Chunking is the hidden variable. Bad chunking ruins retrieval regardless of embedding quality. Spend time on chunk boundaries and overlap.
  3. Normalize your vectors. Most models output unit vectors. If yours doesn’t, L²-normalize before storing — it makes cosine and dot-product equivalent.
  4. Matryoshka gives you dimensionality for free. Use a Matryoshka-compatible model and tune dimensionality as a hyperparameter without re-embedding.
  5. Dimensionality hits a wall around 1536. Going from 1536 to 3072 dims typically improves retrieval <1% for 2× the storage cost. Profile before scaling up.
  6. Vector DB index overhead is real. Budget 1.5–2× the raw vector size for a production search index.

References

  • MTEB: Muennighoff et al., 2023 — MTEB: Massive Text Embedding BenchmarkarXiv
  • BGE: Xiao et al., 2023 — C-Pack: Packaged Resources To Advance General Chinese EmbeddingarXiv
  • BGE-M3: Chen et al., 2024 — BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-GranularityarXiv
  • Matryoshka: Kusupati et al., 2022 — Matryoshka Representation LearningarXiv
  • Jina embeddings v3: Günther et al., 2024 — jina-embeddings-v3arXiv
  • E5: Wang et al., 2022 — Text Embeddings by Weakly-Supervised Contrastive Pre-trainingarXiv
  • CLIP: Radford et al., 2021 — Learning Transferable Visual Models From Natural Language SupervisionarXiv