Specialized Databases
Vector Databases
Vector databases are optimized for storing and querying embeddings — dense vector representations of data (text, images, audio). They enable approximate nearest neighbor (ANN) search for AI/ML applications.
Core Algorithm: ANN Search
Exact nearest neighbor search is O(n*d) — too slow for millions of vectors. ANN sacrifices a small amount of accuracy for massive speed gains:
| Algorithm | Type | Build Time | Search Time | Memory | Accuracy |
|---|---|---|---|---|---|
| HNSW | Hierarchical navigable small world graph | O(n log n) | O(log n) | O(n) | 95-99% |
| IVF | Inverted file index | O(n) | O(sqrt(n)) | O(n) | 90-95% |
| IVF + PQ | IVF with product quantization | O(n) | O(sqrt(n)) | Compressed | 85-95% |
| DiskANN | Vamana graph on disk | O(n log n) | O(log n) | 10GB/TB on disk | 95-99% |
HNSW: The most popular algorithm. Builds a multi-layer graph:
- Layer 0: All vectors (dense connections)
- Layer 1: Random subset (sparser)
- Higher layers: Progressively sparser
- Search traverses from top layer down, refining at each level
Database Comparison
| Database | Algorithm | Cloud-Native | Features |
|---|---|---|---|
| pgvector (PostgreSQL extension) | IVFFlat, HNSW | Yes (via RDS) | SQL interface, ACID, 2000 dims max |
| Pinecone | Custom (HNSW-based) | Yes (SaaS) | Serverless, metadata filtering, namespaces |
| Milvus | HNSW, IVF, DiskANN | Yes (K8s) | Multi-modal, hybrid search, GPU acceleration |
| Weaviate | Custom (HNSW-based) | Yes (K8s) | GraphQL, hybrid search, modules |
| Qdrant | Custom (HNSW-based) | Yes (SaaS, self-hosted) | Payload filtering, snapshots |
Use cases: RAG (Retrieval-Augmented Generation), semantic search, recommendation systems, anomaly detection, image similarity.
Key Metrics
| Metric | Description |
|---|---|
| Recall | Fraction of true nearest neighbors found |
| QPS | Queries per second |
| Indexing time | Time to build the index |
| Memory usage | RAM needed to serve queries |
Search Engines
Search engines provide full-text search with relevance scoring, faceted aggregation, and near-real-time indexing.
Inverted Index
The core data structure for full-text search:
Document 1: "the quick brown fox"
Document 2: "the lazy dog"
Document 3: "quick fox jumps"
Inverted Index:
brown → {1}
dog → {2}
fox → {1, 3}
jumps → {3}
lazy → {2}
quick → {1, 3}
the → {1, 2}
Behind the scenes: Term dictionary → posting list (document IDs + positions + offsets). Compressed using delta encoding, bit packing, and skip lists.
Lucene / Elasticsearch Architecture
graph TD
subgraph "Lucene Index"
D1[Segment 1<br/>in-memory]
D2[Segment 2<br/>on disk]
D3[Segment 3<br/>on disk]
end
Write[Document] -->|1. Buffer| D1
D1 -->|2. Flush<br/>(commit)| D2
D1 & D2 & D3 -->|3. Merge| Merged[(Merged Segment)]
Query -->|4. Search all segments| Searcher
Near-real-time: Documents are searchable almost immediately after indexing (refresh interval, default 1s in Elasticsearch).
Scoring (BM25): The default relevance algorithm:
score(D, Q) = Σ (IDF(t) * TF(t, D) * (k1 + 1)) / (TF(t, D) + k1 * (1 - b + b * |D| / avgdl))
Where:
IDF(t): Inverse document frequency of term tTF(t, D): Term frequency in document Dk1,b: Tunable parameters (defaults: k1=1.2, b=0.75)|D|: Document lengthavgdl: Average document length
| Database | Engine | Strengths |
|---|---|---|
| Elasticsearch | Lucene | Full-text, analytics, aggregation, monitoring |
| Meilisearch | Custom (Rust) | Developer-friendly, typo-tolerant, instant |
| Typesense | Custom (C++) | Low-latency, simple deployment, geo-search |
Use cases: Site search, log analytics (ELK stack), product search, e-commerce filtering.
Embedded Databases
Embedded databases run in-process with the application (no separate server). They sacrifice scalability for simplicity and speed.
SQLite
The most widely deployed database engine in the world (every smartphone, browser, many embedded systems).
graph TD
App[Application<br/>Process] -->|SQL queries| SQLite
SQLite -->|B-Tree pages| DB[(database.sqlite)]
SQLite -->|WAL| WAL[(.sqlite-wal)]
SQLite -->|Rollback journal| RJ[(.sqlite-journal)]
Page structure: 4KB pages (default), B-Tree for tables (leaf = data), B+Tree for indexes.
Journal modes:
| Mode | Durability | Performance |
|---|---|---|
DELETE (default) | Safe (crash = rollback) | Slower (delete journal at end) |
TRUNCATE | Safe | Faster (truncate journal) |
PERSIST | Safe | Same as TRUNCATE |
WAL (Write-Ahead Log) | Safe | Much faster (concurrent reads + write) |
MEMORY | Not durable (lost on crash) | Fastest |
OFF | Not durable | Fastest |
WAL mode: The write-ahead log allows concurrent readers and a single writer. Readers don’t block the writer. The WAL is periodically checkpointed into the main database file.
Concurrency: Limited — only one writer at a time (table-level locking). MVCC via WAL allows concurrent reads during writes.
Best for: Mobile apps, desktop apps, small servers, embedded systems, prototyping.
DuckDB
DuckDB is an in-process columnar OLAP database — the SQLite equivalent for analytical workloads.
Columnar storage: Data is stored by column, not by row. Enables:
- Only read relevant columns (skip others)
- Better compression (same datatype per column)
- Vectorized execution (SIMD-friendly)
Vectorized execution: Operates on batches of 2048 values at a time (vectors), not individual rows. Minimizes function call overhead and exploits CPU cache / SIMD instructions.
Performance:
- Analytical queries: 10-100x faster than SQLite for aggregations
- Data loading: Millions of rows per second
- Memory: Efficient compression (run-length, dictionary, constant encoding)
Use cases: Data science, analytics, ETL, ad-hoc queries on parquet/CSV files.
Streaming Databases
Streaming databases process real-time data streams with SQL semantics.
| Database | Model | Storage | Consistency |
|---|---|---|---|
| Materialize | SQL materialized views | Persistent (Kafka-backed) | Eventually consistent |
| RisingWave | SQL streaming, ETL | Object storage (S3) | Exactly-once |
| ksqlDB | Kafka-native streaming | Kafka topics | Exactly-once |
Key concept: A materialized view that is continuously updated as new data arrives, rather than recomputed on query.
Time-Series Databases
While time-series is covered in the taxonomy, key implementation details:
| Database | Storage Engine | Compression | Query Language |
|---|---|---|---|
| InfluxDB | TSM (Time-Structured Merge Tree) | 10-100x (float64 → XOR, timestamps → delta-of-delta) | Flux, InfluxQL |
| TimescaleDB | PostgreSQL (hypertables) | Native compression (per chunk) | SQL |
| VictoriaMetrics | LSM with custom merge | 10-100x (integer delta, string dictionary) | PromQL, MetricsQL |
InfluxDB TSM: An LSM-tree variant optimized for time-series:
- WAL for recent writes
- MemTable (in-memory, sorted by time + tag)
- Flush to TSM files (compressed, read-only)
- Merge / compaction
TimescaleDB hypertables: Automatic partitioning by time into chunks. SQL interface with full PostgreSQL compatibility.
| Feature | InfluxDB | TimescaleDB | VictoriaMetrics |
|---|---|---|---|
| SQL? | No (Flux) | Yes (PostgreSQL) | No (PromQL) |
| Joins | Limited | Full SQL joins | None |
| High-availability | InfluxDB Clustered | Patroni, streaming rep | VictoriaMetrics cluster |
| Compression ratio | 10x-100x | 6x-10x | 10x-100x |