Chandra
Database / Specialized Databases

Specialized Databases

Vector Databases

Vector databases are optimized for storing and querying embeddings — dense vector representations of data (text, images, audio). They enable approximate nearest neighbor (ANN) search for AI/ML applications.

Exact nearest neighbor search is O(n*d) — too slow for millions of vectors. ANN sacrifices a small amount of accuracy for massive speed gains:

AlgorithmTypeBuild TimeSearch TimeMemoryAccuracy
HNSWHierarchical navigable small world graphO(n log n)O(log n)O(n)95-99%
IVFInverted file indexO(n)O(sqrt(n))O(n)90-95%
IVF + PQIVF with product quantizationO(n)O(sqrt(n))Compressed85-95%
DiskANNVamana graph on diskO(n log n)O(log n)10GB/TB on disk95-99%

HNSW: The most popular algorithm. Builds a multi-layer graph:

  • Layer 0: All vectors (dense connections)
  • Layer 1: Random subset (sparser)
  • Higher layers: Progressively sparser
  • Search traverses from top layer down, refining at each level

Database Comparison

DatabaseAlgorithmCloud-NativeFeatures
pgvector (PostgreSQL extension)IVFFlat, HNSWYes (via RDS)SQL interface, ACID, 2000 dims max
PineconeCustom (HNSW-based)Yes (SaaS)Serverless, metadata filtering, namespaces
MilvusHNSW, IVF, DiskANNYes (K8s)Multi-modal, hybrid search, GPU acceleration
WeaviateCustom (HNSW-based)Yes (K8s)GraphQL, hybrid search, modules
QdrantCustom (HNSW-based)Yes (SaaS, self-hosted)Payload filtering, snapshots

Use cases: RAG (Retrieval-Augmented Generation), semantic search, recommendation systems, anomaly detection, image similarity.

Key Metrics

MetricDescription
RecallFraction of true nearest neighbors found
QPSQueries per second
Indexing timeTime to build the index
Memory usageRAM needed to serve queries

Search Engines

Search engines provide full-text search with relevance scoring, faceted aggregation, and near-real-time indexing.

Inverted Index

The core data structure for full-text search:

Document 1: "the quick brown fox"
Document 2: "the lazy dog"
Document 3: "quick fox jumps"

Inverted Index:
brown  → {1}
dog    → {2}
fox    → {1, 3}
jumps  → {3}
lazy   → {2}
quick  → {1, 3}
the    → {1, 2}

Behind the scenes: Term dictionary → posting list (document IDs + positions + offsets). Compressed using delta encoding, bit packing, and skip lists.

Lucene / Elasticsearch Architecture

graph TD
    subgraph "Lucene Index"
        D1[Segment 1<br/>in-memory]
        D2[Segment 2<br/>on disk]
        D3[Segment 3<br/>on disk]
    end
    Write[Document] -->|1. Buffer| D1
    D1 -->|2. Flush<br/>(commit)| D2
    D1 & D2 & D3 -->|3. Merge| Merged[(Merged Segment)]
    Query -->|4. Search all segments| Searcher

Near-real-time: Documents are searchable almost immediately after indexing (refresh interval, default 1s in Elasticsearch).

Scoring (BM25): The default relevance algorithm:

score(D, Q) = Σ (IDF(t) * TF(t, D) * (k1 + 1)) / (TF(t, D) + k1 * (1 - b + b * |D| / avgdl))

Where:

  • IDF(t): Inverse document frequency of term t
  • TF(t, D): Term frequency in document D
  • k1, b: Tunable parameters (defaults: k1=1.2, b=0.75)
  • |D|: Document length
  • avgdl: Average document length
DatabaseEngineStrengths
ElasticsearchLuceneFull-text, analytics, aggregation, monitoring
MeilisearchCustom (Rust)Developer-friendly, typo-tolerant, instant
TypesenseCustom (C++)Low-latency, simple deployment, geo-search

Use cases: Site search, log analytics (ELK stack), product search, e-commerce filtering.

Embedded Databases

Embedded databases run in-process with the application (no separate server). They sacrifice scalability for simplicity and speed.

SQLite

The most widely deployed database engine in the world (every smartphone, browser, many embedded systems).

graph TD
    App[Application<br/>Process] -->|SQL queries| SQLite
    SQLite -->|B-Tree pages| DB[(database.sqlite)]
    SQLite -->|WAL| WAL[(.sqlite-wal)]
    SQLite -->|Rollback journal| RJ[(.sqlite-journal)]

Page structure: 4KB pages (default), B-Tree for tables (leaf = data), B+Tree for indexes.

Journal modes:

ModeDurabilityPerformance
DELETE (default)Safe (crash = rollback)Slower (delete journal at end)
TRUNCATESafeFaster (truncate journal)
PERSISTSafeSame as TRUNCATE
WAL (Write-Ahead Log)SafeMuch faster (concurrent reads + write)
MEMORYNot durable (lost on crash)Fastest
OFFNot durableFastest

WAL mode: The write-ahead log allows concurrent readers and a single writer. Readers don’t block the writer. The WAL is periodically checkpointed into the main database file.

Concurrency: Limited — only one writer at a time (table-level locking). MVCC via WAL allows concurrent reads during writes.

Best for: Mobile apps, desktop apps, small servers, embedded systems, prototyping.

DuckDB

DuckDB is an in-process columnar OLAP database — the SQLite equivalent for analytical workloads.

Columnar storage: Data is stored by column, not by row. Enables:

  • Only read relevant columns (skip others)
  • Better compression (same datatype per column)
  • Vectorized execution (SIMD-friendly)

Vectorized execution: Operates on batches of 2048 values at a time (vectors), not individual rows. Minimizes function call overhead and exploits CPU cache / SIMD instructions.

Performance:

  • Analytical queries: 10-100x faster than SQLite for aggregations
  • Data loading: Millions of rows per second
  • Memory: Efficient compression (run-length, dictionary, constant encoding)

Use cases: Data science, analytics, ETL, ad-hoc queries on parquet/CSV files.

Streaming Databases

Streaming databases process real-time data streams with SQL semantics.

DatabaseModelStorageConsistency
MaterializeSQL materialized viewsPersistent (Kafka-backed)Eventually consistent
RisingWaveSQL streaming, ETLObject storage (S3)Exactly-once
ksqlDBKafka-native streamingKafka topicsExactly-once

Key concept: A materialized view that is continuously updated as new data arrives, rather than recomputed on query.

Time-Series Databases

While time-series is covered in the taxonomy, key implementation details:

DatabaseStorage EngineCompressionQuery Language
InfluxDBTSM (Time-Structured Merge Tree)10-100x (float64 → XOR, timestamps → delta-of-delta)Flux, InfluxQL
TimescaleDBPostgreSQL (hypertables)Native compression (per chunk)SQL
VictoriaMetricsLSM with custom merge10-100x (integer delta, string dictionary)PromQL, MetricsQL

InfluxDB TSM: An LSM-tree variant optimized for time-series:

  • WAL for recent writes
  • MemTable (in-memory, sorted by time + tag)
  • Flush to TSM files (compressed, read-only)
  • Merge / compaction

TimescaleDB hypertables: Automatic partitioning by time into chunks. SQL interface with full PostgreSQL compatibility.

FeatureInfluxDBTimescaleDBVictoriaMetrics
SQL?No (Flux)Yes (PostgreSQL)No (PromQL)
JoinsLimitedFull SQL joinsNone
High-availabilityInfluxDB ClusteredPatroni, streaming repVictoriaMetrics cluster
Compression ratio10x-100x6x-10x10x-100x