System Design
Building and architecting distributed systems at scale. Every concept here is a response to a real failure in production — someone built a system, it broke under load, and the pattern emerged.
Core Design Principles
Scalability
Handling growth without rewriting the system. Two approaches:
- Vertical scaling: buy a bigger machine. Simple, hits a ceiling fast. AWS’s largest instance can’t serve 100 million users.
- Horizontal scaling: add more machines. Complex — now you have to coordinate them — but unbounded. Netflix runs thousands of instances.
- The hard part isn’t adding servers. It’s that your database, your cache, and your message queue all need to scale too, and they scale differently.
Reliability & Availability
A system that works today but fails tomorrow isn’t reliable. A system that’s down right now isn’t available. The distinction matters because the fixes are different:
- Redundancy: if one server dies, another takes over. But redundancy alone isn’t enough — you need failover detection. A dead server in the pool silently dropping requests is worse than a server that’s visibly down.
- Uptime percentages: 99.9% (“three nines”) = ~8.7 hours of downtime per year. 99.99% = ~52 minutes. The jump from three to four nines often doubles the engineering effort.
- Real example: GitHub’s 2018 outage — a network partition split a MySQL cluster. The system detected failure but the failover mechanism itself had a bug. Availability requires testing the failure paths, not just the happy path.
Performance
Speed and capacity are two different things:
- Latency: how fast one request completes. A user typing in a search box cares about latency.
- Throughput: how many requests the system handles per second. A payment processor during Black Friday cares about throughput.
- Optimizing one often helps the other — caching reduces both latency and server load — but they can conflict. Batching requests improves throughput at the cost of individual request latency.
Fault Tolerance
Systems fail. The question is whether they fail gracefully:
- Retry with backoff: a downstream service times out. Retrying immediately just adds load to an already struggling service. Exponential backoff (1s → 2s → 4s → 8s) gives it room to recover.
- Circuit breaker: after N consecutive failures, stop trying and fail fast. Amazon’s “Blast Radius” principle: a failing recommendation service shouldn’t take down checkout.
- Graceful degradation: if the recommendation engine is down, show the top-10 list instead of an error page. Users tolerate reduced functionality more than they tolerate errors.
Topics
- Infrastructure — Kafka (high-throughput event streaming, LinkedIn processes 7 trillion messages/day), RabbitMQ (flexible routing, when you need per-message acknowledgements), event processing patterns (dead letter queues, exactly-once semantics, ordering guarantees).
- Patterns — consistent hashing (add a cache node without invalidating everything), ID generation (Twitter Snowflake: 64-bit IDs, no coordination), cache stampede (expiring a hot key → 1000 servers all hit the DB simultaneously), rate limiting (token bucket — smooth bursts, sliding window — precise limits).
- Systems — distributed cache (memcached at Facebook, Redis at Twitter), task scheduler (ensure a job runs exactly once, not zero times and not twice), notification system (APNs/FCM + WebSocket fallback, fan-out to millions), real-time chat (WebSocket state, message ordering, presence detection).
Building Blocks
Load Balancers
Distribute traffic across servers. Layer 4 (TCP) is fast but blind. Layer 7 (HTTP) can route /checkout to a bigger instance pool and /static to a CDN. Example: a single NGINX instance can handle ~10k concurrent connections. At scale, you need a tiered approach — DNS → hardware LB → software LB → application.
SQL vs NoSQL
Not a religious war. A decision tree:
- Need ACID transactions, complex joins, data integrity above all? → PostgreSQL, MySQL.
- Need flexible schema, rapid iteration, nested documents? → MongoDB.
- Need sub-millisecond latency, simple key-value? → Redis.
- Need massive write throughput, time-series? → Cassandra.
- Example: Uber started with PostgreSQL. When trip volume exploded, they moved hot data to Redis (driver locations, sub-ms updates) and kept trip history in schemaless document stores. Different data, different database.
Caching
The fastest database query is the one you never make. Example: at Reddit, a popular post’s comment count might be requested 100k times per minute. Cache it in Redis with a 1-second TTL and you’ve eliminated 99,999 database queries. The stale-1-second count is imperceptible to users.
- Redis / Memcached: in-memory, sub-ms latency. Redis adds data structures (lists, sets, sorted sets) beyond simple key-value.
- Cache invalidation: the famously hard problem. Strategies: TTL (simple, eventually consistent), write-through (cache updated on every write, consistent but slow), write-behind (cache updated async, fast but can lose data).
Message Queues
Decouple producers from consumers. A user uploads a video — instead of processing it synchronously (user waits 30 seconds), you drop a message in a queue and return “processing.” Example: Instagram uploads. The upload API accepts the file, queues a processing job, and returns immediately. The user sees a spinner, not a timeout.
CAP Theorem
In a network partition, you can have Consistency (all nodes see the same data) or Availability (every request gets a response) — not both. Example: a banking system during a network split. Choose Consistency (reject writes until partition heals — no overdrafts but service is degraded) or Availability (accept writes on both sides, merge later — 24/7 service but risk of conflicting balances). Most systems choose CP for financial data and AP for user profiles.
Database Sharding
Split data across multiple databases by a shard key. Example: shard by user_id % 1000 → 1000 shards. User 42’s data lives on shard 42. The problem: queries that cross shards (all users in California) become expensive scatter-gather operations. Shard key choice is irreversible — changing it requires migrating all data.
Database Replication
Keep copies for redundancy. Primary-replica: writes go to primary, reads go to replicas. Example: YouTube’s comments. The write volume is moderate, read volume is enormous. One primary handles writes, 10 replicas handle reads. Replication lag: a comment posted 200ms ago might not appear in a read replica yet. Your UX needs to handle that.
DNS
More than just name resolution. DNS does routing, load balancing, and failover. Example: api.example.com resolves to different IPs based on geography (geo-DNS) and health (remove unhealthy IPs from DNS response). Tradeoff: DNS changes propagate slowly (TTL), so it’s the slowest failover mechanism, not suitable for sub-second cutover.
CDNs
Serve static content from edge locations close to users. Example: an image hosted on cdn.example.com is cached in 200+ edge locations. A user in Sydney fetches it from the Sydney edge in 10ms instead of from Virginia in 200ms. CDNs also absorb DDoS attacks — a 1 Tbps attack on Cloudflare’s 200+ edge locations is a Tuesday.
Data Consistency Models
- Strong consistency: after a write, all subsequent reads see it. Costs latency (consensus required).
- Eventual consistency: after a write, reads eventually see it. Amazon DynamoDB default: replicas converge over time.
- Real example: a social media “like” count. Strong consistency (every like is immediately reflected) costs ~100ms per operation. Eventual consistency (like count updates within 5 seconds) costs ~5ms. For a like button, 5-second staleness is fine. For a bank balance, it’s not.
API Design
- REST: resource-oriented, stateless, cacheable. Good for CRUD, public APIs. Use cursor-based pagination for large datasets (stable under inserts), not offset-based (drifts when rows are added).
- gRPC: binary protocol, strongly typed contracts via protobuf. Good for internal service-to-service communication where latency matters.
- Rate limiting: protect your API from abuse. Token bucket (steady rate with burst tolerance), sliding window log (precise, expensive), fixed window (simple, burst at boundary).
- Versioning: never break a live API. Deploy v2 alongside v1, deprecate v1 with headers, remove when traffic reaches zero.
Observability
You can’t fix what you can’t see. Three pillars:
- Logging: structured, searchable record of events. “User 42 completed checkout. Total: $53.00.” ELK stack (Elasticsearch + Logstash + Kibana) or Grafana Loki.
- Metrics: numeric time-series data. “P99 latency for
/checkout: 230ms.” Prometheus + Grafana. - Tracing: follows a single request across service boundaries. “
POST /ordersspent 12ms in auth service, 45ms in inventory service, 200ms in payment service.” Jaeger, Zipkin, or OpenTelemetry.
Example: Stripe’s observability. When a payment fails, the engineer traces the exact request through 15 microservices, sees the inventory service returned a 500 at timestamp X, and correlates it with a deployment 3 minutes earlier. Without tracing, that’s a multi-hour investigation.