System Design Fundamentals

Definition

System design is the process of defining architecture, components, interfaces, and data flows to satisfy specified requirements. In an interview or production context it encompasses both the decomposition of a problem into services and the selection of appropriate infrastructure primitives (load balancers, caches, queues, databases) to meet scale and reliability targets.

Intuition

Every system design decision is a trade-off between competing constraints: consistency vs. availability, cost vs. latency, simplicity vs. flexibility. The goal is not to find a perfect design—it does not exist—but to articulate which trade-offs are acceptable given the requirements and to build a system that can evolve when those requirements change.

Formal Description

Requirements Gathering

Functional requirements define what the system does:

User-facing features (create post, send message, search)
Data ingestion and processing pipelines
External integrations and APIs

Non-functional requirements define how well it does it:

Latency: p50/p99 read/write targets (e.g., < 100 ms p99 read)
Throughput: requests per second, messages per second
Availability: target uptime SLA (99.9% = 8.7 h/year downtime; 99.99% = 52 min/year)
Durability: tolerance for data loss (RPO)
Consistency: whether all users see the same data at the same time
Scalability: expected growth (10x in 2 years?)

Capacity Estimation (Back-of-Envelope)

Useful constants to memorize:

Unit	Value
1 million requests/day	~12 req/s
1 billion requests/day	~12,000 req/s
1 KB × 1M users	~1 GB
HDD read	~100 MB/s
SSD read	~500 MB/s–3 GB/s
Network (datacenter)	~10 Gbps
RTT same region	~1 ms
RTT cross-continent	~100–150 ms

Storage estimation example (Twitter-scale):

500 M tweets/day × 300 bytes/tweet = 150 GB/day raw text
Over 5 years = ~270 TB
With replication (3×) and indexes (~2×): ~1.6 PB

Bandwidth estimation: peak traffic ≈ 3× average. If average is 1 Gbps, provision for 3 Gbps.

Core Components

Load Balancer

Distributes traffic across a pool of application servers
Algorithms: round-robin, least-connections, IP-hash (session affinity), weighted
Layer 4 (TCP) vs Layer 7 (HTTP, can route by URL/headers)
Introduces a single point of failure → use active-active or active-passive pairs
Hardware (F5) vs software (HAProxy, Nginx, AWS ALB)

CDN (Content Delivery Network)

Caches static and cacheable content at edge nodes geographically close to users
Reduces origin load and cross-continent latency
Pull CDN: content fetched on first miss, cached with TTL
Push CDN: content uploaded to CDN proactively (suitable for large files, video)
Use for: images, JS/CSS bundles, large media, and APIs with high read:write ratio

Cache

In-memory key-value store between application and database
Redis: persistent, rich data structures (sorted sets, hashes), pub/sub, Lua scripting
Memcached: simpler, horizontally scalable, no persistence
Cache-aside (lazy loading): app checks cache, on miss fetches from DB and populates cache
Write-through: writes go to cache and DB synchronously
Write-back (write-behind): writes go to cache immediately, async flush to DB
Eviction policies: LRU (most common), LFU, TTL-based
Cache stampede / thundering herd: many simultaneous misses → use mutex or probabilistic early refresh

Message Queue

Decouples producers from consumers; enables async processing and buffering
Kafka: durable, distributed, replay-capable log; high throughput; consumer groups
RabbitMQ: traditional message broker; routing, fanout, dead-letter queues
SQS: managed, at-least-once delivery, FIFO variant for ordering
Use cases: order processing, event sourcing, notification delivery, log aggregation

Database: SQL vs NoSQL

Dimension	SQL (RDBMS)	NoSQL
Schema	Fixed, normalized	Flexible / schema-on-read
Transactions	ACID by default	Varies (Cassandra: lightweight txns; DynamoDB: limited)
Query flexibility	Arbitrary JOINs, aggregations	Limited (scan is expensive)
Horizontal scale	Hard (sharding complex)	Designed for it
Consistency	Strong by default	Often eventual
Examples	PostgreSQL, MySQL	Cassandra, DynamoDB, MongoDB, Redis

Choose SQL when: transactions across entities matter, schema is stable, relational queries are needed. Choose NoSQL when: write throughput is extreme, schema is dynamic, denormalization is acceptable.

CAP Theorem

A distributed system can guarantee at most two of:

Consistency (C): every read returns the most recent write
Availability (A): every request receives a (non-error) response
Partition Tolerance (P): system continues to operate despite network partitions

In practice, partitions will occur, so the real choice is between CP and AP:

CP systems: HBase, Zookeeper, etcd — return errors under partition to preserve consistency
AP systems: Cassandra, DynamoDB, CouchDB — return stale data under partition to preserve availability
CA systems: only possible in single-node or non-distributed settings

See also PACELC: even when no partition exists, there is a latency vs. consistency trade-off.

System Design Interview Framework

Clarify requirements (5 min): functional scope, scale, constraints
Estimate capacity (3 min): QPS, storage, bandwidth
Define the API (3 min): endpoints, request/response shapes
High-level design (10 min): draw boxes — clients, LB, app servers, cache, DB, queue
Deep dive (15 min): pick 2–3 components to detail; discuss trade-offs
Identify bottlenecks (5 min): SPOFs, hot partitions, cascading failures; propose mitigations

Applications

Designing URL shorteners (TinyURL): hash function, collision handling, redirect latency
Designing news feeds (Twitter/Facebook): fan-out on write vs fan-out on read
Designing rate limiters: token bucket per user, sliding window counters in Redis
Designing search indexes: inverted index, sharding by term or document
Designing notification systems: push vs pull, webhooks, websockets

Trade-offs

Choice	Pro	Con
More caching	Low latency, reduced DB load	Stale data, memory cost, invalidation complexity
Async via queues	Decoupling, resilience	Eventual consistency, harder debugging
Microservices	Independent scaling and deployment	Network overhead, distributed tracing complexity
Denormalization	Faster reads	Write amplification, data inconsistency risk
More replicas	High availability, read scaling	Replication lag, cost

Notes

Explorer

system_design_fundamentals