Scalability Patterns
Definition
Scalability is a system’s ability to handle increasing load without degrading performance beyond acceptable bounds. Scalability patterns are architectural decisions and algorithms that enable a system to grow — more users, more data, more requests — while preserving low latency and high availability.
Intuition
You scale a system by either making each component more powerful (vertical) or by adding more components (horizontal). Horizontal scaling is preferable at cloud scale because commodity hardware is cheaper than specialized hardware and it enables redundancy. The prerequisite for horizontal scaling is that components are stateless or that state is managed externally and partitioned effectively.
Formal Description
Horizontal vs Vertical Scaling
Vertical scaling (scale-up)
- Replace a node with a larger one: more CPU cores, RAM, faster disk
- Requires downtime for hardware changes (unless live migration is possible)
- Has a hard physical ceiling
- Simple: no distributed coordination required
- Appropriate for: databases in early stage, legacy single-threaded applications
Horizontal scaling (scale-out)
- Add more nodes of the same size to a pool
- Requires workload distribution (load balancer, consistent hashing, partitioning)
- Theoretically unbounded; adds redundancy automatically
- Appropriate for: stateless web/app tier, caches, distributed databases, queues
- Challenge: distributed state — sessions, locks, aggregations must be externalized
Stateless Services
A stateless service holds no client-specific state in application memory. All state lives in an external store (database, Redis, object storage). Any instance can handle any request.
Benefits:
- Horizontal scaling is trivial: add instances behind a load balancer
- Rolling deploys with zero downtime: drain old instances, spin up new ones
- No sticky sessions required
Implementation:
- Store session tokens in Redis or JWTs (self-contained, signed)
- Store uploaded files in S3 / object storage, not local disk
- Externalize in-memory caches to Redis or Memcached
- Configuration via environment variables or secret stores (not local files)
Read Replicas
A primary database handles all writes; one or more replicas receive an asynchronous stream of changes and serve read traffic.
- Replication lag: replicas trail primary by milliseconds to seconds depending on write volume and network
- Read-your-writes problem: a user writes then immediately reads; if routed to a lagging replica they see stale data. Fix: route a user’s reads to primary for a short window after a write, or use synchronous replication for critical paths
- Scaling reads: typical OLTP workloads are 80–95% reads. Read replicas can multiply read capacity almost linearly
- Failover: if primary fails, promote a replica. Semi-synchronous replication reduces data loss risk
When read replicas are insufficient (write-heavy workload), move to sharding.
Sharding Strategies
Sharding (horizontal partitioning) divides data across multiple database nodes, each owning a subset of the keyspace.
Range sharding
- Each shard owns a contiguous range of keys (e.g., user IDs 0–1M, 1M–2M)
- Natural for range queries: scan a single shard
- Risk: hot partitions if data is not uniformly distributed (e.g., all new users go to the last shard)
- Solution: pre-split ranges, dynamic range splitting (used in HBase, Google Bigtable)
Hash sharding
- Shard = hash(key) mod N
- Uniform distribution of keys across shards (assuming a good hash function)
- Range queries are inefficient: must scatter-gather across all shards
- Problem: resharding requires moving ~half of all keys when N changes
Consistent hashing
- Keys and nodes are mapped onto a ring by hash value
- Each key is owned by the first node clockwise on the ring
- Adding/removing a node affects only its immediate neighbor’s keys (~1/N fraction)
- Virtual nodes (vnodes): each physical node owns multiple positions on the ring, improving balance
- Used in: Amazon DynamoDB, Cassandra, Apache Riak, Akamai CDN
Directory-based sharding
- A lookup service maps each key (or key range) to its shard
- Maximum flexibility: any partition scheme, easy resharding (update the directory)
- The directory itself becomes a bottleneck / SPOF; must be cached and replicated
Sharding challenges:
- Cross-shard joins require scatter-gather or application-level JOINs
- Cross-shard transactions require distributed transaction protocols (2PC, sagas)
- Re-sharding (changing N) is operationally complex; plan ahead for growth
Fan-Out Patterns
Fan-out on write (push model)
- When a user creates content, immediately push it to all followers’ feeds/inboxes
- Write amplification: a celebrity with 50M followers creates 50M write operations
- Reads are O(1): just read the pre-computed feed
- Appropriate for users with small/medium follower counts
- Used by: Twitter (for non-celebrities), Facebook
Fan-out on read (pull model)
- Each user’s feed is computed at read time by fetching posts from followed accounts and merging
- No write amplification; scales well for celebrities
- Expensive reads: O(followers) queries per feed load, requires heavy caching
- Used for: celebrity accounts (Twitter’s hybrid model)
Hybrid approach (Twitter)
- Fan-out on write for users with < N followers (say, 1M)
- Fan-out on read for celebrities (> N followers); results merged at read time
Rate Limiting Algorithms
Rate limiting protects services from abuse and ensures fair resource sharing.
Token bucket
- A bucket holds up to
capacitytokens; tokens are added at ratertokens/second - Each request consumes one token; if bucket is empty, request is rejected or queued
- Allows bursts up to
capacity; sustained rate cannot exceedr - Easy to implement in Redis with a counter and timestamp
- Used by: AWS API Gateway, Stripe
Leaky bucket
- Requests enter a queue (the bucket); they are processed at a fixed output rate
- Excess requests overflow (are dropped)
- Enforces a strict output rate; no burst allowed
- Suitable for smoothing traffic to a downstream service with a fixed capacity
Fixed window counter
- Count requests in the current time window (e.g., per minute)
- Problem: a user can send 2× the limit by placing requests at the end of one window and the start of the next
- Simple and memory-efficient
Sliding window log
- Store timestamps of all requests in a sorted set (e.g., Redis ZSET)
- On each request, remove entries older than window, count remaining, add current timestamp
- Accurate but memory-intensive for high-rate users
Sliding window counter (approximate)
- Combine current and previous window counters weighted by position in the current window
count = prev_count × (1 - elapsed/window) + curr_count- Accurate enough for most purposes; O(1) memory per user
Distributed rate limiting requires atomic operations. Redis INCR + EXPIRE for simple cases; Lua scripts for compound operations; Redis Cluster for multi-shard scenarios.
Caching Strategies
Write-through
- Write goes to cache and database synchronously before acknowledging to the client
- Cache always has fresh data; no stale reads
- Higher write latency (two writes per operation)
- Best when reads follow writes closely and stale data is unacceptable
Write-back (write-behind)
- Write acknowledged after writing to cache only; database updated asynchronously
- Very low write latency
- Risk of data loss if cache node fails before flush
- Best for write-heavy workloads where some loss is tolerable (session data, counters)
Write-around
- Write goes directly to database, bypassing cache
- Cache is populated only on subsequent reads (cache-aside)
- Good when data is written once and rarely read (large files, audit logs)
Cache-aside (lazy loading)
- Application checks cache; on miss, reads from DB, populates cache with TTL
- Cache holds only data that has been requested at least once
- Risk: cold start performance, thundering herd on popular keys expiring simultaneously
- Mitigation: randomize TTLs, use “dog-pile” prevention (single-flight, mutex locks)
Read-through
- Cache sits transparently in front of DB; on miss, cache fetches from DB and returns result
- Application only talks to cache
- Simplifies application code; cache provider handles invalidation logic
Cache invalidation strategies:
- TTL (time-to-live): simplest; accepts bounded staleness
- Event-driven invalidation: DB triggers or application events explicitly delete/update cache keys
- Versioned keys:
user:123:v42; increment version on update; old keys expire naturally - Cache-aside with write invalidation: on write to DB, delete cache key (not update — avoids race condition)
Applications
- URL shortener: consistent hashing across redirect servers; Redis cache for hot URLs
- Social media feed: fan-out on write + read replicas + Redis sorted set per user
- E-commerce flash sale: token bucket rate limiting + write-through cache for inventory
- Multi-region deployment: geographically distributed shards + eventual consistency across regions
- Search service: sharding by document ID, read replicas for search load
Trade-offs
| Pattern | Pro | Con |
|---|---|---|
| Horizontal scaling | Redundancy, unbounded growth | State externalization, coordination overhead |
| Hash sharding | Uniform distribution | No range queries, resharding complexity |
| Fan-out on write | O(1) reads | Write amplification for popular users |
| Token bucket | Burst-friendly | Slightly complex to implement atomically |
| Write-through cache | Strong consistency | Higher write latency |
| Write-back cache | Low write latency | Data loss risk |