Scalability Patterns

Definition

Scalability is a system’s ability to handle increasing load without degrading performance beyond acceptable bounds. Scalability patterns are architectural decisions and algorithms that enable a system to grow — more users, more data, more requests — while preserving low latency and high availability.

Intuition

You scale a system by either making each component more powerful (vertical) or by adding more components (horizontal). Horizontal scaling is preferable at cloud scale because commodity hardware is cheaper than specialized hardware and it enables redundancy. The prerequisite for horizontal scaling is that components are stateless or that state is managed externally and partitioned effectively.

Formal Description

Horizontal vs Vertical Scaling

Vertical scaling (scale-up)

  • Replace a node with a larger one: more CPU cores, RAM, faster disk
  • Requires downtime for hardware changes (unless live migration is possible)
  • Has a hard physical ceiling
  • Simple: no distributed coordination required
  • Appropriate for: databases in early stage, legacy single-threaded applications

Horizontal scaling (scale-out)

  • Add more nodes of the same size to a pool
  • Requires workload distribution (load balancer, consistent hashing, partitioning)
  • Theoretically unbounded; adds redundancy automatically
  • Appropriate for: stateless web/app tier, caches, distributed databases, queues
  • Challenge: distributed state — sessions, locks, aggregations must be externalized

Stateless Services

A stateless service holds no client-specific state in application memory. All state lives in an external store (database, Redis, object storage). Any instance can handle any request.

Benefits:

  • Horizontal scaling is trivial: add instances behind a load balancer
  • Rolling deploys with zero downtime: drain old instances, spin up new ones
  • No sticky sessions required

Implementation:

  • Store session tokens in Redis or JWTs (self-contained, signed)
  • Store uploaded files in S3 / object storage, not local disk
  • Externalize in-memory caches to Redis or Memcached
  • Configuration via environment variables or secret stores (not local files)

Read Replicas

A primary database handles all writes; one or more replicas receive an asynchronous stream of changes and serve read traffic.

  • Replication lag: replicas trail primary by milliseconds to seconds depending on write volume and network
  • Read-your-writes problem: a user writes then immediately reads; if routed to a lagging replica they see stale data. Fix: route a user’s reads to primary for a short window after a write, or use synchronous replication for critical paths
  • Scaling reads: typical OLTP workloads are 80–95% reads. Read replicas can multiply read capacity almost linearly
  • Failover: if primary fails, promote a replica. Semi-synchronous replication reduces data loss risk

When read replicas are insufficient (write-heavy workload), move to sharding.

Sharding Strategies

Sharding (horizontal partitioning) divides data across multiple database nodes, each owning a subset of the keyspace.

Range sharding

  • Each shard owns a contiguous range of keys (e.g., user IDs 0–1M, 1M–2M)
  • Natural for range queries: scan a single shard
  • Risk: hot partitions if data is not uniformly distributed (e.g., all new users go to the last shard)
  • Solution: pre-split ranges, dynamic range splitting (used in HBase, Google Bigtable)

Hash sharding

  • Shard = hash(key) mod N
  • Uniform distribution of keys across shards (assuming a good hash function)
  • Range queries are inefficient: must scatter-gather across all shards
  • Problem: resharding requires moving ~half of all keys when N changes

Consistent hashing

  • Keys and nodes are mapped onto a ring by hash value
  • Each key is owned by the first node clockwise on the ring
  • Adding/removing a node affects only its immediate neighbor’s keys (~1/N fraction)
  • Virtual nodes (vnodes): each physical node owns multiple positions on the ring, improving balance
  • Used in: Amazon DynamoDB, Cassandra, Apache Riak, Akamai CDN

Directory-based sharding

  • A lookup service maps each key (or key range) to its shard
  • Maximum flexibility: any partition scheme, easy resharding (update the directory)
  • The directory itself becomes a bottleneck / SPOF; must be cached and replicated

Sharding challenges:

  • Cross-shard joins require scatter-gather or application-level JOINs
  • Cross-shard transactions require distributed transaction protocols (2PC, sagas)
  • Re-sharding (changing N) is operationally complex; plan ahead for growth

Fan-Out Patterns

Fan-out on write (push model)

  • When a user creates content, immediately push it to all followers’ feeds/inboxes
  • Write amplification: a celebrity with 50M followers creates 50M write operations
  • Reads are O(1): just read the pre-computed feed
  • Appropriate for users with small/medium follower counts
  • Used by: Twitter (for non-celebrities), Facebook

Fan-out on read (pull model)

  • Each user’s feed is computed at read time by fetching posts from followed accounts and merging
  • No write amplification; scales well for celebrities
  • Expensive reads: O(followers) queries per feed load, requires heavy caching
  • Used for: celebrity accounts (Twitter’s hybrid model)

Hybrid approach (Twitter)

  • Fan-out on write for users with < N followers (say, 1M)
  • Fan-out on read for celebrities (> N followers); results merged at read time

Rate Limiting Algorithms

Rate limiting protects services from abuse and ensures fair resource sharing.

Token bucket

  • A bucket holds up to capacity tokens; tokens are added at rate r tokens/second
  • Each request consumes one token; if bucket is empty, request is rejected or queued
  • Allows bursts up to capacity; sustained rate cannot exceed r
  • Easy to implement in Redis with a counter and timestamp
  • Used by: AWS API Gateway, Stripe

Leaky bucket

  • Requests enter a queue (the bucket); they are processed at a fixed output rate
  • Excess requests overflow (are dropped)
  • Enforces a strict output rate; no burst allowed
  • Suitable for smoothing traffic to a downstream service with a fixed capacity

Fixed window counter

  • Count requests in the current time window (e.g., per minute)
  • Problem: a user can send 2× the limit by placing requests at the end of one window and the start of the next
  • Simple and memory-efficient

Sliding window log

  • Store timestamps of all requests in a sorted set (e.g., Redis ZSET)
  • On each request, remove entries older than window, count remaining, add current timestamp
  • Accurate but memory-intensive for high-rate users

Sliding window counter (approximate)

  • Combine current and previous window counters weighted by position in the current window
  • count = prev_count × (1 - elapsed/window) + curr_count
  • Accurate enough for most purposes; O(1) memory per user

Distributed rate limiting requires atomic operations. Redis INCR + EXPIRE for simple cases; Lua scripts for compound operations; Redis Cluster for multi-shard scenarios.

Caching Strategies

Write-through

  • Write goes to cache and database synchronously before acknowledging to the client
  • Cache always has fresh data; no stale reads
  • Higher write latency (two writes per operation)
  • Best when reads follow writes closely and stale data is unacceptable

Write-back (write-behind)

  • Write acknowledged after writing to cache only; database updated asynchronously
  • Very low write latency
  • Risk of data loss if cache node fails before flush
  • Best for write-heavy workloads where some loss is tolerable (session data, counters)

Write-around

  • Write goes directly to database, bypassing cache
  • Cache is populated only on subsequent reads (cache-aside)
  • Good when data is written once and rarely read (large files, audit logs)

Cache-aside (lazy loading)

  • Application checks cache; on miss, reads from DB, populates cache with TTL
  • Cache holds only data that has been requested at least once
  • Risk: cold start performance, thundering herd on popular keys expiring simultaneously
  • Mitigation: randomize TTLs, use “dog-pile” prevention (single-flight, mutex locks)

Read-through

  • Cache sits transparently in front of DB; on miss, cache fetches from DB and returns result
  • Application only talks to cache
  • Simplifies application code; cache provider handles invalidation logic

Cache invalidation strategies:

  • TTL (time-to-live): simplest; accepts bounded staleness
  • Event-driven invalidation: DB triggers or application events explicitly delete/update cache keys
  • Versioned keys: user:123:v42; increment version on update; old keys expire naturally
  • Cache-aside with write invalidation: on write to DB, delete cache key (not update — avoids race condition)

Applications

  • URL shortener: consistent hashing across redirect servers; Redis cache for hot URLs
  • Social media feed: fan-out on write + read replicas + Redis sorted set per user
  • E-commerce flash sale: token bucket rate limiting + write-through cache for inventory
  • Multi-region deployment: geographically distributed shards + eventual consistency across regions
  • Search service: sharding by document ID, read replicas for search load

Trade-offs

PatternProCon
Horizontal scalingRedundancy, unbounded growthState externalization, coordination overhead
Hash shardingUniform distributionNo range queries, resharding complexity
Fan-out on writeO(1) readsWrite amplification for popular users
Token bucketBurst-friendlySlightly complex to implement atomically
Write-through cacheStrong consistencyHigher write latency
Write-back cacheLow write latencyData loss risk