Distributed Systems

Definition

A distributed system is a collection of autonomous computing nodes that communicate over a network and coordinate to appear as a single coherent system to end users. The defining properties are: nodes are physically separated, communication is unreliable (messages can be lost, delayed, or reordered), and nodes can fail independently.

Intuition

The core challenge of distributed systems is that you cannot have simultaneity: there is no shared clock, no instantaneous communication, no global observer. Every assertion like “node A and node B agree on X” must be achieved through a protocol that tolerates message delays and failures. Most distributed systems problems reduce to: how do we make nodes agree on something (consensus) while tolerating failures?

Formal Description

Consistency Models

Consistency models define what guarantees a distributed database gives about the order in which operations are observed across nodes.

Strong consistency (linearizability)

  • Every operation appears to take effect instantaneously at some point between its invocation and completion
  • All clients observe the same total order of operations
  • Equivalent to a single-machine system from the client’s perspective
  • Cost: high latency (requires coordination on every write), reduced availability under partitions
  • Examples: etcd, Zookeeper, Google Spanner (with TrueTime)

Sequential consistency

  • All operations appear in some sequential order that is consistent with the order seen by each individual process
  • Weaker than linearizability: the chosen order need not align with real-time
  • Used in some CPU memory models

Causal consistency

  • Operations that are causally related (A happened-before B) are seen in that order by all nodes
  • Concurrent operations (no causal relation) may be seen in different orders by different nodes
  • Stronger than eventual, weaker than strong; often a practical sweet spot
  • Examples: MongoDB causal sessions, some CRDTs

Eventual consistency

  • If no new updates are made, all replicas will eventually converge to the same value
  • No bound on how long convergence takes
  • High availability and low latency at the cost of temporary divergence
  • Requires conflict resolution: last-write-wins (LWW), CRDTs, application-level merging
  • Examples: Cassandra, DynamoDB (default), S3

Read-your-writes consistency

  • A client always reads its own most recent write (even if other clients may not yet see it)
  • Implemented by routing a client’s reads to the replica that received its writes, or by version vectors

Monotonic reads

  • Once a client reads a value, subsequent reads will never return an older value
  • Prevents a confusing “time traveling” effect where a page load shows older data than a previous one

Consensus: Paxos and Raft

Consensus protocols allow a cluster of nodes to agree on a single value (or sequence of values) even in the presence of failures, as long as a quorum (majority) of nodes is reachable.

Paxos (Lamport, 1989)

  • Roles: Proposer, Acceptor, Learner
  • Two phases: Prepare/Promise (elect a leader), Accept/Accepted (commit a value)
  • Single-decree Paxos agrees on one value; Multi-Paxos extends to a log by reusing the leader
  • Notoriously hard to implement correctly; leaves many details underspecified

Raft (Ongaro & Ousterhout, 2014)

  • Designed for understandability; equivalent safety guarantees to Multi-Paxos
  • Leader-based: one leader per term, elected by majority vote
  • Log replication: leader appends entries, replicates to followers, commits once majority acknowledges
  • Key invariant: a node cannot become leader unless its log is at least as up-to-date as the majority
  • Leader election uses randomized timeouts to avoid split votes
  • Used in: etcd, CockroachDB, TiKV, Consul

Quorum writes/reads

  • With N replicas, a write quorum W and read quorum R, requiring W + R > N guarantees read-your-writes
  • Example: N=3, W=2, R=2. Any pair of read and write sets must overlap by at least one replica
  • Tunable in Cassandra: ALL, QUORUM, LOCAL_QUORUM, ONE

Leader Election

Beyond Raft, simpler leader election can use:

  • Bully algorithm: highest-ID node wins; O(N²) messages
  • Ring algorithm: token passed around ring; O(N) messages
  • Zookeeper ephemeral nodes: first to create /leader wins; watchers for failure detection
  • Distributed locks (Redlock on Redis): controversial; Martin Kleppmann’s critique argues it cannot guarantee safety under clock skew

Failure Modes

Partial failure — the defining characteristic of distributed systems. Node A may fail while B continues. Unlike a single machine where a crash is total, partial failures create ambiguity: did the remote call succeed or fail? The receiving node might have processed the request before crashing.

Split-brain — a network partition causes two sub-clusters each believing they are the sole authoritative partition. Without a quorum requirement, both halves may accept writes, leading to divergent state. Solution: require a majority quorum for writes (sacrificing availability in the minority partition).

Network partitions — network link failure separates nodes. Messages may be dropped, delayed indefinitely, or reordered. TCP guarantees ordering and delivery within a connection but not across connections or after reconnect.

Byzantine failures — nodes behave arbitrarily maliciously or send contradictory messages to different nodes. Requires Byzantine Fault Tolerant (BFT) protocols (PBFT, Tendermint). Most datacenter systems assume crash-stop failures only.

Clock skew — physical clocks on different machines drift. NTP reduces but does not eliminate drift (typically ±100 ms). Logical clocks (Lamport timestamps, vector clocks) capture causality without relying on wall clocks.

Cascading failures — a struggling node responds slowly, causing upstream timeouts and retries, increasing load further. Circuit breakers, bulkheads, and load shedding break the cascade.

Patterns

Saga pattern

  • Manages distributed transactions across services without two-phase commit
  • A saga is a sequence of local transactions; each step publishes an event or message
  • On failure, compensating transactions undo the preceding steps
  • Choreography: services react to events (decentralized, harder to monitor)
  • Orchestration: a central saga orchestrator issues commands and handles failures
  • Example: booking a flight + hotel + car — if car fails, cancel flight and hotel

Circuit Breaker

  • Wraps remote calls; tracks failure rate over a sliding window
  • States: Closed (normal) → Open (failing fast, no calls attempted) → Half-Open (probe)
  • Prevents wasted resources on calls to a down service and gives the service time to recover
  • Hystrix (Netflix), Resilience4j, Polly (.NET)

Bulkhead

  • Isolates thread pools (or connection pools) per downstream service
  • A slow or failing service exhausts only its own pool, not the shared pool
  • Analogy: watertight compartments on a ship; one flooded compartment does not sink the ship

Retry with exponential backoff and jitter

  • On transient failure, retry after 2^attempt × base_delay + random_jitter
  • Jitter prevents synchronized retry storms (all clients retrying simultaneously)
  • Idempotency keys are essential: safe to retry only if the operation is idempotent
  • Distinguish retryable errors (503, network timeout) from non-retryable (400, 401)

Two-Phase Commit (2PC)

  • Coordinator sends Prepare to all participants; waits for all votes
  • If all vote Yes, coordinator sends Commit; otherwise Abort
  • Blocking protocol: if coordinator crashes after Prepare, participants are locked until it recovers
  • Use distributed sagas or XA transactions only where truly necessary

CAP and PACELC

CAP theorem (Brewer 2000, proved by Gilbert & Lynch 2002): under a network Partition, choose between Consistency and Availability.

PACELC (Abadi 2012): extends CAP by noting that even when there is no partition (the else case), there is always a trade-off between Latency and Consistency:

  • DynamoDB/Cassandra: PA/EL (sacrifice consistency for availability and low latency)
  • etcd/Zookeeper: PC/EC (sacrifice availability and accept higher latency for consistency)
  • Google Spanner: PC/EL (strong consistency even at high latency, using TrueTime)

Applications

  • Distributed databases and key-value stores (Cassandra, DynamoDB, CockroachDB)
  • Distributed coordination services (Zookeeper, etcd)
  • Distributed messaging systems (Kafka — partition leaders, ISR, offset commits)
  • Microservice architectures with saga-based transaction management
  • Distributed caches (Redis Cluster with hash slots)
  • Stream processing systems (Flink, Kafka Streams — state management and checkpointing)

Trade-offs

ApproachProCon
Strong consistencySimple application logicHigher latency, lower availability
Eventual consistencyHigh availability, low latencyComplex conflict resolution, surprising behavior
Saga over 2PCNo distributed lock, resilientCompensating transactions are complex to write
Leader-based replicationSimple reasoningLeader is a bottleneck and SPOF risk
Leaderless (Dynamo-style)No single bottleneckConflict resolution required (vector clocks, LWW)
Synchronous replicationDurability guaranteedWrite latency = slowest replica
Asynchronous replicationLow write latencyRisk of data loss if primary fails before replica catches up