Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Raft Consensus

Timing

  • Election timeout: 150-300ms (randomized)
  • Heartbeat interval: 50ms
  • Peer liveness window: 3s — a node is shown as is_alive: false in GetClusterStatus if no successful AppendEntries/heartbeat has been received from it within this window

Log Replication

  1. Client sends command to leader
  2. Leader appends to log and replicates via AppendEntries
  3. Majority acknowledgment → committed
  4. Applied to state machine

Proposal backpressure

The leader’s proposal channel holds at most 256 pending commands. If the channel is full, SubmitJob returns RESOURCE_EXHAUSTED immediately — the caller should retry with exponential backoff. Additionally, if the Raft loop accepts the command but quorum is lost before the entry commits, SubmitJob returns DEADLINE_EXCEEDED after 5 seconds. Both guarantees ensure clients always get a bounded response rather than hanging indefinitely.

State Machine Commands

Each log entry carries one of the following commands applied to the job queue on every node:

CommandDescription
SubmitJobAdds a new job in Pending state
AssignJobAssigns a pending job to a specific worker, setting it to Running on all nodes
RegisterWorkerRecords a worker in the cluster-wide registry
UpdateJobStatus / BatchUpdateJobStatusMarks one or more jobs Completed or Failed

AssignJob is proposed by the leader’s scheduler loop after it selects the least-loaded live worker. Because this command travels through the Raft log, every follower applies the same assignment atomically — workers discover their jobs by querying the local job queue rather than a node-local in-memory map.

Log Compaction

When the in-memory log exceeds 1000 entries, the committed prefix is replaced with a snapshot of the current state (job queue + worker registrations). This bounds memory usage regardless of throughput. Followers that fall too far behind receive the snapshot via InstallSnapshot RPC instead of replaying individual entries.

State Persistence

By default, all Raft state (log, term, voted-for, snapshot) is held in memory and lost on restart. Pass --data-dir <path> to enable a local RocksDB store. On startup the node reloads its persisted term, voted-for, and committed log entries and replays any stored snapshot into the job queue before rejoining the cluster. Storage failures cause an immediate crash (crash-fail-safe) rather than silent data loss.

Safety Guarantees

  • Election safety: One leader per term
  • Leader append-only: Never overwrites log
  • Log matching: Same index/term = identical
  • Leader completeness: Committed entries persist

Cluster Sizing

NodesMajorityFault Tolerance
321 failure
532 failures
743 failures

Use odd numbers—even numbers add overhead without improving fault tolerance.