Raft Consensus

Timing

Election timeout: 150-300ms (randomized)
Heartbeat interval: 50ms
Peer liveness window: 3s — a node is shown as is_alive: false in GetClusterStatus if no successful AppendEntries/heartbeat has been received from it within this window

Log Replication

Client sends command to leader
Leader appends to log and replicates via AppendEntries
Majority acknowledgment → committed
Applied to state machine

The leader’s proposal channel holds at most 256 pending commands. If the channel is full, SubmitJob returns RESOURCE_EXHAUSTED immediately — the caller should retry with exponential backoff. Additionally, if the Raft loop accepts the command but quorum is lost before the entry commits, SubmitJob returns DEADLINE_EXCEEDED after 5 seconds. Both guarantees ensure clients always get a bounded response rather than hanging indefinitely.

State Machine Commands

Each log entry carries one of the following commands applied to the job queue on every node:

Command	Description
`SubmitJob`	Adds a new job in Pending state
`AssignJob`	Assigns a pending job to a specific worker, setting it to Running on all nodes
`RegisterWorker`	Records a worker in the cluster-wide registry
`UpdateJobStatus` / `BatchUpdateJobStatus`	Marks one or more jobs Completed or Failed

AssignJob is proposed by the leader’s scheduler loop after it selects the least-loaded live worker. Because this command travels through the Raft log, every follower applies the same assignment atomically — workers discover their jobs by querying the local job queue rather than a node-local in-memory map.

Log Compaction

When the in-memory log exceeds 1000 entries, the committed prefix is replaced with a snapshot of the current state (job queue + worker registrations). This bounds memory usage regardless of throughput. Followers that fall too far behind receive the snapshot via InstallSnapshot RPC instead of replaying individual entries.

State Persistence

By default, all Raft state (log, term, voted-for, snapshot) is held in memory and lost on restart. Pass --data-dir <path> to enable a local RocksDB store. On startup the node reloads its persisted term, voted-for, and committed log entries and replays any stored snapshot into the job queue before rejoining the cluster. Storage failures cause an immediate crash (crash-fail-safe) rather than silent data loss.

Safety Guarantees

Election safety: One leader per term
Leader append-only: Never overwrites log
Log matching: Same index/term = identical
Leader completeness: Committed entries persist

Cluster Sizing

Nodes	Majority	Fault Tolerance
3	2	1 failure
5	3	2 failures
7	4	3 failures

Use odd numbers—even numbers add overhead without improving fault tolerance.

nomad-lite