Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

nomad-lite is a distributed job scheduler with custom Raft consensus, similar to Nomad or Kubernetes scheduler. Jobs are shell commands executed in isolated Docker containers across a cluster.

Features

  • Custom Raft Consensus - Leader election, log replication, and fault tolerance from scratch
  • Distributed Scheduling - Jobs executed across cluster with automatic failover
  • Unified CLI - Single binary for both server and client with automatic leader redirect
  • mTLS Security - Mutual TLS for all gRPC communication
  • Docker Sandboxing - Jobs run in isolated containers with restricted capabilities
  • Web Dashboard - Real-time monitoring and job management
  • gRPC + REST APIs - Type-safe client communication
  • Job Cancellation - Cancel pending or running jobs via CLI, gRPC, or REST dashboard
  • Health Endpoints - /health/live (process alive) and /health/ready (leader known) for orchestrator integration
  • Graceful Shutdown - SIGTERM/SIGINT handling with drain period for in-flight work
  • Leader Draining & Transfer - Voluntary leadership transfer and node draining for safe maintenance
  • Batch Replication - Multiple job status updates batched into a single Raft log entry for reduced consensus overhead
  • State Persistence - Optional RocksDB-backed Raft log, term, and snapshot storage via --data-dir; nodes survive restarts and rejoin without losing committed state
  • Log Compaction - Automatic in-memory log prefix truncation with snapshot transfer to slow followers
  • Proposal Backpressure - Bounded proposal queue (256 slots); clients get an immediate RESOURCE_EXHAUSTED when the leader is overloaded and a DEADLINE_EXCEEDED if a commit stalls, rather than hanging indefinitely

Requirements

DependencyVersionInstallation
Rust1.56+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
protoc3.0+apt install protobuf-compiler / brew install protobuf
Docker20.0+apt install docker.io / brew install --cask docker

Nomad vs nomad-lite

nomad-lite draws direct inspiration from HashiCorp Nomad, a production-grade workload orchestrator. This page documents where the two systems overlap and where they diverge.

nomad-lite is not a drop-in replacement for Nomad. It is a purpose-built implementation of the core distributed scheduling concepts — Raft consensus, distributed job assignment, worker liveness, graceful failover — written from scratch in Rust as a learning and experimentation platform. Understanding where Nomad sets the bar makes it easier to reason about which parts of that bar nomad-lite has already cleared and which remain ahead.


Scheduler Types

Nomad ships three built-in schedulers, each with a fundamentally different execution model:

SchedulerNomadnomad-lite
batchRun-to-completion jobs; retries on failure✅ Core model
serviceLong-running daemons; keeps N instances alive, restarts on crashNot supported
systemRuns exactly one instance on every node (log shippers, exporters)Not supported

nomad-lite is a pure batch scheduler. Every job is expected to start, do work, and exit. Long-running services and system-wide daemons are outside its current scope.


Task Drivers (How Jobs Run)

Nomad abstracts the execution runtime through pluggable task drivers:

DriverNomadnomad-lite
dockerFull image + entrypoint + env controlPartial — fixed Alpine image, sh -c only
exec / raw_execDirect process execution, no containerNot supported
javaJVM workloads with classpath managementNot supported
qemuFull virtual machine executionNot supported
podman, containerdOCI-compatible runtimes via pluginsNot supported

nomad-lite runs all jobs as docker run --rm alpine:latest sh -c <command>. The image is configurable at the node level but uniform across all jobs — submitters cannot choose a per-job image or entrypoint.


Scheduling Features

FeatureNomadnomad-lite
Placement algorithmBin-packing (CPU + memory aware)Least-loaded (running job count)
Resource constraintsCPU, memory, GPU, disk declared per jobNot supported
Node constraintsAttribute expressions (kernel.name == linux)Not supported
AffinitiesSoft preferences for placementNot supported
SpreadDistribute across failure domains (AZ, rack)Not supported
Job prioritiesInteger 1–100; high priority preempts lowNot supported
PreemptionEvict lower-priority running jobs to place high-priority onesNot supported

nomad-lite’s assigner selects the worker with the fewest running jobs. This is a reasonable approximation of load balancing but ignores actual resource consumption — a node running one heavy job looks the same as one running one trivial job.


Job Lifecycle

FeatureNomadnomad-lite
Job submissionHCL job spec fileSingle string command (gRPC / CLI)
Job timeoutskill_timeout, max_kill_timeout per taskNode-level 30 s wall-clock timeout (kills container + marks Failed); per-job configurable timeout not supported
Retry policyrestart stanza: attempts, delay, modeNot supported
Reschedule on node lossreschedule stanza with backoffNot supported
Job cancellationnomad job stopnomad-lite job cancel <job-id>
Job priorities1–100 integer fieldNot supported
Parameterized jobsparameterized stanza; dispatch via CLI/APINot supported
Periodic jobsCron expression in job specNot supported
Job dependencies (DAG)Not native; requires external toolingNot supported
Rolling updatesupdate stanza with canary, max_parallelNot supported
Job versioningTracks job spec history, auto-reverts on failureNot supported

Consensus and State

FeatureNomadnomad-lite
Consensus protocolRaft (via HashiCorp Raft library)Custom Raft implementation in Rust
State persistenceDurable BoltDB-backed Raft log✅ Optional RocksDB-backed log + snapshot via --data-dir; in-memory if omitted
Log compactionRaft snapshots to BoltDBIn-memory prefix truncation + snapshot; persisted to RocksDB when --data-dir is set
Multi-regionFederation across regions with replicationSingle cluster only
Leader electionRandomized timeoutsRandomized timeouts (150–300 ms)
Leadership transfernomad operator raft transfer-leadershipnomad-lite cluster transfer-leader
Node drainnomad node drainnomad-lite cluster drain

nomad-lite now supports optional state persistence via --data-dir. When enabled, every Raft log entry, term change, voted-for value, and snapshot is written to a local RocksDB store before being acknowledged. A crashed node can rejoin, replay its log, and resume without losing any committed state. Without --data-dir the node runs in-memory only.


Security

FeatureNomadnomad-lite
Transport encryptionmTLS for all RPC✅ mTLS implemented
ACL systemToken-based with policies and rolesNot supported
Vault integrationSecrets injection at job startNot supported
NamespacesMulti-tenant isolationNot supported
Sentinel policiesFine-grained job submission governanceNot supported

nomad-lite implements the transport layer (mTLS) but has no authorization model. Any client that presents a valid certificate can submit jobs, drain nodes, or transfer leadership. In a controlled environment this is acceptable; in a shared environment it is not.


Observability

FeatureNomadnomad-lite
MetricsPrometheus endpoint (/v1/metrics)Not supported
Distributed tracingOpenTelemetry supportNot supported
Health endpoints/v1/agent/health (live + ready)/health/live + /health/ready
Audit loggingImmutable audit log (Enterprise)Not supported
Web UIFull job and cluster management UI✅ Basic dashboard (status + job list)
Log streamingnomad alloc logs -fNot supported

Client API

FeatureNomadnomad-lite
ProtocolHTTP/JSON REST + gRPCgRPC (primary) + HTTP (dashboard only)
Job submissionHCL / JSON job specstring command field
StreamingEvent stream, log tailingStreamJobs gRPC streaming
PaginationCursor-based✅ Cursor-based ListJobs
Batch submissionMultiple allocations per jobNot supported
Leader redirectX-Nomad-Index + redirect hints✅ CLI auto-redirects to leader

Feature Summary

                    ┌─────────────────────────────────────────┐
                    │         nomad-lite feature coverage     │
                    └─────────────────────────────────────────┘

 Consensus layer
   ✅  Leader election (randomized timeouts)
   ✅  Log replication (AppendEntries)
   ✅  Log compaction + snapshots
   ✅  Leadership transfer
   ✅  InstallSnapshot for lagging followers
   ✅  Dedicated heartbeat channel (no HOL blocking)
   ✅  Proposal backpressure (bounded queue, immediate RESOURCE_EXHAUSTED)

 Cluster operations
   ✅  Node drain (stop accepting, finish in-flight, transfer leadership)
   ✅  Graceful shutdown (SIGTERM/SIGINT with 30 s drain window)
   ✅  mTLS for all cluster and client communication
   ✅  Peer liveness tracking

 Job scheduling
   ✅  Distributed job assignment via Raft (all nodes agree on assignments)
   ✅  Least-loaded worker selection
   ✅  Worker heartbeat liveness (5 s window)
   ✅  Batch status update replication
   ✅  Job output stored on executing node, fetched on demand

 Client-facing
   ✅  SubmitJob / CancelJob / GetJobStatus / ListJobs (paginated) / StreamJobs
   ✅  GetClusterStatus
   ✅  GetRaftLogEntries (debug)
   ✅  CLI with automatic leader redirect
   ✅  Web dashboard
   ✅  /health/live + /health/ready endpoints

 Persistence
   ✅  State persistence (RocksDB-backed Raft log + snapshot via --data-dir)

 Not supported / Out of scope
   ⬜  Job timeouts, retries, priorities
   ⬜  Reschedule on node failure
   ⬜  Typed job payloads (HttpCallback, WasmModule, DockerImage)
   ⬜  Parameterized job templates
   ⬜  Periodic / cron jobs
   ⬜  Resource-aware placement
   ⬜  Job dependency graphs (DAG)
   ⬜  Shared output storage
   ⬜  Prometheus metrics / OpenTelemetry tracing
   ⬜  Authorization (ACL tokens)

Getting Started

Quick Start

# Build and install
cargo install --path .

# Run single node with dashboard
nomad-lite server --node-id 1 --port 50051 --dashboard-port 8080

# Open http://localhost:8080

# Submit a job (in another terminal)
nomad-lite job submit "echo hello"

# Check job status
nomad-lite job status <job-id>

# View cluster status
nomad-lite cluster status

Running a Cluster

The easiest way to run a 3-node cluster. Choose between production (mTLS) or development (no TLS) setup.

Production Setup (mTLS)

# Generate certificates (one-time setup)
./scripts/gen-test-certs.sh ./certs

# Start cluster
docker-compose up --build

# Stop cluster
docker-compose down

Using the CLI with mTLS:

# Check cluster status
nomad-lite cluster \
  --addr "https://127.0.0.1:50051" \
  --ca-cert ./certs/ca.crt \
  --cert ./certs/client.crt \
  --key ./certs/client.key \
  status

# Find the leader node from cluster status output, then submit to it
# Example: if Node 3 is leader, use port 50053
nomad-lite job \
  --addr "https://127.0.0.1:50053" \
  --ca-cert ./certs/ca.crt \
  --cert ./certs/client.crt \
  --key ./certs/client.key \
  submit "echo hello from mTLS"

# Get job status
nomad-lite job \
  --addr "https://127.0.0.1:50051" \
  --ca-cert ./certs/ca.crt \
  --cert ./certs/client.crt \
  --key ./certs/client.key \
  status <job-id>

# List all jobs
nomad-lite job \
  --addr "https://127.0.0.1:50051" \
  --ca-cert ./certs/ca.crt \
  --cert ./certs/client.crt \
  --key ./certs/client.key \
  list

Development Setup (No TLS)

# Start cluster
docker-compose -f docker-compose.dev.yml up --build

# Stop cluster
docker-compose -f docker-compose.dev.yml down

Using the CLI without TLS:

# Check cluster status
nomad-lite cluster --addr "http://127.0.0.1:50051" status

# Find the leader node from cluster status output, then submit to it
# Example: if Node 1 is leader, use port 50051
nomad-lite job --addr "http://127.0.0.1:50051" submit "echo hello"

# Get job status
nomad-lite job --addr "http://127.0.0.1:50051" status <job-id>

# List all jobs
nomad-lite job --addr "http://127.0.0.1:50051" list

Cluster Endpoints:

NodegRPCDashboard
1localhost:50051localhost:8081
2localhost:50052localhost:8082
3localhost:50053localhost:8083

Note: Auto-redirect doesn’t work with Docker Compose because nodes report internal addresses (e.g., node1:50051) that aren’t accessible from the host. Always check which node is leader using cluster status and connect directly to it for job submissions.

Option 2: Local Multi-Node

Run nodes directly without Docker:

# Terminal 1
nomad-lite server --node-id 1 --port 50051 --dashboard-port 8081 \
  --peers "2:127.0.0.1:50052,3:127.0.0.1:50053"

# Terminal 2
nomad-lite server --node-id 2 --port 50052 --dashboard-port 8082 \
  --peers "1:127.0.0.1:50051,3:127.0.0.1:50053"

# Terminal 3
nomad-lite server --node-id 3 --port 50053 --dashboard-port 8083 \
  --peers "1:127.0.0.1:50051,2:127.0.0.1:50052"

With state persistence (survives restarts):

# Add --data-dir to each node; state is preserved across restarts
nomad-lite server --node-id 1 --port 50051 --dashboard-port 8081 \
  --peers "2:127.0.0.1:50052,3:127.0.0.1:50053" \
  --data-dir /var/lib/nomad-lite/node1

nomad-lite server --node-id 2 --port 50052 --dashboard-port 8082 \
  --peers "1:127.0.0.1:50051,3:127.0.0.1:50053" \
  --data-dir /var/lib/nomad-lite/node2

nomad-lite server --node-id 3 --port 50053 --dashboard-port 8083 \
  --peers "1:127.0.0.1:50051,2:127.0.0.1:50052" \
  --data-dir /var/lib/nomad-lite/node3

Each node stores its Raft log, term, voted-for, and job snapshot in a local RocksDB instance. Omit --data-dir to run with in-memory state (useful for development and testing).

With mTLS:

# Generate certificates first
./scripts/gen-test-certs.sh ./certs

# Add TLS flags to each node
--tls --ca-cert ./certs/ca.crt --cert ./certs/node1.crt --key ./certs/node1.key

CLI Reference

The nomad-lite binary provides both server and client functionality.

nomad-lite
├── server                        # Start a server node
├── job                           # Job management
│   ├── submit <COMMAND>         # Submit a new job
│   ├── status <JOB_ID>          # Get job status
│   ├── cancel <JOB_ID>          # Cancel a pending or running job
│   └── list                     # List all jobs
├── cluster                       # Cluster management
│   ├── status                   # Get cluster info
│   ├── transfer-leader          # Transfer leadership to another node
│   └── drain                    # Drain node for maintenance
└── log                           # Raft log inspection
    └── list                     # View committed log entries

Server Options

FlagDefaultDescription
--node-id1Unique node identifier
--port50051gRPC server port
--dashboard-port-Web dashboard port (optional)
--peers“”Peer addresses: "id:host:port,..."
--imagealpine:latestDocker image for jobs
--tlsfalseEnable mTLS
--ca-cert-CA certificate path
--cert-Node certificate path
--key-Node private key path
--allow-insecurefalseRun without TLS if certs fail
--data-dir-Path to RocksDB data directory for persistence (optional; omit for in-memory)
--advertise-addr127.0.0.1:<port>Address reported to peers and shown in cluster status; defaults to the listen port on loopback; override when nodes must be reachable at a specific hostname or external IP (e.g., 192.168.1.10:50051)

Client Options

FlagDefaultDescription
-a, --addrhttp://127.0.0.1:50051Server address
-o, --outputtableOutput format: table or json
--ca-cert-CA certificate for TLS
--cert-Client certificate for mTLS
--key-Client private key for mTLS

job submit Options

FlagDefaultDescription
--image IMAGEserver default (alpine:latest)Docker image to run this job in; overrides the server-wide --image setting for this job only

Command Examples

Submit a job:

nomad-lite job submit "echo hello"
# Job submitted successfully!
# Job ID: ef319e40-c888-490d-8349-e9c05f78cf5a

# Override the Docker image for this specific job
nomad-lite job submit --image python:3.12-alpine "python3 -c 'print(42)'"
# Job submitted successfully!
# Job ID: 3b7a1c22-...

Get job status:

nomad-lite job status ef319e40-c888-490d-8349-e9c05f78cf5a
# Job ID:          ef319e40-c888-490d-8349-e9c05f78cf5a
# Status:          COMPLETED
# Exit Code:       0
# Assigned Worker: 1
# Executed By:     1
# Output:
#   hello

Cancel a job:

nomad-lite job cancel ef319e40-c888-490d-8349-e9c05f78cf5a
# job ef319e40-c888-490d-8349-e9c05f78cf5a cancelled

Cancelling a job that is already terminal returns an error:

nomad-lite job cancel ef319e40-c888-490d-8349-e9c05f78cf5a
# Error: job is already completed

List all jobs:

nomad-lite job list
# JOB ID                                 STATUS       WORKER   COMMAND
# ------------------------------------------------------------------------------
# ef319e40-c888-490d-8349-e9c05f78cf5a   COMPLETED    1        echo hello
#
# Showing 1 of 1 jobs

List jobs with pagination:

nomad-lite job list --page-size 50 --all  # Fetch all pages
nomad-lite job list --stream              # Use streaming API

Filter jobs:

# Filter by status
nomad-lite job list --status pending
nomad-lite job list --status completed

# Filter by worker node
nomad-lite job list --worker 2

# Filter by command substring (case-insensitive)
nomad-lite job list --command-substr echo

# Filter by creation time range (Unix ms timestamps)
nomad-lite job list --created-after-ms 1700000000000 --created-before-ms 1710000000000

# Combine filters
nomad-lite job list --status pending --command-substr echo --worker 1

Filter flags for job list:

FlagDescription
--status STATUSOnly jobs with this status: pending, running, completed, failed, cancelled
--worker WORKER_IDOnly jobs whose assigned_worker or executed_by matches this node ID
--command-substr SUBSTRCase-insensitive substring match on the job command
--created-after-ms MSOnly jobs created at or after this Unix timestamp (milliseconds)
--created-before-ms MSOnly jobs created at or before this Unix timestamp (milliseconds)

Note: Filters are not supported with --stream (only --status applies to the streaming API).

Get cluster status:

nomad-lite cluster status
# Cluster Status
# ========================================
# Term:   5
# Leader: Node 1
#
# Nodes:
# ID       ADDRESS                   STATUS
# ---------------------------------------------
# 1        0.0.0.0:50051             [+] alive
# 2        127.0.0.1:50052           [+] alive
# 3        127.0.0.1:50053           [+] alive

Transfer leadership:

# Transfer to a specific node
nomad-lite cluster -a http://127.0.0.1:50051 transfer-leader --to 2
# Leadership transferred successfully!
# New leader: Node 2

# Auto-select best candidate
nomad-lite cluster -a http://127.0.0.1:50051 transfer-leader
# Leadership transferred successfully!
# New leader: Node 3

Drain a node for maintenance:

# Drain the node: stops accepting jobs, waits for running jobs, transfers leadership
nomad-lite cluster -a http://127.0.0.1:50051 drain
# Draining node...
# Node drained successfully.
# Node drained successfully

# Verify leadership moved
nomad-lite cluster -a http://127.0.0.1:50052 status

View Raft log entries:

nomad-lite log list
# Raft Log Entries
# ================================================================================
# Commit Index: 6  |  Last Log Index: 6
#
# INDEX  TERM   COMMITTED  TYPE                 DETAILS
# --------------------------------------------------------------------------------
# 1      1      yes        Noop
# 2      1      yes        SubmitJob            job_id=bd764021-..., cmd=echo job1
# 3      1      yes        SubmitJob            job_id=1cce681f-..., cmd=echo job2
# 4      1      yes        SubmitJob            job_id=26694755-..., cmd=echo job3
# 5      1      yes        BatchUpdateJobStatus 3 updates
# 6      1      yes        UpdateJobStatus      job_id=17cc39b2-..., status=Completed
#
# Showing 6 entries

View log entries after compaction:

When log compaction has occurred, earlier entries are replaced by a snapshot:

nomad-lite log list
# Raft Log Entries
# ================================================================================
# First Available: 1001  |  Commit Index: 1050  |  Last Log Index: 1050
# (Entries 1-1000 were compacted into a snapshot)
#
# INDEX  TERM   COMMITTED  TYPE                 DETAILS
# ...

View log entries with pagination:

nomad-lite log list --start-index 1 --limit 50  # Start from index 1, max 50 entries
nomad-lite log list --start-index 10            # Start from index 10

JSON output:

nomad-lite job -o json list
nomad-lite cluster -o json status
nomad-lite log -o json list

Automatic Leader Redirect

For local clusters (non-Docker), the CLI automatically redirects to the leader if you connect to a follower. This works for both submit and cancel:

# Connect to follower node (port 50052), CLI auto-redirects to leader
nomad-lite job -a http://127.0.0.1:50052 submit "echo hello"
# Redirecting to leader at http://127.0.0.1:50051...
# Job submitted successfully!

nomad-lite job -a http://127.0.0.1:50052 cancel <job-id>
# Redirecting to leader at http://127.0.0.1:50051...
# job <job-id> cancelled

API Reference

REST API (Dashboard)

The web dashboard exposes a REST API for job and cluster management.

Get cluster status:

curl http://localhost:8081/api/cluster
# Response:
# {
#   "node_id": 1,
#   "role": "leader",
#   "current_term": 5,
#   "leader_id": 1,
#   "commit_index": 3,
#   "last_applied": 3,
#   "log_length": 3,
#   "nodes": [
#     { "node_id": 1, "address": "0.0.0.0:50051", "is_alive": true },
#     { "node_id": 2, "address": "127.0.0.1:50052", "is_alive": true },
#     { "node_id": 3, "address": "127.0.0.1:50053", "is_alive": false }
#   ]
# }

Submit a job:

curl -X POST http://localhost:8081/api/jobs \
  -H "Content-Type: application/json" \
  -d '{"command": "echo hello"}'
# Response:
# {
#   "job_id": "ef319e40-c888-490d-8349-e9c05f78cf5a",
#   "status": "pending"
# }

# With a specific Docker image (overrides the server default for this job)
curl -X POST http://localhost:8081/api/jobs \
  -H "Content-Type: application/json" \
  -d '{"command": "python3 -c '\''print(42)'\''", "image": "python:3.12-alpine"}'

Cancel a job:

curl -X DELETE http://localhost:8081/api/jobs/ef319e40-c888-490d-8349-e9c05f78cf5a
# Response (success):
# {
#   "success": true,
#   "error": null
# }

# Response (already terminal):
# HTTP 400
# {
#   "success": false,
#   "error": "job is already completed"
# }

List all jobs:

curl http://localhost:8081/api/jobs
# Response:
# [
#   {
#     "id": "ef319e40-c888-490d-8349-e9c05f78cf5a",
#     "command": "echo hello",
#     "status": "completed",
#     "executed_by": 1,
#     "output": "hello\n",
#     "error": null,
#     "created_at": "2026-01-28T12:45:41.231558433+00:00",
#     "completed_at": "2026-01-28T12:45:41.678341558+00:00"
#   }
# ]

Liveness probe:

curl http://localhost:8081/health/live
# Response (always 200 while the process is alive):
# {
#   "status": "ok"
# }

Readiness probe:

curl http://localhost:8081/health/ready
# Response when a leader has been elected (200):
# {
#   "status": "ok",
#   "leader_id": 1
# }

# Response during startup or mid-election (503):
# {
#   "status": "no_leader",
#   "leader_id": null
# }

gRPC API

SchedulerService (client-facing)

MethodDescriptionLeader Only
SubmitJob(command, image?)Submit a job; image overrides the server-default Docker image for this jobYes
CancelJob(job_id)Cancel a pending or running jobYes
GetJobStatus(job_id)Get job statusNo
ListJobs(page_size, page_token, status_filter, worker_id_filter, command_filter, created_after_ms, created_before_ms)List jobs (paginated, filterable)No
StreamJobs()Stream jobsNo
GetClusterStatus()Cluster infoForwarded to leader
GetRaftLogEntries()View Raft log entriesForwarded to leader
TransferLeadership(target)Transfer leadershipYes
DrainNode()Drain node for maintenanceNo

ListJobs request fields

FieldTypeDefaultDescription
page_sizeuint32100Max results per page (capped at 1000)
page_tokenstring“”Token from the previous response for the next page
status_filterJobStatusUNSPECIFIEDOnly return jobs with this status; 0/UNSPECIFIED = no filter
worker_id_filteruint640Only return jobs whose assigned_worker or executed_by matches; 0 = no filter
command_filterstring“”Case-insensitive substring match on the command; empty = no filter
created_after_msint640Only return jobs created at or after this Unix timestamp (ms); 0 = no bound
created_before_msint640Only return jobs created at or before this Unix timestamp (ms); 0 = no bound

total_count in the response reflects the filtered result set size (not the total queue size).

SubmitJob error codes

gRPC statusMeaningClient action
OKJob accepted and committed
FAILED_PRECONDITIONNode is not the leaderRedirect to the node ID in the message
RESOURCE_EXHAUSTEDLeader proposal queue is full (>256 pending)Retry with exponential backoff
DEADLINE_EXCEEDEDRaft did not commit the entry within 5 secondsRetry; may indicate a degraded cluster
UNAVAILABLENode is draining, or the Raft loop has stoppedRetry on a different node
INVALID_ARGUMENTEmpty command string, or command exceeds 1024 bytesFix the request

CancelJob error codes

gRPC statusMeaningClient action
OKJob cancelled and committed
FAILED_PRECONDITIONNode is not the leader, or job is already in a terminal stateRedirect to leader / check job status
NOT_FOUNDJob ID does not exist
RESOURCE_EXHAUSTEDLeader proposal queue is fullRetry with exponential backoff
DEADLINE_EXCEEDEDRaft did not commit within 5 secondsRetry
INVALID_ARGUMENTMalformed job UUIDFix the request

InternalService (node-to-node, not client-facing)

MethodDescription
GetJobOutput(job_id)Fetch job output from the node that executed it
WorkerHeartbeat(node_id)Worker liveness signal sent every 2 s to the leader; auto-registers on first call; workers not seen for 5 s are excluded from job assignment
ForwardJobStatus(updates)Follower worker forwards completed job status to the leader for Raft replication

RaftService (node-to-node, consensus protocol)

MethodDescription
AppendEntriesLog replication and heartbeats
RequestVoteLeader election voting
TimeoutNowTrigger immediate election on the target node (used by TransferLeadership)
InstallSnapshotTransfer compacted state to slow followers

Architecture

Cluster Overview

graph LR
    subgraph Clients
        CLI[CLI]
        REST[REST]
        GRPC[gRPC]
    end

    subgraph Cluster[3-Node Raft Cluster]
        N1[Node 1<br/>gRPC :50051<br/>Dashboard :8081]
        N2[Node 2<br/>gRPC :50052<br/>Dashboard :8082]
        N3[Node 3<br/>gRPC :50053<br/>Dashboard :8083]

        N1 <-->|Raft RPCs| N2
        N1 <-->|Raft RPCs| N3
        N2 <-->|Raft RPCs| N3
    end

    subgraph Docker[Docker Containers]
        D1[Container]
        D2[Container]
        D3[Container]
    end

    CLI -->|Submit to leader| N1
    REST -->|HTTP API| N1
    GRPC -->|gRPC API| N1

    N1 -->|docker run| D1
    N2 -->|docker run| D2
    N3 -->|docker run| D3

    classDef leader fill:#90EE90,stroke:#006400,stroke-width:2px
    classDef follower fill:#FFE4B5,stroke:#FF8C00,stroke-width:2px
    class N1 leader
    class N2,N3 follower

Node Internal Architecture

graph TB
    subgraph External[External APIs]
        GRPC[gRPC Server<br/>SubmitJob, GetStatus, ListJobs]
        DASH[Dashboard<br/>REST API + Web UI]
    end

    subgraph Core[Core Components]
        RAFT[Raft Module<br/>Leader Election<br/>Log Replication]
        LOG[Raft Log<br/>In-Memory or RocksDB]
    end

    subgraph Loops[Background Loops]
        SCHED[Scheduler Loop<br/>Event-driven<br/>Applies commits<br/>Assigns jobs on leader]
        WORKER[Worker Loop<br/>Event-driven + 2s heartbeat<br/>Executes assigned jobs]
    end

    QUEUE[Job Queue<br/>State Machine]
    DOCKER[Docker Container<br/>Sandboxed Execution]

    %% External to Core
    GRPC -->|Commands| RAFT
    DASH -->|Commands| RAFT
    DASH -->|Query| QUEUE

    %% Core
    RAFT <--> LOG

    %% Loops subscribe/interact
    SCHED -.->|Subscribe commits| RAFT
    SCHED -->|Apply entries & Assign jobs| QUEUE
    WORKER -->|Notified & Update| QUEUE
    WORKER -->|Execute| DOCKER

    classDef api fill:#87CEEB,stroke:#4682B4
    classDef core fill:#90EE90,stroke:#006400
    classDef loop fill:#FFE4B5,stroke:#FF8C00
    classDef state fill:#E6E6FA,stroke:#9370DB

    class GRPC,DASH api
    class RAFT,LOG core
    class SCHED,WORKER loop
    class QUEUE,DOCKER state

Key Points:

  • Every node runs all components (gRPC, Dashboard, Raft, Scheduler, Worker)
  • Only the leader’s Scheduler Loop assigns jobs; followers just apply committed entries
  • Job assignments travel through Raft (AssignJob command) so every node’s queue reflects the same state
  • Workers send WorkerHeartbeat gRPC to the leader every 2 s to signal liveness; workers that miss heartbeats for more than 5 s are excluded from job assignment
  • Workers discover assigned jobs by querying the local job queue directly (no node-local in-memory map needed)
  • Job output is stored only on the executing node (fetched via GetJobOutput RPC when queried)

Data Flow

sequenceDiagram
    participant Client as CLI Client
    participant N1 as Node 1 (Leader)<br/>gRPC Server
    participant Raft1 as Node 1<br/>Raft Module
    participant N2 as Node 2 (Follower)<br/>Raft Module
    participant N3 as Node 3 (Follower)<br/>Raft Module
    participant Log1 as Node 1<br/>Raft Log
    participant Queue1 as Node 1<br/>Job Queue
    participant Sched as Scheduler Loop<br/>(Leader Only)
    participant Worker as Worker Loop<br/>(Node 2)
    participant Docker as Docker Container

    Note over Client,Docker: Job Submission Flow (Write Operation)

    Client->>N1: 1. SubmitJob("echo hello")
    N1->>N1: 2. Create Job object<br/>job_id = uuid
    N1->>Raft1: 3. Propose SubmitJob command
    Raft1->>Log1: 4. Append to local log<br/>(index=N, term=T)

    par Replicate to Followers
        Raft1->>N2: 5a. AppendEntries RPC<br/>(log entry)
        Raft1->>N3: 5b. AppendEntries RPC<br/>(log entry)
    end

    N2-->>Raft1: 6a. ACK (success)
    N3-->>Raft1: 6b. ACK (success)

    Note over Raft1: Majority reached (2/3)

    Raft1-->>N1: 7. Commit confirmed
    N1->>Queue1: 8. Add job to queue<br/>Status: PENDING
    N1-->>Client: 9. Response: job_id

    Note over Client,Docker: Job Assignment Flow (Leader Scheduler Loop - event-driven)

    Sched->>Raft1: 10. Subscribe to commits
    Raft1-->>Sched: 11. Commit notification
    Sched->>Queue1: 12. Apply committed entry<br/>(idempotent add)

    Sched->>Queue1: 13. Check pending jobs
    Queue1-->>Sched: 14. Job found (PENDING)
    Sched->>Sched: 15. Select least-loaded live worker<br/>(Node 2, via WorkerHeartbeat liveness)
    Sched->>Queue1: 16. Optimistic local assign<br/>Status: RUNNING (leader only)
    Sched->>Raft1: 17. Propose AssignJob(job_id, worker=2)

    par Replicate AssignJob to Followers
        Raft1->>N2: 18a. AppendEntries (AssignJob)
        Raft1->>N3: 18b. AppendEntries (AssignJob)
    end

    N2-->>Raft1: 19a. ACK
    N3-->>Raft1: 19b. ACK

    Note over N2,N3: Followers apply AssignJob<br/>Status: RUNNING on all nodes

    Note over Client,Docker: Job Execution Flow (Worker Loop - notified on assignment)

    Worker->>Queue1: 20. Notified, query jobs_assigned_to(node2)
    Queue1-->>Worker: 21. Job assigned to me<br/>(RUNNING status)

    Worker->>Docker: 22. docker run alpine:latest<br/>--network=none --read-only<br/>--memory=256m --cpus=0.5<br/>echo hello

    Docker-->>Worker: 23. Output: "hello\n"<br/>Exit code: 0

    Worker->>Queue1: 24. Update local job result<br/>Status: COMPLETED<br/>Output stored locally

    Note over Worker,Raft1: Follower worker calls ForwardJobStatus → leader proposes to Raft

    Worker->>Raft1: 25. ForwardJobStatus (job done, exit=0)
    Raft1->>N2: 26. AppendEntries (UpdateJobStatus)
    Raft1->>N3: 26. AppendEntries (UpdateJobStatus)

    Note over Client,Docker: Job Status Query Flow (Read Operation)

    Client->>N1: 27. GetJobStatus(job_id)
    N1->>Queue1: 28. Query job queue
    Queue1-->>N1: 29. Job metadata<br/>(executed_by: Node 2)
    N1->>N2: 30. GetJobOutput RPC<br/>(fetch from executor)
    N2-->>N1: 31. Output: "hello"
    N1-->>Client: 32. Response: COMPLETED<br/>Output: "hello"

    Note over Client,Docker: Leader Election Flow (Failure Scenario)

    Note over Raft1: Leader crashes!

    Note over N2,N3: Election timeout (150-300ms)<br/>No heartbeat received

    N2->>N2: Election timeout triggered
    N2->>N2: Increment term, become candidate
    N2->>N3: RequestVote RPC (term=T+1)
    N3-->>N2: VoteGranted

    Note over N2: Won election with majority

    N2->>N2: Become Leader
    Note over N2: Scheduler loop now active<br/>(assigns jobs)

    Note over N2,N3: Node 2 now leads cluster

Security

mTLS

All gRPC communication (node-to-node, client-to-node) can be secured with mutual TLS:

  • Both parties authenticate via certificates signed by cluster CA
  • All traffic encrypted with TLS 1.2+
  • Generate certs: ./scripts/gen-test-certs.sh ./certs

Docker Sandboxing

Jobs run in isolated containers with:

RestrictionSetting
Network--network=none
Capabilities--cap-drop=ALL
Filesystem--read-only
Privileges--security-opt=no-new-privileges
Memory--memory=256m
CPU--cpus=0.5
Wall-clock timeout30s (hardcoded; jobs exceeding this are killed and marked Failed)

Raft Consensus

Timing

  • Election timeout: 150-300ms (randomized)
  • Heartbeat interval: 50ms
  • Peer liveness window: 3s — a node is shown as is_alive: false in GetClusterStatus if no successful AppendEntries/heartbeat has been received from it within this window

Log Replication

  1. Client sends command to leader
  2. Leader appends to log and replicates via AppendEntries
  3. Majority acknowledgment → committed
  4. Applied to state machine

Proposal backpressure

The leader’s proposal channel holds at most 256 pending commands. If the channel is full, SubmitJob returns RESOURCE_EXHAUSTED immediately — the caller should retry with exponential backoff. Additionally, if the Raft loop accepts the command but quorum is lost before the entry commits, SubmitJob returns DEADLINE_EXCEEDED after 5 seconds. Both guarantees ensure clients always get a bounded response rather than hanging indefinitely.

State Machine Commands

Each log entry carries one of the following commands applied to the job queue on every node:

CommandDescription
SubmitJobAdds a new job in Pending state
AssignJobAssigns a pending job to a specific worker, setting it to Running on all nodes
RegisterWorkerRecords a worker in the cluster-wide registry
UpdateJobStatus / BatchUpdateJobStatusMarks one or more jobs Completed or Failed

AssignJob is proposed by the leader’s scheduler loop after it selects the least-loaded live worker. Because this command travels through the Raft log, every follower applies the same assignment atomically — workers discover their jobs by querying the local job queue rather than a node-local in-memory map.

Log Compaction

When the in-memory log exceeds 1000 entries, the committed prefix is replaced with a snapshot of the current state (job queue + worker registrations). This bounds memory usage regardless of throughput. Followers that fall too far behind receive the snapshot via InstallSnapshot RPC instead of replaying individual entries.

State Persistence

By default, all Raft state (log, term, voted-for, snapshot) is held in memory and lost on restart. Pass --data-dir <path> to enable a local RocksDB store. On startup the node reloads its persisted term, voted-for, and committed log entries and replays any stored snapshot into the job queue before rejoining the cluster. Storage failures cause an immediate crash (crash-fail-safe) rather than silent data loss.

Safety Guarantees

  • Election safety: One leader per term
  • Leader append-only: Never overwrites log
  • Log matching: Same index/term = identical
  • Leader completeness: Committed entries persist

Cluster Sizing

NodesMajorityFault Tolerance
321 failure
532 failures
743 failures

Use odd numbers—even numbers add overhead without improving fault tolerance.

Testing

cargo test                # Run all tests
cargo test --lib          # Unit tests only
cargo test --test <name>  # Specific test suite

Test Suites

SuiteDescription
lib (unit)Config, Raft state machine, RPC serialization
scheduler_testsJob queue, worker assignment, heartbeats
raft_rpc_testsAppendEntries, RequestVote, InstallSnapshot, term handling, compaction boundary cases, re-election vote reset
integration_testsMulti-node election, replication, consistency
failover_testsLeader crash, re-election, quorum loss
partition_testsNetwork partitions, split-brain prevention, healing
chaos_testsRapid leader churn, network flapping, cascading failures, full isolation recovery
tls_testsmTLS certificate loading, encrypted cluster communication
executor_testsDocker sandbox command execution
internal_service_testsInternal node-to-node API, job output fetching, ForwardJobStatus to leader
distribution_testsJobs spread across nodes (not just leader), WorkerHeartbeat RPC, dead-worker exclusion
dashboard_testsREST API endpoints
leadership_transfer_testsVoluntary leadership transfer, auto-select, non-leader rejection
drain_testsNode draining, job rejection during drain, leadership handoff
compaction_testsLog compaction trigger, snapshot transfer to slow followers, multiple compaction rounds, AppendEntries fallback when below compaction threshold, state consistency
backpressure_testsFull proposal channel returns RESOURCE_EXHAUSTED immediately; stalled Raft loop returns DEADLINE_EXCEEDED after 5 s
persistence_testsRocksDB storage: fresh DB returns None, hard state/log reload, log truncation, snapshot compaction deletes old entries, post-compaction state reconstruction on restart, node term survives restart