Introduction
nomad-lite is a distributed job scheduler with custom Raft consensus, similar to Nomad or Kubernetes scheduler. Jobs are shell commands executed in isolated Docker containers across a cluster.
Features
- Custom Raft Consensus - Leader election, log replication, and fault tolerance from scratch
- Distributed Scheduling - Jobs executed across cluster with automatic failover
- Unified CLI - Single binary for both server and client with automatic leader redirect
- mTLS Security - Mutual TLS for all gRPC communication
- Docker Sandboxing - Jobs run in isolated containers with restricted capabilities
- Web Dashboard - Real-time monitoring and job management
- gRPC + REST APIs - Type-safe client communication
- Job Cancellation - Cancel pending or running jobs via CLI, gRPC, or REST dashboard
- Health Endpoints -
/health/live(process alive) and/health/ready(leader known) for orchestrator integration - Graceful Shutdown - SIGTERM/SIGINT handling with drain period for in-flight work
- Leader Draining & Transfer - Voluntary leadership transfer and node draining for safe maintenance
- Batch Replication - Multiple job status updates batched into a single Raft log entry for reduced consensus overhead
- State Persistence - Optional RocksDB-backed Raft log, term, and snapshot storage via
--data-dir; nodes survive restarts and rejoin without losing committed state - Log Compaction - Automatic in-memory log prefix truncation with snapshot transfer to slow followers
- Proposal Backpressure - Bounded proposal queue (256 slots); clients get an immediate
RESOURCE_EXHAUSTEDwhen the leader is overloaded and aDEADLINE_EXCEEDEDif a commit stalls, rather than hanging indefinitely
Requirements
| Dependency | Version | Installation |
|---|---|---|
| Rust | 1.56+ | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh |
| protoc | 3.0+ | apt install protobuf-compiler / brew install protobuf |
| Docker | 20.0+ | apt install docker.io / brew install --cask docker |
Nomad vs nomad-lite
nomad-lite draws direct inspiration from HashiCorp Nomad, a production-grade workload orchestrator. This page documents where the two systems overlap and where they diverge.
nomad-lite is not a drop-in replacement for Nomad. It is a purpose-built implementation of the core distributed scheduling concepts — Raft consensus, distributed job assignment, worker liveness, graceful failover — written from scratch in Rust as a learning and experimentation platform. Understanding where Nomad sets the bar makes it easier to reason about which parts of that bar nomad-lite has already cleared and which remain ahead.
Scheduler Types
Nomad ships three built-in schedulers, each with a fundamentally different execution model:
| Scheduler | Nomad | nomad-lite |
|---|---|---|
| batch | Run-to-completion jobs; retries on failure | ✅ Core model |
| service | Long-running daemons; keeps N instances alive, restarts on crash | Not supported |
| system | Runs exactly one instance on every node (log shippers, exporters) | Not supported |
nomad-lite is a pure batch scheduler. Every job is expected to start, do work, and exit. Long-running services and system-wide daemons are outside its current scope.
Task Drivers (How Jobs Run)
Nomad abstracts the execution runtime through pluggable task drivers:
| Driver | Nomad | nomad-lite |
|---|---|---|
docker | Full image + entrypoint + env control | Partial — fixed Alpine image, sh -c only |
exec / raw_exec | Direct process execution, no container | Not supported |
java | JVM workloads with classpath management | Not supported |
qemu | Full virtual machine execution | Not supported |
podman, containerd | OCI-compatible runtimes via plugins | Not supported |
nomad-lite runs all jobs as docker run --rm alpine:latest sh -c <command>. The image is
configurable at the node level but uniform across all jobs — submitters cannot choose a
per-job image or entrypoint.
Scheduling Features
| Feature | Nomad | nomad-lite |
|---|---|---|
| Placement algorithm | Bin-packing (CPU + memory aware) | Least-loaded (running job count) |
| Resource constraints | CPU, memory, GPU, disk declared per job | Not supported |
| Node constraints | Attribute expressions (kernel.name == linux) | Not supported |
| Affinities | Soft preferences for placement | Not supported |
| Spread | Distribute across failure domains (AZ, rack) | Not supported |
| Job priorities | Integer 1–100; high priority preempts low | Not supported |
| Preemption | Evict lower-priority running jobs to place high-priority ones | Not supported |
nomad-lite’s assigner selects the worker with the fewest running jobs. This is a reasonable approximation of load balancing but ignores actual resource consumption — a node running one heavy job looks the same as one running one trivial job.
Job Lifecycle
| Feature | Nomad | nomad-lite |
|---|---|---|
| Job submission | HCL job spec file | Single string command (gRPC / CLI) |
| Job timeouts | kill_timeout, max_kill_timeout per task | Node-level 30 s wall-clock timeout (kills container + marks Failed); per-job configurable timeout not supported |
| Retry policy | restart stanza: attempts, delay, mode | Not supported |
| Reschedule on node loss | reschedule stanza with backoff | Not supported |
| Job cancellation | nomad job stop | ✅ nomad-lite job cancel <job-id> |
| Job priorities | 1–100 integer field | Not supported |
| Parameterized jobs | parameterized stanza; dispatch via CLI/API | Not supported |
| Periodic jobs | Cron expression in job spec | Not supported |
| Job dependencies (DAG) | Not native; requires external tooling | Not supported |
| Rolling updates | update stanza with canary, max_parallel | Not supported |
| Job versioning | Tracks job spec history, auto-reverts on failure | Not supported |
Consensus and State
| Feature | Nomad | nomad-lite |
|---|---|---|
| Consensus protocol | Raft (via HashiCorp Raft library) | Custom Raft implementation in Rust |
| State persistence | Durable BoltDB-backed Raft log | ✅ Optional RocksDB-backed log + snapshot via --data-dir; in-memory if omitted |
| Log compaction | Raft snapshots to BoltDB | In-memory prefix truncation + snapshot; persisted to RocksDB when --data-dir is set |
| Multi-region | Federation across regions with replication | Single cluster only |
| Leader election | Randomized timeouts | Randomized timeouts (150–300 ms) |
| Leadership transfer | nomad operator raft transfer-leadership | ✅ nomad-lite cluster transfer-leader |
| Node drain | nomad node drain | ✅ nomad-lite cluster drain |
nomad-lite now supports optional state persistence via --data-dir. When enabled, every
Raft log entry, term change, voted-for value, and snapshot is written to a local RocksDB store
before being acknowledged. A crashed node can rejoin, replay its log, and resume without losing
any committed state. Without --data-dir the node runs in-memory only.
Security
| Feature | Nomad | nomad-lite |
|---|---|---|
| Transport encryption | mTLS for all RPC | ✅ mTLS implemented |
| ACL system | Token-based with policies and roles | Not supported |
| Vault integration | Secrets injection at job start | Not supported |
| Namespaces | Multi-tenant isolation | Not supported |
| Sentinel policies | Fine-grained job submission governance | Not supported |
nomad-lite implements the transport layer (mTLS) but has no authorization model. Any client that presents a valid certificate can submit jobs, drain nodes, or transfer leadership. In a controlled environment this is acceptable; in a shared environment it is not.
Observability
| Feature | Nomad | nomad-lite |
|---|---|---|
| Metrics | Prometheus endpoint (/v1/metrics) | Not supported |
| Distributed tracing | OpenTelemetry support | Not supported |
| Health endpoints | /v1/agent/health (live + ready) | ✅ /health/live + /health/ready |
| Audit logging | Immutable audit log (Enterprise) | Not supported |
| Web UI | Full job and cluster management UI | ✅ Basic dashboard (status + job list) |
| Log streaming | nomad alloc logs -f | Not supported |
Client API
| Feature | Nomad | nomad-lite |
|---|---|---|
| Protocol | HTTP/JSON REST + gRPC | gRPC (primary) + HTTP (dashboard only) |
| Job submission | HCL / JSON job spec | string command field |
| Streaming | Event stream, log tailing | ✅ StreamJobs gRPC streaming |
| Pagination | Cursor-based | ✅ Cursor-based ListJobs |
| Batch submission | Multiple allocations per job | Not supported |
| Leader redirect | X-Nomad-Index + redirect hints | ✅ CLI auto-redirects to leader |
Feature Summary
┌─────────────────────────────────────────┐
│ nomad-lite feature coverage │
└─────────────────────────────────────────┘
Consensus layer
✅ Leader election (randomized timeouts)
✅ Log replication (AppendEntries)
✅ Log compaction + snapshots
✅ Leadership transfer
✅ InstallSnapshot for lagging followers
✅ Dedicated heartbeat channel (no HOL blocking)
✅ Proposal backpressure (bounded queue, immediate RESOURCE_EXHAUSTED)
Cluster operations
✅ Node drain (stop accepting, finish in-flight, transfer leadership)
✅ Graceful shutdown (SIGTERM/SIGINT with 30 s drain window)
✅ mTLS for all cluster and client communication
✅ Peer liveness tracking
Job scheduling
✅ Distributed job assignment via Raft (all nodes agree on assignments)
✅ Least-loaded worker selection
✅ Worker heartbeat liveness (5 s window)
✅ Batch status update replication
✅ Job output stored on executing node, fetched on demand
Client-facing
✅ SubmitJob / CancelJob / GetJobStatus / ListJobs (paginated) / StreamJobs
✅ GetClusterStatus
✅ GetRaftLogEntries (debug)
✅ CLI with automatic leader redirect
✅ Web dashboard
✅ /health/live + /health/ready endpoints
Persistence
✅ State persistence (RocksDB-backed Raft log + snapshot via --data-dir)
Not supported / Out of scope
⬜ Job timeouts, retries, priorities
⬜ Reschedule on node failure
⬜ Typed job payloads (HttpCallback, WasmModule, DockerImage)
⬜ Parameterized job templates
⬜ Periodic / cron jobs
⬜ Resource-aware placement
⬜ Job dependency graphs (DAG)
⬜ Shared output storage
⬜ Prometheus metrics / OpenTelemetry tracing
⬜ Authorization (ACL tokens)
Getting Started
Quick Start
# Build and install
cargo install --path .
# Run single node with dashboard
nomad-lite server --node-id 1 --port 50051 --dashboard-port 8080
# Open http://localhost:8080
# Submit a job (in another terminal)
nomad-lite job submit "echo hello"
# Check job status
nomad-lite job status <job-id>
# View cluster status
nomad-lite cluster status
Running a Cluster
Option 1: Docker Compose (Recommended)
The easiest way to run a 3-node cluster. Choose between production (mTLS) or development (no TLS) setup.
Production Setup (mTLS)
# Generate certificates (one-time setup)
./scripts/gen-test-certs.sh ./certs
# Start cluster
docker-compose up --build
# Stop cluster
docker-compose down
Using the CLI with mTLS:
# Check cluster status
nomad-lite cluster \
--addr "https://127.0.0.1:50051" \
--ca-cert ./certs/ca.crt \
--cert ./certs/client.crt \
--key ./certs/client.key \
status
# Find the leader node from cluster status output, then submit to it
# Example: if Node 3 is leader, use port 50053
nomad-lite job \
--addr "https://127.0.0.1:50053" \
--ca-cert ./certs/ca.crt \
--cert ./certs/client.crt \
--key ./certs/client.key \
submit "echo hello from mTLS"
# Get job status
nomad-lite job \
--addr "https://127.0.0.1:50051" \
--ca-cert ./certs/ca.crt \
--cert ./certs/client.crt \
--key ./certs/client.key \
status <job-id>
# List all jobs
nomad-lite job \
--addr "https://127.0.0.1:50051" \
--ca-cert ./certs/ca.crt \
--cert ./certs/client.crt \
--key ./certs/client.key \
list
Development Setup (No TLS)
# Start cluster
docker-compose -f docker-compose.dev.yml up --build
# Stop cluster
docker-compose -f docker-compose.dev.yml down
Using the CLI without TLS:
# Check cluster status
nomad-lite cluster --addr "http://127.0.0.1:50051" status
# Find the leader node from cluster status output, then submit to it
# Example: if Node 1 is leader, use port 50051
nomad-lite job --addr "http://127.0.0.1:50051" submit "echo hello"
# Get job status
nomad-lite job --addr "http://127.0.0.1:50051" status <job-id>
# List all jobs
nomad-lite job --addr "http://127.0.0.1:50051" list
Cluster Endpoints:
| Node | gRPC | Dashboard |
|---|---|---|
| 1 | localhost:50051 | localhost:8081 |
| 2 | localhost:50052 | localhost:8082 |
| 3 | localhost:50053 | localhost:8083 |
Note: Auto-redirect doesn’t work with Docker Compose because nodes report internal addresses (e.g.,
node1:50051) that aren’t accessible from the host. Always check which node is leader usingcluster statusand connect directly to it for job submissions.
Option 2: Local Multi-Node
Run nodes directly without Docker:
# Terminal 1
nomad-lite server --node-id 1 --port 50051 --dashboard-port 8081 \
--peers "2:127.0.0.1:50052,3:127.0.0.1:50053"
# Terminal 2
nomad-lite server --node-id 2 --port 50052 --dashboard-port 8082 \
--peers "1:127.0.0.1:50051,3:127.0.0.1:50053"
# Terminal 3
nomad-lite server --node-id 3 --port 50053 --dashboard-port 8083 \
--peers "1:127.0.0.1:50051,2:127.0.0.1:50052"
With state persistence (survives restarts):
# Add --data-dir to each node; state is preserved across restarts
nomad-lite server --node-id 1 --port 50051 --dashboard-port 8081 \
--peers "2:127.0.0.1:50052,3:127.0.0.1:50053" \
--data-dir /var/lib/nomad-lite/node1
nomad-lite server --node-id 2 --port 50052 --dashboard-port 8082 \
--peers "1:127.0.0.1:50051,3:127.0.0.1:50053" \
--data-dir /var/lib/nomad-lite/node2
nomad-lite server --node-id 3 --port 50053 --dashboard-port 8083 \
--peers "1:127.0.0.1:50051,2:127.0.0.1:50052" \
--data-dir /var/lib/nomad-lite/node3
Each node stores its Raft log, term, voted-for, and job snapshot in a local RocksDB instance.
Omit --data-dir to run with in-memory state (useful for development and testing).
With mTLS:
# Generate certificates first
./scripts/gen-test-certs.sh ./certs
# Add TLS flags to each node
--tls --ca-cert ./certs/ca.crt --cert ./certs/node1.crt --key ./certs/node1.key
CLI Reference
The nomad-lite binary provides both server and client functionality.
nomad-lite
├── server # Start a server node
├── job # Job management
│ ├── submit <COMMAND> # Submit a new job
│ ├── status <JOB_ID> # Get job status
│ ├── cancel <JOB_ID> # Cancel a pending or running job
│ └── list # List all jobs
├── cluster # Cluster management
│ ├── status # Get cluster info
│ ├── transfer-leader # Transfer leadership to another node
│ └── drain # Drain node for maintenance
└── log # Raft log inspection
└── list # View committed log entries
Server Options
| Flag | Default | Description |
|---|---|---|
--node-id | 1 | Unique node identifier |
--port | 50051 | gRPC server port |
--dashboard-port | - | Web dashboard port (optional) |
--peers | “” | Peer addresses: "id:host:port,..." |
--image | alpine:latest | Docker image for jobs |
--tls | false | Enable mTLS |
--ca-cert | - | CA certificate path |
--cert | - | Node certificate path |
--key | - | Node private key path |
--allow-insecure | false | Run without TLS if certs fail |
--data-dir | - | Path to RocksDB data directory for persistence (optional; omit for in-memory) |
--advertise-addr | 127.0.0.1:<port> | Address reported to peers and shown in cluster status; defaults to the listen port on loopback; override when nodes must be reachable at a specific hostname or external IP (e.g., 192.168.1.10:50051) |
Client Options
| Flag | Default | Description |
|---|---|---|
-a, --addr | http://127.0.0.1:50051 | Server address |
-o, --output | table | Output format: table or json |
--ca-cert | - | CA certificate for TLS |
--cert | - | Client certificate for mTLS |
--key | - | Client private key for mTLS |
job submit Options
| Flag | Default | Description |
|---|---|---|
--image IMAGE | server default (alpine:latest) | Docker image to run this job in; overrides the server-wide --image setting for this job only |
Command Examples
Submit a job:
nomad-lite job submit "echo hello"
# Job submitted successfully!
# Job ID: ef319e40-c888-490d-8349-e9c05f78cf5a
# Override the Docker image for this specific job
nomad-lite job submit --image python:3.12-alpine "python3 -c 'print(42)'"
# Job submitted successfully!
# Job ID: 3b7a1c22-...
Get job status:
nomad-lite job status ef319e40-c888-490d-8349-e9c05f78cf5a
# Job ID: ef319e40-c888-490d-8349-e9c05f78cf5a
# Status: COMPLETED
# Exit Code: 0
# Assigned Worker: 1
# Executed By: 1
# Output:
# hello
Cancel a job:
nomad-lite job cancel ef319e40-c888-490d-8349-e9c05f78cf5a
# job ef319e40-c888-490d-8349-e9c05f78cf5a cancelled
Cancelling a job that is already terminal returns an error:
nomad-lite job cancel ef319e40-c888-490d-8349-e9c05f78cf5a
# Error: job is already completed
List all jobs:
nomad-lite job list
# JOB ID STATUS WORKER COMMAND
# ------------------------------------------------------------------------------
# ef319e40-c888-490d-8349-e9c05f78cf5a COMPLETED 1 echo hello
#
# Showing 1 of 1 jobs
List jobs with pagination:
nomad-lite job list --page-size 50 --all # Fetch all pages
nomad-lite job list --stream # Use streaming API
Filter jobs:
# Filter by status
nomad-lite job list --status pending
nomad-lite job list --status completed
# Filter by worker node
nomad-lite job list --worker 2
# Filter by command substring (case-insensitive)
nomad-lite job list --command-substr echo
# Filter by creation time range (Unix ms timestamps)
nomad-lite job list --created-after-ms 1700000000000 --created-before-ms 1710000000000
# Combine filters
nomad-lite job list --status pending --command-substr echo --worker 1
Filter flags for job list:
| Flag | Description |
|---|---|
--status STATUS | Only jobs with this status: pending, running, completed, failed, cancelled |
--worker WORKER_ID | Only jobs whose assigned_worker or executed_by matches this node ID |
--command-substr SUBSTR | Case-insensitive substring match on the job command |
--created-after-ms MS | Only jobs created at or after this Unix timestamp (milliseconds) |
--created-before-ms MS | Only jobs created at or before this Unix timestamp (milliseconds) |
Note: Filters are not supported with
--stream(only--statusapplies to the streaming API).
Get cluster status:
nomad-lite cluster status
# Cluster Status
# ========================================
# Term: 5
# Leader: Node 1
#
# Nodes:
# ID ADDRESS STATUS
# ---------------------------------------------
# 1 0.0.0.0:50051 [+] alive
# 2 127.0.0.1:50052 [+] alive
# 3 127.0.0.1:50053 [+] alive
Transfer leadership:
# Transfer to a specific node
nomad-lite cluster -a http://127.0.0.1:50051 transfer-leader --to 2
# Leadership transferred successfully!
# New leader: Node 2
# Auto-select best candidate
nomad-lite cluster -a http://127.0.0.1:50051 transfer-leader
# Leadership transferred successfully!
# New leader: Node 3
Drain a node for maintenance:
# Drain the node: stops accepting jobs, waits for running jobs, transfers leadership
nomad-lite cluster -a http://127.0.0.1:50051 drain
# Draining node...
# Node drained successfully.
# Node drained successfully
# Verify leadership moved
nomad-lite cluster -a http://127.0.0.1:50052 status
View Raft log entries:
nomad-lite log list
# Raft Log Entries
# ================================================================================
# Commit Index: 6 | Last Log Index: 6
#
# INDEX TERM COMMITTED TYPE DETAILS
# --------------------------------------------------------------------------------
# 1 1 yes Noop
# 2 1 yes SubmitJob job_id=bd764021-..., cmd=echo job1
# 3 1 yes SubmitJob job_id=1cce681f-..., cmd=echo job2
# 4 1 yes SubmitJob job_id=26694755-..., cmd=echo job3
# 5 1 yes BatchUpdateJobStatus 3 updates
# 6 1 yes UpdateJobStatus job_id=17cc39b2-..., status=Completed
#
# Showing 6 entries
View log entries after compaction:
When log compaction has occurred, earlier entries are replaced by a snapshot:
nomad-lite log list
# Raft Log Entries
# ================================================================================
# First Available: 1001 | Commit Index: 1050 | Last Log Index: 1050
# (Entries 1-1000 were compacted into a snapshot)
#
# INDEX TERM COMMITTED TYPE DETAILS
# ...
View log entries with pagination:
nomad-lite log list --start-index 1 --limit 50 # Start from index 1, max 50 entries
nomad-lite log list --start-index 10 # Start from index 10
JSON output:
nomad-lite job -o json list
nomad-lite cluster -o json status
nomad-lite log -o json list
Automatic Leader Redirect
For local clusters (non-Docker), the CLI automatically redirects to the leader if you connect to a follower. This works for both submit and cancel:
# Connect to follower node (port 50052), CLI auto-redirects to leader
nomad-lite job -a http://127.0.0.1:50052 submit "echo hello"
# Redirecting to leader at http://127.0.0.1:50051...
# Job submitted successfully!
nomad-lite job -a http://127.0.0.1:50052 cancel <job-id>
# Redirecting to leader at http://127.0.0.1:50051...
# job <job-id> cancelled
API Reference
REST API (Dashboard)
The web dashboard exposes a REST API for job and cluster management.
Get cluster status:
curl http://localhost:8081/api/cluster
# Response:
# {
# "node_id": 1,
# "role": "leader",
# "current_term": 5,
# "leader_id": 1,
# "commit_index": 3,
# "last_applied": 3,
# "log_length": 3,
# "nodes": [
# { "node_id": 1, "address": "0.0.0.0:50051", "is_alive": true },
# { "node_id": 2, "address": "127.0.0.1:50052", "is_alive": true },
# { "node_id": 3, "address": "127.0.0.1:50053", "is_alive": false }
# ]
# }
Submit a job:
curl -X POST http://localhost:8081/api/jobs \
-H "Content-Type: application/json" \
-d '{"command": "echo hello"}'
# Response:
# {
# "job_id": "ef319e40-c888-490d-8349-e9c05f78cf5a",
# "status": "pending"
# }
# With a specific Docker image (overrides the server default for this job)
curl -X POST http://localhost:8081/api/jobs \
-H "Content-Type: application/json" \
-d '{"command": "python3 -c '\''print(42)'\''", "image": "python:3.12-alpine"}'
Cancel a job:
curl -X DELETE http://localhost:8081/api/jobs/ef319e40-c888-490d-8349-e9c05f78cf5a
# Response (success):
# {
# "success": true,
# "error": null
# }
# Response (already terminal):
# HTTP 400
# {
# "success": false,
# "error": "job is already completed"
# }
List all jobs:
curl http://localhost:8081/api/jobs
# Response:
# [
# {
# "id": "ef319e40-c888-490d-8349-e9c05f78cf5a",
# "command": "echo hello",
# "status": "completed",
# "executed_by": 1,
# "output": "hello\n",
# "error": null,
# "created_at": "2026-01-28T12:45:41.231558433+00:00",
# "completed_at": "2026-01-28T12:45:41.678341558+00:00"
# }
# ]
Liveness probe:
curl http://localhost:8081/health/live
# Response (always 200 while the process is alive):
# {
# "status": "ok"
# }
Readiness probe:
curl http://localhost:8081/health/ready
# Response when a leader has been elected (200):
# {
# "status": "ok",
# "leader_id": 1
# }
# Response during startup or mid-election (503):
# {
# "status": "no_leader",
# "leader_id": null
# }
gRPC API
SchedulerService (client-facing)
| Method | Description | Leader Only |
|---|---|---|
SubmitJob(command, image?) | Submit a job; image overrides the server-default Docker image for this job | Yes |
CancelJob(job_id) | Cancel a pending or running job | Yes |
GetJobStatus(job_id) | Get job status | No |
ListJobs(page_size, page_token, status_filter, worker_id_filter, command_filter, created_after_ms, created_before_ms) | List jobs (paginated, filterable) | No |
StreamJobs() | Stream jobs | No |
GetClusterStatus() | Cluster info | Forwarded to leader |
GetRaftLogEntries() | View Raft log entries | Forwarded to leader |
TransferLeadership(target) | Transfer leadership | Yes |
DrainNode() | Drain node for maintenance | No |
ListJobs request fields
| Field | Type | Default | Description |
|---|---|---|---|
page_size | uint32 | 100 | Max results per page (capped at 1000) |
page_token | string | “” | Token from the previous response for the next page |
status_filter | JobStatus | UNSPECIFIED | Only return jobs with this status; 0/UNSPECIFIED = no filter |
worker_id_filter | uint64 | 0 | Only return jobs whose assigned_worker or executed_by matches; 0 = no filter |
command_filter | string | “” | Case-insensitive substring match on the command; empty = no filter |
created_after_ms | int64 | 0 | Only return jobs created at or after this Unix timestamp (ms); 0 = no bound |
created_before_ms | int64 | 0 | Only return jobs created at or before this Unix timestamp (ms); 0 = no bound |
total_count in the response reflects the filtered result set size (not the total queue size).
SubmitJob error codes
| gRPC status | Meaning | Client action |
|---|---|---|
OK | Job accepted and committed | — |
FAILED_PRECONDITION | Node is not the leader | Redirect to the node ID in the message |
RESOURCE_EXHAUSTED | Leader proposal queue is full (>256 pending) | Retry with exponential backoff |
DEADLINE_EXCEEDED | Raft did not commit the entry within 5 seconds | Retry; may indicate a degraded cluster |
UNAVAILABLE | Node is draining, or the Raft loop has stopped | Retry on a different node |
INVALID_ARGUMENT | Empty command string, or command exceeds 1024 bytes | Fix the request |
CancelJob error codes
| gRPC status | Meaning | Client action |
|---|---|---|
OK | Job cancelled and committed | — |
FAILED_PRECONDITION | Node is not the leader, or job is already in a terminal state | Redirect to leader / check job status |
NOT_FOUND | Job ID does not exist | — |
RESOURCE_EXHAUSTED | Leader proposal queue is full | Retry with exponential backoff |
DEADLINE_EXCEEDED | Raft did not commit within 5 seconds | Retry |
INVALID_ARGUMENT | Malformed job UUID | Fix the request |
InternalService (node-to-node, not client-facing)
| Method | Description |
|---|---|
GetJobOutput(job_id) | Fetch job output from the node that executed it |
WorkerHeartbeat(node_id) | Worker liveness signal sent every 2 s to the leader; auto-registers on first call; workers not seen for 5 s are excluded from job assignment |
ForwardJobStatus(updates) | Follower worker forwards completed job status to the leader for Raft replication |
RaftService (node-to-node, consensus protocol)
| Method | Description |
|---|---|
AppendEntries | Log replication and heartbeats |
RequestVote | Leader election voting |
TimeoutNow | Trigger immediate election on the target node (used by TransferLeadership) |
InstallSnapshot | Transfer compacted state to slow followers |
Architecture
Cluster Overview
graph LR
subgraph Clients
CLI[CLI]
REST[REST]
GRPC[gRPC]
end
subgraph Cluster[3-Node Raft Cluster]
N1[Node 1<br/>gRPC :50051<br/>Dashboard :8081]
N2[Node 2<br/>gRPC :50052<br/>Dashboard :8082]
N3[Node 3<br/>gRPC :50053<br/>Dashboard :8083]
N1 <-->|Raft RPCs| N2
N1 <-->|Raft RPCs| N3
N2 <-->|Raft RPCs| N3
end
subgraph Docker[Docker Containers]
D1[Container]
D2[Container]
D3[Container]
end
CLI -->|Submit to leader| N1
REST -->|HTTP API| N1
GRPC -->|gRPC API| N1
N1 -->|docker run| D1
N2 -->|docker run| D2
N3 -->|docker run| D3
classDef leader fill:#90EE90,stroke:#006400,stroke-width:2px
classDef follower fill:#FFE4B5,stroke:#FF8C00,stroke-width:2px
class N1 leader
class N2,N3 follower
Node Internal Architecture
graph TB
subgraph External[External APIs]
GRPC[gRPC Server<br/>SubmitJob, GetStatus, ListJobs]
DASH[Dashboard<br/>REST API + Web UI]
end
subgraph Core[Core Components]
RAFT[Raft Module<br/>Leader Election<br/>Log Replication]
LOG[Raft Log<br/>In-Memory or RocksDB]
end
subgraph Loops[Background Loops]
SCHED[Scheduler Loop<br/>Event-driven<br/>Applies commits<br/>Assigns jobs on leader]
WORKER[Worker Loop<br/>Event-driven + 2s heartbeat<br/>Executes assigned jobs]
end
QUEUE[Job Queue<br/>State Machine]
DOCKER[Docker Container<br/>Sandboxed Execution]
%% External to Core
GRPC -->|Commands| RAFT
DASH -->|Commands| RAFT
DASH -->|Query| QUEUE
%% Core
RAFT <--> LOG
%% Loops subscribe/interact
SCHED -.->|Subscribe commits| RAFT
SCHED -->|Apply entries & Assign jobs| QUEUE
WORKER -->|Notified & Update| QUEUE
WORKER -->|Execute| DOCKER
classDef api fill:#87CEEB,stroke:#4682B4
classDef core fill:#90EE90,stroke:#006400
classDef loop fill:#FFE4B5,stroke:#FF8C00
classDef state fill:#E6E6FA,stroke:#9370DB
class GRPC,DASH api
class RAFT,LOG core
class SCHED,WORKER loop
class QUEUE,DOCKER state
Key Points:
- Every node runs all components (gRPC, Dashboard, Raft, Scheduler, Worker)
- Only the leader’s Scheduler Loop assigns jobs; followers just apply committed entries
- Job assignments travel through Raft (
AssignJobcommand) so every node’s queue reflects the same state - Workers send
WorkerHeartbeatgRPC to the leader every 2 s to signal liveness; workers that miss heartbeats for more than 5 s are excluded from job assignment - Workers discover assigned jobs by querying the local job queue directly (no node-local in-memory map needed)
- Job output is stored only on the executing node (fetched via
GetJobOutputRPC when queried)
Data Flow
sequenceDiagram
participant Client as CLI Client
participant N1 as Node 1 (Leader)<br/>gRPC Server
participant Raft1 as Node 1<br/>Raft Module
participant N2 as Node 2 (Follower)<br/>Raft Module
participant N3 as Node 3 (Follower)<br/>Raft Module
participant Log1 as Node 1<br/>Raft Log
participant Queue1 as Node 1<br/>Job Queue
participant Sched as Scheduler Loop<br/>(Leader Only)
participant Worker as Worker Loop<br/>(Node 2)
participant Docker as Docker Container
Note over Client,Docker: Job Submission Flow (Write Operation)
Client->>N1: 1. SubmitJob("echo hello")
N1->>N1: 2. Create Job object<br/>job_id = uuid
N1->>Raft1: 3. Propose SubmitJob command
Raft1->>Log1: 4. Append to local log<br/>(index=N, term=T)
par Replicate to Followers
Raft1->>N2: 5a. AppendEntries RPC<br/>(log entry)
Raft1->>N3: 5b. AppendEntries RPC<br/>(log entry)
end
N2-->>Raft1: 6a. ACK (success)
N3-->>Raft1: 6b. ACK (success)
Note over Raft1: Majority reached (2/3)
Raft1-->>N1: 7. Commit confirmed
N1->>Queue1: 8. Add job to queue<br/>Status: PENDING
N1-->>Client: 9. Response: job_id
Note over Client,Docker: Job Assignment Flow (Leader Scheduler Loop - event-driven)
Sched->>Raft1: 10. Subscribe to commits
Raft1-->>Sched: 11. Commit notification
Sched->>Queue1: 12. Apply committed entry<br/>(idempotent add)
Sched->>Queue1: 13. Check pending jobs
Queue1-->>Sched: 14. Job found (PENDING)
Sched->>Sched: 15. Select least-loaded live worker<br/>(Node 2, via WorkerHeartbeat liveness)
Sched->>Queue1: 16. Optimistic local assign<br/>Status: RUNNING (leader only)
Sched->>Raft1: 17. Propose AssignJob(job_id, worker=2)
par Replicate AssignJob to Followers
Raft1->>N2: 18a. AppendEntries (AssignJob)
Raft1->>N3: 18b. AppendEntries (AssignJob)
end
N2-->>Raft1: 19a. ACK
N3-->>Raft1: 19b. ACK
Note over N2,N3: Followers apply AssignJob<br/>Status: RUNNING on all nodes
Note over Client,Docker: Job Execution Flow (Worker Loop - notified on assignment)
Worker->>Queue1: 20. Notified, query jobs_assigned_to(node2)
Queue1-->>Worker: 21. Job assigned to me<br/>(RUNNING status)
Worker->>Docker: 22. docker run alpine:latest<br/>--network=none --read-only<br/>--memory=256m --cpus=0.5<br/>echo hello
Docker-->>Worker: 23. Output: "hello\n"<br/>Exit code: 0
Worker->>Queue1: 24. Update local job result<br/>Status: COMPLETED<br/>Output stored locally
Note over Worker,Raft1: Follower worker calls ForwardJobStatus → leader proposes to Raft
Worker->>Raft1: 25. ForwardJobStatus (job done, exit=0)
Raft1->>N2: 26. AppendEntries (UpdateJobStatus)
Raft1->>N3: 26. AppendEntries (UpdateJobStatus)
Note over Client,Docker: Job Status Query Flow (Read Operation)
Client->>N1: 27. GetJobStatus(job_id)
N1->>Queue1: 28. Query job queue
Queue1-->>N1: 29. Job metadata<br/>(executed_by: Node 2)
N1->>N2: 30. GetJobOutput RPC<br/>(fetch from executor)
N2-->>N1: 31. Output: "hello"
N1-->>Client: 32. Response: COMPLETED<br/>Output: "hello"
Note over Client,Docker: Leader Election Flow (Failure Scenario)
Note over Raft1: Leader crashes!
Note over N2,N3: Election timeout (150-300ms)<br/>No heartbeat received
N2->>N2: Election timeout triggered
N2->>N2: Increment term, become candidate
N2->>N3: RequestVote RPC (term=T+1)
N3-->>N2: VoteGranted
Note over N2: Won election with majority
N2->>N2: Become Leader
Note over N2: Scheduler loop now active<br/>(assigns jobs)
Note over N2,N3: Node 2 now leads cluster
Security
mTLS
All gRPC communication (node-to-node, client-to-node) can be secured with mutual TLS:
- Both parties authenticate via certificates signed by cluster CA
- All traffic encrypted with TLS 1.2+
- Generate certs:
./scripts/gen-test-certs.sh ./certs
Docker Sandboxing
Jobs run in isolated containers with:
| Restriction | Setting |
|---|---|
| Network | --network=none |
| Capabilities | --cap-drop=ALL |
| Filesystem | --read-only |
| Privileges | --security-opt=no-new-privileges |
| Memory | --memory=256m |
| CPU | --cpus=0.5 |
| Wall-clock timeout | 30s (hardcoded; jobs exceeding this are killed and marked Failed) |
Raft Consensus
Timing
- Election timeout: 150-300ms (randomized)
- Heartbeat interval: 50ms
- Peer liveness window: 3s — a node is shown as
is_alive: falseinGetClusterStatusif no successful AppendEntries/heartbeat has been received from it within this window
Log Replication
- Client sends command to leader
- Leader appends to log and replicates via
AppendEntries - Majority acknowledgment → committed
- Applied to state machine
Proposal backpressure
The leader’s proposal channel holds at most 256 pending commands. If the channel is full, SubmitJob returns RESOURCE_EXHAUSTED immediately — the caller should retry with exponential backoff. Additionally, if the Raft loop accepts the command but quorum is lost before the entry commits, SubmitJob returns DEADLINE_EXCEEDED after 5 seconds. Both guarantees ensure clients always get a bounded response rather than hanging indefinitely.
State Machine Commands
Each log entry carries one of the following commands applied to the job queue on every node:
| Command | Description |
|---|---|
SubmitJob | Adds a new job in Pending state |
AssignJob | Assigns a pending job to a specific worker, setting it to Running on all nodes |
RegisterWorker | Records a worker in the cluster-wide registry |
UpdateJobStatus / BatchUpdateJobStatus | Marks one or more jobs Completed or Failed |
AssignJob is proposed by the leader’s scheduler loop after it selects the least-loaded live worker. Because this command travels through the Raft log, every follower applies the same assignment atomically — workers discover their jobs by querying the local job queue rather than a node-local in-memory map.
Log Compaction
When the in-memory log exceeds 1000 entries, the committed prefix is replaced with a snapshot of the current state (job queue + worker registrations). This bounds memory usage regardless of throughput. Followers that fall too far behind receive the snapshot via InstallSnapshot RPC instead of replaying individual entries.
State Persistence
By default, all Raft state (log, term, voted-for, snapshot) is held in memory and lost on restart.
Pass --data-dir <path> to enable a local RocksDB store. On startup the node reloads its persisted
term, voted-for, and committed log entries and replays any stored snapshot into the job queue before
rejoining the cluster. Storage failures cause an immediate crash (crash-fail-safe) rather than silent
data loss.
Safety Guarantees
- Election safety: One leader per term
- Leader append-only: Never overwrites log
- Log matching: Same index/term = identical
- Leader completeness: Committed entries persist
Cluster Sizing
| Nodes | Majority | Fault Tolerance |
|---|---|---|
| 3 | 2 | 1 failure |
| 5 | 3 | 2 failures |
| 7 | 4 | 3 failures |
Use odd numbers—even numbers add overhead without improving fault tolerance.
Testing
cargo test # Run all tests
cargo test --lib # Unit tests only
cargo test --test <name> # Specific test suite
Test Suites
| Suite | Description |
|---|---|
lib (unit) | Config, Raft state machine, RPC serialization |
scheduler_tests | Job queue, worker assignment, heartbeats |
raft_rpc_tests | AppendEntries, RequestVote, InstallSnapshot, term handling, compaction boundary cases, re-election vote reset |
integration_tests | Multi-node election, replication, consistency |
failover_tests | Leader crash, re-election, quorum loss |
partition_tests | Network partitions, split-brain prevention, healing |
chaos_tests | Rapid leader churn, network flapping, cascading failures, full isolation recovery |
tls_tests | mTLS certificate loading, encrypted cluster communication |
executor_tests | Docker sandbox command execution |
internal_service_tests | Internal node-to-node API, job output fetching, ForwardJobStatus to leader |
distribution_tests | Jobs spread across nodes (not just leader), WorkerHeartbeat RPC, dead-worker exclusion |
dashboard_tests | REST API endpoints |
leadership_transfer_tests | Voluntary leadership transfer, auto-select, non-leader rejection |
drain_tests | Node draining, job rejection during drain, leadership handoff |
compaction_tests | Log compaction trigger, snapshot transfer to slow followers, multiple compaction rounds, AppendEntries fallback when below compaction threshold, state consistency |
backpressure_tests | Full proposal channel returns RESOURCE_EXHAUSTED immediately; stalled Raft loop returns DEADLINE_EXCEEDED after 5 s |
persistence_tests | RocksDB storage: fresh DB returns None, hard state/log reload, log truncation, snapshot compaction deletes old entries, post-compaction state reconstruction on restart, node term survives restart |