API Reference
REST API (Dashboard)
The web dashboard exposes a REST API for job and cluster management.
Get cluster status:
curl http://localhost:8081/api/cluster
# Response:
# {
# "node_id": 1,
# "role": "leader",
# "current_term": 5,
# "leader_id": 1,
# "commit_index": 3,
# "last_applied": 3,
# "log_length": 3,
# "nodes": [
# { "node_id": 1, "address": "0.0.0.0:50051", "is_alive": true },
# { "node_id": 2, "address": "127.0.0.1:50052", "is_alive": true },
# { "node_id": 3, "address": "127.0.0.1:50053", "is_alive": false }
# ]
# }
Submit a job:
curl -X POST http://localhost:8081/api/jobs \
-H "Content-Type: application/json" \
-d '{"command": "echo hello"}'
# Response:
# {
# "job_id": "ef319e40-c888-490d-8349-e9c05f78cf5a",
# "status": "pending"
# }
# With a specific Docker image (overrides the server default for this job)
curl -X POST http://localhost:8081/api/jobs \
-H "Content-Type: application/json" \
-d '{"command": "python3 -c '\''print(42)'\''", "image": "python:3.12-alpine"}'
Cancel a job:
curl -X DELETE http://localhost:8081/api/jobs/ef319e40-c888-490d-8349-e9c05f78cf5a
# Response (success):
# {
# "success": true,
# "error": null
# }
# Response (already terminal):
# HTTP 400
# {
# "success": false,
# "error": "job is already completed"
# }
List all jobs:
curl http://localhost:8081/api/jobs
# Response:
# [
# {
# "id": "ef319e40-c888-490d-8349-e9c05f78cf5a",
# "command": "echo hello",
# "status": "completed",
# "executed_by": 1,
# "output": "hello\n",
# "error": null,
# "created_at": "2026-01-28T12:45:41.231558433+00:00",
# "completed_at": "2026-01-28T12:45:41.678341558+00:00"
# }
# ]
Liveness probe:
curl http://localhost:8081/health/live
# Response (always 200 while the process is alive):
# {
# "status": "ok"
# }
Readiness probe:
curl http://localhost:8081/health/ready
# Response when a leader has been elected (200):
# {
# "status": "ok",
# "leader_id": 1
# }
# Response during startup or mid-election (503):
# {
# "status": "no_leader",
# "leader_id": null
# }
gRPC API
SchedulerService (client-facing)
| Method | Description | Leader Only |
|---|---|---|
SubmitJob(command, image?) | Submit a job; image overrides the server-default Docker image for this job | Yes |
CancelJob(job_id) | Cancel a pending or running job | Yes |
GetJobStatus(job_id) | Get job status | No |
ListJobs(page_size, page_token, status_filter, worker_id_filter, command_filter, created_after_ms, created_before_ms) | List jobs (paginated, filterable) | No |
StreamJobs() | Stream jobs | No |
GetClusterStatus() | Cluster info | Forwarded to leader |
GetRaftLogEntries() | View Raft log entries | Forwarded to leader |
TransferLeadership(target) | Transfer leadership | Yes |
DrainNode() | Drain node for maintenance | No |
ListJobs request fields
| Field | Type | Default | Description |
|---|---|---|---|
page_size | uint32 | 100 | Max results per page (capped at 1000) |
page_token | string | “” | Token from the previous response for the next page |
status_filter | JobStatus | UNSPECIFIED | Only return jobs with this status; 0/UNSPECIFIED = no filter |
worker_id_filter | uint64 | 0 | Only return jobs whose assigned_worker or executed_by matches; 0 = no filter |
command_filter | string | “” | Case-insensitive substring match on the command; empty = no filter |
created_after_ms | int64 | 0 | Only return jobs created at or after this Unix timestamp (ms); 0 = no bound |
created_before_ms | int64 | 0 | Only return jobs created at or before this Unix timestamp (ms); 0 = no bound |
total_count in the response reflects the filtered result set size (not the total queue size).
SubmitJob error codes
| gRPC status | Meaning | Client action |
|---|---|---|
OK | Job accepted and committed | — |
FAILED_PRECONDITION | Node is not the leader | Redirect to the node ID in the message |
RESOURCE_EXHAUSTED | Leader proposal queue is full (>256 pending) | Retry with exponential backoff |
DEADLINE_EXCEEDED | Raft did not commit the entry within 5 seconds | Retry; may indicate a degraded cluster |
UNAVAILABLE | Node is draining, or the Raft loop has stopped | Retry on a different node |
INVALID_ARGUMENT | Empty command string, or command exceeds 1024 bytes | Fix the request |
CancelJob error codes
| gRPC status | Meaning | Client action |
|---|---|---|
OK | Job cancelled and committed | — |
FAILED_PRECONDITION | Node is not the leader, or job is already in a terminal state | Redirect to leader / check job status |
NOT_FOUND | Job ID does not exist | — |
RESOURCE_EXHAUSTED | Leader proposal queue is full | Retry with exponential backoff |
DEADLINE_EXCEEDED | Raft did not commit within 5 seconds | Retry |
INVALID_ARGUMENT | Malformed job UUID | Fix the request |
InternalService (node-to-node, not client-facing)
| Method | Description |
|---|---|
GetJobOutput(job_id) | Fetch job output from the node that executed it |
WorkerHeartbeat(node_id) | Worker liveness signal sent every 2 s to the leader; auto-registers on first call; workers not seen for 5 s are excluded from job assignment |
ForwardJobStatus(updates) | Follower worker forwards completed job status to the leader for Raft replication |
RaftService (node-to-node, consensus protocol)
| Method | Description |
|---|---|
AppendEntries | Log replication and heartbeats |
RequestVote | Leader election voting |
TimeoutNow | Trigger immediate election on the target node (used by TransferLeadership) |
InstallSnapshot | Transfer compacted state to slow followers |