Skip to content

Error Handling

The orchestrator has a structured error hierarchy so that every failure mode has a clear type, category, and recovery path. Errors are never swallowed — they either trigger a retry, trip a circuit breaker, or terminate the run with a precise reason.

ClassModuleKey PropertiesWhen Thrown
BudgetExceededErrorrunner/errorstokensUsed, budgetToken budget exceeded during workflow
WorkflowTimeoutErrorrunner/errorsworkflowId, runId, elapsedMsWall-clock time exceeded
NodeConfigErrorrunner/errorsnodeId, nodeType, missingFieldRequired config missing from a node
CircuitBreakerOpenErrorrunner/errorsnodeIdNode circuit breaker is open
EventLogCorruptionErrorrunner/errorsrunIdMissing/corrupt events during recovery
UnsupportedNodeTypeErrorrunner/errorsnodeTypeUnknown node type encountered
PermissionDeniedErroragent-executor/errorsAgent writes to unauthorized keys
AgentTimeoutErroragent-executor/errorsAgent LLM call exceeds timeout
AgentExecutionErroragent-executor/errorscauseAgent LLM call fails (non-timeout)
AgentNotFoundErroragent-factory/errorsAgent ID not in registry
AgentLoadErroragent-factory/errorscauseRegistry lookup fails (transient)
SupervisorConfigErrorsupervisor-executor/errorssupervisorIdSupervisor missing config
SupervisorRoutingErrorsupervisor-executor/errorschosenNode, allowedNodesSupervisor routes to invalid node
ArchitectErrorarchitect/errorsGraph generation fails after retries
MCPGatewayErrormcp/errorsMCP gateway unreachable
MCPToolExecutionErrormcp/errorstoolNameMCP tool returns error
PersistenceUnavailableErrordb/persistence-healthConsecutive persistence failures exceed threshold

All errors extend Error and set this.name to their class name, enabling reliable switch(error.name) handling across module boundaries.

  • NodeConfigError — A node is missing required configuration (e.g. agent_id, tool_id, approval_config).
  • SupervisorConfigError — Supervisor node is missing its supervisor_config.
  • UnsupportedNodeTypeError — The graph references a node type the runner doesn’t support.
  • BudgetExceededError — Token budget exhausted. Non-retryable within the same run.
  • WorkflowTimeoutError — Execution exceeded wall-clock limit.
  • CircuitBreakerOpenError — Node failures tripped the breaker. Automatically retries after timeout.
  • AgentTimeoutError — Individual LLM call timed out. Retryable per failure_policy.
  • AgentExecutionError — LLM call failed (API error, rate limit). Retryable per failure_policy.
  • MCPGatewayError — MCP gateway unreachable. Tool adapter falls back to built-in tools.
  • MCPToolExecutionError — Specific MCP tool failed. Retryable depending on tool.
  • EventLogCorruptionError — Event log is missing or corrupt. Cannot safely recover.
  • PersistenceUnavailableError — Database unreachable after consecutive failures. Halts to prevent data loss.

Agent permission errors — security boundary

Section titled “Agent permission errors — security boundary”
  • PermissionDeniedError — Agent attempted to write to unauthorized memory keys.
  • SupervisorRoutingError — Supervisor routed to a node outside its managed_nodes.
ErrorRetryable?Notes
AgentTimeoutErrorYesRetried per failure_policy.max_retries
AgentExecutionErrorYesWith exponential backoff
MCPGatewayErrorYesTool adapter retries, then falls back
MCPToolExecutionErrorYesDepends on tool semantics
CircuitBreakerOpenErrorAutoTransitions to half-open after timeout
NodeConfigErrorNoFix the graph definition
UnsupportedNodeTypeErrorNoFix the graph definition
BudgetExceededErrorNoBudget is exhausted for the run
WorkflowTimeoutErrorNoMax execution time reached
EventLogCorruptionErrorNoManual intervention required
PersistenceUnavailableErrorNoHalts to prevent data loss
PermissionDeniedErrorNoSecurity violation — fix agent permissions
SupervisorRoutingErrorNoSupervisor bug — fix agent prompt or managed_nodes

GraphRunner.executeNodeWithRetry() handles this automatically:

  1. Catch error from node executor
  2. Check retry count against failure_policy.max_retries
  3. If retryable: backoff → retry
  4. If exhausted or fatal: dispatch _fail action

CircuitBreakerManager handles the state machine:

CLOSED → (failures ≥ threshold) → OPEN → (timeout) → HALF-OPEN → (success) → CLOSED

CircuitBreakerOpenError is thrown when the breaker is OPEN and timeout hasn’t elapsed. After timeout, one probe attempt is allowed (HALF-OPEN state).

Persistence degradation — progressive failure

Section titled “Persistence degradation — progressive failure”

persistWorkflow() tracks consecutive failures:

  1. 1st failure: log warning, continue
  2. 2nd failure: log warning, continue
  3. 3rd failure (threshold): throw PersistenceUnavailableError → halt workflow

Any success resets the counter to 0.

GraphRunner.recoverFromEventLog() replays events:

  1. Check for checkpoint (fast path)
  2. If no checkpoint: load all events
  3. If no events: throw EventLogCorruptionError
  4. If no _init event: throw EventLogCorruptionError
  5. Replay events through reducers to reconstruct state
Node Executor throws
├─ NodeConfigError / UnsupportedNodeTypeError
│ → GraphRunner catches → dispatch _fail → workflow status = 'failed'
├─ AgentTimeoutError / AgentExecutionError
│ → GraphRunner catches → check retry policy
│ ├─ retries remaining → backoff → re-execute node
│ └─ retries exhausted → dispatch _fail → workflow status = 'failed'
├─ CircuitBreakerOpenError
│ → GraphRunner catches → skip node → advance to fallback edge
├─ PermissionDeniedError
│ → GraphRunner catches → dispatch _fail → workflow status = 'failed'
├─ BudgetExceededError
│ → GraphRunner catches → dispatch _budget_exceeded → workflow status = 'failed'
└─ PersistenceUnavailableError
→ Bubbles up to Worker → Worker marks job as failed
  • Workflow State — the shared state that errors affect
  • Security — how write_keys and taint tracking enforce zero trust
  • Tracing — correlating errors with distributed traces