Skip to content

Evaluations

Unit tests check code (does the function crash?). Evals check behavior (did the agent solve the user’s problem?).

  • Scope: Reducers, graph routing logic, tool inputs/outputs
  • Tool: Vitest
  • Example: “Given State X + Action Y, does the reducer produce State Z?”

MC-AI includes an eval framework for running assertions against workflow outputs.

A trusted set of inputs and expected outcomes:

  • Input: “Create a React Button component.”
  • Expected criteria: [“Has TypeScript interfaces”, “Uses proper props”, “No syntax errors”]
  1. Run: The system processes the input → output artifact
  2. Evaluate: A separate “Judge Agent” reads the output and criteria
  3. Score: The judge returns pass/fail and reasoning
Terminal window
cd packages/orchestrator
ANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/linear-completion.ts
ANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/supervisor-routing.ts
ANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/hitl-approval.ts

See the evals examples for complete eval suite implementations.

Every workflow run tracks token usage and cost:

  • total_tokens_used — cumulative tokens across all nodes
  • total_cost_usd — estimated cost based on model pricing
  • budget_usd — optional per-agent budget cap

Budget enforcement throws BudgetExceededError immediately if costs exceed the configured limit.

  • Tracing — see workflow execution in real-time
  • Security — economic guardrails and denial-of-wallet prevention