Evaluations
Unit tests check code (does the function crash?). Evals check behavior (did the agent solve the user’s problem?).
Unit tests (deterministic)
Section titled “Unit tests (deterministic)”- Scope: Reducers, graph routing logic, tool inputs/outputs
- Tool: Vitest
- Example: “Given State X + Action Y, does the reducer produce State Z?”
Evals (probabilistic)
Section titled “Evals (probabilistic)”MC-AI includes an eval framework for running assertions against workflow outputs.
Dataset
Section titled “Dataset”A trusted set of inputs and expected outcomes:
- Input: “Create a React Button component.”
- Expected criteria: [“Has TypeScript interfaces”, “Uses proper props”, “No syntax errors”]
The evaluator loop
Section titled “The evaluator loop”- Run: The system processes the input → output artifact
- Evaluate: A separate “Judge Agent” reads the output and criteria
- Score: The judge returns pass/fail and reasoning
Running evals
Section titled “Running evals”cd packages/orchestratorANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/linear-completion.tsANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/supervisor-routing.tsANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/hitl-approval.tsSee the evals examples for complete eval suite implementations.
Cost tracking
Section titled “Cost tracking”Every workflow run tracks token usage and cost:
total_tokens_used— cumulative tokens across all nodestotal_cost_usd— estimated cost based on model pricingbudget_usd— optional per-agent budget cap
Budget enforcement throws BudgetExceededError immediately if costs exceed the configured limit.