Evaluations
Unit tests check code — does the function crash? Evals check behavior — did the workflow produce the right result? cycgraph includes a built-in eval framework for defining test cases, running workflows, and asserting on the final state.
Quick start
Section titled “Quick start”Define a suite, run it, and inspect the report:
import { runEval, EvalSuite } from '@cycgraph/orchestrator';
const suite: EvalSuite = { name: 'My First Eval', cases: [ { name: 'Research pipeline completes', graph: myGraph, input: { goal: 'Summarize recent AI news' }, assertions: [ { type: 'status_equals', expected: 'completed' }, { type: 'node_visited', node_id: 'researcher' }, { type: 'memory_contains', key: 'summary' }, ], }, ],};
const report = await runEval(suite);
console.log(`Score: ${report.overall_score}`); // 0.0–1.0console.log(`Passed: ${report.passed}/${report.total}`);How it works
Section titled “How it works”For each case in the suite:
- Build state —
goal,constraints, andmax_token_budgetare extracted frominput. The entireinputobject is seeded intomemory. - Run workflow — A
GraphRunnerexecutes the graph to completion (or failure/timeout). - Assert — Each assertion is checked against the final
WorkflowState. - Score — Case score = passed assertions / total assertions. Overall score = mean of all case scores.
Cases run sequentially to avoid LLM provider contention. If a workflow crashes, the case gets a score of 0 and an error field — other cases continue unaffected.
Assertion types
Section titled “Assertion types”status_equals
Section titled “status_equals”Check the workflow’s final status:
{ type: 'status_equals', expected: 'completed' }{ type: 'status_equals', expected: 'waiting' } // for HITL workflowsnode_visited
Section titled “node_visited”Verify a specific node executed:
{ type: 'node_visited', node_id: 'researcher' }memory_contains
Section titled “memory_contains”Check that a key exists in the final state memory:
{ type: 'memory_contains', key: 'summary' }memory_matches
Section titled “memory_matches”Inspect a memory value with three matching modes:
// Exact match (JSON equality){ type: 'memory_matches', key: 'count', mode: 'exact', expected: 42, pattern: '' }
// Substring match{ type: 'memory_matches', key: 'output', mode: 'contains', expected: 'hello', pattern: '' }
// Regex match (against stringified value){ type: 'memory_matches', key: 'output', mode: 'regex', pattern: '^hello\\s\\w+$' }token_budget_respected
Section titled “token_budget_respected”Verify the workflow stayed within its token budget:
{ type: 'token_budget_respected' }llm_judge
Section titled “llm_judge”Use an LLM evaluator agent to score the output against criteria. This is the only probabilistic assertion — all others are deterministic.
{ type: 'llm_judge', criteria: 'Is the summary accurate, well-structured, and under 300 words?', threshold: 0.75, // minimum passing score (0.0–1.0) evaluator_agent_id: EVALUATOR_ID, // UUID of a registered evaluator agent}The evaluator agent calls generateText() with a structured output schema and returns a score (0.0–1.0), reasoning, and optional suggestions. The assertion passes if score >= threshold.
EvalSuite structure
Section titled “EvalSuite structure”interface EvalSuite { name: string; cases: EvalCase[];}
interface EvalCase { name: string; // Human-readable case name graph: Graph; // The graph to execute input: Record<string, unknown>; // Initial memory (goal, constraints, etc.) assertions: EvalAssertion[]; // What to check timeout_ms?: number; // Workflow timeout (default: 60000ms)}EvalReport structure
Section titled “EvalReport structure”runEval() returns a detailed report:
interface EvalReport { suite_name: string; cases: EvalCaseResult[]; overall_score: number; // Mean of all case scores (0.0–1.0) total: number; // Total cases passed: number; // Cases where all assertions passed failed: number; // Cases with at least one failure duration_ms: number; // Wall-clock duration}
interface EvalCaseResult { name: string; passed: boolean; // All assertions passed? score: number; // Fraction of assertions that passed duration_ms: number; assertions: AssertionResult[]; error?: string; // Set if workflow crashed}
interface AssertionResult { assertion: EvalAssertion; passed: boolean; actual?: unknown; // Observed value message?: string; // Failure explanation}Example eval suites
Section titled “Example eval suites”cycgraph ships with three example suites that demonstrate common patterns.
Linear completion
Section titled “Linear completion”Tests a 2-node tool pipeline (fetch → transform):
const suite: EvalSuite = { name: 'Linear Completion', cases: [ { name: 'Two tool nodes complete successfully', graph: linearGraph, input: { goal: 'Fetch and transform data' }, assertions: [ { type: 'status_equals', expected: 'completed' }, { type: 'node_visited', node_id: 'fetch' }, { type: 'node_visited', node_id: 'transform' }, { type: 'memory_contains', key: 'fetch_result' }, { type: 'memory_contains', key: 'transform_result' }, ], }, ],};Supervisor routing
Section titled “Supervisor routing”Tests a router dispatching to a worker:
assertions: [ { type: 'status_equals', expected: 'completed' }, { type: 'node_visited', node_id: 'router' }, { type: 'node_visited', node_id: 'worker' }, { type: 'memory_contains', key: 'worker_result' },],Human-in-the-loop approval
Section titled “Human-in-the-loop approval”Tests that the workflow pauses at an approval gate (status is waiting, not completed):
assertions: [ { type: 'status_equals', expected: 'waiting' }, { type: 'node_visited', node_id: 'prepare' }, { type: 'node_visited', node_id: 'review' }, { type: 'memory_contains', key: 'prepare_result' },],Running the examples
Section titled “Running the examples”cd packages/orchestratorANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/linear-completion.tsANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/supervisor-routing.tsANTHROPIC_API_KEY=sk-ant-... npx tsx examples/evals/hitl-approval.tsScoring
Section titled “Scoring”- A case with 3/5 passing assertions scores 0.6 and is marked
passed: false. - A case with 0 assertions scores 1.0 (all assertions trivially pass).
- The suite’s
overall_scoreis the mean of all case scores. - A case that crashes before assertions are checked scores 0 with the error captured in
error.
Next steps
Section titled “Next steps”- Tracing — see workflow execution in real-time with OpenTelemetry
- Cost & Budget Tracking — token and cost budgets
- Security — economic guardrails and denial-of-wallet prevention