# agent-eval-harness > CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. - Author: EdwardIrby - Repository: plaited/agent-eval-harness - Version: 20260122002133 - Stars: 2 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/plaited/agent-eval-harness - Web: https://mule.run/skillshub/@@plaited/agent-eval-harness~agent-eval-harness:20260122002133 --- --- name: agent-eval-harness description: CLI tool for capturing agent trajectories. Execute prompts against headless CLI agents via schema-driven adapters, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. compatibility: Bun >= 1.2.9 --- # Agent Eval Harness ## Purpose CLI tool for capturing trajectories from headless CLI agents, optimized for TypeScript/JavaScript projects using Bun. **The harness captures. You score.** | Harness Provides | You Provide | |------------------|-------------| | Prompt execution via headless adapters | Scoring logic (Braintrust, custom scripts) | | Full trajectory capture (thoughts, tools, plans) | Pass/fail determination via graders | | Structured JSONL output | LLM-as-judge prompts | | Reproducible execution environment | CI integration, golden file comparison | **Use this when:** - Capturing trajectories for downstream evaluation - Generating training data (SFT/DPO) with full context - Building regression test fixtures for agent behavior - Comparing agent responses across configurations ## Installation ```bash # Run without installing (recommended) bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl # Or install as project dependency bun add @plaited/agent-eval-harness ``` ## Core Principle: Capture Once, Derive Many Views ```mermaid flowchart LR Prompts["prompts.jsonl"] --> Capture["capture/trials"] Schema["headless schema"] --> Capture Capture --> Results["results.jsonl (full trajectory)"] Results --> Summarize["summarize"] Results --> Calibrate["calibrate"] Results --> Custom["(your tools)"] Summarize --> Views["summary.jsonl / .md"] Calibrate --> Report["calibration.md"] Custom --> Pipeline["any scoring platform"] ``` **Single output format:** Full trajectory JSONL (always) **No `--format` flag:** Derive views with separate commands **Schema exports:** Zod schemas + JSON Schema for any tooling ## Commands ### Core Commands | Command | Input | Output | Purpose | |---------|-------|--------|---------| | `capture` | prompts.jsonl + schema | results.jsonl | Trajectory capture (full) | | `trials` | prompts.jsonl + schema | trials.jsonl | Multi-run + optional metrics | | `summarize` | results.jsonl | summary.jsonl or .md | Derive compact views | | `calibrate` | results.jsonl | calibration.md | Sample failures for review | | `validate-refs` | prompts.jsonl | validation.jsonl | Check reference solutions | | `balance` | prompts.jsonl | balance.json | Analyze test set coverage | | `schemas` | (none) | JSON Schema | Export schemas for non-TS users | ### Pipeline Commands (Unix-style) | Command | Input | Output | Purpose | |---------|-------|--------|---------| | `run` | prompts.jsonl + schema | raw.jsonl | Execute prompts, raw output | | `extract` | raw.jsonl + schema | extracted.jsonl | Parse trajectories | | `grade` | extracted.jsonl + grader | graded.jsonl | Apply grader scoring | | `format` | results.jsonl | jsonl/markdown/csv | Convert output format | | `compare` | multiple results.jsonl | comparison.jsonl | Compare multiple runs | All commands support optional `--grader ./grader.ts` for scoring. ## Capture Command ### Basic Usage ```bash bunx @plaited/agent-eval-harness capture --schema [options] ``` ### Arguments | Argument/Flag | Description | Default | |------|-------------|---------| | `prompts.jsonl` | Input file with prompts to execute | Required | | `-s, --schema` | Path to headless adapter schema | Required | | `-o, --output` | Output file/path | stdout | | `-c, --cwd` | Working directory for agent | current | | `-t, --timeout` | Request timeout in ms | `60000` | | `--progress` | Show progress to stderr | false | | `--append` | Append to output file | false | | `-g, --grader` | Path to grader module | none | | `--debug` | Show detailed CLI output for debugging | false | ### Examples ```bash # Basic capture bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json -o results.jsonl # Using a local adapter script bunx @plaited/agent-eval-harness capture prompts.jsonl bun ./my-adapter.ts -o results.jsonl # With grader (adds score to each result) bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts -o results.jsonl ``` ## Trials Command Run each prompt multiple times for pass@k/pass^k analysis. ```bash # Capture only (no grader) bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 -o trials.jsonl # With grader (computes pass@k, pass^k) bunx @plaited/agent-eval-harness trials prompts.jsonl --schema ./claude.json -k 5 --grader ./grader.ts -o trials.jsonl ``` ### Output Without grader: ```jsonl {"id":"search-001","input":"Find the CEO","k":5,"trials":[{"trialNum":1,"output":"...","trajectory":[...],"duration":1234},...]} ``` With grader: ```jsonl {"id":"search-001","input":"Find the CEO","k":5,"passRate":0.8,"passAtK":0.99,"passExpK":0.33,"trials":[{"trialNum":1,"output":"...","pass":true,"score":1.0},...]} ``` ## Summarize Command Derive compact views from full trajectory results. ```bash # Summary JSONL (for jq analysis) bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl # Markdown (for LLM-as-judge) bunx @plaited/agent-eval-harness summarize results.jsonl --markdown -o results.md ``` ## Calibrate Command Sample failures for grader review. Calibration helps you distinguish between **agent failures** (agent did wrong thing) and **grader bugs** (agent was correct, grader too strict). ```bash # Sample failures for human review bunx @plaited/agent-eval-harness calibrate results.jsonl --sample 10 -o calibration.md # Re-score with different grader to compare bunx @plaited/agent-eval-harness calibrate results.jsonl --grader ./loose-grader.ts --sample 10 -o comparison.md ``` See [eval-concepts.md](references/eval-concepts.md#grader-calibration) for why calibration matters. ## Validate-Refs Command Check that reference solutions pass your grader before evaluating agents. ```bash # Validate reference solutions bunx @plaited/agent-eval-harness validate-refs prompts.jsonl --grader ./grader.ts -o validation.jsonl # Check for failures cat validation.jsonl | jq 'select(.pass == false)' ``` ### Why Use This? If your reference solution fails your own grader: - The task definition is ambiguous - The grader is too strict - The hint is wrong **Fix the eval before evaluating the agent.** ### Input Format Prompts must include a `reference` field: ```jsonl {"id":"test-001","input":"Create a button component","hint":""} ``` ### Output Format ```jsonl {"id":"test-001","input":"Create a button component","reference":"export const Button = () => ","pass":true,"score":1.0,"reasoning":"Contains hint content"} ``` ## Balance Command Analyze test set coverage to ensure balanced evaluation. ```bash # Analyze prompt distribution bunx @plaited/agent-eval-harness balance prompts.jsonl -o balance.json # Pretty print bunx @plaited/agent-eval-harness balance prompts.jsonl | jq . ``` ### Why Use This? An eval with only "make X work" misses "don't break Y". Balance analysis shows: - **Category distribution** (from `metadata.category`) - **Positive/negative case ratio** - **Coverage gaps** ### Output Format ```json { "totalCases": 50, "categories": [ { "name": "ui", "count": 20, "percentage": 40 }, { "name": "logic", "count": 15, "percentage": 30 }, { "name": "api", "count": 10, "percentage": 20 }, { "name": "edge-case", "count": 5, "percentage": 10 } ], "underrepresented": ["edge-case"], "suggestions": ["Consider adding more test cases for: edge-case"] } ``` ### Balanced Eval Design Include both positive and negative cases: | Type | Example | Purpose | |------|---------|---------| | Positive | "Add a login button" | Agent should succeed | | Negative | "Add a button without breaking tests" | Agent should not break things | | Edge case | "Handle empty input gracefully" | Agent should be robust | See [eval-concepts.md](references/eval-concepts.md#test-set-balance) for more on balanced test sets. ## Pipeline Workflow The pipeline commands enable Unix-style composition for flexible evaluation workflows. ### Full Pipeline Example ```bash # Execute → Extract → Grade → Format in one pipeline cat prompts.jsonl | \ bunx @plaited/agent-eval-harness run -s claude.json | \ bunx @plaited/agent-eval-harness extract -s claude.json | \ bunx @plaited/agent-eval-harness grade -g ./grader.ts | \ bunx @plaited/agent-eval-harness format -f markdown > report.md ``` ### Run Command Execute prompts and output raw results. Three modes available: ```bash # Schema mode (recommended) bunx @plaited/agent-eval-harness run prompts.jsonl --schema claude.json # Simple mode: {} placeholder substitution bunx @plaited/agent-eval-harness run prompts.jsonl --simple "claude -p {} --output-format stream-json" # Shell mode: $PROMPT environment variable bunx @plaited/agent-eval-harness run prompts.jsonl --shell 'claude -p "$PROMPT" --output-format stream-json' ``` > **⚠️ Security Warning:** The `--simple` and `--shell` modes execute prompts via shell commands. Prompts are escaped but **do not use untrusted prompt content** with these modes. Malicious prompt text could potentially escape the quoting and execute arbitrary commands. Use `--schema` mode (headless adapter) for untrusted inputs. ### Extract Command Parse raw output into structured trajectories: ```bash # From file bunx @plaited/agent-eval-harness extract raw.jsonl --schema claude.json -o extracted.jsonl # Piped from run bunx @plaited/agent-eval-harness run prompts.jsonl -s claude.json | \ bunx @plaited/agent-eval-harness extract -s claude.json ``` ### Grade Command Apply grader to extracted results: ```bash bunx @plaited/agent-eval-harness grade extracted.jsonl --grader ./grader.ts -o graded.jsonl ``` ### Format Command Convert results to different output formats: ```bash # Markdown report bunx @plaited/agent-eval-harness format results.jsonl --style markdown -o report.md # CSV for spreadsheets bunx @plaited/agent-eval-harness format results.jsonl --style csv -o results.csv # JSONL (pass-through, default) bunx @plaited/agent-eval-harness format results.jsonl --style jsonl ``` ### Compare Command Compare multiple runs of the same prompts: ```bash # Compare multiple result files bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl run3.jsonl \ --grader ./compare-grader.ts -o comparison.jsonl # With explicit labels bunx @plaited/agent-eval-harness compare \ --run "with-mcp:results-mcp.jsonl" \ --run "vanilla:results-vanilla.jsonl" \ --grader ./compare-grader.ts ``` **Use cases for compare:** - Same agent, different MCP servers - Same agent, different skills enabled - Same agent, different model versions - Different agents entirely ### Comparison Grader Interface ```typescript import type { ComparisonGrader } from '@plaited/agent-eval-harness/pipeline' export const grade: ComparisonGrader = async ({ id, input, hint, runs }) => { // runs is Record // Return rankings from best to worst return { rankings: [ { run: 'with-mcp', rank: 1, score: 0.9 }, { run: 'vanilla', rank: 2, score: 0.7 }, ], reasoning: 'MCP run produced more accurate output' } } ``` ### Pipeline Workflow Diagram ```mermaid flowchart LR Prompts["prompts.jsonl"] --> Run["run"] Schema["headless schema"] --> Run Run --> Raw["raw.jsonl"] Raw --> Extract["extract"] Schema --> Extract Extract --> Extracted["extracted.jsonl"] Extracted --> Grade["grade"] Grader["grader.ts"] --> Grade Grade --> Graded["graded.jsonl"] Graded --> Format["format"] Format --> Output["report.md / .csv / .jsonl"] Graded --> Compare["compare"] Results2["other runs..."] --> Compare CompareGrader["compare-grader.ts"] --> Compare Compare --> Comparison["comparison.jsonl"] ``` ## Schemas Command Export JSON schemas for non-TypeScript tools. ```bash # List available schemas bunx @plaited/agent-eval-harness schemas # Export all schemas as JSON bunx @plaited/agent-eval-harness schemas --json -o schemas.json # Export specific schema bunx @plaited/agent-eval-harness schemas CaptureResult --json bunx @plaited/agent-eval-harness schemas TrialResult --json bunx @plaited/agent-eval-harness schemas GraderResult --json ``` ### Available Schemas | Schema | Description | |--------|-------------| | `CaptureResult` | Single capture output (id, input, output, trajectory, timing) | | `TrialResult` | Multi-run trial output (includes passAtK, passExpK) | | `GraderResult` | Grader return value (pass, score, reasoning) | | `PromptInput` | Input prompt format | | `TrajectoryStep` | Single step in trajectory array | | `SummaryResult` | Compact summary format | ### Usage in Other Languages Export schemas for validation in Python, Go, etc.: ```bash # Export all schemas bunx @plaited/agent-eval-harness schemas --json -o schemas.json # Use in Python with jsonschema python -c " import json from jsonschema import validate with open('schemas.json') as f: schemas = json.load(f) with open('results.jsonl') as f: for line in f: result = json.loads(line) validate(result, schemas['CaptureResult']) print(f'{result[\"id\"]}: valid') " ``` ## Grader Interface Graders provide semantic pass/fail scoring for captured trajectories. The harness supports graders written in **any language**. ### TypeScript Grader ```typescript // my-grader.ts import type { Grader } from '@plaited/agent-eval-harness/schemas' export const grade: Grader = async ({ input, output, hint, trajectory }) => { const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '') return { pass, score: pass ? 1 : 0, reasoning: pass ? 'Contains hint content' : 'Missing hint content' } } ``` **Note:** `input` can be `string` (single turn) or `string[]` (multi-turn). The `hint` field provides grader context (renamed from `expected`). ### Python/Executable Graders Any executable can be a grader using stdin/stdout JSON protocol: ```python #!/usr/bin/env python3 import json, sys data = json.load(sys.stdin) output = data.get("output", "").lower() hint = (data.get("hint") or "").lower() pass_result = hint in output if hint else True print(json.dumps({ "pass": pass_result, "score": 1.0 if pass_result else 0.0, "reasoning": "Contains hint" if pass_result else "Missing hint" })) ``` ```bash chmod +x ./grader.py bunx @plaited/agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py -o results.jsonl ``` See [graders.md](references/graders.md) for complete polyglot grader documentation including shell scripts and LLM-as-judge patterns. ## Input Format Each line in `prompts.jsonl`: ```jsonl {"id":"test-001","input":"Create a button","hint":"should contain