# agent-evaluation-framework > Three-layer evaluation for agentic systems — component testing, trajectory evaluation, and outcome evaluation with LLM-as-Judge. Includes golden trajectories, the Four C's, and the evaluation flywheel. Use when building test suites for agents. - Author: Mahdi Khan - Repository: mahdi-khaannn/agentic-templates - Version: 20260207144042 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/mahdi-khaannn/agentic-templates - Web: https://mule.run/skillshub/@@mahdi-khaannn/agentic-templates~agent-evaluation-framework:20260207144042 --- --- name: agent-evaluation-framework description: Three-layer evaluation for agentic systems — component testing, trajectory evaluation, and outcome evaluation with LLM-as-Judge. Includes golden trajectories, the Four C's, and the evaluation flywheel. Use when building test suites for agents. globs: ["**/*.py", "**/*.ts", "**/*.yaml"] --- # Agent Evaluation Framework ## The Paradigm Shift - Traditional: assert f(x) == y (deterministic, reproducible) - Agents: Non-deterministic, infinite surface area, same input → different outputs - **New question**: Did the agent follow a valid methodology to achieve a good outcome? ## Three-Layer Evaluation (MECE) ### Layer 1: Component Testing (Deterministic) - Standard unit/integration tests for tools, parsers, validators, state managers - Test each tool: valid inputs → expected outputs, invalid → graceful errors - **Necessary but insufficient**: Passing component tests ≠ agent uses them correctly ### Layer 2: Trajectory Evaluation (Procedural) - Analyze sequence of (Thought → Action → Observation) steps - **Tool Recall**: Correct tools called / All tools that should have been called - **Tool Precision**: Correct tools called / All tools agent actually called - **Parameter Accuracy**: Correct parameters / All parameters passed - **Golden Trajectories**: Expert-defined ideal execution paths as "answer key" - **LLM-as-Judge for Trajectories**: Evaluate actual vs golden trajectory (1-5 scoring) ### Layer 3: Outcome Evaluation (Semantic) - The Four C's: 1. **Correctness**: All claims grounded in sources? Hallucination detection? 2. **Completeness**: Fully answers the question? No refusals? 3. **Clarity**: Appropriate language, well-organized, on-brand? 4. **Compliance**: Safe, unbiased, meets regulatory requirements? - **LLM-as-Judge**: Powerful LLM evaluates against detailed rubric → JSON scores + justification - **Faithfulness Test**: RAG responses use ONLY provided context? ## The Evaluation Flywheel ``` Capture (log all traces) → Surface (find failures) → Annotate (expert review) → Integrate (add to golden sets) → Test (regression) → Improve → repeat ``` - Every production bug becomes a permanent test case - Future versions must pass all historical cases - Gate deployments on evaluation thresholds ## The Four Pillars of Agent Quality | Pillar | Definition | Metrics | |--------|------------|---------| | Effectiveness | Goal achieved? | Task success rate, accuracy | | Efficiency | Without waste? | Tokens/task, cost/task, latency | | Robustness | Handles errors? | Error recovery rate | | Safety | Within bounds? | Guardrail trigger rate, PII leak rate |