# agent-evaluation-framework

> Three-layer evaluation for agentic systems — component testing, trajectory evaluation, and outcome evaluation with LLM-as-Judge. Includes golden trajectories, the Four C's, and the evaluation flywheel. Use when building test suites for agents.

- Author: Mahdi Khan
- Repository: mahdi-khaannn/agentic-templates
- Version: 20260207144042
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/mahdi-khaannn/agentic-templates
- Web: https://mule.run/skillshub/@@mahdi-khaannn/agentic-templates~agent-evaluation-framework:20260207144042

---

---
name: agent-evaluation-framework
description: Three-layer evaluation for agentic systems — component testing, trajectory evaluation, and outcome evaluation with LLM-as-Judge. Includes golden trajectories, the Four C's, and the evaluation flywheel. Use when building test suites for agents.
globs: ["**/*.py", "**/*.ts", "**/*.yaml"]
---

# Agent Evaluation Framework

## The Paradigm Shift
- Traditional: assert f(x) == y (deterministic, reproducible)
- Agents: Non-deterministic, infinite surface area, same input → different outputs
- **New question**: Did the agent follow a valid methodology to achieve a good outcome?

## Three-Layer Evaluation (MECE)

### Layer 1: Component Testing (Deterministic)
- Standard unit/integration tests for tools, parsers, validators, state managers
- Test each tool: valid inputs → expected outputs, invalid → graceful errors
- **Necessary but insufficient**: Passing component tests ≠ agent uses them correctly

### Layer 2: Trajectory Evaluation (Procedural)
- Analyze sequence of (Thought → Action → Observation) steps
- **Tool Recall**: Correct tools called / All tools that should have been called
- **Tool Precision**: Correct tools called / All tools agent actually called
- **Parameter Accuracy**: Correct parameters / All parameters passed
- **Golden Trajectories**: Expert-defined ideal execution paths as "answer key"
- **LLM-as-Judge for Trajectories**: Evaluate actual vs golden trajectory (1-5 scoring)

### Layer 3: Outcome Evaluation (Semantic)
- The Four C's:
  1. **Correctness**: All claims grounded in sources? Hallucination detection?
  2. **Completeness**: Fully answers the question? No refusals?
  3. **Clarity**: Appropriate language, well-organized, on-brand?
  4. **Compliance**: Safe, unbiased, meets regulatory requirements?
- **LLM-as-Judge**: Powerful LLM evaluates against detailed rubric → JSON scores + justification
- **Faithfulness Test**: RAG responses use ONLY provided context?

## The Evaluation Flywheel
```
Capture (log all traces) → Surface (find failures) → Annotate (expert review)
→ Integrate (add to golden sets) → Test (regression) → Improve → repeat
```
- Every production bug becomes a permanent test case
- Future versions must pass all historical cases
- Gate deployments on evaluation thresholds

## The Four Pillars of Agent Quality
| Pillar | Definition | Metrics |
|--------|------------|---------|
| Effectiveness | Goal achieved? | Task success rate, accuracy |
| Efficiency | Without waste? | Tokens/task, cost/task, latency |
| Robustness | Handles errors? | Error recovery rate |
| Safety | Within bounds? | Guardrail trigger rate, PII leak rate |