# agent-validator > Validate AI agents against production-level quality criteria with 0-100 scoring. Use when evaluating agent quality, identifying bugs/gaps, or improving agents to expert level. Evaluates agents across 9 categories: structure, role definition, methodology, user interaction, quality standards, context management, technical robustness, pedagogical effectiveness (tutors), and production readiness (operators). Returns actionable validation report with specific improvements. - Author: Hafiz Naveed Uddin - Repository: NAVEED261/Reusable-shop - Version: 20260208190515 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-08 - Source: https://github.com/NAVEED261/Reusable-shop - Web: https://mule.run/skillshub/@@NAVEED261/Reusable-shop~agent-validator:20260208190515 --- --- name: agent-validator description: | Validate AI agents against production-level quality criteria with 0-100 scoring. Use when evaluating agent quality, identifying bugs/gaps, or improving agents to expert level. Evaluates agents across 9 categories: structure, role definition, methodology, user interaction, quality standards, context management, technical robustness, pedagogical effectiveness (tutors), and production readiness (operators). Returns actionable validation report with specific improvements. --- # Agent Validator Systematically evaluate AI agents against production-level criteria and generate actionable improvement roadmaps. ## Validation Workflow (4 Phases) ### Phase 1: Gather Context (5 min) 1. **Read agent file** completely 2. **Identify agent pattern** from content analysis: - **Tutor Pattern**: Teaching, progressive learning, practice exercises, encouraging tone - **Architect Pattern**: Design workflows, quality verification, trade-off analysis - **Operator Pattern**: Production procedures, operational readiness, runbooks 3. **Estimate line count** (target: 70-174 lines for agents) 4. **Note frontmatter** (name, description, model, color, skills) 5. **Identify expertise domains** from role description ### Phase 2: Apply Criteria (20 min) Evaluate against **9 categories** with dynamic weighting based on agent pattern. Each criterion scores **0-3 scale**: - **0**: Missing/Absent - **1**: Present but inadequate - **2**: Adequate implementation - **3**: Excellent implementation ### Phase 3: Calculate & Report (5 min) - Calculate category scores: (Sum of criteria) / (Max possible) × 100 - Apply dynamic weights (redistribute unused weights) - Calculate overall score: Σ(Category Score × Weight) - Determine rating (Production/Good/Adequate/Developing/Incomplete) ### Phase 4: Generate Recommendations (10 min) - Identify critical issues (blocks deployment) - List high/medium/low priority improvements - Suggest pattern-specific enhancements - Provide strengths summary --- ## Evaluation Categories (9 Total) ### Category 1: Structure & Metadata (12%) **Purpose**: Foundation quality—file organization, naming, frontmatter | Criterion | 0-3 Scoring | |-----------|------------| | **File structure** | 0: No frontmatter; 1: Incomplete YAML; 2: Valid YAML, missing fields; 3: Complete YAML (name, description, model, color) | | **File size** | 0: >300 lines; 1: 200-300 lines; 2: 100-200 lines; 3: 70-150 lines (optimal for agents) | | **Name constraints** | 0: Invalid format; 1: Multiple violations; 2: Minor issues; 3: Lowercase, hyphens/numbers only, ≤64 chars | | **Description format** | 0: Absent/vague; 1: Incomplete; 2: Adequate [What/When]; 3: Excellent [What] + [When] + [Examples], ≤1024 chars | | **Metadata completeness** | 0: Missing name/description; 1: Name or description absent; 2: Both present, minimal detail; 3: All fields present (name, description, model, color, skills) | **Critical fail condition**: Missing frontmatter or >300 lines = automatic 0 score --- ### Category 2: Role Definition & Expertise (15-23%, dynamic) **Purpose**: Clear positioning—what the agent does, expertise areas, specialization | Criterion | 0-3 Scoring | |-----------|------------| | **Role clarity** | 0: Undefined/vague; 1: Stated but unclear; 2: Clear role statement; 3: Crystal-clear with specific expertise domain | | **Expertise domains** | 0: None defined; 1: Vague list; 2: 3-5 domains listed; 3: 5+ domains clearly described with context | | **Specialization** | 0: General/unfocused; 1: Broad focus; 2: Clear specialization; 3: Laser-focused expertise with unique value prop | | **Anti-scope** | 0: Not mentioned; 1: Vague exclusions; 2: Some "must avoid" mentioned; 3: Clear "what we don't do" section | | **Skill integration** | 0: No skills listed; 1: Skills listed, not explained; 2: Skills integrated into workflow; 3: Skills explicitly tied to use cases | **Dynamic weight adjustment**: If agent has no pedagogical focus (not a tutor) → +8% to this category --- ### Category 3: Methodology & Workflow (14-22%, dynamic) **Purpose**: Structured approach—how agent tackles problems, decision points, adaptation | Criterion | 0-3 Scoring | |-----------|------------| | **Workflow structure** | 0: No workflow; 1: Mentions steps but vague; 2: 3-5 clear phases; 3: 4+ phases with numbered steps | | **Context gathering** | 0: None mentioned; 1: Vague gathering; 2: Structured approach ("ask before acting"); 3: Clear protocols (what/when/how to gather) | | **Progressive structure** | 0: All-at-once approach; 1: Some progression; 2: Clear progression; 3: Explicit layering (fundamentals → advanced) | | **Decision points** | 0: No decisions shown; 1: Implicit decisions; 2: Some decision criteria; 3: Clear "when to do X" guidance | | **Adaptation guidance** | 0: Rigid, no adaptation; 1: Mentions flexibility; 2: Some adaptation patterns; 3: Clear "if user needs X, then do Y" | | **Feedback mechanisms** | 0: None; 1: Mentioned; 2: Structured feedback; 3: Clear feedback loops with verification steps | --- ### Category 4: User Interaction Patterns (12%) **Purpose**: Quality of user engagement—how well agent communicates, clarifies, handles questions | Criterion | 0-3 Scoring | |-----------|------------| | **Clarification strategy** | 0: No clarification; 1: Asks when stuck; 2: Proactive clarification; 3: Structured clarification protocol | | **Tone & voice** | 0: Absent/robotic; 1: Generic; 2: Consistent persona; 3: Clear, encouraging, personable voice throughout | | **Question quality** | 0: Asks obvious/vague questions; 1: Some unnecessary questions; 2: Generally good questions; 3: Targeted, necessary clarifications | | **Error handling** | 0: No error protocol; 1: Vague handling; 2: Some error scenarios covered; 3: Clear "when this happens, do that" guidance | | **Output communication** | 0: Undefined outputs; 1: Vague outputs; 2: General output style; 3: Clear output format with examples | --- ### Category 5: Quality Standards & Gates (13%) **Purpose**: Verification—how agent ensures output quality, validates completeness, prevents bad outputs | Criterion | 0-3 Scoring | |-----------|------------| | **Must Follow checklist** | 0: None; 1: Mentioned; 2: Partial checklist (3-4 items); 3: Complete checklist (5+ items) covering all critical aspects | | **Must Avoid section** | 0: None; 1: Vague anti-patterns; 2: 2-3 specific anti-patterns; 3: 4+ specific, well-explained anti-patterns | | **Verification steps** | 0: None; 1: Mentioned; 2: 2-3 verification steps; 3: Complete verification protocol with success criteria | | **Quality gates** | 0: No gates; 1: Informal checks; 2: Structured quality check; 3: Explicit "before delivery" checklist | | **Success criteria** | 0: Undefined; 1: Vague; 2: Stated; 3: Measurable, clear success definition | --- ### Category 6: Context Management (10-18%, dynamic) **Purpose**: Efficiency—token optimization, tool usage, skill leverage, delegation decisions | Criterion | 0-3 Scoring | |-----------|------------| | **Token efficiency awareness** | 0: Ignores context; 1: Mentioned; 2: Some optimization guidance; 3: Explicit token-saving strategies | | **Tool usage strategy** | 0: No guidance; 1: Generic tool advice; 2: Tool selection criteria; 3: Clear "use tool X when Y" with rationale | | **Skill leverage** | 0: No skills mentioned; 1: Skills listed; 2: Skills integrated into workflow; 3: Clear skill delegation strategy | | **Sub-agent delegation** | 0: No delegation guidance; 1: Mentions delegation; 2: Some delegation criteria; 3: Clear "use Explore agent when X" patterns | | **Context preservation** | 0: No mention; 1: Mentioned; 2: Some guidance; 3: Clear strategies for multi-turn context maintenance | **Dynamic weight adjustment**: Central to all agents - increases to 18% if other categories need redistribution --- ### Category 7: Technical Robustness (8%) **Purpose**: Reliability—error recovery, edge cases, dependencies, validation | Criterion | 0-3 Scoring | |-----------|------------| | **Error recovery** | 0: No error handling; 1: Vague handling; 2: Some error scenarios; 3: Clear recovery strategies | | **Edge case awareness** | 0: No mention; 1: Acknowledges complexity; 2: 2-3 edge cases addressed; 3: Common edge cases documented | | **Dependency clarity** | 0: No dependencies noted; 1: Some mentioned; 2: Most dependencies clear; 3: All external dependencies documented | | **Validation guidance** | 0: None; 1: Informal; 2: Structured validation; 3: Clear validation criteria for outputs | --- ### Category 8: Pedagogical Effectiveness (8%, tutors only) **Purpose**: Teaching quality (TUTOR AGENTS ONLY)—progressive learning, practice exercises, feedback | Criterion | 0-3 Scoring | |-----------|------------| | **Learning progression** | 0: All-at-once; 1: Vague progression; 2: Clear progression (basics → advanced); 3: Explicit layered approach with prerequisites | | **Practice exercises** | 0: None; 1: Mentioned; 2: 2-3 exercises included; 3: Regular practice with difficulty levels | | **Example quality** | 0: No examples; 1: Generic examples; 2: Domain-specific examples; 3: Runnable, well-explained examples | | **Feedback mechanism** | 0: None; 1: Mentioned; 2: Some feedback guidance; 3: Clear "how to provide feedback" and "how to receive feedback" | **Application**: Score if tutor pattern detected; redistribute 8% if not applicable --- ### Category 9: Production Readiness (8%, operators only) **Purpose**: Operations quality (OPERATOR AGENTS ONLY)—procedures, compliance, artifact completeness | Criterion | 0-3 Scoring | |-----------|------------| | **Operational procedures** | 0: None; 1: Vague; 2: Some procedures; 3: Clear step-by-step operational procedures | | **Compliance standards** | 0: None; 1: Mentioned; 2: Some standards; 3: Clear compliance requirements and verification | | **Runbook completeness** | 0: No runbooks; 1: Mentioned; 2: Partial runbooks; 3: Complete runbooks with examples | | **Production safeguards** | 0: None; 1: Mentioned; 2: Some safeguards; 3: Clear safeguards, validation, rollback procedures | **Application**: Score if operator pattern detected; redistribute 8% if not applicable --- ## Agent Pattern Detection ### Tutor Pattern (Teaching-Focused) **Keywords to detect**: - "Learn", "teach", "instruct", "practice", "exercise", "beginner/intermediate/advanced", "explain", "understand" - Progressive complexity mentioned - Examples and code samples provided - Encouragement and feedback mechanisms **Required sections**: - Learning progression - Practice exercises (or clear method for practice) - Examples (runnable or realistic) - Feedback mechanisms **Category 8 applies**: Pedagogical Effectiveness (8%) **Category 9 redistributes**: +4% to Methodology, +4% to Role Definition --- ### Architect Pattern (Design-Focused) **Keywords to detect**: - "Design", "architecture", "quality", "best practices", "verification", "checklist", "trade-off", "pattern" - Quality gates and verification steps - Design workflows and decision frameworks - Anti-patterns and must-avoid guidance **Required sections**: - Design workflow (phases/steps) - Quality verification checklist - Design trade-offs explained - Best practices enforced **Category 8 redistributes**: +8% to Role Definition **Category 9 redistributes**: +8% to Context Management --- ### Operator Pattern (Operations-Focused) **Keywords to detect**: - "Production", "operational", "deploy", "monitor", "procedures", "runbook", "production-ready", "compliance" - Clear operational steps - Production safeguards and validation - Deployment and rollback procedures **Required sections**: - Operational procedures (clear steps) - Production safeguards - Runbook(s) or deployment guide - Compliance requirements **Category 8 redistributes**: +8% to Role Definition **Category 9 applies**: Production Readiness (8%) --- ## Scoring Methodology ### Category Score Calculation ``` Category Score = (Sum of criterion scores / Max possible points) × 100 Example: - Criterion 1: 2/3 - Criterion 2: 3/3 - Criterion 3: 1/3 - Total: 6/9 = 0.667 × 100 = 66.7/100 ``` ### Overall Score Calculation ``` Overall Score = Σ(Category Score × Adjusted Weight) With dynamic weight redistribution: - Base weights = 100% total - If category N/A → redistribute its weight to applicable categories - Sum all weighted contributions ``` ### Rating Thresholds | Score Range | Rating | Meaning | Action | |------------|--------|---------|--------| | 90-100 | **Production** | Expert-level, ready for wide use | Deploy | | 75-89 | **Good** | Solid functionality, minor improvements needed | Address High priority items | | 60-74 | **Adequate** | Functional but needs work | Plan significant improvements | | 40-59 | **Developing** | Significant gaps, not ready | Major rework required | | 0-39 | **Incomplete** | Major issues, not deployable | Rebuild or retire | --- ## Output Format Generate validation report using this structure: ```markdown # Agent Validation Report: [agent-name] **Pattern Detected**: [Tutor/Architect/Operator/Mixed/Unclear] **Rating**: [Production/Good/Adequate/Developing/Incomplete] **Overall Score**: [X]/100 ## Summary [2-3 sentence assessment of agent quality, pattern clarity, and main findings] ## Category Scores | Category | Score | Weight | Weighted | |----------|-------|--------|----------| | Structure & Metadata | X/100 | 12% | X | | Role Definition & Expertise | X/100 | 15-23% | X | | Methodology & Workflow | X/100 | 14-22% | X | | User Interaction Patterns | X/100 | 12% | X | | Quality Standards & Gates | X/100 | 13% | X | | Context Management | X/100 | 10-18% | X | | Technical Robustness | X/100 | 8% | X | | Pedagogical Effectiveness | X/100 | 0-8% | X | | Production Readiness | X/100 | 0-8% | X | | **Overall** | **X**/100 | - | - | ## Critical Issues (if any) [If score < 60, list issues preventing deployment] - [Issue 1 with impact] - [Issue 2 with impact] ## Improvement Recommendations ### High Priority (Address first) 1. [Specific action with impact] 2. [Specific action with impact] ### Medium Priority (Address next) 1. [Specific action with impact] 2. [Specific action with impact] ### Low Priority (Nice to have) 1. [Specific action with benefit] ## Pattern Compliance Check **Pattern Detected**: [Tutor/Architect/Operator] **Pattern Requirements Met**: [X/Y] - [ ] [Required section 1] - [ ] [Required section 2] - [ ] [Required section 3] [List any missing pattern-specific requirements] ## Strengths - [What the agent does well] - [Key differentiator or strong point] - [Technical or pedagogical strength] ## Weight Adjustments [Document any dynamic weight redistribution applied] - Category X weight adjusted: +Y% (reason) - Category Z weight adjusted: -W% (reason) --- **Recommendation**: [Next steps to reach Production level if not already there] ``` --- ## Quick Validation Checklist For rapid assessment, verify these critical items: ### Structure & Metadata (Must have) - [ ] Frontmatter present (name, description) - [ ] Agent length 70-200 lines (context-efficient) - [ ] Name format valid (lowercase, hyphens, ≤64 chars) ### Pattern Clarity (Must have) - [ ] Pattern identifiable (Tutor/Architect/Operator) - [ ] Pattern requirements present - [ ] Role clearly defined ### Core Content (Must have) - [ ] Expertise domains listed - [ ] Methodology/workflow described - [ ] User interaction strategy present ### Quality Gates (Must have) - [ ] Must Follow checklist present - [ ] Must Avoid section present - [ ] Verification steps defined ### Context Management (Must have) - [ ] Tool/skill usage strategy mentioned - [ ] Context efficiency addressed - [ ] Delegation patterns present **Scoring Quick Estimate**: - All 5 must-haves present → Likely Good/Production (75+) - 3-4 must-haves → Likely Adequate (60-74) - <3 must-haves → Likely Developing (40-59) --- ## Reference Files | File | Purpose | When to Read | |------|---------|--------------| | `references/detailed-criteria.md` | Full rubric with examples | Deep evaluation or uncertain scores | | `references/agent-patterns.md` | Pattern definitions & requirements | Pattern identification or compliance check | | `references/scoring-examples.md` | Calibration with real agents | Scoring consistency or calibration | | `references/improvement-patterns.md` | Common issues & fixes | Generating actionable recommendations | --- ## Usage Examples ### Basic validation ``` Validate the database-skill-tutor agent against production criteria ``` ### Pattern-focused review ``` Check if the prod-microservices-operator agent meets Operator pattern requirements ``` ### Improvement planning ``` Validate frontend-ui-architect and generate a roadmap to reach 95+ score ``` --- ## Agent Pattern Classification Summary | Pattern | Use When | Key Characteristics | |---------|----------|-------------------| | **Tutor** | Teaching concepts, progressive learning | Progressive structure, practice exercises, encouragement | | **Architect** | Design guidance, quality verification | Design workflows, checklists, best practices, trade-offs | | **Operator** | Production operations, deployment | Procedures, runbooks, compliance, production safeguards | See `references/agent-patterns.md` for detailed pattern requirements and detection guidelines.