# evaluate-skill > Evaluate Agent Skill design quality against specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files. Provides multi-dimensional scoring and actionable improvements. - Author: Daniel Suazo Pavez - Repository: DanielSuazoPavez/claude-toolkit - Version: 20260127001543 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/DanielSuazoPavez/claude-toolkit - Web: https://mule.run/skillshub/@@DanielSuazoPavez/claude-toolkit~evaluate-skill:20260127001543 --- --- name: evaluate-skill description: Evaluate Agent Skill design quality against specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files. Provides multi-dimensional scoring and actionable improvements. --- # Skill Judge Evaluate skill design quality against best practices. ## Core Philosophy **What is a Skill?** A knowledge externalization mechanism, not a tutorial. **The Formula:** `Good Skill = Expert-only Knowledge − What Claude Already Knows` Value = knowledge delta. Skills should contain decision trees, trade-offs, edge cases, domain frameworks—not basics Claude already understands. ## Three Knowledge Types | Type | Action | Example | |------|--------|---------| | **Expert** | Keep | Non-obvious decision trees, trade-offs | | **Activation** | Keep sparingly | Brief reminders of known concepts | | **Redundant** | Delete | Basic concepts Claude knows | ## Evaluation Dimensions (120 points) ### D1: Knowledge Delta (20 pts) - Most Critical Does it add genuine expert knowledge? - Red flags: "What is X" sections, generic best practices - Green flags: Non-obvious decisions, expert trade-offs ### D2: Mindset + Procedures (15 pts) Does it transfer expert thinking AND domain-specific workflows? ### D3: Anti-Pattern Quality (15 pts) Are anti-patterns specific with reasoning, not vague warnings? ### D4: Specification Compliance (15 pts) Is the description clear about WHAT, WHEN, and KEYWORDS for triggering? ### D5: Progressive Disclosure (15 pts) - Metadata: Always in memory - Body: Loaded when triggered - References: On-demand - Target: Under 500 lines #### Supporting Files Checklist When skill has companion files, also check: | Issue | Deduction | |-------|-----------| | Supporting file >500 lines | -3 per file | | No TOC when >100 lines | -2 per file | | Nested references (refs within refs) | -3 | | Bare reference (no context) | -1 each | | Orphaned file (never referenced) | -2 per file | ### D6: Freedom Calibration (15 pts) - Creative tasks → High freedom (principles) - Fragile operations → Low freedom (exact scripts) ### D7: Pattern Recognition (10 pts) Does it follow established patterns? - Mindset (~50 lines) - Navigation (~30 lines) - Philosophy (~150 lines) - Process (~200 lines) - Tool (~300 lines) ### D8: Practical Usability (15 pts) Decision trees, working examples, error handling, edge cases? ## Scoring Calibration ### D1 (Knowledge Delta) - 20 pts | Score | Criteria | |-------|----------| | 18-20 | Expert says "yes, this took years to learn" | | 14-17 | Useful but partially derivable from first principles | | 10-13 | Mostly activation knowledge, some expert bits | | 5-9 | Tutorial territory - explains basics | | 0-4 | Pure redundancy - Claude already knows this | ### D3 (Anti-Pattern Quality) - 15 pts | Score | Criteria | |-------|----------| | 13-15 | Specific anti-patterns with reasoning AND fixes | | 10-12 | Named anti-patterns with some reasoning | | 6-9 | Vague warnings ("avoid bad practices") | | 0-5 | No anti-patterns section | ## Edge Cases | Skill Type | Evaluation Adjustment | |------------|----------------------| | **Reset/Calibration** | D1 judged on "resets behavior effectively" not "adds knowledge" | | **Meta-skills** | Self-reference is fine if genuinely useful | | **Navigation** | Minimal is correct; penalize bloat, not brevity | | **Wrapper/Utility** | Lower D1 bar if procedural value is clear | ## Grading Scale | Grade | Score | Description | |-------|-------|-------------| | A | 108+ (90%) | Exemplary - reference quality | | A- | 102-107 (85-89%) | Excellent - minimal polish needed | | B+ | 96-101 (80-84%) | Solid - minor improvements | | B | 90-95 (75-79%) | Good - clear path forward | | B- | 84-89 (70-74%) | Functional - needs attention | | C+ | 78-83 (65-69%) | Adequate - notable gaps | | C | 72-77 (60-64%) | Needs work | | D | 60-71 (50-59%) | Significant issues | | F | <60 (<50%) | Needs redesign | ## Common Failures | Failure | How to Recognize | How to Fix | |---------|------------------|------------| | **The Tutorial** | "What is X" sections, explains basics | Delete basics, keep only expert delta | | **The Dump** | 800+ lines, everything included | Split into skill + references | | **The Invisible Skill** | Great content, vague description | Add WHEN and KEYWORDS to description | | **The Freedom Mismatch** | Rigid for creative, vague for fragile | Match freedom to task risk | ## Evaluation Protocol **Use a subagent** to run evaluations - avoids self-evaluation bias when reviewing your own work. 1. Read completely, mark sections as [E]xpert, [A]ctivation, [R]edundant 2. Analyze structure: frontmatter, line count, pattern 3. Score each dimension with evidence 4. Calculate total, assign grade 5. Generate report with critical issues and top 3 improvements ## Example Evaluation **Before (D - 62/120):** ```markdown # Git Workflow Use branches for features. Commit often. Write good messages. ``` - D1: 6/20 - Claude knows this - D3: 0/15 - No anti-patterns - D8: 4/15 - No decision trees **After (A - 112/120):** ```markdown # Git Workflow ## Branch Naming Decision Tree [specific tree based on team conventions] ## Commit Sizing | Change Type | Commit Strategy | [expert guidance on atomic commits] ## Anti-Patterns | Pattern | Why Bad | Fix | | Mega-commit | Unreviewable | [specific split strategy] ``` - D1: 18/20 - Team-specific expert knowledge - D3: 14/15 - Specific anti-patterns with fixes - D8: 14/15 - Decision trees, examples ## The Meta-Question > "Would an expert say this captures knowledge requiring years to learn?" If yes → genuine value. If no → it's compressing what Claude already knows.