# skill-judge

> Evaluate Agent Skill design quality against specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files. Provides multi-dimensional scoring and actionable improvements.

- Author: Daniel Suazo Pavez
- Repository: DanielSuazoPavez/dotfiles
- Version: 20260124135100
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/DanielSuazoPavez/dotfiles
- Web: https://mule.run/skillshub/@@DanielSuazoPavez/dotfiles~skill-judge:20260124135100

---

---
name: skill-judge
description: Evaluate Agent Skill design quality against specifications and best practices. Use when reviewing, auditing, or improving SKILL.md files. Provides multi-dimensional scoring and actionable improvements.
disable-model-invocation: true
---

# Skill Judge

Evaluate skill design quality against best practices.

## Core Philosophy

**What is a Skill?** A knowledge externalization mechanism, not a tutorial.

**The Formula:** `Good Skill = Expert-only Knowledge − What Claude Already Knows`

Value = knowledge delta. Skills should contain decision trees, trade-offs, edge cases, domain frameworks—not basics Claude already understands.

## Three Knowledge Types

| Type | Action | Example |
|------|--------|---------|
| **Expert** | Keep | Non-obvious decision trees, trade-offs |
| **Activation** | Keep sparingly | Brief reminders of known concepts |
| **Redundant** | Delete | Basic concepts Claude knows |

## Evaluation Dimensions (120 points)

### D1: Knowledge Delta (20 pts) - Most Critical
Does it add genuine expert knowledge?
- Red flags: "What is X" sections, generic best practices
- Green flags: Non-obvious decisions, expert trade-offs

### D2: Mindset + Procedures (15 pts)
Does it transfer expert thinking AND domain-specific workflows?

### D3: Anti-Pattern Quality (15 pts)
Are anti-patterns specific with reasoning, not vague warnings?

### D4: Specification Compliance (15 pts)
Is the description clear about WHAT, WHEN, and KEYWORDS for triggering?

### D5: Progressive Disclosure (15 pts)
- Metadata: Always in memory
- Body: Loaded when triggered
- References: On-demand
- Target: Under 500 lines

### D6: Freedom Calibration (15 pts)
- Creative tasks → High freedom (principles)
- Fragile operations → Low freedom (exact scripts)

### D7: Pattern Recognition (10 pts)
Does it follow established patterns?
- Mindset (~50 lines)
- Navigation (~30 lines)
- Philosophy (~150 lines)
- Process (~200 lines)
- Tool (~300 lines)

### D8: Practical Usability (15 pts)
Decision trees, working examples, error handling, edge cases?

## Grading Scale

| Grade | Score | Status |
|-------|-------|--------|
| A | 90%+ (108+) | Production-ready |
| B | 80-89% | Minor improvements |
| C | 70-79% | Clear improvement path |
| D | 60-69% | Significant issues |
| F | <60% | Needs redesign |

## Common Failures

1. **The Tutorial** - Explains basics Claude knows
2. **The Dump** - 800+ lines, everything included
3. **The Invisible Skill** - Great content, vague description
4. **The Freedom Mismatch** - Rigid for creative, vague for fragile

## Evaluation Protocol

1. Read completely, mark sections as [E]xpert, [A]ctivation, [R]edundant
2. Analyze structure: frontmatter, line count, pattern
3. Score each dimension with evidence
4. Calculate total, assign grade
5. Generate report with critical issues and top 3 improvements

## The Meta-Question

> "Would an expert say this captures knowledge requiring years to learn?"

If yes → genuine value. If no → it's compressing what Claude already knows.