# evaluate-hook

> Evaluate Claude Code hook quality. Use when reviewing, auditing, or improving hooks before deployment.

- Author: Daniel Suazo Pavez
- Repository: DanielSuazoPavez/claude-toolkit
- Version: 20260127001543
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/DanielSuazoPavez/claude-toolkit
- Web: https://mule.run/skillshub/@@DanielSuazoPavez/claude-toolkit~evaluate-hook:20260127001543

---

---
name: evaluate-hook
description: Evaluate Claude Code hook quality. Use when reviewing, auditing, or improving hooks before deployment.
---

# Hook Judge

Evaluate hook quality against hook-specific best practices.

## When to Use

- Reviewing a hook before deployment
- Auditing existing hooks for quality
- Improving a hook that's causing issues

## Core Philosophy

**What is a Hook?** A safety/automation gate, not application logic.

**The Formula:** `Good Hook = Correct Behavior + Testable + Maintainable`

Hooks must be reliable (they guard critical operations), testable (stdin/stdout), and fail gracefully.

## Evaluation Dimensions (100 points)

### D1: Correctness (25 pts) - Most Critical

| Score | Criteria |
|-------|----------|
| 22-25 | Correct event, matcher, output format; handles edge cases |
| 17-21 | Mostly correct, minor edge case gaps |
| 10-16 | Works for happy path, misses important cases |
| 0-9 | Wrong event type, broken output format, or logic errors |

**Check:**
- Right hook event? (PreToolUse for blocking, PostToolUse for reactions)
- Output format correct? (`{"decision":"block","reason":"..."}` or empty)
- Early exit for non-matching tools?

### D2: Testability (20 pts)

| Score | Criteria |
|-------|----------|
| 18-20 | Can test via stdin/stdout, clear block/allow cases documented |
| 13-17 | Testable but test cases not documented |
| 7-12 | Hard to test (external dependencies, side effects) |
| 0-6 | Untestable (hardcoded paths, no clear inputs/outputs) |

**Check:**
- Can run `echo '{"tool_name":...}' | ./hook.sh` ?
- Are block and allow test cases obvious?

### D3: Safety & Robustness (20 pts)

| Score | Criteria |
|-------|----------|
| 18-20 | Handles errors gracefully, logs failures, has allowlist |
| 13-17 | Basic error handling, some edge cases covered |
| 7-12 | Fails silently on errors, no allowlist |
| 0-6 | Crashes on bad input, blocks legitimate work |

**Check:**
- What happens if jq fails or input is malformed?
- Does it have an allowlist for safe exceptions?
- Is it overly strict (blocks legitimate operations)?

### D4: Maintainability (20 pts)

| Score | Criteria |
|-------|----------|
| 18-20 | Clear structure, configurable (safety levels), no hardcoded paths |
| 13-17 | Readable but some hardcoding |
| 7-12 | Works but hard to modify |
| 0-6 | Spaghetti logic, magic values everywhere |

**Check:**
- Uses `$HOME` or env vars instead of hardcoded paths?
- Safety level configurable via single constant?
- Logic easy to extend with new patterns?

### D5: Documentation (15 pts)

| Score | Criteria |
|-------|----------|
| 13-15 | Purpose clear, test commands documented, settings.json example |
| 9-12 | Purpose clear, minimal docs |
| 4-8 | Unclear what it does or how to configure |
| 0-3 | No documentation |

## Grading Scale

| Grade | Score | Description |
|-------|-------|-------------|
| A | 90+ | Production-ready |
| B | 75-89 | Good, minor improvements needed |
| C | 60-74 | Functional but notable gaps |
| D | 40-59 | Significant issues |
| F | <40 | Not safe to deploy |

## Evaluation Protocol

**Use a subagent** to run evaluations - avoids self-evaluation bias when reviewing your own work.

1. Identify hook event and matcher
2. Verify output format matches spec
3. Check for early exit on non-matching tools
4. Look for error handling and allowlists
5. Verify testability via stdin/stdout
6. Score each dimension with evidence
7. Generate report with grade and top 3 improvements

## Anti-Patterns

| Pattern | Problem | Score Impact |
|---------|---------|--------------|
| **Wrong output format** | Hook doesn't block when it should | D1: -15 |
| **No early exit** | Processes every tool, wastes cycles | D1: -5, D4: -5 |
| **Silent failures** | Errors go unnoticed | D3: -10 |
| **Hardcoded paths** | Breaks on other machines | D4: -10 |
| **No allowlist** | Blocks legitimate work | D3: -8 |
| **Untestable** | Can't verify behavior | D2: -15 |

## Edge Cases

| Hook Type | Scoring Adjustment |
|-----------|-------------------|
| **Logging-only** | D1 lower bar (no blocking logic), D2/D5 still matter |
| **Simple passthrough** | Minimal is fine if purpose is clear |
| **Multi-tool** | Higher D4 bar (must handle all matched tools) |

## Example Evaluation

**Hook:** `enforce-make-commands.sh` (blocks direct pytest/ruff, suggests make targets)

| Dimension | Score | Evidence |
|-----------|-------|----------|
| D1: Correctness | 22/25 | Right event (PreToolUse), correct output format, early exit for non-Bash |
| D2: Testability | 16/20 | Testable via stdin, but no documented test cases |
| D3: Safety | 15/20 | No error handling if jq fails, no allowlist |
| D4: Maintainability | 17/20 | Clear structure, but patterns could be configurable |
| D5: Documentation | 10/15 | Purpose clear from comments, no settings.json example |

**Total: 80/100 - Grade B**

**Top improvements:** Add jq error handling, document test cases, add settings.json example.