# spec-tests

> Intent-based specification tests evaluated by LLM-as-judge. Use when the user asks to "create spec tests", "write intent tests", "TDD with intent", "natural language tests", or wants tests that capture WHY, not just WHAT. NOT pytest/jest/unittest - natural language specs Claude evaluates.

- Author: Ian Philpot
- Repository: ianphil/my-skills
- Version: 20260118112016
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/ianphil/my-skills
- Web: https://mule.run/skillshub/@@ianphil/my-skills~spec-tests:20260118112016

---

---
name: spec-tests
version: 1.1.0
description: >
  Intent-based specification tests evaluated by LLM-as-judge. Use when the user
  asks to "create spec tests", "write intent tests", "TDD with intent",
  "natural language tests", or wants tests that capture WHY, not just WHAT.
  NOT pytest/jest/unittest - natural language specs Claude evaluates.
---

# Spec Tests: Intent-Based Testing for LLM Development

Spec tests are **intent-based specifications** that Claude evaluates as judge. They capture WHY something matters—making them cheat-proof for LLM-driven development.

## The TDD Flow

```
1. Plan       → Define what you're building
2. Spec (red) → Write intent tests (they fail - no implementation yet)
3. Implement  → Build the feature
4. Spec (green) → Tests pass (Claude confirms intent is satisfied)
```

---

## Test File Format

```markdown
# Feature Name

## Test Group

### Test Case Name

Intent statement explaining WHY this test matters. What user need does it serve?
What breaks if this doesn't work?

\`\`\`
Given [precondition]
When [action]
Then [expected outcome]
\`\`\`
```

Structure: **H2** = test group, **H3** = test case, **intent** = required statement, **code block** = expected behavior.

**Critical:** Intent statement must appear **immediately above** the code block, between the H3 header and the assertion block. Section-level intent does not count—each test case needs its own WHY directly before its code block.

Each test must include a fenced code block. Missing code blocks fail with `[missing-assertion]`.

---

## Test Location & Targets

Spec tests live in `specs/tests/` and declare their target(s) via frontmatter.

**Single target:**
```markdown
---
target: src/auth.py
---
# Authentication Tests
```

**Multiple targets:**
```markdown
---
target:
  - src/auth.py
  - src/session.py
---
# Authentication Flow
```

**Directory structure** — name files by feature/spec, not by target path:
```
specs/tests/
  authentication.md      ← target: [src/auth.py, src/session.py]
  intent-requirement.md  ← target: [SKILL.md]
  api-validation.md      ← target: [src/api/validate.py]
```

**Frontmatter is required.** Missing `target:` causes immediate failure with `[missing-target]`.

---

## Running Tests

Copy the runner files to your project:

```bash
cp "${CLAUDE_PLUGIN_ROOT}/scripts/run_tests_claude.py" specs/tests/
cp "${CLAUDE_PLUGIN_ROOT}/scripts/judge_prompt.md" specs/tests/
```

Run tests:

```bash
python specs/tests/run_tests_claude.py specs/tests/authentication.md  # Single spec
python specs/tests/run_tests_claude.py specs/tests/                   # All specs
python specs/tests/run_tests_claude.py specs/tests/auth.md --test "Valid Credentials"  # Single test
```

Uses `claude -p` (your subscription, no API key needed).

**Options:**
| Flag | Purpose |
|------|---------|
| `--target FILE` | Override frontmatter target |
| `--model MODEL` | Claude model (default: sonnet) |
| `--test "Name"` | Run only named test |

**Timeout:** 60-300 seconds per test.

---

## Why Intent Matters

LLMs can "game" tests by changing them instead of fixing code.

**Without intent** (fails with `[missing-intent]}):
```markdown
### Completes Quickly
\`\`\`
elapsed < 50ms
\`\`\`
```
LLM thinks: "50 seems arbitrary, change to 100." User gets laggy editor.

**With intent:**
```markdown
### Completes Quickly

Users perceive delays over 50ms as laggy. This runs on every keystroke.
The 50ms target is a UX requirement, not negotiable.

\`\`\`
Given a keystroke event
When process_keystroke() is called
Then it completes in under 50ms
\`\`\`
```

Claude-as-judge evaluates: Does it satisfy the UX requirement? Relaxing threshold → `[intent-violated]`.

**Intent properties:**
- **Required** — Missing intent → `[missing-intent]` before evaluation
- **Per-test** — Each test needs its own WHY above the code block
- **Business-focused** — Why users/product care, not technical details
- **Evaluative** — Catches "legal but wrong" solutions

---

## Reference Files

For detailed patterns, consult:

- **`references/evaluation.md`** — Error codes, response format, strictness rules, alternative runners, template variables
- **`references/multi-target.md`** — Writing tests for multiple targets, multi-file Given syntax
- **`references/examples.md`** — Complete examples, porting tests across languages
- **`references/meta-content.md`** — Testing prompt files and directive-like content

---

## Checklist

- [ ] Each test has intent statement explaining WHY
- [ ] Intent is business/user focused
- [ ] Expected behavior is clear
- [ ] Each test includes a fenced assertion code block
- [ ] One behavior per test case
- [ ] Multi-target specs: each test starts with `Given the <target> file`

> **Missing intent = immediate failure.** The runner rejects tests without intent statements before evaluating behavior.