# session-title-eval

> Build and run evaluations for session title generation quality. Use after modifying prompts, when curating test cases, or for periodic quality audits.

- Author: Michael Fairchild
- Repository: fairchild/dotclaude
- Version: 20260121222431
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/fairchild/dotclaude
- Web: https://mule.run/skillshub/@@fairchild/dotclaude~session-title-eval:20260121222431

---

---
name: session-title-eval
description: Build and run evaluations for session title generation quality. Use after modifying prompts, when curating test cases, or for periodic quality audits.
license: Apache-2.0
status: wip
---

## Overview

This skill helps evaluate and improve session title generation by:
1. Extracting test cases from existing sessions
2. Maintaining a curated "golden" dataset with ideal titles
3. Running evaluations with multiple metrics
4. Generating reports to identify improvements

## Directory Structure

```
data/
├── candidates.jsonl   # Raw extracted (uncurated)
├── golden.jsonl       # Curated with ideal titles + scores
└── results/           # Eval run outputs
```

## Workflows

### 1. Extract Candidates

Pull test cases from existing session transcripts:

```bash
bun ~/.claude/skills/session-title-eval/scripts/extract-candidates.ts
```

Options:
- `--limit N` - Extract at most N candidates
- `--project NAME` - Filter to specific project

### 2. Curate Golden Dataset

Review `data/candidates.jsonl` and for each promising case:
1. Add `ideal_title` - what the title SHOULD be
2. Add `human_score` (1-5) - how good is the generated title
3. Add `notes` - why you scored it that way
4. Move to `data/golden.jsonl`

### 3. Run Evaluation

Test current title generation against golden dataset:

```bash
bun ~/.claude/skills/session-title-eval/scripts/run-eval.ts
```

Options:
- `--judge-model MODEL` - LLM for judging (default: haiku)

Outputs to `data/results/{timestamp}.jsonl`

### 4. Generate Report

Create readable summary of latest eval run:

```bash
bun ~/.claude/skills/session-title-eval/scripts/report.ts
```

## Metrics

| Metric | Description |
|--------|-------------|
| Fuzzy Match | Levenshtein similarity to ideal title (0-1) |
| LLM Judge | Qualitative score from AI reviewer (1-5) |
| Human Score | Your rating from golden dataset (1-5) |

## Scoring Rubric

See `references/scoring-rubric.md` for detailed criteria.

| Score | Meaning |
|-------|---------|
| 5 | Perfect - specific, actionable, concise |
| 4 | Good - minor issues |
| 3 | Acceptable - gets the gist |
| 2 | Poor - too vague or wrong focus |
| 1 | Bad - misses the point entirely |