# post-experiment-truth-extractor

> Extract defensible conclusions from experiment results without overclaiming. Use when reading test results, interpreting p-values or intervals, or deciding what can actually be said with confidence.

- Author: Yannik Pitcan
- Repository: pitcany/claude-config
- Version: 20260103063103
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/pitcany/claude-config
- Web: https://mule.run/skillshub/@@pitcany/claude-config~post-experiment-truth-extractor:20260103063103

---

---
name: post-experiment-truth-extractor
description: Extract defensible conclusions from experiment results without overclaiming. Use when reading test results, interpreting p-values or intervals, or deciding what can actually be said with confidence.
---

# Post-Experiment Truth Extractor

## Overview

Determine what can and cannot be legitimately claimed from experiment results. Prevents the common failure of overstating findings or missing important caveats.

## When to Use

- Interpreting A/B test results
- Writing up experiment conclusions
- Deciding whether to ship based on data
- Reviewing others' experimental claims
- Translating statistical results for stakeholders

## Quick Reference

| Result | What You Can Say | What You Cannot Say |
|--------|------------------|---------------------|
| p < 0.05, positive effect | "Evidence of an effect" | "Definitely works" |
| p > 0.05, positive point estimate | "No clear evidence" | "No effect" |
| Wide confidence interval | "Effect could range from X to Y" | "Effect is about Z" |
| Effect smaller than MDE | "May not be practically meaningful" | "Doesn't work" |
| Significant in one metric | "Improved metric A" | "Improved overall" |

## Interpretation Framework

### Statistical vs Practical Significance

```
Statistical significance: p < 0.05
  → Evidence that effect is non-zero

Practical significance: Effect > MDE
  → Effect is large enough to matter

Both required for confident action.
```

### Confidence Interval Interpretation

```
95% CI: [1.2%, 4.8%]
  ✓ "We expect the true effect is between 1.2% and 4.8%"
  ✓ "The effect is likely positive, ranging from small to moderate"
  ✗ "The effect is 3%" (this is just the point estimate)
  ✗ "95% chance the effect is in this range" (frequentist misinterpretation)
```

## Truth Extraction Checklist

### What the Data Shows

- [ ] What is the point estimate?
- [ ] What is the confidence interval?
- [ ] Is it statistically significant?
- [ ] Is the effect practically meaningful?
- [ ] Were guardrail metrics affected?

### What the Data Does NOT Show

- [ ] Does this generalize beyond the test population?
- [ ] Is this effect durable (not novelty)?
- [ ] What drove the effect (mechanism)?
- [ ] Will this work for different segments?
- [ ] Are there long-term effects we haven't measured?

## Common Overclaims

| Overclaim | Reality | Correction |
|-----------|---------|------------|
| "X increases conversion by 5%" | Point estimate with uncertainty | "X increases conversion by 2-8% (95% CI)" |
| "No effect found" | Failed to detect effect | "Unable to detect effect at this sample size" |
| "Works for all users" | Tested on sample | "Works for tested population" |
| "Proven to work" | Single experiment | "Evidence of effect in one test" |
| "Better than control" | Within confidence interval | "Directionally better, CI includes zero" |

## Honest Reporting Templates

### Conclusive Positive Result

```markdown
**Result**: Treatment improved [metric] by [X-Y%] (95% CI)
**Confidence**: High (p < 0.05, effect > MDE)
**Caveats**: [population, duration, novelty considerations]
**Recommendation**: Ship
```

### Inconclusive Result

```markdown
**Result**: Point estimate of [X%], but CI includes zero [-Y%, +Z%]
**Confidence**: Low (insufficient power or small effect)
**What this means**: Cannot conclude effect is non-zero
**Recommendation**: [Run longer / Accept uncertainty / Redesign]
```

### Negative Result

```markdown
**Result**: Treatment decreased [metric] by [X-Y%] (95% CI)
**Confidence**: High (p < 0.05 for negative effect)
**Learnings**: [What we learned from this failure]
**Recommendation**: Do not ship; investigate mechanism
```

## Subgroup Analysis Caveats

```markdown
If testing multiple segments:
- Pre-specified segments → more credible
- Post-hoc segments → exploratory only
- Multiple testing correction needed
- Winner's curse inflates effect sizes
```

## Questions to Ask Before Claiming

1. **Would I be comfortable if a skeptic reviewed this?**
2. **Am I reporting the full uncertainty, not just point estimate?**
3. **Am I acknowledging what I don't know?**
4. **Would I bet real money on this effect persisting?**
5. **Am I distinguishing "no evidence" from "evidence of no effect"?**

## Common Mistakes

| Mistake | Problem | Fix |
|---------|---------|-----|
| Reporting only point estimate | Hides uncertainty | Always report CI |
| "No effect" when p > 0.05 | Absence of evidence ≠ evidence of absence | Say "unable to detect" |
| Cherry-picking significant results | Multiple testing inflation | Report all pre-specified metrics |
| Ignoring effect size | Statistically significant but tiny | Report practical significance |
| Generalizing beyond sample | May not apply to all users | State limitations |
| Claiming causation without design | Correlation ≠ causation | Be precise about what's identified |

## Related Skills

- **experiment-design-optimizer** - Designing experiments for clear answers
- **causal-identification-validator** - Validating causal claims
- **executive-translation-layer** - Communicating results to stakeholders
- **reviewer2-emulator** - Anticipating challenges to claims