# systematic-debugging

> Use for ANY bug, test failure, or unexpected behavior. Use BEFORE proposing fixes. Four phases: investigate, analyze, hypothesize, implement. Ensures understanding before attempting solutions.

- Author: discountedcookie
- Repository: discountedcookie/10x-mapmaster
- Version: 20251206213515
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/discountedcookie/10x-mapmaster
- Web: https://mule.run/skillshub/@@discountedcookie/10x-mapmaster~systematic-debugging:20251206213515

---

---
name: systematic-debugging
description: >-
  Use for ANY bug, test failure, or unexpected behavior. Use BEFORE proposing
  fixes. Four phases: investigate, analyze, hypothesize, implement. Ensures
  understanding before attempting solutions.
---

# Systematic Debugging

Find root cause before fixing.

> **Announce:** "I'm using systematic-debugging to investigate this issue before proposing fixes."

## Iron Law

```
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
```

If you haven't completed Phase 1, you cannot propose fixes.

## The Four Phases

You MUST complete each phase before proceeding to the next.

### Phase 1: Root Cause Investigation

**1.1 Read Error Messages Carefully**
- Don't skip past errors
- Read complete stack traces
- Note file paths, line numbers, error codes

**1.2 Reproduce Consistently**
- Can you trigger it reliably?
- What are exact steps?
- If not reproducible → gather more data, don't guess

**1.3 Check Recent Changes**
```bash
git log --oneline -10
git diff HEAD~5
```
- What changed that could cause this?
- New dependencies? Config changes?

**1.4 Gather Evidence**

For multi-component issues, trace the data flow:
```
Component A → Component B → Component C
     ↓             ↓             ↓
  Log input    Log input    Log input
  Log output   Log output   Log output
```

Add diagnostic logging at each boundary to find WHERE it breaks.

### Phase 2: Pattern Analysis

**2.1 Find Working Examples**
- Locate similar working code in codebase
- What's different between working and broken?

**2.2 Compare Against References**
- Read relevant specs: `openspec show [capability]`
- Check documentation for expected behavior

**2.3 Identify Differences**
- List EVERY difference, however small
- Don't assume "that can't matter"

### Phase 3: Hypothesis and Testing

**3.1 Form Single Hypothesis**
State clearly:
```
I think [X] is the root cause because [Y].
Evidence: [Z]
```

**3.2 Test Minimally**
- Make SMALLEST possible change to test hypothesis
- ONE variable at a time
- Don't fix multiple things at once

**3.3 Evaluate Result**
- Hypothesis confirmed? → Phase 4
- Hypothesis rejected? → Form NEW hypothesis, return to 3.1
- If 3+ hypotheses failed → Question the architecture (see below)

### Phase 4: Implementation

**4.1 Create Failing Test**
→ REQUIRED SUB-SKILL: Load `test-tdd`
- Write test that reproduces the bug
- Verify it fails for the right reason

**4.2 Implement Fix**
- Address the ROOT CAUSE identified in Phase 2
- ONE change only
- No "while I'm here" improvements

**4.3 Verify Fix**
- New test passes
- All other tests pass
- Issue actually resolved

## Async/Queue System Investigation

When debugging async systems (pgmq, triggers, edge functions):

**1. Don't assume the queue is broken**
```sql
-- Check if queue exists and has messages
SELECT * FROM pgmq.list_queues();
SELECT * FROM pgmq.read('queue_name', 30, 10);

-- Check ARCHIVED messages (already processed!)
SELECT * FROM pgmq.a_queue_name ORDER BY archived_at DESC LIMIT 10;
```

**2. Verify triggers are attached**
```sql
SELECT tgname, tgenabled, pg_get_triggerdef(oid)
FROM pg_trigger
WHERE tgrelid = 'schema.table'::regclass
  AND NOT tgisinternal;
```

**3. Check for race conditions**
- Multiple triggers firing for same entity?
- Concurrent executions overwriting each other?
- Look for DELETE statements that might cause data loss

**4. Trace the actual code path**
- Read the trigger function source
- Read the queue handler source
- Identify WHERE data is deleted/replaced

**Common pitfall:** "Inconsistent results" often means concurrent execution, not queue failure.

## When 3+ Fixes Fail

This indicates an architectural problem, not a bug:

```
STOP: 3+ fix attempts have failed.

This suggests the issue is architectural, not a simple bug.

Pattern observed:
- Fix 1 tried: [what]
- Fix 2 tried: [what]  
- Fix 3 tried: [what]

Each fix revealed new issues in different places.

Question: Is this pattern fundamentally sound, or should we
refactor the architecture?

Awaiting guidance before attempting more fixes.
```

## Red Flags - STOP and Reset

If you catch yourself thinking:
- "Quick fix for now, investigate later"
- "Just try changing X and see if it works"
- "Add multiple changes, run tests"
- "Skip the test, I'll manually verify"
- "It's probably X, let me fix that"
- "I don't fully understand but this might work"

STOP. Return to Phase 1.

## Common Rationalizations

| Excuse | Reality |
|--------|---------|
| "Issue is simple, skip investigation" | Simple issues have root causes too. |
| "Emergency, no time for process" | Systematic is FASTER than thrashing. |
| "Just try this first" | First fix sets the pattern. Do it right. |
| "I'll write test after fix works" | Untested fixes don't stick. Test first. |

## REQUIRED SUB-SKILL

For implementing the fix → Load `test-tdd`

## Output Format

After investigation:

```
## Root Cause Analysis

**Symptom:** [What was observed]

**Root Cause:** [What's actually wrong]

**Evidence:** [How I know this]

**Fix:** [What should change]

**Test:** [How to verify the fix]

Proceed with fix? (Will use test-tdd skill)
```