# systematic-debugging

> Use when fixing bugs, errors, or user mentions "debug", "erro", "não funciona", "bug", "quebrou", "failing". Four-phase root cause analysis.

- Author: Pedro Saubellona
- Repository: impeto-ai/claude-skills-impeto
- Version: 20260104192656
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/impeto-ai/claude-skills-impeto
- Web: https://mule.run/skillshub/@@impeto-ai/claude-skills-impeto~systematic-debugging:20260104192656

---

---
name: systematic-debugging
description: Use when fixing bugs, errors, or user mentions "debug", "erro", "não funciona", "bug", "quebrou", "failing". Four-phase root cause analysis.
---

# Systematic Debugging

Four-phase methodology for finding and fixing root causes, not symptoms.

## When to Use

- Error messages or exceptions
- Unexpected behavior
- Tests failing
- User says: debug, erro, bug, não funciona, quebrou, failing
- NOT when: adding features, refactoring working code

## The Four Phases

```
┌─────────────────────────────────────────────────┐
│  PHASE 1: REPRODUCE                             │
│  Can you make it fail consistently?             │
├─────────────────────────────────────────────────┤
│  PHASE 2: ISOLATE                               │
│  Where exactly does it fail?                    │
├─────────────────────────────────────────────────┤
│  PHASE 3: IDENTIFY                              │
│  WHY does it fail? Root cause.                  │
├─────────────────────────────────────────────────┤
│  PHASE 4: FIX & VERIFY                          │
│  Fix it and prove it's fixed.                   │
└─────────────────────────────────────────────────┘
```

## Phase 1: REPRODUCE

**Goal:** Consistent reproduction steps

```markdown
### Reproduction Checklist
- [ ] Can reproduce locally?
- [ ] Minimum steps to reproduce?
- [ ] What's the exact error message?
- [ ] What's the expected vs actual behavior?
- [ ] Environment details (versions, config)?
```

**Document:**
```
REPRODUCTION STEPS:
1. [Step 1]
2. [Step 2]
3. [Step 3]

EXPECTED: [What should happen]
ACTUAL: [What actually happens]
ERROR: [Exact error message]
```

## Phase 2: ISOLATE

**Goal:** Find the exact location

### Techniques

| Technique | When to Use |
|-----------|-------------|
| Binary search | Large codebase, unclear location |
| Stack trace | Exception with traceback |
| Git bisect | "It worked yesterday" |
| Print/logging | Flow unclear |
| Breakpoints | Need to inspect state |

### Binary Search Method

```
1. Find a known good state (test passes, feature works)
2. Find the bad state (current broken)
3. Test the middle point
4. Narrow down: which half has the bug?
5. Repeat until found
```

### For AI Agents

```python
# Add trace logging to agent steps
import logging
logging.basicConfig(level=logging.DEBUG)

# Log each step
logger.debug(f"Input: {input}")
logger.debug(f"Agent state: {agent.state}")
logger.debug(f"Tool called: {tool_name} with {args}")
logger.debug(f"Tool result: {result}")
logger.debug(f"Output: {output}")
```

### For Database

```sql
-- Check query plan
EXPLAIN ANALYZE SELECT ...;

-- Check recent errors
SELECT * FROM pg_stat_activity WHERE state = 'error';

-- Check locks
SELECT * FROM pg_locks WHERE NOT granted;
```

## Phase 3: IDENTIFY (Root Cause)

**Goal:** Understand WHY, not just WHERE

### 5 Whys Technique

```
Problem: User login fails
Why 1: Token validation returns false
Why 2: Token is expired
Why 3: Token expiry was set to 1 second
Why 4: Config loaded wrong environment
Why 5: .env file not in container → ROOT CAUSE
```

### Common Root Causes

| Symptom | Likely Cause |
|---------|--------------|
| Works locally, fails in prod | Environment/config difference |
| Intermittent failure | Race condition, timing |
| Fails after deploy | New code, dependency update |
| Fails at scale | Memory, connection limits |
| Fails randomly | Uninitialized state, random seed |

### For Pydantic AI

```python
# Common issues:
# 1. Schema validation errors
# 2. Token limits exceeded
# 3. Tool not returning expected format
# 4. State not persisting between steps

# Debug with:
from pydantic_ai import debug
debug.enable()
```

## Phase 4: FIX & VERIFY

**Goal:** Fix root cause, prove it's fixed

### Fix Checklist

```markdown
- [ ] Fix addresses ROOT CAUSE (not symptom)
- [ ] Write test that would have caught this
- [ ] Test passes now
- [ ] No regressions (other tests still pass)
- [ ] Document what was wrong and why
```

### Verification

```bash
# Run specific test
pytest tests/test_feature.py::test_that_was_failing -v

# Run related tests
pytest tests/test_feature.py -v

# Run full suite
pytest

# Check for regressions
git diff HEAD~1 | pytest --collect-only
```

## Defense in Depth

After fixing, add layers of protection:

```python
# 1. Input validation
def process_user(user_id: int) -> User:
    if user_id <= 0:
        raise ValueError(f"Invalid user_id: {user_id}")
    ...

# 2. Assertions for invariants
assert len(results) > 0, "Query returned no results"

# 3. Logging for future debugging
logger.info(f"Processed {len(items)} items in {elapsed}s")

# 4. Monitoring/alerting
metrics.increment("process_user.success")
```

## Output Format

When debugging, structure responses as:

```
## PHASE 1: REPRODUCE
- Steps: [numbered list]
- Error: `[exact message]`
- Consistent: YES/NO

## PHASE 2: ISOLATE
- Location: `[file:line]`
- Method used: [binary search/stack trace/etc]

## PHASE 3: IDENTIFY
- Root cause: [explanation]
- 5 Whys: [chain]

## PHASE 4: FIX & VERIFY
- Fix: `[file:line]` - [description]
- Test: `[test name]` - PASSING ✓
- Regression: NONE ✓
```

## Common Mistakes

- Fixing symptoms instead of root cause
- Not reproducing before fixing
- Changing multiple things at once
- Not writing regression test
- Skipping verification phase
- Assuming instead of proving