# ORCHESTRATION_DEBUGGING

> Troubleshoot agent & tool failures in scheduling orchestration. Use when MCP tools fail, agent communication breaks, constraint engines error, or database operations timeout. Provides systematic incident response and root cause analysis.

- Author: Euda1mon1a
- Repository: Euda1mon1a/Autonomous-Assignment-Program-Manager
- Version: 20260121142452
- Stars: 2
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/Euda1mon1a/Autonomous-Assignment-Program-Manager
- Web: https://mule.run/skillshub/@@Euda1mon1a/Autonomous-Assignment-Program-Manager~ORCHESTRATION_DEBUGGING:20260121142452

---

---
name: ORCHESTRATION_DEBUGGING
description: Troubleshoot agent & tool failures in scheduling orchestration. Use when MCP tools fail, agent communication breaks, constraint engines error, or database operations timeout. Provides systematic incident response and root cause analysis.
---

# ORCHESTRATION_DEBUGGING

A comprehensive debugging skill for diagnosing and resolving failures in the AI-orchestrated scheduling system, including MCP tool integration, agent workflows, constraint engine, and database operations.

## When This Skill Activates

- **MCP Tool Failures**: Timeout, connection errors, or incorrect responses
- **Agent Communication Issues**: Multi-agent workflows failing to coordinate
- **Constraint Engine Errors**: OR-Tools solver failures, constraint conflicts
- **Database Operation Failures**: Deadlocks, connection pool exhaustion, slow queries
- **Schedule Generation Failures**: Validation errors, compliance violations, infeasible schedules
- **Background Task Issues**: Celery worker crashes, task timeouts, queue backlogs
- **API Integration Failures**: Backend API errors, authentication issues, rate limiting

## Overview

This skill provides structured workflows for:

1. **Incident Review**: Post-mortem analysis with root cause identification
2. **Log Analysis**: Systematic log parsing across services (backend, MCP, Celery, database)
3. **Root Cause Analysis**: 5-whys investigation methodology
4. **Common Failure Patterns**: Catalog of known issues with solutions
5. **Debugging Checklist**: Step-by-step troubleshooting for each component

## Architecture Context

### System Components

```
Claude Agent
    ↓ (MCP Protocol)
MCP Server (29+ tools)
    ↓ (HTTP API)
FastAPI Backend
    ↓ (SQLAlchemy)
PostgreSQL Database
    ↓ (Async Tasks)
Celery + Redis
```

### Common Failure Points

| Layer | Component | Failure Mode |
|-------|-----------|--------------|
| **Agent** | Claude Code | Token limits, context overflow, skill conflicts |
| **MCP** | Tool invocation | Timeout, serialization errors, auth failures |
| **API** | FastAPI routes | Validation errors, database session issues |
| **Service** | Business logic | Constraint violations, ACGME compliance failures |
| **Solver** | OR-Tools engine | Infeasible constraints, timeout, memory exhaustion |
| **Database** | PostgreSQL | Deadlocks, connection pool exhaustion, slow queries |
| **Tasks** | Celery workers | Task timeout, serialization errors, queue backlog |

## Core Debugging Phases

### Phase 1: DETECTION
**Goal:** Identify what failed and where

```
1. Check error visibility
   - User-facing error message
   - API response logs
   - Backend service logs
   - Database query logs
   - MCP server logs

2. Establish failure scope
   - Single request or systemic?
   - Reproducible or intermittent?
   - User-specific or system-wide?
```

### Phase 2: DIAGNOSIS
**Goal:** Understand why it failed

```
1. Trace request path
   - Agent → MCP → API → Service → Database
   - Identify where the chain breaks

2. Collect evidence
   - Error stack traces
   - Recent code changes (git log)
   - Database state (queries, locks)
   - System resources (CPU, memory, connections)
```

### Phase 3: RESOLUTION
**Goal:** Fix the issue

```
1. Implement fix
   - Code changes
   - Configuration updates
   - Database repairs

2. Verify fix
   - Reproduce original failure
   - Confirm fix resolves it
   - Check for regressions
```

### Phase 4: PREVENTION
**Goal:** Prevent recurrence

```
1. Document incident
   - Root cause
   - Fix applied
   - Lessons learned

2. Implement safeguards
   - Add tests
   - Add monitoring
   - Update documentation
```

## Workflow Files

### Workflows/incident-review.md
Post-mortem template for systematic incident analysis:
- Timeline reconstruction
- Impact assessment
- Root cause identification (5-whys)
- Remediation actions
- Prevention measures

**Use when:** After resolving a major incident or when debugging a complex failure

### Workflows/log-analysis.md
Log parsing and correlation across services:
- Log location discovery
- Error pattern extraction
- Cross-service correlation
- Timeline reconstruction
- Anomaly detection

**Use when:** Error is unclear or spans multiple services

### Workflows/root-cause-analysis.md
5-whys investigation methodology:
- Problem statement definition
- Iterative questioning
- Evidence gathering
- Root cause identification

**Use when:** Surface-level fix is clear but underlying cause is not

## Reference Files

### Reference/common-failure-patterns.md
Catalog of known issues with symptoms and fixes:
- Database connection failures
- MCP tool timeouts
- Constraint engine errors
- Agent communication failures
- Each with: Symptoms → Diagnosis → Fix

**Use when:** Encountering a familiar-looking error

### Reference/debugging-checklist.md
Step-by-step troubleshooting guide:
- Service health checks
- Log verification
- Database inspection
- MCP tool status
- Agent state verification

**Use when:** Starting investigation with no clear direction

## Key Files to Inspect

### Backend Logs
```bash
# Application logs
docker-compose logs backend --tail=200 --follow

# Uvicorn access logs
docker-compose logs backend | grep "POST\|GET\|PUT\|DELETE"

# Error-specific logs
docker-compose logs backend 2>&1 | grep -i "error\|exception\|failed"
```

### MCP Server Logs
```bash
# MCP server output
docker-compose logs mcp-server --tail=100 --follow

# Tool invocation logs
docker-compose logs mcp-server | grep "tool_call\|error"

# API connectivity
docker-compose exec mcp-server curl -s http://backend:8000/health
```

### Database Logs
```bash
# Connect to database
docker-compose exec db psql -U scheduler -d residency_scheduler

# Check active queries
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

# Check locks
SELECT * FROM pg_locks WHERE NOT granted;
```

### Celery Logs
```bash
# Worker logs
docker-compose logs celery-worker --tail=100 --follow

# Beat scheduler logs
docker-compose logs celery-beat --tail=50 --follow

# Check queue status
docker-compose exec redis redis-cli LLEN celery
```

## Output Format

### Quick Status Check
```
SYSTEM HEALTH: [GREEN|YELLOW|ORANGE|RED]

Backend API: ✓ Responding (200ms avg)
MCP Server: ✓ Connected (29 tools available)
Database: ✓ 8/20 connections used
Celery: ✗ 3 failed tasks in queue
Redis: ✓ Connected

ISSUES DETECTED:
1. Celery worker timeout on schedule generation task
2. 2 database deadlocks in last hour

RECOMMENDED ACTION: Review celery worker logs and database lock contention
```

### Full Incident Report
```markdown
## INCIDENT REPORT: [Title]

**Date**: 2025-12-26 14:32 UTC
**Severity**: [LOW|MEDIUM|HIGH|CRITICAL]
**Status**: [INVESTIGATING|RESOLVED|MONITORING]
**Reporter**: [Agent/User/Automated]

### Summary
One-sentence description of what failed

### Timeline
- 14:30 - First error detected
- 14:31 - Service degraded
- 14:35 - Fix implemented
- 14:40 - Service restored

### Impact
- Users affected: [number or "all"]
- Data integrity: [preserved/compromised]
- ACGME compliance: [maintained/violated]
- Downtime: [duration]

### Root Cause
Detailed explanation using 5-whys methodology

### Resolution
What was done to fix the issue

### Prevention
How to prevent this in the future

### Action Items
- [ ] Add monitoring for [metric]
- [ ] Create test case for [scenario]
- [ ] Update documentation for [component]
```

## Error Handling Best Practices

### 1. Preserve Context
```python
# Bad - loses context
try:
    result = await some_operation()
except Exception:
    raise HTTPException(status_code=500, detail="Operation failed")

# Good - preserves stack trace
try:
    result = await some_operation()
except Exception as e:
    logger.error(f"Operation failed: {e}", exc_info=True)
    raise HTTPException(
        status_code=500,
        detail="Operation failed - check logs for details"
    )
```

### 2. Log Diagnostic Information
```python
logger.info(f"Starting operation with params: {params}")
logger.debug(f"Intermediate state: {state}")
logger.error(f"Operation failed at step {step}", exc_info=True)
```

### 3. Add Request IDs
```python
# For tracing requests across services
request_id = str(uuid.uuid4())
logger.info(f"[{request_id}] Processing schedule generation")
```

## Integration with Other Skills

### With systematic-debugger
For code-level debugging:
1. ORCHESTRATION_DEBUGGING identifies which component failed
2. systematic-debugger investigates the code

### With production-incident-responder
For production emergencies:
1. production-incident-responder handles immediate crisis
2. ORCHESTRATION_DEBUGGING performs post-mortem

### With automated-code-fixer
For automated fixes:
1. ORCHESTRATION_DEBUGGING identifies root cause
2. automated-code-fixer applies tested solution

## Escalation Criteria

**ALWAYS escalate to human when:**
1. Data corruption detected
2. Security vulnerability discovered
3. ACGME compliance violated
4. Multi-hour outage
5. Root cause unclear after investigation
6. Fix requires database migration or schema change

**Can handle automatically:**
1. Configuration issues
2. Known failure patterns with documented fixes
3. Resource exhaustion (restart services)
4. Transient network errors
5. Log analysis and report generation

## Monitoring Recommendations

After resolving incidents, add monitoring for:
- Error rate by endpoint
- Request latency (p50, p95, p99)
- Database connection pool usage
- Celery queue depth
- MCP tool success rate
- Schedule generation success rate

## References

- `/docs/development/DEBUGGING_WORKFLOW.md` - Overall debugging methodology
- `/docs/development/CI_CD_TROUBLESHOOTING.md` - CI/CD specific patterns
- `/mcp-server/RESILIENCE_MCP_INTEGRATION.md` - MCP tool documentation
- `/backend/app/core/logging.py` - Logging configuration
- `Workflows/` - Detailed workflow templates
- `Reference/` - Common patterns and checklists