# rag-debugger

> Diagnoses RAG pipeline failures in WaTaxDesk. This skill should be used when debugging why queries fail, analyzing test results, tracing retrieval issues, or identifying missing knowledge base documents.

- Author: DOR Monitor Bot
- Repository: jjsupreme7/WashingtonTaxDesk
- Version: 20260203084842
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/jjsupreme7/WashingtonTaxDesk
- Web: https://mule.run/skillshub/@@jjsupreme7/WashingtonTaxDesk~rag-debugger:20260203084842

---

---
name: rag-debugger
description: Diagnoses RAG pipeline failures in WaTaxDesk. This skill should be used when debugging why queries fail, analyzing test results, tracing retrieval issues, or identifying missing knowledge base documents.
---

# RAG Debugger Skill

This skill provides tools and guidance for diagnosing failures in the WaTaxDesk RAG (Retrieval Augmented Generation) pipeline.

## When to Activate

- User asks "why did this query fail?"
- User asks to debug RAG or retrieval issues
- User asks about test failures or pass rates
- User asks what documents are missing from the knowledge base
- User wants to trace a query through the pipeline
- User asks why certain citations aren't being retrieved

## Available Scripts

### trace_query.py
Traces a single query through the RAG pipeline and outputs diagnostic information.

```bash
python skills/rag-debugger/scripts/trace_query.py "What is the B&O tax rate for retailers?"
```

**Output includes:**
- Query preprocessing and expansion
- Vector search results per variation
- Keyword search results per variation
- Corrective RAG validation scores
- Reranking position changes
- Final results with confidence breakdown

### batch_diagnose.py
Runs all test questions and groups failures by root cause.

```bash
python skills/rag-debugger/scripts/batch_diagnose.py
```

**Output includes:**
- Pass/fail summary
- Failures grouped by cause (missing docs, low relevance, reranking issues)
- Suggested fixes for each failure category

### find_gaps.py
Analyzes failures to identify missing knowledge base documents.

```bash
python skills/rag-debugger/scripts/find_gaps.py
```

**Output includes:**
- List of topics with no matching documents
- Suggested documents to add
- Priority ranking based on failure frequency

## Using the Debug API

The `/api/chat` endpoint accepts a `debug: true` parameter that returns detailed pipeline traces:

```python
import requests

response = requests.post("http://localhost:5002/api/chat", json={
    "question": "What is the B&O tax rate?",
    "debug": True
})

trace = response.json().get('debug_trace')
```

## Interpreting Debug Traces

### Trace Structure

```json
{
  "timestamp": "2024-12-24T...",
  "query_original": "...",
  "stages": {
    "query_expansion": {...},
    "hybrid_retrieval": {...},
    "corrective_rag": {...},
    "reranking": {...},
    "final": {...}
  }
}
```

### Stage: query_expansion
Shows the 4 query variations generated:
- Original query
- Legal/technical variation (RCW/WAC terms)
- Business terms variation
- Tax authority phrasing variation

**Red flag:** If variations are too similar, the query may need preprocessing fixes.

### Stage: hybrid_retrieval
Shows vector and keyword search results per variation:
- `vector_results`: Documents found via embedding similarity
- `keyword_results`: Documents found via full-text search
- `similarity` scores (0-1, higher is better)

**Red flags:**
- All similarity scores < 0.5 → Query doesn't match knowledge base well
- No results for any variation → Topic may be missing entirely
- Only keyword hits, no vector hits → Embedding mismatch

### Stage: corrective_rag
Shows validation of each candidate chunk:
- `score`: Relevance score (0-1)
- `reason`: Why the chunk was judged relevant/irrelevant
- `passed`: Whether it exceeded the 0.4 threshold

**Red flags:**
- High similarity but low relevance score → Retrieved doc exists but doesn't answer the query
- Many chunks failing validation → Query is ambiguous or docs are tangential

### Stage: reranking
Shows position changes after AI reranking:
- `position_before`: Position after validation
- `position_after`: Position after reranking
- `position_change`: How much it moved (positive = moved up)

**Red flags:**
- Large position changes → Initial retrieval order was poor
- Best doc moved down → Reranking may be over-optimizing for wrong criteria

### Stage: final
Summary of final results:
- `result_count`: Number of chunks returned
- `average_relevance`: Mean relevance score
- `top_citations`: The citations being returned

**Red flags:**
- `result_count` < requested `top_k` → Not enough relevant docs
- Low average_relevance → Docs are marginally relevant
- Wrong citations → Knowledge base may have incorrect tagging

## Common Failure Patterns

### 1. Missing Documents
**Symptoms:**
- Low/no results in hybrid_retrieval
- All variations return same few docs
- High-level topic queries fail

**Fix:** Ingest the missing documents (RCW sections, DOR guidance, rate tables)

### 2. Embedding Mismatch
**Symptoms:**
- Keyword search finds docs but vector search doesn't
- Technical terms not matching legal terms

**Fix:** Improve query preprocessing or add synonyms

### 3. Over-Strict Validation
**Symptoms:**
- Many chunks with similarity > 0.6 fail validation
- Relevance reasons cite narrow interpretation

**Fix:** Lower validation threshold or improve chunk content

### 4. Incorrect Tagging
**Symptoms:**
- Wrong `law_version` or `tax_types` on relevant docs
- Filter excludes correct documents

**Fix:** Re-tag documents in knowledge base

### 5. Reranking Issues
**Symptoms:**
- Best doc has highest similarity but low final position
- Reranking reasons don't match query intent

**Fix:** Review reranking prompt or reduce reranking influence

## Pipeline Reference

See `references/pipeline_stages.md` for detailed documentation of each RAG pipeline stage.