# rag-debugger > Diagnoses RAG pipeline failures in WaTaxDesk. This skill should be used when debugging why queries fail, analyzing test results, tracing retrieval issues, or identifying missing knowledge base documents. - Author: DOR Monitor Bot - Repository: jjsupreme7/WashingtonTaxDesk - Version: 20260203084842 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/jjsupreme7/WashingtonTaxDesk - Web: https://mule.run/skillshub/@@jjsupreme7/WashingtonTaxDesk~rag-debugger:20260203084842 --- --- name: rag-debugger description: Diagnoses RAG pipeline failures in WaTaxDesk. This skill should be used when debugging why queries fail, analyzing test results, tracing retrieval issues, or identifying missing knowledge base documents. --- # RAG Debugger Skill This skill provides tools and guidance for diagnosing failures in the WaTaxDesk RAG (Retrieval Augmented Generation) pipeline. ## When to Activate - User asks "why did this query fail?" - User asks to debug RAG or retrieval issues - User asks about test failures or pass rates - User asks what documents are missing from the knowledge base - User wants to trace a query through the pipeline - User asks why certain citations aren't being retrieved ## Available Scripts ### trace_query.py Traces a single query through the RAG pipeline and outputs diagnostic information. ```bash python skills/rag-debugger/scripts/trace_query.py "What is the B&O tax rate for retailers?" ``` **Output includes:** - Query preprocessing and expansion - Vector search results per variation - Keyword search results per variation - Corrective RAG validation scores - Reranking position changes - Final results with confidence breakdown ### batch_diagnose.py Runs all test questions and groups failures by root cause. ```bash python skills/rag-debugger/scripts/batch_diagnose.py ``` **Output includes:** - Pass/fail summary - Failures grouped by cause (missing docs, low relevance, reranking issues) - Suggested fixes for each failure category ### find_gaps.py Analyzes failures to identify missing knowledge base documents. ```bash python skills/rag-debugger/scripts/find_gaps.py ``` **Output includes:** - List of topics with no matching documents - Suggested documents to add - Priority ranking based on failure frequency ## Using the Debug API The `/api/chat` endpoint accepts a `debug: true` parameter that returns detailed pipeline traces: ```python import requests response = requests.post("http://localhost:5002/api/chat", json={ "question": "What is the B&O tax rate?", "debug": True }) trace = response.json().get('debug_trace') ``` ## Interpreting Debug Traces ### Trace Structure ```json { "timestamp": "2024-12-24T...", "query_original": "...", "stages": { "query_expansion": {...}, "hybrid_retrieval": {...}, "corrective_rag": {...}, "reranking": {...}, "final": {...} } } ``` ### Stage: query_expansion Shows the 4 query variations generated: - Original query - Legal/technical variation (RCW/WAC terms) - Business terms variation - Tax authority phrasing variation **Red flag:** If variations are too similar, the query may need preprocessing fixes. ### Stage: hybrid_retrieval Shows vector and keyword search results per variation: - `vector_results`: Documents found via embedding similarity - `keyword_results`: Documents found via full-text search - `similarity` scores (0-1, higher is better) **Red flags:** - All similarity scores < 0.5 → Query doesn't match knowledge base well - No results for any variation → Topic may be missing entirely - Only keyword hits, no vector hits → Embedding mismatch ### Stage: corrective_rag Shows validation of each candidate chunk: - `score`: Relevance score (0-1) - `reason`: Why the chunk was judged relevant/irrelevant - `passed`: Whether it exceeded the 0.4 threshold **Red flags:** - High similarity but low relevance score → Retrieved doc exists but doesn't answer the query - Many chunks failing validation → Query is ambiguous or docs are tangential ### Stage: reranking Shows position changes after AI reranking: - `position_before`: Position after validation - `position_after`: Position after reranking - `position_change`: How much it moved (positive = moved up) **Red flags:** - Large position changes → Initial retrieval order was poor - Best doc moved down → Reranking may be over-optimizing for wrong criteria ### Stage: final Summary of final results: - `result_count`: Number of chunks returned - `average_relevance`: Mean relevance score - `top_citations`: The citations being returned **Red flags:** - `result_count` < requested `top_k` → Not enough relevant docs - Low average_relevance → Docs are marginally relevant - Wrong citations → Knowledge base may have incorrect tagging ## Common Failure Patterns ### 1. Missing Documents **Symptoms:** - Low/no results in hybrid_retrieval - All variations return same few docs - High-level topic queries fail **Fix:** Ingest the missing documents (RCW sections, DOR guidance, rate tables) ### 2. Embedding Mismatch **Symptoms:** - Keyword search finds docs but vector search doesn't - Technical terms not matching legal terms **Fix:** Improve query preprocessing or add synonyms ### 3. Over-Strict Validation **Symptoms:** - Many chunks with similarity > 0.6 fail validation - Relevance reasons cite narrow interpretation **Fix:** Lower validation threshold or improve chunk content ### 4. Incorrect Tagging **Symptoms:** - Wrong `law_version` or `tax_types` on relevant docs - Filter excludes correct documents **Fix:** Re-tag documents in knowledge base ### 5. Reranking Issues **Symptoms:** - Best doc has highest similarity but low final position - Reranking reasons don't match query intent **Fix:** Review reranking prompt or reduce reranking influence ## Pipeline Reference See `references/pipeline_stages.md` for detailed documentation of each RAG pipeline stage.