# retrieval-judge > Evaluate and filter recall_search results for relevance - Author: John Yang - Repository: jyang234/ai-engineering-framework - Version: 20260129224245 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/jyang234/ai-engineering-framework - Web: https://mule.run/skillshub/@@jyang234/ai-engineering-framework~retrieval-judge:20260129224245 --- --- name: retrieval-judge description: Evaluate and filter recall_search results for relevance --- # Retrieval Judge Skill When using `recall_search`, apply critical judgment to results rather than treating them as authoritative. ## Query Construction Build specific queries that include conversation context: - **Good**: `"payment retry logic after provider timeout"`, `"Go error wrapping with sentinel errors"` - **Bad**: `"retry"`, `"error handling"` Include the problem domain, technology, and specific concern in your query. ## Result Evaluation After each `recall_search` call, evaluate every result before using it: 1. **Title/type match** — Does the entry's title and type (pattern, decision, failure, etc.) relate to what you're actually looking for? 2. **Content relevance** — Does the snippet address your specific question, or is it tangentially related? 3. **Applicability** — Does the result apply to the current technology, codebase area, or problem domain? Only reference results that **directly address** the query. Discard results that are merely keyword-adjacent. ## Anti-Patterns - **Don't trust rank order blindly.** RRF scoring produces narrow distributions — result #1 may not be meaningfully better than result #5. - **Don't use results just because they appeared.** An empty answer is better than citing irrelevant knowledge. - **Don't ignore low-ranked results.** A result at position 8 may be more relevant than position 2 if it matches the actual intent. - **Don't skip evaluation.** Every `recall_search` call should be followed by a mental relevance check before incorporating results into your response. ## Audit Trail After evaluating recall_search results, always: 1. **Log your judgment** — call `flight_recorder_log` with: - type: `retrieval_judgment` - content: One-line summary, e.g. `"3/7 results relevant for 'payment retry timeout'"` - metadata: ```json { "query": "", "kept": [{"id": "", "title": "", "reason": "directly addresses retry logic"}], "dropped": [{"id": "<id>", "title": "<title>", "reason": "about billing, not payments"}] } ``` 2. **Show a summary** to the user: ``` RECALL: 3/7 results kept for "payment retry timeout" ``` 3. **On request** — if the user asks for details on the judgment, output the full kept/dropped list with per-result reasoning. ## When Results Are Poor If no results are clearly relevant: - Say so — don't force-fit irrelevant knowledge - Try a rephrased query with different terms - Proceed without RECALL context rather than using noise