# evaluate

> Run comprehensive evaluation of the Financial RAG system to measure quality, performance, and cost metrics. Use when testing RAG performance or validating system quality.

- Author: Logan Liu
- Repository: JumpLogan/ai-financial-advisor
- Version: 20260125213233
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/JumpLogan/ai-financial-advisor
- Web: https://mule.run/skillshub/@@JumpLogan/ai-financial-advisor~evaluate:20260125213233

---

---
name: evaluate
description: Run comprehensive evaluation of the Financial RAG system to measure quality, performance, and cost metrics. Use when testing RAG performance or validating system quality.
---

# Evaluate RAG System

When the user invokes this skill, run a comprehensive evaluation of the Financial RAG system.

## Steps to Follow

### 1. Check Prerequisites

Before running evaluation, verify:
- `run_evaluation.py` exists
- `data/chroma_db/` directory exists (vector database)
- `.env` file exists with OPENAI_API_KEY configured

If any prerequisite is missing, inform the user what needs to be set up.

### 2. Parse User Arguments

The evaluation script supports these optional arguments:
- `--test-cases <file>`: Use custom test cases JSON file
- `--model <model>`: Use specific model (default: gpt-3.5-turbo)
- `--output-dir <dir>`: Specify output directory (default: evaluation_results)

### 3. Run the Evaluation

Execute: `python run_evaluation.py [arguments]`

Monitor the output and show progress to the user.

### 4. Display Results

After completion, provide a summary including:
- Pass rate (percentage of tests that passed)
- Hallucinations detected
- Average latency per query
- Total cost and cost per query
- Report file locations

### 5. Provide Recommendations

Based on results:
- If pass rate < 80%: Suggest improvements
- If pass rate >= 80%: Acknowledge good performance
- Offer to analyze detailed results or failed test cases

### 6. Offer Next Steps

Suggest follow-up actions like analyzing the report, examining failures, or re-running with different settings.