# agent-test > Analyze aviation agent planner behavioral test results and apply improvement workflow. Use when: running planner tests, adding test cases to planner_test_cases.json, analyzing test failures, improving planner prompts, validating planner improvements, or working with tests/aviation_agent/ test infrastructure. - Author: vfdg344334 - Repository: vfdg344334/flyfun-apps - Version: 20260129010156 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/vfdg344334/flyfun-apps - Web: https://mule.run/skillshub/@@vfdg344334/flyfun-apps~agent-test:20260129010156 --- --- name: agent-test description: > Analyze aviation agent planner behavioral test results and apply improvement workflow. Use when: running planner tests, adding test cases to planner_test_cases.json, analyzing test failures, improving planner prompts, validating planner improvements, or working with tests/aviation_agent/ test infrastructure. allowed-tools: Read, Edit, Write, Bash, Glob, Grep --- # Aviation Agent Test Improvement Analyze planner behavioral test results and systematically enhance tool definitions, planner prompts, and test coverage. ## Critical: Use Existing Infrastructure Only **DO NOT CREATE NEW SCRIPTS.** All infrastructure exists: - **Test Runner**: `tests/aviation_agent/test_planner_behavior.py` - **Test Cases**: `tests/aviation_agent/fixtures/planner_test_cases.json` - **CSV Results**: `tests/aviation_agent/results/` ## Quick Start 1. **Read** existing test cases from `tests/aviation_agent/fixtures/planner_test_cases.json` 2. **Generate** new test cases and **append** to JSON file 3. **Run tests**: ```bash source ./venv/bin/activate export $(cat web/server/.env | grep -v '^#' | xargs) RUN_PLANNER_BEHAVIOR_TESTS=1 python -m pytest tests/aviation_agent/test_planner_behavior.py -v ``` 4. **Report** CSV file path and summary metrics 5. **If failures exist**, analyze CSV and propose code changes ## Three-Agent Workflow For detailed instructions on each agent, see: - [Ground Truth Generator](ground-truth.md) - Generate expected tool selection for new questions - [Failure Analysis](failure-analysis.md) - Analyze failures and propose fixes - [Validation](validation.md) - Compare before/after results ## Tool Selection Patterns | Query Type | Tool | Key Arguments | |------------|------|---------------| | Routes (from X to Y) | `find_airports_near_route` | `from_location`, `to_location`, `filters` | | Near location | `find_airports_near_location` | `location_query`, `filters` | | Airport details | `get_airport_details` | `icao_code` | | Country search | `search_airports` | `query`, `filters` | | Notification requirements | `get_notification_for_airport` | `icao`, `day_of_week` | | Rules question (ONE country) | `answer_rules_question` | `country_code`, `question`, `tags` | | Rules browsing (list all) | `browse_rules` | `country_code`, `tags`, `offset`, `limit` | | Rules comparison (2+ countries) | `compare_rules_between_countries` | `countries`, `tags`, `category` | ## Available Filters | Filter | Type | Description | |--------|------|-------------| | `fuel_type` | `'avgas'` \| `'jet_a'` | Preferred over legacy `has_avgas`/`has_jet_a` | | `has_avgas` | boolean | Legacy - still works | | `has_jet_a` | boolean | Legacy - still works | | `has_hard_runway` | boolean | Paved/hard surface runways | | `has_procedures` | boolean | IFR procedures available | | `point_of_entry` | boolean | Customs/border crossing | | `country` | string | ISO-2 country code | | `min_runway_length_ft` | number | Minimum runway length | | `max_runway_length_ft` | number | Maximum runway length | | `max_landing_fee` | number | Maximum landing fee | | `max_hours_notice` | number | Notification requirements | | `hotel` | boolean | On-site hotel | | `restaurant` | boolean | On-site restaurant | ## Test Case Format ```json { "question": "User question in natural language", "expected_tool": "tool_name_from_manifest", "expected_arguments": { "arg1": "value1", "filters": { "filter_key": true } }, "description": "Why this tool/args combination is correct" } ``` ## Critical Rules 1. **NEVER create new test scripts** - Use existing `test_planner_behavior.py` 2. **NEVER create analysis scripts** - Read CSV files directly 3. **ALWAYS edit existing files** - Append to `planner_test_cases.json` 4. **ALWAYS use venv** - `source ./venv/bin/activate` 5. **ALWAYS load environment** - `export $(cat web/server/.env | grep -v '^#' | xargs)` 6. **ALWAYS run tests and report** - Print CSV path and summary metrics ## Output Format ### After Running Tests ``` Tests completed Results saved to: tests/aviation_agent/results/planner_test_results_YYYYMMDD_HHMMSS.csv Summary: - Total tests: 21 - Passed: 21 (100%) - Failed: 0 (0%) - Tool match: 21/21 (100%) - Args match: 21/21 (100%) ``` ## Key Files - **Test Cases**: `tests/aviation_agent/fixtures/planner_test_cases.json` - **Test Runner**: `tests/aviation_agent/test_planner_behavior.py` - **Planner Prompt**: `shared/aviation_agent/planning.py` - **Tool Definitions**: `shared/aviation_agent/tools.py` - **Formatter**: `shared/aviation_agent/formatting.py`