# bat

> Run bot acceptance tests to validate MCP tools work correctly from a real AI agent's perspective. Use when testing PRs, detecting regressions, or verifying tool changes end-to-end with Claude/Gemini CLIs.

- Author: sergeykad
- Repository: sergeykad/ha-mcp
- Version: 20260209164502
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-09
- Source: https://github.com/sergeykad/ha-mcp
- Web: https://mule.run/skillshub/@@sergeykad/ha-mcp~bat:20260209164502

---

---
name: bat
description: Run bot acceptance tests to validate MCP tools work correctly from a real AI agent's perspective. Use when testing PRs, detecting regressions, or verifying tool changes end-to-end with Claude/Gemini CLIs.
disable-model-invocation: true
argument-hint: [scenario-description or --help]
allowed-tools: Bash, Read, Write
---

# BAT - Bot Acceptance Testing

Bot acceptance testing validates that MCP tools work correctly from a real AI agent's perspective. You design test scenarios dynamically, run them via `tests/uat/run_uat.py`, and evaluate results.

## When to Use BAT

- **PR validation**: Test that tool changes work correctly from an agent's perspective
- **Regression detection**: Compare behavior between branches
- **Integration verification**: Ensure MCP tools work end-to-end with real agent CLIs

## Workflow

1. **Analyze the change**: Read the diff, identify which tools are affected
2. **Design scenario**: Generate a scenario JSON with setup/test/teardown prompts
3. **Run the script**: Pipe the scenario to `python tests/uat/run_uat.py`
4. **Evaluate summary**: Check `all_passed` per agent. If true, you're done.
5. **Dig deeper on failure**: Read `results_file` for full output, stderr, raw JSON
6. **Regression check**: If test fails, re-run with `--branch master` to compare

## Output Structure

The runner returns a **concise summary** to stdout (saves context when all passes):

```json
{
  "results_file": "/tmp/bat_results_abc123.json",
  "agents": {
    "gemini": {
      "all_passed": true,
      "test": {
        "completed": true,
        "duration_ms": 8100,
        "exit_code": 0,
        "num_turns": 5,
        "tool_stats": { "totalCalls": 4, "totalSuccess": 4, "totalFail": 0 }
      },
      "aggregate": {
        "total_duration_ms": 15300,
        "total_turns": 12,
        "total_tool_calls": 9,
        "total_tool_success": 9,
        "total_tool_fail": 0
      }
    }
  }
}
```

- **Phase stats**: `num_turns`, `tool_stats` (per phase) for fine-grained comparison
- **Aggregate stats**: Total counts across all phases for overall efficiency comparison
- **On failure**: also includes `output` and `stderr` for diagnosis
- **Full results**: raw JSON, complete output always available at `results_file`

## Scenario Design Guidelines

- **setup_prompt**: Create any entities/state the test needs
- **test_prompt**: Exercise the tools being tested, ask the agent to report results clearly
- **teardown_prompt**: Clean up created entities
- Keep prompts focused - each scenario tests ONE behavior
- Ask the agent to report: what succeeded, what failed, any unexpected behavior

## Example: Testing Error Signaling

```bash
cat <<'EOF' | python tests/uat/run_uat.py --agents gemini
{
  "setup_prompt": "Create a test automation called 'bat_error_test' with a time trigger at 23:59 and action to turn on light.bed_light.",
  "test_prompt": "Try to get automation 'automation.nonexistent_xyz'. Report if the tool signaled an error or returned a normal response. Then get automation 'automation.bat_error_test' and report its structure.",
  "teardown_prompt": "Delete automation 'bat_error_test' if it exists."
}
EOF
```

## Regression Comparison Workflow

**Full BAT comparison** (recommended):

1. **Pull latest master**: `git fetch origin master && git checkout master && git pull`
2. **Run on master**: Save scenario to file, run and save results
3. **Switch to branch**: `git checkout feat/my-branch`
4. **Run on branch**: Run same scenario, compare stats

**Compare these metrics:**
- **Task completion**: Did both pass? Any new failures?
- **Accuracy**: Check agent output quality - did it understand the task correctly?
- **Efficiency**: Compare `aggregate.total_tool_calls`, `aggregate.total_turns`, `aggregate.total_duration_ms`
- **Variation testing**: Ask the same task in different ways to test robustness

**Quick comparison** (single command):

```bash
# Test the PR branch
echo '{"test_prompt":"..."}' | python tests/uat/run_uat.py --branch feat/tool-errors --agents gemini

# Compare against master
echo '{"test_prompt":"..."}' | python tests/uat/run_uat.py --branch master --agents gemini
```

## Cost Awareness

Each scenario invocation costs API credits (one per agent per phase). Design scenarios efficiently:

- Combine related checks in a single test_prompt when possible
- Only use setup/teardown when the test needs specific state
- Start with one agent, expand to both only when cross-agent comparison matters

## Handling Arguments

When `/bat` is invoked with arguments:

**If arguments contain a scenario description**, generate the JSON scenario and run it:
```
/bat test automation create with sunrise trigger then modify to sunset
```
→ Generate appropriate scenario JSON and execute

**If `--help` or no arguments**, show this help text.

**Otherwise**, treat `$ARGUMENTS` as instructions for what to test and design+run the scenario accordingly.

## Full Documentation

For complete CLI reference and output format, see `tests/uat/README.md`.