# skill-test > Testing framework for evaluating Databricks skills. Use when building test cases for skills, running skill evaluations, comparing skill versions, or creating ground truth datasets with the Generate-Review-Promote (GRP) pipeline. Triggers include "test skill", "evaluate skill", "skill regression", "ground truth", "GRP pipeline", "skill quality", and "skill metrics". - Author: Malcoln Dandaro - Repository: jacksandom/ai-dev-kit - Version: 20260205235044 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/jacksandom/ai-dev-kit - Web: https://mule.run/skillshub/@@jacksandom/ai-dev-kit~skill-test:20260205235044 --- --- name: skill-test description: Testing framework for evaluating Databricks skills. Use when building test cases for skills, running skill evaluations, comparing skill versions, or creating ground truth datasets with the Generate-Review-Promote (GRP) pipeline. Triggers include "test skill", "evaluate skill", "skill regression", "ground truth", "GRP pipeline", "skill quality", and "skill metrics". command: skill-test arguments: "[skill-name] [subcommand]" --- # Databricks Skills Testing Framework Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement. ## Quick References - [Scorers](references/scorers.md) - Available scorers and quality gates - [YAML Schemas](references/yaml-schemas.md) - Manifest and ground truth formats - [Python API](references/python-api.md) - Programmatic usage examples - [Workflows](references/workflows.md) - Detailed example workflows - [Trace Evaluation](references/trace-eval.md) - Session trace analysis ## /skill-test Command The `/skill-test` command provides an interactive CLI for testing Databricks skills with real execution on Databricks. ### Basic Usage ``` /skill-test [subcommand] ``` ### Subcommands | Subcommand | Description | |------------|-------------| | `run` | Run evaluation against ground truth (default) | | `regression` | Compare current results against baseline | | `init` | Initialize test scaffolding for a new skill | | `add` | Interactive: prompt -> invoke skill -> test -> save | | `add --trace` | Add test case with trace evaluation | | `review` | Review pending candidates interactively | | `review --batch` | Batch approve all pending candidates | | `baseline` | Save current results as regression baseline | | `mlflow` | Run full MLflow evaluation with LLM judges | | `trace-eval` | Evaluate traces against skill expectations | | `list-traces` | List available traces (MLflow or local) | | `scorers` | List configured scorers for a skill | | `scorers update` | Add/remove scorers or update default guidelines | | `sync` | Sync YAML to Unity Catalog (Phase 2) | ### Quick Examples ``` /skill-test spark-declarative-pipelines run /skill-test spark-declarative-pipelines add --trace /skill-test spark-declarative-pipelines review --batch --filter-success /skill-test my-new-skill init ``` See [Workflows](references/workflows.md) for detailed examples of each subcommand. ## Execution Instructions ### Environment Setup ```bash uv pip install -e .test/ ``` Environment variables for Databricks MLflow: - `DATABRICKS_CONFIG_PROFILE` - Databricks CLI profile (default: "DEFAULT") - `MLFLOW_TRACKING_URI` - Set to "databricks" for Databricks MLflow - `MLFLOW_EXPERIMENT_NAME` - Experiment path (e.g., "/Users/{user}/skill-test") ### Running Scripts All subcommands have corresponding scripts in `.test/scripts/`: ```bash uv run python .test/scripts/{subcommand}.py {skill_name} [options] ``` | Subcommand | Script | |------------|--------| | `run` | `run_eval.py` | | `regression` | `regression.py` | | `init` | `init_skill.py` | | `add` | `add.py` | | `review` | `review.py` | | `baseline` | `baseline.py` | | `mlflow` | `mlflow_eval.py` | | `scorers` | `scorers.py` | | `scorers update` | `scorers_update.py` | | `sync` | `sync.py` | | `trace-eval` | `trace_eval.py` | | `list-traces` | `list_traces.py` | | `_routing mlflow` | `routing_eval.py` | Use `--help` on any script for available options. ## Command Handler When `/skill-test` is invoked, parse arguments and execute the appropriate command. ### Argument Parsing - `args[0]` = skill_name (required) - `args[1]` = subcommand (optional, default: "run") ### Subcommand Routing | Subcommand | Action | |------------|--------| | `run` | Execute `run(skill_name, ctx)` and display results | | `regression` | Execute `regression(skill_name, ctx)` and display comparison | | `init` | Execute `init(skill_name, ctx)` to create scaffolding | | `add` | Prompt for test input, invoke skill, run `interactive()` | | `review` | Execute `review(skill_name, ctx)` to review pending candidates | | `baseline` | Execute `baseline(skill_name, ctx)` to save as regression baseline | | `mlflow` | Execute `mlflow_eval(skill_name, ctx)` with MLflow logging | | `scorers` | Execute `scorers(skill_name, ctx)` to list configured scorers | | `scorers update` | Execute `scorers_update(skill_name, ctx, ...)` to modify scorers | ### init Behavior When running `/skill-test init`: 1. Read the skill's SKILL.md to understand its purpose 2. Create `manifest.yaml` with appropriate scorers and trace_expectations 3. Create empty `ground_truth.yaml` and `candidates.yaml` templates 4. Recommend test prompts based on documentation examples Follow with `/skill-test add` using recommended prompts. ### Context Setup Create CLIContext with MCP tools before calling any command. See [Python API](references/python-api.md#clicontext-setup) for details. ## File Locations **Important:** All test files are stored at the **repository root** level, not relative to this skill's directory. | File Type | Path | |-----------|------| | Ground truth | `{repo_root}/.test/skills/{skill-name}/ground_truth.yaml` | | Candidates | `{repo_root}/.test/skills/{skill-name}/candidates.yaml` | | Manifest | `{repo_root}/.test/skills/{skill-name}/manifest.yaml` | | Routing tests | `{repo_root}/.test/skills/_routing/ground_truth.yaml` | | Baselines | `{repo_root}/.test/baselines/{skill-name}/baseline.yaml` | For example, to test `spark-declarative-pipelines` in this repository: ``` /Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml ``` **Not** relative to the skill definition: ``` /Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG ``` ## Directory Structure ``` .test/ # At REPOSITORY ROOT (not skill directory) ├── pyproject.toml # Package config (pip install -e ".test/") ├── README.md # Contributor documentation ├── SKILL.md # Source of truth (synced to .claude/skills/) ├── install_skill_test.sh # Sync script ├── scripts/ # Wrapper scripts │ ├── _common.py # Shared utilities │ ├── run_eval.py │ ├── regression.py │ ├── init_skill.py │ ├── add.py │ ├── baseline.py │ ├── mlflow_eval.py │ ├── routing_eval.py │ ├── trace_eval.py # Trace evaluation │ ├── list_traces.py # List available traces │ ├── scorers.py │ ├── scorers_update.py │ └── sync.py ├── src/ │ └── skill_test/ # Python package │ ├── cli/ # CLI commands module │ ├── fixtures/ # Test fixture setup │ ├── scorers/ # Evaluation scorers │ ├── grp/ # Generate-Review-Promote pipeline │ └── runners/ # Evaluation runners ├── skills/ # Per-skill test definitions │ ├── _routing/ # Routing test cases │ └── {skill-name}/ # Skill-specific tests │ ├── ground_truth.yaml │ ├── candidates.yaml │ └── manifest.yaml ├── tests/ # Unit tests ├── references/ # Documentation references └── baselines/ # Regression baselines ``` ## References - [Scorers](references/scorers.md) - Available scorers and quality gates - [YAML Schemas](references/yaml-schemas.md) - Manifest and ground truth formats - [Python API](references/python-api.md) - Programmatic usage examples - [Workflows](references/workflows.md) - Detailed example workflows - [Trace Evaluation](references/trace-eval.md) - Session trace analysis