# prompt-evaluation > Prompt testing, metrics, and A/B testing frameworks - Author: pluginagentmarketplace - Repository: pluginagentmarketplace/custom-plugin-prompt-engineering - Version: 20260105115554 - Stars: 2 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/pluginagentmarketplace/custom-plugin-prompt-engineering - Web: https://mule.run/skillshub/@@pluginagentmarketplace/custom-plugin-prompt-engineering~prompt-evaluation:20260105115554 --- --- name: prompt-evaluation description: Prompt testing, metrics, and A/B testing frameworks sasmp_version: "1.3.0" bonded_agent: 06-evaluation-testing-agent bond_type: PRIMARY_BOND --- # Prompt Evaluation Skill **Bonded to:** `evaluation-testing-agent` --- ## Quick Start ```bash Skill("custom-plugin-prompt-engineering:prompt-evaluation") ``` --- ## Parameter Schema ```yaml parameters: evaluation_type: type: enum values: [accuracy, consistency, robustness, efficiency, ab_test] required: true test_cases: type: array items: input: string expected: string|object category: string min_items: 5 metrics: type: array items: string default: [accuracy, consistency] ``` --- ## Evaluation Metrics | Metric | Formula | Target | |--------|---------|--------| | Accuracy | correct / total | > 0.90 | | Consistency | identical_runs / total_runs | > 0.85 | | Robustness | edge_cases_passed / edge_cases_total | > 0.80 | | Format Compliance | valid_format / total | > 0.95 | | Token Efficiency | quality_score / tokens_used | Maximize | --- ## Test Case Framework ### Test Case Schema ```yaml test_case: id: "TC001" category: "happy_path|edge_case|adversarial|regression" description: "What this test verifies" priority: "critical|high|medium|low" input: user_message: "Test input text" context: {} # Optional additional context expected: contains: ["required", "phrases"] not_contains: ["forbidden", "content"] format: "json|text|markdown" schema: {} # Optional JSON schema evaluation: metrics: ["accuracy", "format_compliance"] threshold: 0.9 timeout: 30 ``` ### Test Categories ```yaml categories: happy_path: description: "Standard expected inputs" coverage_target: 40% examples: - typical_user_query - common_use_case - expected_format edge_cases: description: "Boundary conditions" coverage_target: 25% examples: - empty_input - very_long_input - special_characters - unicode_text - minimal_input adversarial: description: "Attempts to break prompt" coverage_target: 20% examples: - injection_attempts - conflicting_instructions - ambiguous_requests - malformed_input regression: description: "Previously failed cases" coverage_target: 15% examples: - fixed_bugs - known_edge_cases ``` --- ## Scoring Rubric ```yaml accuracy_rubric: 1.0: "Exact match to expected output" 0.8: "Semantically equivalent, minor differences" 0.5: "Partially correct, major elements present" 0.2: "Related but incorrect" 0.0: "Completely wrong or off-topic" consistency_rubric: 1.0: "Identical across all runs" 0.8: "Minor variations (punctuation, formatting)" 0.5: "Significant variations, same meaning" 0.2: "Different approaches, inconsistent" 0.0: "Completely different each time" robustness_rubric: 1.0: "Handles all edge cases correctly" 0.8: "Fails gracefully with clear error messages" 0.5: "Some edge cases cause issues" 0.2: "Many edge cases fail" 0.0: "Breaks on most edge cases" ``` --- ## A/B Testing Framework ```yaml ab_test_config: name: "Prompt Variant Comparison" hypothesis: "New prompt improves accuracy by 10%" variants: control: prompt: "{original_prompt}" allocation: 50% treatment: prompt: "{new_prompt}" allocation: 50% metrics: primary: - name: accuracy min_improvement: 0.05 significance: 0.05 secondary: - token_count - response_time - user_satisfaction sample: minimum: 100 power: 0.8 stopping_rules: - condition: significant_regression action: stop_treatment - condition: clear_winner threshold: 0.99 action: early_stop ``` --- ## Evaluation Report Template ```yaml report: metadata: prompt_name: "string" version: "string" date: "ISO8601" evaluator: "string" summary: overall_score: 0.87 status: "PASS|FAIL|REVIEW" recommendation: "string" metrics: accuracy: 0.92 consistency: 0.88 robustness: 0.79 format_compliance: 0.95 test_results: total: 50 passed: 43 failed: 7 pass_rate: 0.86 failures: - id: "TC023" category: "edge_case" issue: "Description of failure" severity: "high|medium|low" input: "..." expected: "..." actual: "..." recommendations: - priority: high action: "Specific improvement" - priority: medium action: "Another improvement" ``` --- ## Evaluation Workflow ```yaml workflow: 1_prepare: - Load prompt under test - Load test cases - Configure metrics 2_execute: - Run all test cases - Record outputs - Measure latency 3_score: - Compare against expected - Calculate metrics - Identify patterns 4_analyze: - Categorize failures - Find root causes - Prioritize issues 5_report: - Generate report - Make recommendations - Archive results ``` --- ## Troubleshooting | Issue | Cause | Solution | |-------|-------|----------| | All tests fail | Wrong expected outputs | Review test design | | Flaky tests | Non-determinism | Lower temperature | | False positives | Lenient matching | Tighten criteria | | False negatives | Strict matching | Use semantic similarity | | Slow evaluation | Too many tests | Sample strategically | --- ## Integration ```yaml integrates_with: - prompt-optimization: Improvement feedback loop - all skills: Quality gate before deployment automation: ci_cd: "Run on prompt changes" scheduled: "Weekly regression" triggered: "On demand" ``` --- ## References See `references/GUIDE.md` for evaluation methodology. See `scripts/helper.py` for automation utilities.