# benchmark-fetcher > Fetch benchmark performance data from 6 leaderboard websites using Playwright MCP and update model manifests with the latest scores. Supports SWE-bench, TerminalBench, SciCode, LiveCodeBench, MMMU, MMMU Pro, and WebDevArena benchmarks. - Author: dependabot[bot] - Repository: aicodingstack/aicodingstack.io - Version: 20260117041734 - Stars: 28 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/aicodingstack/aicodingstack.io - Web: https://mule.run/skillshub/@@aicodingstack/aicodingstack.io~benchmark-fetcher:20260117041734 --- --- name: benchmark-fetcher description: Fetch benchmark performance data from 6 leaderboard websites using Playwright MCP and update model manifests with the latest scores. Supports SWE-bench, TerminalBench, SciCode, LiveCodeBench, MMMU, MMMU Pro, and WebDevArena benchmarks. --- # Benchmark Fetcher Skill Automate the fetching of benchmark performance data from leaderboard websites and update model manifests with the latest scores using advanced browser automation. ## Overview This skill extends benchmark data collection by automating visits to 6 major AI model leaderboard websites, extracting performance scores, and updating model manifests in `manifests/models/` with the latest benchmark data. **Key Features:** - **Automated Data Collection**: Uses Playwright MCP to visit and extract data from 6 leaderboard websites - **Intelligent Model Mapping**: Maps website model names to manifest IDs using configurable mappings - **Always Overwrite**: Updates manifests with latest benchmark values - **Error Resilient**: Retry logic with exponential backoff and graceful degradation - **Comprehensive Reporting**: Detailed completion reports with unmapped models and update statistics ## Supported Benchmarks | Benchmark | Website | Manifest Field | Format | |-----------|---------|----------------|--------| | **SWE-bench** | https://www.swebench.com | `sweBench` | Percentage (0-100) | | **TerminalBench** | https://www.tbench.ai/leaderboard/terminal-bench/2.0 | `terminalBench` | Decimal (0-1) | | **MMMU** | https://mmmu-benchmark.github.io/#leaderboard | `mmmu`, `mmmuPro` | Percentage (0-100) | | **SciCode** | https://scicode-bench.github.io/leaderboard/ | `sciCode` | Percentage (0-100) | | **LiveCodeBench** | https://livecodebench.github.io/leaderboard.html | `liveCodeBench` | Percentage (0-100) | | **WebDevArena** | https://web.lmarena.ai/leaderboard | `webDevArena` | Percentage (0-100) | **Note:** TerminalBench uses a decimal format (0-1 scale), while all other benchmarks use percentage format (0-100 scale). ## Usage ### Fetch All Benchmarks Update all model manifests with latest benchmark data from all 6 websites: ```bash node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs ``` ### Fetch Specific Benchmarks Update only specific benchmarks: ```bash # Fetch only SWE-bench and TerminalBench node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --benchmarks swebench,terminalBench # Fetch only LiveCodeBench node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --benchmarks liveCodeBench ``` ### Fetch for Specific Models Update benchmarks for specific models only: ```bash # Update only Claude Sonnet 4.5 and GPT-4o node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --models claude-sonnet-4-5,gpt-4o ``` ### Dry Run Mode Preview what would be updated without actually modifying manifests: ```bash node .claude/skills/benchmark-fetcher/scripts/fetch-benchmarks.mjs --dry-run ``` ## Model Name Mapping ### How Mapping Works Each benchmark website uses different naming conventions for models. The `references/model-name-mappings.json` file maps website-specific model names to manifest IDs. **Example mapping:** ```json { "swebench": { "websiteModels": { "Claude Sonnet 4.5": "claude-sonnet-4-5", "GPT-4o": "gpt-4o", "Gemini 2.5 Pro": "gemini-2-5-pro" } } } ``` ### Mapping Strategy The mapper uses a 3-tier fallback strategy: 1. **Exact match** (case-sensitive): "Claude Sonnet 4.5" → "claude-sonnet-4-5" 2. **Case-insensitive match**: "claude sonnet 4.5" → "claude-sonnet-4-5" 3. **Fuzzy match** (normalized): "Claude-Sonnet-4.5" → "claude-sonnet-4-5" **Normalization:** Removes spaces, hyphens, and special characters for fuzzy matching. ### Adding New Mappings When the script reports unmapped models, add them to `references/model-name-mappings.json`: ```json { "swebench": { "websiteModels": { "New Model Name": "new-model-id" } } } ``` ## Data Extraction Process ### High-Level Workflow 1. **Load Configuration**: Read mappings, load all model manifests 2. **Initialize Browser**: Start Chrome DevTools MCP browser instance 3. **Visit Websites**: Sequentially visit each benchmark website 4. **Extract Data**: Parse leaderboard tables from page snapshots 5. **Map Models**: Match website model names to manifest IDs 6. **Update Manifests**: Overwrite benchmark values in manifest files 7. **Generate Report**: Show updates, failures, and unmapped models ### Website-Specific Extractors Each benchmark has a dedicated extractor function in `scripts/lib/benchmark-extractors.mjs`: - `extractSWEBench()` - Extracts SWE-bench Verified scores - `extractTerminalBench()` - Extracts TerminalBench 2.0 accuracy (decimal format) - `extractMMMU()` - Extracts both MMMU and MMMU Pro scores - `extractSciCode()` - Extracts SciCode benchmark scores - `extractLiveCodeBench()` - Extracts LiveCodeBench Pass@1 scores - `extractWebDevArena()` - Extracts WebDevArena scores ### Special Cases **TerminalBench Format:** - Website displays percentages (42.8%) - Must store as decimal: `0.428` (not `42.8`) - Extractor handles conversion automatically **MMMU Dual Benchmarks:** - Single website has both MMMU and MMMU Pro leaderboards - Extractor returns both in one visit: ```javascript { mmmu: Map, mmmuPro: Map } ``` ## Update Strategy ### Always Overwrite Policy The skill uses an **always overwrite** strategy for benchmark values: - Existing benchmark values are replaced with latest data from websites - Null values are populated if found on websites - Non-null values are updated with latest scores - No confirmation or comparison - latest data always wins **Rationale:** Benchmark scores represent the latest model performance. Websites are the authoritative source. ### What Gets Preserved Only benchmark fields are updated. All other manifest fields are preserved: - ✅ Preserved: `id`, `name`, `description`, `vendor`, `size`, `contextWindow`, etc. - 🔄 Updated: `benchmarks.sweBench`, `benchmarks.terminalBench`, etc. ### Atomic Updates Manifests are updated using atomic file writes: 1. Validate JSON structure 2. Write to temporary file (`.tmp`) 3. Atomic rename to target file 4. No partial updates - all or nothing ## Error Handling ### Retry Logic Each benchmark extraction uses a 3-attempt retry strategy with exponential backoff: **Attempt 1**: Direct extraction (immediate) **Attempt 2**: Retry after 2 seconds **Attempt 3**: Final retry after 4 seconds After 3 failures: - Take debug screenshot (`/tmp/benchmark-{id}-error.png`) - Log error details - Skip benchmark and continue with others ### Error Categories **Website Access Errors:** - Cause: Site down, network timeout, rate limiting - Handling: Retry 3 times, then skip benchmark **Extraction Errors:** - Cause: Page structure changed, data not found - Handling: Screenshot for debugging, skip benchmark **Mapping Errors:** - Cause: Model name not in mapping configuration - Handling: Log unmapped model, continue with others **Manifest Update Errors:** - Cause: File write errors, invalid JSON - Handling: Atomic write protects against corruption, rollback on error ### Graceful Degradation The skill continues processing even when errors occur: - If 1 benchmark fails, others still process - If 1 model can't be mapped, others still update - Partial success is better than no success - Completion report shows exactly what succeeded/failed ## Completion Report After execution, a detailed report shows: ### Summary Section ``` 📊 Benchmark Fetch Report ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✅ Successfully Fetched (5/6 benchmarks) ✓ SWE-bench (swebench.com) ✓ TerminalBench (tbench.ai) ✓ MMMU + MMMU Pro (mmmu-benchmark.github.io) ✓ SciCode (scicode-bench.github.io) ✓ LiveCodeBench (livecodebench.github.io) ❌ Failed to Fetch (1/6 benchmarks) ✗ WebDevArena (web.lmarena.ai) Reason: Timeout after 3 retries ``` ### Manifest Updates ``` 📝 Manifest Updates ✅ Updated: 15 manifests • claude-sonnet-4-5: 3 benchmarks updated - sweBench: null → 74.4 - terminalBench: 0.428 → 0.604 - liveCodeBench: 47.1 → 52.3 • gpt-4o: 2 benchmarks updated - sweBench: 21.62 → 23.5 - sciCode: 1.5 → 2.1 ``` ### Unmapped Models ``` ⚠️ Unmapped Models (require manual mapping) SWE-bench: • "Qwen-Coder-2.5" → Add to model-name-mappings.json • "DeepSeek-Coder-V2" → Add to model-name-mappings.json Suggestion: Update references/model-name-mappings.json ``` ### Statistics ``` 📈 Statistics Total benchmarks fetched: 247 values Total manifests updated: 15 files Execution time: 45.2s Average time per benchmark: 7.5s ``` ### Next Steps ``` ✅ Complete! Next steps: 1. Review updated manifests in manifests/models/ 2. Add unmapped models to references/model-name-mappings.json 3. Retry failed benchmarks if needed 4. Run validation: npm run test:validate 5. Commit changes when satisfied ``` ## Tool Integration ### Chrome DevTools MCP The skill uses Chrome DevTools MCP tools for browser automation: **Navigation:** ```javascript await mcp__chrome-devtools__navigate_page({ url: 'https://www.swebench.com', type: 'url' }) ``` **Wait for Content:** ```javascript await mcp__chrome-devtools__wait_for({ text: 'Leaderboard' }) ``` **Take Snapshot:** ```javascript const snapshot = await mcp__chrome-devtools__take_snapshot() // Parse snapshot.content for leaderboard data ``` **Debug Screenshots:** ```javascript await mcp__chrome-devtools__take_screenshot({ filePath: '/tmp/debug-screenshot.png' }) ``` ## Best Practices ### Running the Skill 1. **Run during off-peak hours** to avoid rate limiting 2. **Review unmapped models** and update mappings before next run 3. **Validate manifests** after updates: `npm run test:validate` 4. **Check for website changes** if extraction fails repeatedly 5. **Keep mappings updated** as new models appear on leaderboards ### Maintaining Mappings 1. **Check completion reports** for unmapped models 2. **Add mappings immediately** after discovering new models 3. **Use canonical manifest IDs** as mapping targets 4. **Test mappings** with `--models` flag to verify 5. **Document special cases** in mapping file comments ### Troubleshooting **Extraction fails for a benchmark:** - Check if website structure changed - Review debug screenshots in `/tmp/` - Update extractor logic if needed **Model not updating:** - Verify model exists in `manifests/models/` - Check mapping configuration - Ensure model appears on leaderboard website **TerminalBench shows wrong values:** - Verify decimal format (0.428 not 42.8) - Check extractor conversion logic - Validate against website directly ## Files Modified After running this skill: 1. **Model manifests**: `manifests/models/*.json` - Updated with latest benchmark scores 2. **No other files modified**: The skill only updates benchmark fields in manifests ## Validation Always validate manifests after updates: ```bash # Run schema validation npm run test:validate # Check JSON formatting node -c manifests/models/*.json ``` ## Next Steps After Execution 1. **Review updates**: Check manifest changes make sense 2. **Update mappings**: Add newly discovered models to `model-name-mappings.json` 3. **Retry failures**: Re-run with `--benchmarks` for failed benchmarks 4. **Validate**: Run `npm run test:validate` to ensure schema compliance 5. **Commit changes**: Commit updated manifests to repository