# extract-sitemap > Map all links from a starting webpage recursively up to 2 hops deep (depth 0, 1, 2 only - depth 3 is descoped), check link accessibility, and generate a sitemap markdown file. Use when the user wants to preview what pages would be extracted, discover site structure, or identify dead links before running extract-webpage-content. - Author: JMBeh - Repository: mostly-coherent/Helpful-Prompts - Version: 20260204021131 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/mostly-coherent/Helpful-Prompts - Web: https://mule.run/skillshub/@@mostly-coherent/Helpful-Prompts~extract-sitemap:20260204021131 --- --- name: extract-sitemap description: Map all links from a starting webpage recursively up to 2 hops deep (depth 0, 1, 2 only - depth 3 is descoped), check link accessibility, and generate a sitemap markdown file. Use when the user wants to preview what pages would be extracted, discover site structure, or identify dead links before running extract-webpage-content. --- # Extract Sitemap Map all links from a starting webpage recursively up to 2 hops deep, check accessibility of each link, and generate a structured sitemap markdown file. ## When to Use Use this skill when: - User wants to preview what pages would be extracted before running `extract-webpage-content` - User needs to discover site structure and understand link relationships - User wants to identify dead links (404, 403, etc.) before extraction - User mentions "map links," "sitemap," "preview extraction," or "check dead links" **Do NOT use when:** - User wants to extract actual page content (use `extract-webpage-content` instead) - User only wants to check a single page's links (use browser tools instead) ## Execution Model **Execute autonomously** - Complete the entire workflow without user approval for each action. **CRITICAL - NO CONFIRMATION REQUESTS:** - ❌ **NEVER ask** "Should I continue?" or "Do you want me to pause?" - ❌ **NEVER ask** "There are X pages remaining, continue?" - ❌ **NEVER ask** for approval between pages or batches - ✅ **ALWAYS continue** until queue is empty and sitemap is generated - ✅ **ALWAYS complete** the entire workflow automatically - ✅ **ONLY report** progress updates, never ask for permission **Required tools:** Playwright MCP (`user-playwright`) - does NOT require per-action approval **Do NOT use:** cursor-ide-browser MCP - requires per-action approval (not suitable) ## Workflow 1. **Resume Logic (MANDATORY AT START)** - Check for existing state, resume if found (see Resume Logic section) 2. **Clean up Chrome processes** - Run `pkill -f "mcp-chrome-" && sleep 2` automatically before browser operations 3. **Initialize or resume** - Either start fresh or resume from saved state 4. **Navigate** - Install browser if needed, create tab, navigate to starting URL (or next URL in queue) 5. **Extract links** - Find all internal links using script from `references/link-extraction.js` 6. **Check accessibility** - Verify each link using script from `references/status-check.js` 7. **Process queue exhaustively** - Follow queue-based algorithm (see Recursive Mapping section) until queue is empty 8. **Progress tracking** - Log progress after every 10-20 pages (see Progress Tracking section) 9. **Completion gates** - Verify all gates pass before proceeding (see Completion Gates section) 10. **Generate sitemap** - Format as markdown table and save to workspace root (only after gates pass) 11. **Clean up** - Delete `sitemap-state.json` after successful completion ## Output Format The sitemap file must follow this exact structure: ```markdown # Sitemap: [Starting Page Title] **Starting URL:** [URL] **Generated:** [Date/Time] **Last Updated:** [Date] **Total Pages Mapped:** [count] **Total Links Found:** [count] **Dead Links:** [count] **Max Depth:** 2 hops ## Summary Statistics - ✅ Accessible Pages: [count] - ❌ Dead Links (404): [count] - 🔒 Forbidden (403): [count] - ⚠️ Server Errors (500): [count] - ⏱️ Timeouts: [count] - 📄 Files: [count] --- ## Pages | Title | URL | Depth | Status | Links Found | |-------|-----|-------|--------|-------------| | [Page Title] | [URL] | 0 | ✅ Accessible (200) | [count] | | [Page Title] | [URL] | 1 | ✅ Accessible (200) | [count] | | [Page Title] | [URL] | 2 | ❌ Dead Link (404) | N/A | ``` **Status values:** - `✅ Accessible (200)` - Link works, page loaded successfully - `❌ Dead Link (404)` - Page not found - `🔒 Forbidden (403)` - Access denied - `⚠️ Server Error (500)` - Server error - `⏱️ Timeout` - Navigation timed out - `📄 File` - Downloadable file (PDF, DOC, etc.) **Depth values:** - `0` - Starting page - `1` - Links from starting page - `2` - Links from depth 1 pages (maximum depth - depth 3 is descoped and not processed) **File naming:** `sitemap-[sanitized-starting-url]-[timestamp].md` saved to workspace root ## Requirements 1. **Autonomous execution** - Execute entire workflow automatically without user approval - **NEVER ask for confirmation** - Continue processing until complete - **NEVER pause for approval** - Process all pages autonomously - **ONLY report progress** - Inform user of status, don't ask permission 2. **Complete link discovery** - Find ALL internal links on each page 3. **Exhaustive recursive mapping** - Follow ALL internal HTML page links recursively up to 2 hops deep - DO NOT stop until queue is empty 4. **Accessibility checking** - Check HTTP status for every link discovered 5. **Track visited URLs** - Prevent duplicate checks and infinite loops (normalize URLs: remove hash fragments and trailing slashes) 6. **State persistence** - Save progress to `sitemap-state.json` after every 10-20 pages 7. **Completion verification** - Verify queue is empty and all links processed before generating sitemap 8. **Table output** - Format as markdown table with columns: Title, URL, Depth, Status, Links Found ## Link Extraction Use the script in `references/link-extraction.js` via `browser_evaluate`: - Extracts all internal links (same domain) - Identifies downloadable files (PDFs, DOCs, etc.) - mapped but not recursively followed - Normalizes URLs (removes hash fragments, trailing slashes) - Filters out anchors (#), javascript:, mailto:, tel: links ## Accessibility Checking Use the script in `references/status-check.js` via `browser_evaluate`: - Navigate to each link using `browser_navigate` - Wait for page load: `browser_wait_for(time: 3)` - Check for error indicators (404, 403, 500 in page content) - Record status code and accessibility ## Recursive Mapping **CRITICAL:** Execute recursive mapping for ALL internal HTML page links found. This MUST be completed exhaustively - do NOT stop until all discovered links have been processed. ### Queue-Based Algorithm (REQUIRED) **You MUST implement this exact algorithm to ensure completeness. See `references/QUEUE_ALGORITHM.md` for detailed pseudocode.** 1. **Initialize state tracking:** - Create `visitedUrls` Set (normalized URLs already processed) - Create `pages` Array (all processed pages with metadata) - Create `queue` Array/Deque (URLs to process: `{url, depth}`) - Create `sitemap-state.json` file to persist state between tool calls 2. **Process starting page (depth 0):** - Navigate to starting URL - Extract links using `link-extraction.js` - Check accessibility using `status-check.js` - Add to `pages` array - Add starting URL to `visitedUrls` - Add ALL discovered internal HTML links to `queue` with `depth: 1` - Save state to `sitemap-state.json` 3. **Process queue until empty (CRITICAL - DO NOT STOP EARLY):** ``` WHILE queue is not empty: a. Pop next item from queue: {url, depth} b. Normalize URL (remove hash, trailing slash) c. IF normalized URL in visitedUrls → skip (continue to next) d. IF depth >= 2 → skip (max depth reached - depth 3 is descoped and not processed) e. Add normalized URL to visitedUrls f. Navigate to URL g. Check accessibility (handle errors/timeouts) h. IF accessible: - Extract links using link-extraction.js - Add page to pages array with: {title, url, depth, status, linksFound: count} - IF depth < 2: - Add ALL discovered internal HTML links to queue with depth: depth + 1 i. IF error/404/403: - Add page to pages array with: {title, url, depth, status: error, linksFound: 0} - Do NOT add links to queue (dead/forbidden pages) j. Save state to sitemap-state.json (after every 10-20 pages) k. Continue to next item in queue l. **NEVER ask for confirmation** - Process automatically until queue is empty ``` 4. **Completion verification (BEFORE generating sitemap):** - Verify queue is empty: `queue.length === 0` - Verify all discovered links processed: Check that every URL in `pages` array has been fully processed - Count total links discovered vs processed: - Sum all `linksFound` values from `pages` - Verify all those links are either in `visitedUrls` or marked as dead/forbidden - If queue is NOT empty, continue processing until empty 5. **For file downloads:** - Check accessibility but do NOT add to queue - Add to `pages` array with status `📄 File` - Do NOT recursively follow links from files ### State Persistence **Save state after every batch (10-20 pages) to `sitemap-state.json`:** ```json { "startingUrl": "https://...", "startingTitle": "...", "visitedUrls": ["url1", "url2", ...], "pages": [ {"title": "...", "url": "...", "depth": 0, "status": "✅ Accessible (200)", "linksFound": 16}, ... ], "queue": [ {"url": "...", "depth": 1}, ... ], "lastUpdated": "2026-01-28T14:00:00Z" } ``` **If interrupted:** Load state from `sitemap-state.json` and continue from where you left off. ## Resume Logic (MANDATORY AT START) **ALWAYS execute this at the very start, before any processing:** ```python def start_workflow(): state_file = 'sitemap-state.json' # Step 1: Check for existing state if file_exists(state_file): state = load_state(state_file) print(f"Resuming from saved state: {len(state['pages'])} pages processed, {len(state['queue'])} URLs remaining") print(f"Last updated: {state['lastUpdated']}") # Verify state is valid if state['queue']: print(f"Continuing from queue: {len(state['queue'])} URLs to process") return state # Resume from here else: print("State file exists but queue empty. Verifying completion...") # Run completion verification if verify_completion(state): print("Previous run completed. Starting fresh.") delete_state_file(state_file) return None # Start fresh else: print("Previous run incomplete. Resuming...") return state # Resume # Step 2: Start fresh print("No saved state found. Starting fresh.") return None ``` **This logic MUST execute at the very start, before any processing. Always check for saved state first.** ### Progress Tracking (MANDATORY) **Log progress after every batch (10-20 pages) - This is MANDATORY, not optional:** ```python def log_progress(state): processed = len(state['pages']) remaining = len(state['queue']) depth_breakdown = { 0: sum(1 for p in state['pages'] if p['depth'] == 0), 1: sum(1 for p in state['pages'] if p['depth'] == 1), 2: sum(1 for p in state['pages'] if p['depth'] == 2) } status_breakdown = { 'accessible': sum(1 for p in state['pages'] if '✅' in p['status']), 'dead': sum(1 for p in state['pages'] if '❌' in p['status']), 'forbidden': sum(1 for p in state['pages'] if '🔒' in p['status']) } print(f"Progress: {processed} pages processed, {remaining} URLs remaining in queue") print(f"Depth breakdown: Depth 0: {depth_breakdown[0]}, Depth 1: {depth_breakdown[1]}, Depth 2: {depth_breakdown[2]}") print(f"Status breakdown: Accessible: {status_breakdown['accessible']}, Dead: {status_breakdown['dead']}, Forbidden: {status_breakdown['forbidden']}") # CRITICAL: If remaining > 0, explicitly state work is NOT complete if remaining > 0: print(f"⚠️ WARNING: {remaining} URLs still pending. Work is NOT complete. Continue processing.") else: print("✅ Queue empty. Proceeding to completion verification.") ``` **Progress logging is MANDATORY after every 10-20 pages. Never skip this step.** ### Common Mistakes to Avoid ❌ **DO NOT:** - Stop processing when you've "demonstrated the pattern" - process ALL links - Skip URLs because "there are too many" - process exhaustively - Mark URLs as "visited" before actually processing them - Generate sitemap while queue still has items - Forget to extract links from accessible pages before moving on ✅ **DO:** - Process queue until completely empty - Extract ALL links from every accessible page - Verify completion before generating final sitemap - Save state frequently to prevent data loss - Continue processing even if it takes many iterations ## Error Recovery (MANDATORY) **When errors occur, follow this recovery procedure. NEVER stop processing due to errors.** ### Navigation Errors 1. **Error:** "Failed to launch browser process" - **Action:** Run `pkill -f "mcp-chrome-" && sleep 2` automatically - **Action:** Retry navigation (up to 3 times) - **If still fails:** Mark URL as error, continue with next URL (DO NOT stop) - **Log:** "Navigation failed for [URL]. Marking as error and continuing." 2. **Error:** Navigation timeout - **Action:** Mark URL as timeout, continue with next URL - **Action:** Log error but continue processing queue - **Log:** "Timeout for [URL]. Marking as timeout and continuing." 3. **Error:** 404/403/500 - **Action:** Mark URL with appropriate status (❌ Dead Link (404), 🔒 Forbidden (403), ⚠️ Server Error (500)) - **Action:** Continue with next URL (DO NOT stop) - **Log:** "Status [code] for [URL]. Marking appropriately and continuing." ### State Corruption 1. **Error:** Cannot load state file - **Action:** Backup corrupted state file first: `mv sitemap-state.json sitemap-state.json.backup` - **Action:** Start fresh (backup corrupted state file first) - **Action:** Log warning but continue - **Log:** "State file corrupted. Backed up and starting fresh." ### Critical Rule **NEVER stop processing due to errors. Always continue with next item in queue. Errors are logged but do not halt execution.** ## Troubleshooting If navigation fails: 1. ✅ Chrome processes cleaned? → Run `pkill -f "mcp-chrome-" && sleep 2` FIRST 2. ✅ Browser installed? → Call `browser_install` 3. ✅ Tab exists? → Call `browser_tabs(action: "list")`, create if needed 4. ✅ URL valid? → Check includes `https://`, no typos 5. ✅ Authenticated? → For internal sites, ensure user is logged in **Priority:** Always check for conflicting Chrome processes FIRST before attempting browser operations. ## Completion Gates (MANDATORY) **BEFORE marking task complete, ALL gates must pass. If ANY gate fails, work is NOT complete.** ### Gate 1: State Verification ```python def verify_gate_1(state): # Load state if not file_exists('sitemap-state.json'): print("❌ Gate 1 FAILED: State file missing. Cannot verify completion.") return False state = load_state('sitemap-state.json') # Verify queue empty if len(state['queue']) > 0: print(f"❌ Gate 1 FAILED: Queue not empty: {len(state['queue'])} items remaining. Continue processing.") return False print("✅ Gate 1 PASSED: Queue is empty.") return True ``` **If Gate 1 fails:** Continue processing queue until empty. DO NOT mark complete. ### Gate 2: Progress Verification ```python def verify_gate_2(state): # Count processed pages processed = len(state['pages']) # Count discovered links total_discovered = sum(p['linksFound'] for p in state['pages']) # Count processed URLs total_processed = len(state['visitedUrls']) # Verify all discovered links processed # (Some links may be dead/forbidden, so processed count may be less than discovered) # But every discovered link should be either in visitedUrls or marked as dead/forbidden dead_forbidden_count = sum(1 for p in state['pages'] if '❌' in p['status'] or '🔒' in p['status'] or '⚠️' in p['status'] or '⏱️' in p['status']) if total_processed + dead_forbidden_count < total_discovered: print(f"❌ Gate 2 FAILED: Not all links processed. Discovered: {total_discovered}, Processed: {total_processed}, Dead/Forbidden: {dead_forbidden_count}. Continue processing.") return False print(f"✅ Gate 2 PASSED: All discovered links processed ({total_discovered} discovered, {total_processed} processed, {dead_forbidden_count} dead/forbidden).") return True ``` **If Gate 2 fails:** Process remaining links. DO NOT mark complete. ### Gate 3: Output Verification ```python def verify_gate_3(state, output_file): # Verify output file exists if not file_exists(output_file): print(f"❌ Gate 3 FAILED: Output file missing: {output_file}. Generate output before marking complete.") return False # Verify output file is non-empty if file_size(output_file) == 0: print(f"❌ Gate 3 FAILED: Output file empty: {output_file}. Generate output before marking complete.") return False # Verify output contains all processed pages output_content = read_file(output_file) pages_in_output = output_content.count('|') // 6 # Approximate count (6 columns per row) if pages_in_output < len(state['pages']): print(f"❌ Gate 3 FAILED: Output incomplete. Pages processed: {len(state['pages'])}, Pages in output: {pages_in_output}. Regenerate output.") return False print(f"✅ Gate 3 PASSED: Output file exists, is non-empty, and contains all processed pages.") return True ``` **If Gate 3 fails:** Generate/fix output. DO NOT mark complete. ### Completion Gate Execution ```python def run_completion_gates(state, output_file): gates = [ ('Gate 1: State Verification', verify_gate_1, state), ('Gate 2: Progress Verification', verify_gate_2, state), ('Gate 3: Output Verification', verify_gate_3, state, output_file) ] all_passed = True for gate_name, gate_func, *args in gates: print(f"\nRunning {gate_name}...") if not gate_func(*args): all_passed = False print(f"{gate_name} FAILED. Fix issues and re-run verification.") if all_passed: print("\n✅ ALL COMPLETION GATES PASSED. Work is complete.") return True else: print("\n❌ ONE OR MORE GATES FAILED. Work is NOT complete. Fix issues and re-run verification.") return False ``` **If ANY gate fails:** Fix issues, re-run verification, DO NOT mark complete. ## Never Stop Early (MANDATORY) **CRITICAL RULES to prevent early stopping:** 1. **Queue Check:** - Before marking complete → Verify `queue.length === 0` - If queue NOT empty → Continue processing (DO NOT mark complete) - Log: "Queue has X items remaining. Continuing processing." 2. **Progress Check:** - Before marking complete → Verify all discovered links processed - If gaps found → Continue processing (DO NOT mark complete) - Log: "X links discovered but Y not processed. Continuing." 3. **State Check:** - Before marking complete → Load state file - Verify state matches current progress - If mismatch → Fix state, continue processing 4. **Output Check:** - Before marking complete → Verify output file exists and is complete - If incomplete → Continue processing (DO NOT mark complete) **If ANY check fails, work is NOT complete. Continue processing.** ## Completion Criteria (ALL MUST BE TRUE) **Work is complete ONLY when ALL criteria are true:** 1. ✅ **Queue empty:** `queue.length === 0` 2. ✅ **All links processed:** Every discovered link either processed OR marked as dead/forbidden 3. ✅ **State saved:** State file reflects current progress 4. ✅ **Output generated:** Output file exists, is non-empty, contains all processed pages 5. ✅ **No pending work:** No URLs remaining to process 6. ✅ **Verification passed:** All completion gates passed **If ANY criterion is false → Work is NOT complete. Continue processing.** **Before marking complete, verify ALL criteria explicitly:** ```python def is_complete(state, output_file): criteria = { 'queue_empty': len(state['queue']) == 0, 'all_links_processed': verify_gate_2(state), 'state_saved': file_exists('sitemap-state.json') and state_matches_progress(state), 'output_generated': file_exists(output_file) and file_size(output_file) > 0, 'no_pending_work': len(state['queue']) == 0, 'verification_passed': run_completion_gates(state, output_file) } all_passed = all(criteria.values()) if not all_passed: failed = [k for k, v in criteria.items() if not v] print(f"❌ Completion criteria failed: {failed}") print("Continue processing until all criteria pass.") return False print("✅ All completion criteria passed. Work is complete.") return True ``` ## If Verification Fails **When completion verification fails, follow this procedure:** 1. **Identify failure reason:** - Queue not empty → Continue processing queue - Links not processed → Process remaining links - Output missing/incomplete → Generate/fix output - State corrupted → Fix state or start fresh 2. **Take corrective action:** - If queue not empty → Process remaining URLs - If links not processed → Extract/process missing links - If output incomplete → Regenerate output - If state corrupted → Fix state or start fresh 3. **Re-run verification:** - After fixes → Re-run all completion gates - If still fails → Repeat corrective action - Continue until ALL verification passes 4. **Log actions:** - Log what failed - Log corrective action taken - Log verification result after fix **NEVER mark complete if verification fails. Always fix and re-verify.** ## Handoff to extract-webpage-content The sitemap file can be used as input to `extract-webpage-content`. Users can edit the sitemap file to remove URLs they don't want extracted before running content extraction.