# extract-page-shallow > Extract complete content from a single webpage and its direct links only (depth 0 + depth 1). Does NOT recursively follow links beyond the first level. Use when you need focused extraction without deep crawling. - Author: JMBeh - Repository: mostly-coherent/Helpful-Prompts - Version: 20260204021131 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/mostly-coherent/Helpful-Prompts - Web: https://mule.run/skillshub/@@mostly-coherent/Helpful-Prompts~extract-page-shallow:20260204021131 --- # Extract Page Shallow Extract complete content from a single webpage and its direct links only (depth 0 + depth 1). Does NOT recursively follow links beyond the first level. Use when you need focused extraction without deep crawling. ## When to Use Use this skill when: - User requests extracting a webpage and its immediate linked pages - User wants shallow extraction (page + direct links only) - User mentions "extract this page and links" or "shallow extraction" - User wants to avoid deep recursive crawling **Do NOT use when:** - User needs deep recursive extraction (use extract-webpage-content instead) - User wants to browse or navigate pages manually ## Execution Model **Execute autonomously** - Complete the entire workflow without user approval for each action. **Required tools:** Playwright MCP (`user-playwright`) ## Scope **Depth 0:** Starting page (the URL provided by user) **Depth 1:** Direct links found on the starting page **Depth 2+:** NOT EXTRACTED (this is the key difference from extract-webpage-content) ## Workflow 1. **Navigate to starting page** (Depth 0) - Clean up Chrome processes: `pkill -f "mcp-chrome-" && sleep 2` - Navigate to target URL - Expand all dynamic content (accordions, tabs, etc.) 2. **Extract starting page content** (Depth 0) - Extract all text content (headings, paragraphs, lists) - Capture full-page screenshot - Extract all internal links found on this page - Save to folder: `[Page_Title]/[Page_Title]_Full_Content.md` 3. **Extract direct linked pages** (Depth 1) - For each internal link found on starting page: - Navigate to linked page - Expand dynamic content - Extract text content - Capture screenshot if needed - Save to subfolder: `[Page_Title]/[Linked_Page_Title]/[Linked_Page_Title]_Full_Content.md` - **DO NOT extract links from these depth 1 pages** (no depth 2) 4. **Progress tracking** - Report: "Extracted depth 0: [starting page]" - Report: "Extracting depth 1: [X] direct links found" - Report progress every 5-10 pages 5. **Save output** - Markdown files in nested folder structure - Images saved alongside markdown files - No state file needed (small scope, runs quickly) ## Output Format **Directory Structure:** ``` [Page_Title]/ ├── [Page_Title]_Full_Content.md ├── [page-title]-image-1.png ├── [Linked_Page_1]/ │ └── [Linked_Page_1]_Full_Content.md ├── [Linked_Page_2]/ │ └── [Linked_Page_2]_Full_Content.md └── ... ``` **Markdown Format:** - Images inserted at position in content flow - Images saved as files in same folder - Document references: `![alt](filename.png)` (relative path) ## Key Differences from extract-webpage-content | Feature | extract-page-shallow | extract-webpage-content | |---------|---------------------|-------------------------| | **Depth** | Depth 0 + 1 only | Depth 0 + 1 + 2 | | **Link following** | Direct links only | Links + links-from-links | | **State persistence** | Not needed (quick) | Required (long-running) | | **Use case** | Focused extraction | Comprehensive site crawl | | **Completion time** | Minutes | Hours (potentially) | ## Error Handling **Navigation errors:** - Retry up to 3 times - Mark page as error, continue with next page - DO NOT stop entire extraction **Authentication required:** - Log warning: "Page requires authentication" - Skip page, continue with remaining pages **404/403/500 errors:** - Log error with status code - Skip page, continue with remaining pages ## Requirements 1. **Autonomous execution** - No user approval needed between pages 2. **Complete extraction** - Expand all dynamic elements 3. **Filter decorative images** - Only content images (skip logos, icons, nav) 4. **Shallow only** - NEVER extract beyond depth 1 5. **Structured output** - Nested folders with descriptive names 6. **Progress reporting** - Log every 5-10 pages ## Completion Criteria Work is complete when: 1. ✅ Starting page (depth 0) extracted 2. ✅ All direct links (depth 1) extracted or marked as error 3. ✅ All output files created and non-empty 4. ✅ NO depth 2 pages extracted ## Example Usage **User:** "Extract this page and its direct links: https://example.com/course" **Agent:** 1. Extracts https://example.com/course (depth 0) 2. Finds 15 internal links on the page 3. Extracts each of those 15 pages (depth 1) 4. **STOPS** - does not follow links from those 15 pages 5. Reports: "Extracted 1 page at depth 0, 15 pages at depth 1. Total: 16 pages." ## Technical Implementation See [REFERENCE.md](REFERENCE.md) for: - Complete extraction algorithm - Dynamic content expansion logic - Link filtering and normalization - Error recovery procedures